Unsupervised Neologism Normalization Using Embedding Space Mapping

Nasser Zalmout1, Kapil Thadani2, Aasish Pappu3
1NYU Abu Dhabi, 2Yahoo Research, 3Spotify Research


This paper presents an approach for detecting and normalizing neologisms in social media content. Neologisms refer to recent expressions that are specific to certain entities or events and are being increasingly used by the public, but have not yet been accepted in mainstream language. Automated methods for handling neologisms are important for natural language understanding and normalization, especially for informal genres with user generated content. We present an unsupervised approach for detecting neologisms and then normalizing them to canonical words without relying on parallel training data. Our approach builds on the text normalization literature and introduces adaptations to fit the specificities of this task, including phonetic and etymological considerations. We evaluate the proposed techniques on a dataset of Reddit comments and share a human-annotated dataset of neologisms and their normalizations for future research.