An In-depth Analysis of the Effect of Lexical Normalization on the Dependency Parsing of Social Media

Rob van der Goot
University of Groningen


Abstract

Existing natural language processing systems have often been designed with standard texts in mind. However, When these tools are used on the substantially different texts from social media, their performance drops dramatically. One solution is to translate social media data to standard language before processing it, this is also called normalization. It is well-known that this improves performance for many natural language processing tasks for social media data. However, little is known about which which types of normalization replacements have the most effect. Furthermore, it is unknown what the weaknesses of existing lexical normalization systems is in an extrinsic setting. In this paper, we analyze the effect of manual as well as automatic lexical normalization for dependency parsing. After our analysis, we conclude that there is still room for improvement for lexical normalization systems and that small annotation differences are important to take into consideration when exploiting normalization in a pipeline setup.