Normalization of Indonesian-English Code-Mixed Twitter Data

Anab Maulana Barik, Rahmad Mahendra, Mirna Adriani
Universitas Indonesia


Abstract

Twitter is an excellent source of data for NLP researches as it offers tremendous amount of very useful textual information. Using nonstandard words and combining multiple languages in a single tweet called code-mixed is common among Twitter data, due to its characteristics where Twitter is written with informal manner. Several studies have addressed nonstandard words or code-mixed issues, but to the best of our knowledge, there is no study that addresses those problems on Indonesian-English code-mixed data. In this study, we created a pipeline to normalize Indonesian-English code-mixed data, comprised of four modules i.e tokenization, language identiļ¬cation, lexical normalization, and translation. In an effort to initiate the task of normalizing code-mixed data especially in domain Indonesian-English, we also created 501 corpora of Indonesian-English code-mixed gold standards including gold standard for the four modules in our pipeline.