Lexical Normalisation for English Tweets

User generated content (UGC) such as the text in Twitter messages is notoriously varied in content and composition, often containing ungrammatical sentence structures, non-standard words and domain-specific entities. Accuracy declines have been observed in many NLP tasks over UGC data (Gimpel et al. 2011, Liu et al. 2011), motivating the need for methods which normalise the content prior to the application of NLP tools to the data.

This shared task focuses on text normalisation, in aiming to normalise non-standard words in English Twitter messages to their canonical forms. In this, we aim to correct non-standard spellings (e.g., toook for took), expand informal abbreviations (e.g., tmrw for tomorrow), and normalise phonetic substitutions (e.g., 4eva for forever). Part of the motivation for the shared task is that text normalisation has been embraced as a research task by the NLP community, but been limited in scope and suffered from a lack of large-scale datasets. We aim to broaden the scope of the task and enable the development/benchmarking of text normalisation approaches over a much larger dataset than is currently available.

Tasks

To make the task of text normalisation tractable, this shared task focuses on context-sensitive lexical normalisation of English Twitter messages, under the following constraints:

Non-standard words (NSWs) are normalised to one or more canonical English words based on a pre-defined lexicon. For instance, l o v e should be normalised to love (many-to-one normalisation), tmrw to tomorrow (one-to-one normalisation), and cu to see you (one-to-many normalisation). Additionally, IBM should be left untouched as it is in the lexicon and in its canonical form, and the informal lol should be expanded to laughing out loud.
Non-standard words may be either out-of-vocabulary (OOV) tokens (e.g., tmrw for tomorrow) or in-vocabulary (IV) tokens (e.g., wit for with in I will come wit you).
Only alphanumeric tokens (e.g., 2, 4eva and tmrw) and apostrophes used in contractions (e.g., yoou've) are considered for normalisation. Tokens including hyphens, single quotes and other types of contractions should be ignored.
Domain specific entities are ignored even if they are in non-standard forms, e.g., #ttyl, @nyc
It is possible for a tweet to have no non-standard tokens but still require normalisation (e.g. our example of wit above), and also for the tweet to require no normalisation whatsoever.
Proper nouns shall be left untouched, even if they are not in the given lexicon (e.g., Twitter).
All normalisations should use American spelling (e.g., tokenize rather than tokenise).
In cases where human annotators have been unable to determine whether a given token is a non-standard word or its normalised (OOV) form, we have chosen to be conservative and leave the token unchanged.

A more detailed set of annotation guidelines is provided here

For your convenience and consistency of evaluation, we have pre-tokenised tweets and provided them in JSON format. The training data file is a JSON list in which each item represents a tweet. A tweet is a JSON dict containing four fields: index (the ID for annotation), tid (tweet ID), input (a list of case-sensitive tokens to be normalised), and output (a list of normalised tokens, in lowercase). The test data for evaluation follows the same format as the training data, but it does NOT have output fields: your task is to automatically predict the output fields. Note that all tokens in the output field for a given tweet should be in lowercase.

A mock-up sample JSON object is provided below, for illustrative purposes. Jst, lol and l o v e are normalised to just, laughing out loud and love, respectively.
{ 'tid': '971011879910802432', 'index': '1064', 'input': [ 'Jst', 'read', 'a', 'tweet', 'lol', 'and', 'l', 'o', 'v', 'e', 'it' ], 'output': [ 'just', 'read', 'a', 'tweet', 'laughing out loud', 'and', 'love', '', '', '', 'it' ] }

For evaluation, we will use the evaluation metrics of Precision, Recall and F1. Two categories of submission will be accepted:

Task 1: Constrained systems (cm)
Participants can only use the provided training data to perform the text normalisation task, but are able to make use of any off-the-shelf tools (e.g., Twitter POS taggers). Normalisaiton lexicons and extra tweet data shall NOT be used in the constrained system.
Task 2: Unconstrained systems (um)
Participants can use any publicly accessible data and tools to perform the text normalisation task.

Please submit your results to lexnorm2015@gmail.com as an email attachment. The attachments shall be plain JSON files, named by team and submission category. For instance, for registered team name "demo" competing in both categories, the following two files would be provided as attachments: "demo.cm.json" and "demo.um.json". You may submit multiple times for each category, but your final submission before the deadline will be used for evaluation.

Resources

Canonical English lexicon: English lexicon
Lexical normalisation dictionary: UniMelb, UTDallas
Tweet POS taggers: CMU, Sheffield
Deep learning component: RNN features
Tweet Dependency parsers: CMU

Important Dates

Note: All deadlines are calculated at 11:59PM Pacific Summer Time.

~~Training data and annotations released: April 7, 2015~~ (dataset updated 30/04/2015, MD5: a2d18cbbd4a69c8241c20dafbc4a3a44)
~~Test data released: May 7, 2015~~ (dataset MD5: 10dfbc18945886935e58b4fc51740277)
~~Result submission: May 11, 2015~~
~~Shared-task results released, and annotations for test data made available to registered teams: May 14, 2015~~ (shared task results)
~~Shared-task paper/poster submission: May 31, 2015~~ (submission site, paper submission specs are posted in the mailing list)
~~Shared-task paper/poster review due: June 14, 2015~~
~~Shared-task paper/poster camera ready: June 24, 2015~~
Mailing list for the shared task: here
Shared-task training, test and annotation data (lexnorm2015)

Organisers

Bo Han (IBM Research)
Timothy Baldwin (The University of Melbourne)

References

Aw, AiTi, Min Zhang, Juan Xiao, and Jian Su. 2006. A phrase-based statistical model for SMS text normalization. In Proceedings of COLING/ACL 2006 , 33–40, Sydney, Australia.
Richard Beaufort, Sophie Roekhaut, Louise-Amélie Cougnon, and Cédrick Fairon. 2010. A hybrid rule/model-based finite-state framework for normalizing SMS messages. In Proceedings of the 48th Annual Meeting of the ACL (ACL 2010), 770–779, Uppsala, Sweden.
Choudhury, Monojit, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar, and Anupam Basu. 2007. Investigation and modeling of the structure of texting language. International Journal on Document Analysis and Recognition 10.157–174.
Chrupala, Grzegorz. 2014. Normalizing tweets with edit scripts and recurrent neural embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), 680–686, Baltimore, USA.
Contractor, Danish, Tanveer A. Faruquie, and L. Venkata Subramaniam. 2010. Unsupervised cleansing of noisy text. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), 189– 196, Beijing, China.
Cook, Paul, and Suzanne Stevenson. 2009. An unsupervised model for text message normalization. In Proceedings of the Workshop on Computational Approaches to Linguistic Creativity (CALC ’09), 71–78, Boulder, USA.
Foster, Jennifer. 2010. “cba to check the spelling” investigating parser performance on discussion forum posts. In Proceedings of Human Language Technologies: The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2010), 381–384, Los Angeles, USA.
Gouws, Stephan, Dirk Hovy, and Donald Metzler. 2011a. Unsupervised mining of lexical variants from noisy text. In Proceedings of the First Workshop on Unsupervised Learning in NLP , 82–90, Edinburgh, UK.
Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 368– 378, Portland, USA.
Bo Han, Paul Cook and Timothy Baldwin. 2012. Automatically constructing a normalisation dictionary for microblogs. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2012), pages 421–432, Jeju, Republic of Korea.
Bo Han, Paul Cook and Timothy Baldwin. 2013. Lexical Normalisation of Short Text Messages. In ACM Transactions on Intelligent Systems and Technology (TIST 2013) , 4(1), pages 5:1{27}.
Hassan, Hany, and Arul Menezes. 2013. Social text normalization using contextual graph random walks. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), 1577–1586, Sofia, Bulgaria.
Catherine Kobus, François Yvon, and Géraldine Damnati. 2008. Normalizing SMS: are two metaphors better than one? In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), 441– 448, Manchester, UK.
Li, Chen, and Yang Liu. 2012. Improving text normalization using character-blocks based models and system combination. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), 1587–1602, Mumbai, India.
Ling, Wang, Chris Dyer, Alan W Black, and Isabel Trancoso. 2013. Paraphrasing 4 microblog normalization. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), 73–84, Seattle, USA.
Liu, Fei, Fuliang Weng, and Xiao Jiang. 2012. A broad-coverage normalization system for social media language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), 1035–1044, Jeju Island, Korea.
Liu, Fei, Fuliang Weng, Bingqing Wang, and Yang Liu. 2011a. Insertion, dele- tion, or substitution? Normalizing text messages without pre-categorization nor supervision. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL HLT 2011), 71–76, Portland, USA.
Pennell, Deana, and Yang Liu. 2011a. A character-level machine translation approach for normalization of SMS abbreviations. In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), 974–982, Chiang Mai, Thailand.
Pennell, Deana, and Yang Liu. 2011b. Toward text message normalization: Modeling abbreviation generation. In Proceedings of 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’11), 5364–5367, Prague, Czech Republic.
Porta, Jordi, and Jos e using finite-state transducers. In Proceedings of the Tweet Normalization Workshop co-located with 29th Conference of the Spanish Society for Natural Language Processing (SEPLN 2013), volume 1086, 49–53, Madrid, Spain.
Sproat, Richard, Alan W. Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. 2001. Normalization of non-standard words. Computer Speech and Language 15.287–333.
Wang, Pidong, and Hwee Tou Ng. 2013. A beam-search decoder for normalization of social media text with application to machine translation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2013), 471–481, Atlanta, USA.
Xu, Wei, Alan Ritter, and Ralph Grishman. 2013. Gathering and generating paraphrases from Twitter with application to normalization. In Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, 121–128, Sofia, Bulgaria.
Xue, Zhenzhen, Dawei Yin, and Brian D. Davison. 2011. Normalizing micro- text. In Proceedings of the AAAI-11 Workshop on Analyzing Microtext, 74–79, San Francisco, USA.
Yang, Yi, and Jacob Eisenstein. 2013. A log-linear model for unsupervised text normalization. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), 61–72, Seattle, USA.
Zhang, Congle, Tyler Baldwin, Howard Ho, Benny Kimelfeld, and Yun- yao Li. 2013. Adaptive parser-centric text normalization. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), 1159–1168, Sofia, Bulgaria.
Zhu, Conghui, Jie Tang, Hang Li, Hwee Tou Ng, and Tiejun Zhao. 2007. A unified tagging approach to text normalization. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007), 688–695, Prague, Czech Republic.