W-NUT 2021 Shared Task: Multi-lingual Lexical Normalization

MultiLexNorm: Multilingual Lexical Normalization

For this task, participants are asked to develop a system that performs lexical normalization: the conversion of non-canonical texts to their canonical equivalent form. In particular, this task includes data from 12 languages.

Results!

Results of the shared task are now available, please find them on Google Group.

Lexical Normalization

Social media provides a rich source of information. It is notoriously noisy, but also interesting because of its fast pace, informal nature and the vast amount of available data. The creative language use found on social media introduces many difficulties for existing natural language processing tools. One way to overcome these issues is to `normalize' this data to a more canonical register before processing it. In this task we focus on lexical normalization, which means that replacements are done on the word level.

      social ppl r troublesome
      Social people are troublesome

Previous work on normalization has mostly been mono-lingual, where a wide variety of approaches, datasets, and evaluation metrics where used. For this shared task we combined existing datasets with new ones, and converted them to the same format.

Data format

We use a tab-seperated format, with pre-tokenized data. Each word is on one line, and sentence boundaries are indicated with an empty line. The normalization is displayed in the first column:

social  Social
ppl     people
r       are
troublesome     troublesome

Some of the languages include annotation for word splits and merges. When a word is split, the normalization column include a white-space character, and with a merge the normalization is only included for the first word:

if	If
i	i
have	have
a	a
head	headache
ache	
tomorro	tomorrow
ima	i'm going to
be	be
pissed	pissed

Evaluation metric

We use Error Reduction Rate (ERR), which is word-level accuracy normalized for the number of replacements in the dataset (van der Goot, 2019). The formula for ERR is:

ERR = (TP − FP)/(TP + FN)

Where TP, FP, TN, FN are defined as follows:

TN = Annotators did not normalize, system did not normalize
FP = Annotators did not normalize, system normalized
FN = Annotators normalized, but system did not find the correct normalization. This could be because it kept the original word, or proposed a wrong candidate.
TP = Annotators normalized, systems normalized correctly

Note that a word which should be normalized, but is normalized to the wrong candidate, is only a FN. Every input token represents exactly one point in TP, FP, TN, FN. It should also be noted that our evaluation script is case-sensitive, even though some of the dataset do not include capitalization corrections. For a more in depth discussion about evaluation of normalization and ERR in particular we refer to Chapter 5 of Normalization and Parsing Algorithms for Uncertain Input

For the final ranking, we use the macro-average ERR over all languages. It should be noted that the official winner will be the highest ranking open-source system

Baselines

We provide two simple baselines:

Leave-As-Is (LAI): Simply uses the input as output. Word level accuracy would be equal to the % of words that are not normalized, and ERR is 0.0.
Most-frequent-Replacement (MFR): Uses the most frequent replacement based on the training data. If the input word is not present in the training data, it returns the input word.

Languages

Our dataset consists of the following languages (two of them are language-pairs, these include code-switched data):

Language	Data from	Original Source	Size (#words)	1-n/n-1	Caps	%normed	MFR-ERR
Croatian	Twitter	Ljubešić et al, 2017 [bib]	75,276	-	+	8.98	35.41
Danish	Twitter/Arto	Plank et al, 2020 [bib]	11,816	+	+	8.66	41.69
Dutch	Twitter/sms/forum	Schuur, 2020 [bib]	23,053	+	+	26.49	29.97
English	Twitter	Baldwin et al, 2015 [bib]	73,806	+	-	6.90	61.88
German	Twitter	Sidarenka et al, 2013 [bib]	25,157	+	+	8.90	60.00
Indonesian-English	Twitter	Barik et al, 2019 [bib]	23,124	+	-	12.16	62.91
Italian	Twitter	van der Goot et al, 2020 [bib]	14,641	+	+	7.36	15.90
Serbian	Twitter	Ljubešić et al, 2017 [bib]	91,738	-	+	7.73	43.86
Slovenian	Twitter	Erjavec et al, 2017 [bib]	75,276	-	+	15.66	54.34
Spanish	Twitter	Alegria et al, 2013 [bib]	13,827	-	-	7.69	21.33
Turkish	Twitter	Çolakoğlu et al, 2019 [bib]	7,949	-	+	36.60	15.38
Turkish-German	Twitter	van der Goot & Çetinoglu [bib]	16,546	+	+	24.25	15.59

The 1-n/n-1 column indicates whether words are split and or merged in the annotation, and the caps column indicates whether capitalization is corrected. It should be noted that there were annotation guidelines differences as well as filtering criteria during creation of these datasets, which might hinder cross-lingual learning. We attempted to converge some of these annotation differences automatically, and some of them manually, but did not have the resources to do a full re-annotation.

The current version of the data can be downloaded here. We encourage all the participants to notify the organizers of any disagreements with the annotation. We will take these into account, and improve the dataset until the 1st of June 2021. Please forward such cases to multilexnorm@gmail.com .

We also created word embeddings (skip-gram) trained on large amounts of Twitter data, as well as word unigram and bigram counts. This data is available from: https://robvanderg.github.io/blog/twit_embeds.htm

Update 25-08-2021: Test data is now available: https://bitbucket.org/robvanderg/multilexnorm/src/master/test-eval/, and the online evaluation platform is online: https://competitions.codalab.org/competitions/34355

Extrinsic evaluation

As secondary evaluation, we will include an evaluation of the downstream effect of normalization. We will focus on Dependency parsing, and include the raw input data with the distributed test data for some of the languages. Then, we train a dependency parser for each available language on canonical data, and evaluate the effect of having normalization versus the orginal data.

Dates

Data available: Jan 4, 2021
Data freeze: July 16, 2021
Test data available: Aug 25, 2021
Final Evaluation: Sep 1, 2021
Paper deadline: Sep 22, 2021
Paper reviewed: Oct 1, 2021
Camera ready: Oct 8, 2021
Workshop: Nov 2021

Organizers

The main contact point for the shared task is https://groups.google.com/u/2/g/multilexnorm

Less public matters can be communicated to multilexnorm@gmail.com

Rob van der Goot
Barbara Plank
Alan Ramponi
Tommaso Caselli
Nikola Ljubešić
Timothy Baldwin
Özlem Çetinoglu
Benjamin Muller
Talha Çolakoğlu
Arkaitz Zubiaga
Iñaki San Vicente Roncal
Wladimir Sidorenko
Rahmad Mahendra

Anti-harassment Policy