In our MultiLexNorm2 task, we emphasize non-Indo-European languages, such as Thai, Vietnamese, and Indonesian. Participants are asked to develop a system that performs lexical normalization: the conversion of non-canonical texts to their canonical equivalent form. In particular, this task includes data from 15 languages.
Nov 14, 2024 |
|
Nov 19, 2024 |
|
Nov 20, 2024 |
|
Social media provides a rich source of information. It is notoriously noisy, but also interesting because of its fast pace, informal nature and the vast amount of available data. The creative language use found on social media introduces many difficulties for existing natural language processing tools. One way to overcome these issues is to `normalize' this data to a more canonical register before processing it. In this task we focus on lexical normalization, which means that replacements are done on the word level.
social ppl r troublesome Social people are troublesome
Previous work on normalization has mostly been mono-lingual, where a wide variety of approaches, datasets, and evaluation metrics where used. For this shared task we combined existing datasets with new ones, and converted them to the same format.
We use a tab-seperated format, with pre-tokenized data. Each word is on one line, and sentence boundaries are indicated with an empty line. The normalization is displayed in the first column:
social Social ppl people r are troublesome troublesome
Some of the languages include annotation for word splits and merges. When a word is split, the normalization column include a white-space character, and with a merge the normalization is only included for the first word:
if If i i have have a a head headache ache tomorro tomorrow ima i'm going to be be pissed pissed
We encourage all the participants to notify the organizers of any disagreements with the annotation. We will take these into account, and improve the dataset until the 1st of Jan 2025. Please forward such cases to multilexnorm2@gmail.com.
We use Error Reduction Rate (ERR), which is word-level accuracy normalized for the number of replacements in the dataset (van der Goot, 2019). The formula for ERR is:
ERR = (TP − FP)/(TP + FN)
Where TP, FP, TN, FN are defined as follows:
TN = Annotators did not normalize, system did not normalize FP = Annotators did not normalize, system normalized FN = Annotators normalized, but system did not find the correct normalization. This could be because it kept the original word, or proposed a wrong candidate. TP = Annotators normalized, systems normalized correctly
Note that a word which should be normalized, but is normalized to the wrong candidate, is only a FN. Every input token represents exactly one point in TP, FP, TN, FN. It should also be noted that our evaluation script is case-sensitive, even though some of the dataset do not include capitalization corrections. For a more in depth discussion about evaluation of normalization and ERR in particular we refer to Chapter 5 of Normalization and Parsing Algorithms for Uncertain Input
The final ranking is determined using the macro-average ERR, with two distinct winners:
We provide two simple baselines:
Our dataset consists of the following languages (two of them are language-pairs, these include code-switched data):
Language | Data from | Original Source | Size (#words) | 1-n/n-1 | Caps | %normed | MFR-ERR |
---|---|---|---|---|---|---|---|
Added new languages | |||||||
Thai | Limkonchotiwat et al. [bib] | 3,380,879 | + | - | 4.83 | 51.19 | |
Vietnamese | Facebook/TikTok | Nguyen, Le, & Nguyen, 2024 [bib] | 96,322 | - | - | 16.08 | 73.95 |
Indonesian | Kurnia & Yulianti, 2020 [bib] | 48,716 | - | + | 47.47 | 58.94 | |
Original languages | |||||||
Croatian | Ljubešić et al, 2017 [bib] | 75,276 | - | + | 8.98 | 35.41 | |
Danish | Twitter/Arto | Plank et al, 2020 [bib] | 11,816 | + | + | 8.66 | 41.69 |
Dutch | Twitter/sms/forum | Schuur, 2020 [bib] | 23,053 | + | + | 26.49 | 29.97 |
English | Baldwin et al, 2015 [bib] | 73,806 | + | - | 6.90 | 61.88 | |
German | Sidarenka et al, 2013 [bib] | 25,157 | + | + | 8.90 | 60.00 | |
Indonesian-English | Barik et al, 2019 [bib] | 23,124 | + | - | 12.16 | 62.91 | |
Italian | van der Goot et al, 2020 [bib] | 14,641 | + | + | 7.36 | 15.90 | |
Serbian | Ljubešić et al, 2017 [bib] | 91,738 | - | + | 7.73 | 43.86 | |
Slovenian | Erjavec et al, 2017 [bib] | 75,276 | - | + | 15.66 | 54.34 | |
Spanish | Alegria et al, 2013 [bib] | 13,827 | - | - | 7.69 | 21.33 | |
Turkish | Çolakoğlu et al, 2019 [bib] | 7,949 | - | + | 36.60 | 15.38 | |
Turkish-German | van der Goot & Çetinoglu [bib] | 16,546 | + | + | 24.25 | 15.59 |
The 1-n/n-1 column indicates whether words are split and or merged in the annotation, and the caps column indicates whether capitalization is corrected. It should be noted that there were annotation guidelines differences as well as filtering criteria during creation of these datasets, which might hinder cross-lingual learning. We attempted to converge some of these annotation differences automatically, and some of them manually, but did not have the resources to do a full re-annotation.
Event | Date |
---|---|
Nov 15, 2024 | |
Jan 07, 2025 | |
Jan 25, 2025 | |
Feb 07, 2025 | |
Feb 25, 2025 | |
Mar 01, 2025 | |
Mar 10, 2025 | |
May 03, 2025 (TBD) |
The main contact point for the shared task is Discord.
Less public matters can be communicated to multilexnorm2@gmail.com