W-NUT 2026 Shared Task: Multi-lingual Lexical Normalization

MultiLexNorm2026

In our MultiLexNorm2026 task, we emphasize non-Indo-European languages, such as Thai, Vietnamese, Korean, Japanese and Indonesian. Participants are asked to develop a system that performs lexical normalization: the conversion of non-canonical texts to their canonical equivalent form. In particular, this task includes data from 17 languages.

Registration: Form
Communication: Discord
Data & Code: GitHub, Demo
Leaderboard: CodaBench

Timeline

Date	Event
June 10th	Test data (Final phase)
June 20th	Final Evaluation
July 11th	Paper deadline
August 8th	Paper Acceptance
August 15th	Camera ready
October 28th	Workshop

Lexical Normalization

Social media provides a rich source of information. It is notoriously noisy, but also interesting because of its fast pace, informal nature and the vast amount of available data. The creative language use found on social media introduces many difficulties for existing natural language processing tools. One way to overcome these issues is to `normalize' this data to a more canonical register before processing it. In this task we focus on lexical normalization, which means that replacements are done on the word level.

social  ppl     r    troublesome
Social  people  are  troublesome

Previous work on normalization has mostly been mono-lingual, where a wide variety of approaches, datasets, and evaluation metrics where used. For this shared task we combined existing datasets with new ones, and converted them to the same format.

Data Format

We use a tab-separated format, with pre-tokenized data. Each word is on one line, and sentence boundaries are indicated with an empty line. The normalization is displayed in the first column:

social          Social
ppl             people
r               are
troublesome     troublesome

Some of the languages include annotation for word splits and merges. When a word is split, the normalization column includes a white-space character, and with a merge the normalization is only included for the first word:

if              If
i               i
have            have
a               a
head            headache
ache
tomorro         tomorrow
ima             i'm going to
be              be
pissed          pissed

We encourage all the participants to notify the organizers of any disagreements with the annotation. We will take these into account, and improve the dataset. Please forward any such cases to the communication channel or to an organizer.

Evaluation Metric

We use Error Reduction Rate (ERR), which is word-level accuracy normalized for the number of replacements in the dataset (van der Goot, 2019). The formula for ERR is:

ERR = (TP - FP) / (TP + FN)

Where TP, FP, TN, FN are defined as follows:

TN = Annotators did not normalize, system did not normalize
FP = Annotators did not normalize, system normalized
FN = Annotators normalized, but system did not find the correct normalization.
     This could be because it kept the original word, or proposed a wrong candidate.
TP = Annotators normalized, systems normalized correctly

Note that a word which should be normalized, but is normalized to the wrong candidate, is only a FN. Every input token represents exactly one point in TP, FP, TN, FN. It should also be noted that our evaluation script is case-sensitive, even though some of the datasets do not include capitalization corrections. For a more in depth discussion about evaluation of normalization and ERR in particular we refer to Chapter 5 of Normalization and Parsing Algorithms for Uncertain Input.

Final Ranking

Final rankings are computed using macro-average ERR, weighted at 50% for new languages and 50% for original languages, resulting in two winners:

One winner averaged across the newly added languages
One overall winner averaged across all languages

Note: The official winner will be the highest-ranking open-source system.

Baselines

We provide two simple baselines:

Leave-As-Is (LAI): Simply uses the input as output. Word level accuracy would be equal to the % of words that are not normalized, and ERR is 0.0.
Most-frequent-Replacement (MFR): Uses the most frequent replacement based on the training data. If the input word is not present in the training data, it returns the input word.

Languages

Our dataset consists of the following languages (two of them are language-pairs, these include code-switched data). Please refer to the Multinorm++ paper for more details.

Language	Data from	Size (#words)	1-n/n-1	Caps	%normed	MFR-ERR	Original Source
Added new languages
Indonesian	Instagram	48,716	-	+	47.47	59.75	Kurnia & Yulianti, 2020 [bib]
Japanese	Twitter	95,416	+	-	7.03	6.32	Tomoyuki, Risa, and Naoki
Korean	dcinside	16,577	+	-	7.54	6.35	Yumin Kim, Jimin Lee, and Hwanhee Lee
Thai	Twitter	200,915	+	-	3.99	42.77	Limkonchotiwat et al. [bib]
Vietnamese	Facebook/TikTok	128,685	+	-	15.98	75.77	Nguyen, Le, & Nguyen, 2024 [bib]
Original languages
Croatian	Twitter	89,052	-	+	8.16	41.53	Ljubešić et al, 2017 [bib]
Danish	Twitter/Arto	20,206	+	+	9.09	49.68	Plank et al, 2020 [bib]
Dutch	Twitter/sms/forum	21,657	+	+	28.84	39.39	Schuur, 2020 [bib]
English	Twitter	73,806	+	-	7.62	66.57	Baldwin et al, 2015 [bib]
German	Twitter	24,948	+	+	17.39	34.35	Sidarenka et al, 2013 [bib]
Indonesian-English	Twitter	23,124	+	-	13.93	61.51	Barik et al, 2019 [bib]
Italian	Twitter	14,641	+	+	7.01	16.83	van der Goot et al, 2020 [bib]
Serbian	Twitter	91,738	-	+	7.88	45.19	Ljubešić et al, 2017 [bib]
Slovenian	Twitter	75,276	-	+	14.93	58.70	Erjavec et al, 2017 [bib]
Spanish	Twitter	13,824	-	-	7.48	25.57	Alegria et al, 2013 [bib]
Turkish	Twitter	8,082	-	+	36.83	14.53	Çolakoğlu et al, 2019 [bib]
Turkish-German	Twitter	16,508	+	+	25.59	22.09	van der Goot & Çetinoglu [bib]

The 1-n/n-1 column indicates whether words are split and or merged in the annotation, and the caps column indicates whether capitalization is corrected. It should be noted that there were annotation guidelines differences as well as filtering criteria during creation of these datasets, which might hinder cross-lingual learning. This is already quite visible in the "1-n/n-1" and "Caps" columns — there are still some differences. We attempted to converge some of these annotation differences automatically, and some of them manually, but did not have the resources to do a full re-annotation. In general we follow the guidelines from the 2021 shared task.

Some known peculiarities in the data include: 1) Japanese, insertion is annotated (which is not included in the other languages); 2) for Korean, the data is sampled so that each word is unique, this leads to an MFR baseline score of 0.0.

You can use any pre-trained models for the shared task. Additional data is allowed, except for lexicon-normalized datasets in the target languages. If you want to use additional target language lexicon-normalization data, you must share the data with other teams. We can help by adding it to the shared task.