W-NUT 2020 Shared Task: Entity and Relation Recognition over Wet Lab Protocols

Entity and Relation Recognition over Wet Lab Protocols

For this task, participants are asked to develop systems that automatically identify actions, entities, and their relations from lab instructions. See here for a live demo of the task.

Data is released on June 8, 2020!

Official evaluation for the Named Entity Recognition Task will be between Aug 31 and Sep 4, 2020.
Official evaluation for the Relation Extraction Task will be between Sep 9 and Sep 15, 2020.
(register here to participate).

There is a mailing list for future announcements.

Intro

Lab protocols specify steps in performing a lab procedure. They are noisy, dense, and domain-specific. Automatic or semi-automatic conversion of protocols into machine-readable format benefits biological research. In this task, system entries are invited for the identification of events and relations in these lab protocols.

      How to Make a 0.5M TCEP Stock Solution
      Weigh 5.73 g of TCEP.
      Add 35 ml of cold molecular biology grade water to the vial, and dissolve the TCEP.
      This resulting solution is very acidic, with an approximate pH of 2.5.

For WNUT-2020, web lab data has been annotated for events and relations using BRAT. Submissions should be made in BRAT format. Lab instructions come from protocols.io. An example paper working with this noisy and dense format is N18-2016.

Task Organizers

Jeniya Tabassum (Ohio State University)
Wei Xu (Georgia Tech)
Alan Ritter (Georgia Tech)

Important Dates

Data available: June 8, 2020
NER Evaluation window: Aug 31 - Sep 4
RE Evaluation window: Sep 9 - Sep 15
System description papers submitted: Sep 22
Papers reviewed: Oct 2
Papers camera ready: Oct 9
Workshop day: November 19

Data format

CONLL format:

Protocols are represented in CONLL format. In this format each line of the is in the following format:

<word>+"\t"+<NE>

The end of sentence is marked with an empty line.

The standoff format:

In the standoff format Each text document in the dataset is acompanied by a corresponding annotation file. The two are associatied by using a simple file naming convention, wherein their base name (file name without the file extention) is the same: for example, the file protocol_30.ann contains annotations for the file protocol_30.txt.

Within the document, individual annotations are connected to specific spans of text through character offsets. For example, in a document beginning "Weigh 5.73 g of TCEP." the text "Weigh" is identified by the offset range 0..5. (All offsets all indexed from 0 and include the character at the start offset but exclude the character at the end offset.)

Text file:

Text files are expected to have the file extension .txt and contain the text of the original protocol input into the system.

The protocol texts are stored in plain text files encoded using UTF-8 (an extension of ASCII — plain ASCII texts work too). The Protocol texts contain newlines, each line indicating a single step in the protocol. The first line is always the protocol's name/title, as shown in the example above.

Annotation file:

Annotations are stored in files with the .ann file extension. The various annotation types that may be contained in these files are discussed in the following.

All annotations follow the same basic structure: Each line contains one annotation, and each annotation is given an ID that appears first on the line, separated from the rest of the annotation by a single TAB character. The rest of the structure varies by annotation type.

Examples of annotation for an entity "Reagent" (T1), an event trigger "Action" (T2), an event (E1) and a relation (R2) are shown in the following.

T1 Reagent 111 130 fresh tissue sample
T2 Action 199 204 Weigh
E1 Action:T2 Acts-on:T4
T3 Amount 219 227 50-100mg
T4 Reagent 228 235 tissues
T5 Action 289 294 mince
E2 Action:T5 Acts-on:T6 Using:T10 Site:T8
T6 Reagent 299 305 tissue
T7 Modifier 311 328 very small pieces
R2 Mod-Link Arg1:E2 Arg2:T7

In our annotated dataset, we only ever make use of three types of annotations, each associated with their annotation ID. Those annotation IDs are described below:

T: text-bound annotation
R: relation
E: event

Detailed descriptions of each of these types of annotations are given below.

Text bound annotations:

Text-bound annotations are an important category of annotation related to both entity and event annotations. Text-bound annotation identifies a specific span of text and assigns it a type.

T1 Reagent 111 130 fresh tissue sample
T2 Action 199 204 Weigh

All text-bound annotations follow the same structure. As in all annotations, the ID occurs first and is delimited from the rest of the line with a TAB character. The primary annotation is given as a SPACE-separated triple (type, start-offset, end-offset). The start-offset is the index of the first character of the annotated span in the text (".txt" file), i.e. the number of characters in the document preceding it. The end-offset is the index of the first character after the annotated span. Thus, the character in the end-offset position is not included in the annotated span. For reference, the text spanned by the annotation is included, separated by a TAB character.

Entity annotations:

Each entity annotation has a unique ID and is defined by type (e.g. Reagent or Amount) and the span of characters containing the entity mention (represented as a "start end" offset pair).

T1 Reagent 111 130 fresh tissue sample
T3 Amount 219 227 50-100mg
T4 Reagent 228 235 tissues

Each line contains one text-bound annotation identifying the entity mention in text.

Event annotations:

Each event annotation has a unique ID and is defined by type, event trigger (the text stating the event) and arguments. We only have one type of event in our dataset, which is "Action"

T2 Action 199 204 Weigh
E1 Action:T2 Acts-on:T4

The event triggers, annotations marking the word or words stating each event, are text-bound annotations and their format is identical to that for entities. (The IDs of triggers occupy the same space as the IDs of entities, and these must not overlap.)

As for all annotations, the event ID occurs first, separated by a TAB character. The event trigger is specified as TYPE:ID and identifies the event type and its trigger through the ID. By convention, the event type is specified both in the trigger annotation and the event annotation. The event trigger is separated from the event arguments by SPACE. The event arguments are a SPACE-separated set of ROLE:ID pairs, where ROLE is one of the event- and task-specific argument roles (e.g. Acts-on, Creates, Site, etc) and the ID identifies the entity or event filling that role. Note that several events can share the same trigger and that while the event trigger should be specified first, the event arguments can appear in any order.

Relation annotations:

Binary relations have a unique ID and are defined by their type (e.g. Measure, Mod-Link) and their arguments. Relation arguments are always identified simply as Arg1 and Arg2.

R2 Mod-Link Arg1:E2 Arg2:T7

The format is similar to that applied for events, with the exception that the annotation does not identify a specific piece of text expressing the relation ("trigger"): the ID is separated by a TAB character, and the relation type and arguments by SPACE.

Baseline

A CRF-based system is included as the baseline for this task. Source code for the NER baseline: https://github.com/jeniyat/WNUT_2020_NER

A Maximum Entropy system is included as the baseline for the relation extraction task. Source code for the RE baseline: https://github.com/jeniyat/WNUT_2020_RE

Downloads

As with a typical NLP shared task, the training data is released ahead of time, and then during an evaluation period some unlabelled data is released, and entrants should run their system on this data and upload the result. The score on this is the result. Unlike typical NLP shared tasks, we release no dev partition, but instead invite participants to cross-validate over a larger training set; this should give a stabler picture of performance during system development (P19-1267).

Source data: https://github.com/jeniyat/WNUT_2020_NER

Submit results

Submission of NER predictions

You need to submit your model prediction on the new protocols by September 4, 2020 (AoE), with a brief description (<= 280 characters) of your model using the output submission the form.

Submission Instructions

You are required to submit your model predictions in a zip file. The name of the zipped file must be in the following format:

    <team_name>.zip  
    [e.g., ‘OSU_NLP’ team must submit the predictions in ‘OSU_NLP.zip’]

The zipped file should contain the predictions on these 111 protocols. You can submit your predictions in any format: conll / standoff.

If you choose to submit predictions as standoff format below is the required directory structure.

  OSU_NLP/
      ├── protocol_0623.ann
      ├── protocol_0623.txt
      ├── protocol_0624.ann
      ├── protocol_0624.txt
      ├── protocol_0625.ann
      ├── protocol_0625.txt
      ├── protocol_0626.ann
      ├── ...

If you choose to submit predictions as conll format below is the required directory structure.

  OSU_NLP/
      ├── protocol_0623_conll.txt
      ├── protocol_0624_conll.txt
      ├── protocol_0625_conll.txt
      ├── protocol_0626_conll.txt
      ├── ...

Technical System Papers

We strongly encourage every participant team to submit a paper on system description. Your papers will be reviewed through a single-blind peer-reviewing process. The accepted papers will be published in the EMNLP workshop proceedings, and the authors will be invited to present the paper at the virtual WNUT workshop (though not required).

You may submit a paper up to 8 pages (plus extra pages for references). The submission does not need to be anonymized; you shall include your name and affiliation in the submitted paper. The title of the paper must follow the format: " at WNUT 2020 Shared Task-1: ". For example, "OSU-NLP at WNUT 2020 Shared Task-1: A Log-linear Model for Wet Lab Entity Extraction" if the team name is "OSU-NLP".

The submission should conform to EMNLP 2020 style guidelines and needs to be submitted at the SoftConf link.

The deadline to submit the system paper is Sep 22, 2020 (AoE).

You may find examples of shared-task system description papers in the previous WNUT proceedings at https://www.aclweb.org/anthology/volumes/W15-43

Sponsored by

Anti-harassment Policy