Geolocation Prediction in Twitter

With ever increasing numbers of people interacting with social media, social data has become a gold mine of insights into the people, opinions and events of the world. Perhaps the greatest insights come when that data is partitioned into meaningful sub-populations, with one of the most obvious such dimensions being geographical. In many social platforms, however, geographical information is either missing, incomplete or not accessible. This greatly restricts the utility of social data for location-related applications such as regional sentiment analysis, local event detection, and geographically-bounded marketing and advertising. This shared task focuses on predicting geographical location (i.e., geotagging) using Twitter text data. The task on its own offers a benchmark dataset for comparing different geotagging methods, and also sheds light on how to expand geotagging from social media to a more general domain.

Shared Task

Task Specifications
The shared task is presented as a multiclass classification problem: you will be given a list of mutually exclusive classes (e.g. metropolitan city centres). You will also be given training/dev data based on this class representation. Your goal is to predict the class label for each item in the test dataset.
The shared task will be carried out on two levels:
Task Settings
User-level (each user has a unique class label)
Training: 1 million users
Dev/Test: 10,000 users each

Message-level (each tweet has a unique class label)
Training: tweets from 1 million users (in the user-level setting)
Dev/Test: 10,000 tweets each (different from user-level dev and test data)

Datasets and Evaluation Metric

Submission Specifications

The shared task will focus on English tweets. For both the user- and message-level tasks, you will be provided with compressed public Tweet JSON data sourced from the Twitter streaming API. Due to Twitter's terms of service, we can only provide tweet Ids and you are required to register a Twitter dev account to download data yourself. Downloader scripts will be provided.

Note: Author and co-author information shall be accompanied with submissions. An author can only join one team and each team can submit maximum 3 results for a level. The total number of co-author is maximum 5.

Evaluation
  • Classification accuracy
  • Median error distance
  • Mean error distance

Timeline

All dates are based on: 11:59PM PACIFIC STANDARD TIME
  • Release of training/dev data: 15 August 2016
  • Release of test data: 15 September 2016
  • Submission of runs: 17 September 2016
  • Shared task results and gold labels for test data: 18 September 2016
  • System description papers due: 04 October 2016
  • Notification: 18 October 2016
  • Camera ready due: 30 October 2016
  • Workshop date: Sunday: 11 December 2016
  • Submission

    All submissions should conform to COLING 2016 style guidelines. Please remove author information from your papers, though ince this is a system description paper, if you are describing previously published work that is highly related, you don't need to make the references totally anonymous. The page limit is the same as the main workshop, 8 pages + 2 references, though you don't need to fill this, and four pages is fine if that's enough to describe your work.

    Please submit your papers at https://www.softconf.com/coling2016/WNUT/, and select the track Geolocation Shared Task Papers.

    Shared Task Organizers