CORPUS

TRAINING DATA

CORPUS DESCRIPTION

We will use as a dataset the NewsCom-TOX corpus, which consists of 4,357 comments (approx.) posted in response to different articles extracted from Spanish online newspapers (ABC, elDiario.es, El Mundo, NIUS, etc.) and discussion forums (such as Menéame) from August 2017 to July 2020. These articles were manually selected taking into account their controversial subject matter, their potential toxicity, and the number of comments posted (minimum 50 comments). We used a keyword-based approach to search for articles related mainly to immigration. 


The comments were selected in the same order in which they appear in the time thread in the web. The author (anonymized), the date and the time when the comment was posted are also retrieved. The number of comments ranges from 65 to 359 comments per article. On average, approximately 30% of the comments are toxic. 


We annotate each comment in two categories ‘toxic’ and ‘not toxic’, and then we assign different levels of toxicity: ‘toxicity_level_0=not toxic’, ‘toxicity_level_1=mildly toxic’, ‘toxicity_level_2=toxic’ or ‘toxicity_level_3=very toxic' to those that are annotated first as toxic. In addition to annotating whether or not the comment is toxic and its level of toxicity, we also annotate the following features: argumentation, constructiveness, stance, target, stereotype, sarcasm, mockery, insult, improper language, aggressiveness and intolerance. All these features (or categories) have binary values except the toxicity level.

Each comment is annotated in parallel by three annotators and an inter-annotator agreement test is carried out once all the comments on each article have been annotated. Then, disagreements are discussed by the annotators and a senior annotator until an agreement is reached. The team of annotators involved in the task consists of two expert linguists and two trained annotators, who are linguistics students.

A detailed description of data (annotation scheme, data format, etc.) will soon be available in the task guidelines.

We will provide participants with 80 % of the NewsCom-TOX corpus for training their models, and the remaining 20 % of the corpus will be used for testing their models.

In order to avoid any conflict with the sources of comments regarding their Intellectual Property Rights (IPR), the data will be privately sent to each participant that is interested in the task. The corpus will be only available for research purposes.