EVALUATION MEASURES

Each participating team will initially have access only to the training data. Later, the unlabelled test data will also be released (see the section Important dates). After the assessment, the labels for the test data will also be released.


Toxicity detection task: It is a binary classification problem (toxic/not toxic) with F1 measure (precision and recall) over the toxic class. The test data set is a random sample of comments. Therefore, the class are not balanced, being the toxic class less frequent than the not toxic class. For this reason, according to F1, true positives (detecting toxic comments) have more effect in the evaluation than false positive (not toxic comments classified as toxic).

Toxicity level detection task: Comments must be classified into four ordered classes (0='not toxic', 1='mildly toxic', 2='toxic’ and 3=’very toxic'). Unlike in traditional classification problems, the relative ordering between classes is significant. The official metric for the main system ranking will be the Closeness Evaluation Metric (CEM) (Amigó et al., 2020). CEM is specifically defined for Ordinal Classification tasks. Just like accuracy based metrics, exact matchings are rewarded. Just like ranking metrics, the correct ordering of levels is rewarded. Just like in mean error based metrics, larger errors are penalized to a greater extent. Due to the class balance in the data set, toxicity detections have more effect than incorrect labeling of not toxic comments, and the detection of highly toxic comments is especially significant. 

In addition, for the level detection task, we will provide evaluation results with Rank Biased Precision (RBP) (Moffat et al., 2008) and Pearson coefficient. Each system use scenario requires different metrics. For instance, CEM fits into a text highlighting problem. RBP is appropriate when retrieving high toxic comments from large texts. A high linear correlation (Pearson coefficient) is necessary when comparing the average toxicity of different sources (e.g. journal, web sites, etc.).

Both tasks will share the same system output format consisting in pairs comment_id/toxicitylevel_number. In the first task (binary toxicity detection) any level different than 0 will be considered as toxic. 

 

RESULTS

The global results considering all submitted runs are included in the following excel files:

Next, a ranking per team (selecting their best scoring output) is shown for both subtasks:

 

RESULTS - SUBTASK 1

 

RESULTS -SUBTASK 2