Iterations#
We decided to work in iterations because it allows us to validate and improve the annotation process and guidelines. Each iteration has its own goals and objectives.
Iteration 1#
In this iteration, our goal was to validate and refine our annotation process. It was the first time that we applied the annotation process. Two annotators labeled the data. The first annotator was a volunteer, the second was the author of the dataset. The volunteer provided useful feedback to adjust the annotation process. The data labeled by the researcher was predominated under the volunteer's labels as the researcher fixed some mistakes in the annotation process.
Inter-Rater Reliability#
In this iteration, we didn't generate the inter-rater reliability analysis because we did some changes and alignments during the iteration.
Profiling Report
Iteration 2#
In the second iteration, we introduced contract workers to do the annotations. The annotators were trained by the author of the dataset as described in Qualified annotators.
Inter-Rater Reliability#
As described in the Inter-Rater Reliability section, we evaluate the reliability of the annotators using several coefficients.
We also address the analysis by considering it as a multi-label problem or several binary problems.
Multi-Label Problem
For all our toxicity labels we calculate the Krippendorff's alpha (using the MASI distance) and the Percent Agreement.
- Krippendorff's alpha: 0.1962 (slight agreement)
- Percent Agreement: 0.1877
Binary Problem
Feature / metrics | Percent Agreement | Krippendorff's alpha | Gwet's AC1 | Comments |
---|---|---|---|---|
is_offensive | 0.7277 | 0.0595 | 0.7750 | |
is_targeted | 0.1610 | -0.1348 | -0.1029 | [1] |
targeted_type | 0.0641 | 0.2461 | 0.4978 | [1] |
toxic_spans | 0.1220 | 0.2709 | N/A | |
health | 0.9760 | 0.0447 | 0.9837 | |
ideology | 0.7647 | 0.3019 | 0.7976 | [3] |
insult | 0.4713 | 0.0895 | 0.425 | [3] |
lgbtqphobia | 0.9453 | 0.5583 | 0.9603 | |
other_lifestyle | 0.9860 | 0.0824 | 0.9906 | |
physical_aspects | 0.9463 | 0.3272 | 0.9622 | |
profanity_obscene | 0.6837 | 0.0850 | 0.726 | [3] |
racism | 0.9750 | 0.2564 | 0.9829 | |
religious_intolerance | 1.0 | 1.0 | 1.0 | [2] |
sexism | 0.8753 | 0.1721 | 0.9076 | |
xenophobia | 0.9673 | 0.0732 | 0.9777 |
Comments#
- [1] The question that originated features
is_targeted
andtargeted_type
are optional, it must be marked only if the text is targeted. Looks like annotator 126 didn't understand it and marked everything as targeted. - [2] We don't have any text tagged with
religious_intolerance
by our annotators. - [3] We have more inconsistent annotations in labels
idelogy
,insult
, andprofanity_obscene
(disregarding [1] [2])
Conclusions#
We had a misunderstanding of the annotation guidelines by one of the annotators, which resulted in an inconsistency in the is_targeted
and targeted_type
labels.
Regarding toxicity labels, we noticed that there are rare cases in which all annotators agree with the annotation, leading to a high rate of disagreement and consequently to a low value of Krippendorff's alpha. The labels with the highest disagreement are insult
, ideology
, and profanity_obscene
.
We will pass along the annotation guidelines with the annotators for the next iteration.
Profiling Report
Iteration 3#
In the third iteration, we retrained the annotators using the output of the previous iteration. One of the annotators was replaced, and another one was trained by the author of the dataset.
Inter-Rater Reliability#
As described in the Inter-Rater Reliability section, we evaluate the reliability of the annotators using several coefficients.
We also address the analysis by considering it as a multi-label problem or several binary problems.
Multi-Label Problem
For all our toxicity labels we calculate the Krippendorff's alpha (using the MASI distance) and the Percent Agreement.
- Krippendorff's alpha: 0.4653 (moderate agreement)
- Percent Agreement: 0.2758
Binary Problem
Feature / metrics | Percent Agreement | Krippendorff's alpha | Gwet's AC1 | Comments |
---|---|---|---|---|
is_offensive | 0.6509 | 0.1777 | 0.6754 | |
is_targeted | 0.3551 | 0.1072 | 0.1709 | |
targeted_type | 0.1975 | 0.4887 | 0.6300 | |
toxic_spans | 0.1757 | 0.4427 | N/A | |
health | 0.9700 | 0.2641 | 0.9794 | |
ideology | 0.8670 | 0.4728 | 0.8934 | |
insult | 0.5488 | 0.3317 | 0.4531 | |
lgbtqphobia | 0.9613 | 0.6393 | 0.9722 | |
other_lifestyle | 0.9787 | 0.4683 | 0.9854 | |
physical_aspects | 0.9560 | 0.4160 | 0.9691 | |
profanity_obscene | 0.7089 | 0.4894 | 0.6870 | |
racism | 0.9913 | 0.3781 | 0.9942 | |
religious_intolerance | 1.0 | 1.0 | 1.0 | 1 |
sexism | 0.9550 | 0.1566 | 0.9689 | |
xenophobia | 0.9847 | 0.2980 | 0.9896 |
Comments#
- [1] We don't have any text tagged with
religious_intolerance
by our annotators.
Conclusions#
In this iteration, we had more consistent annotations which led to a better agreement between the annotators. Krippendorff's alpha for toxicity labels increased from 0.1962 to 0.4653.
Profiling Report
Iteration 4#
In the fourth iteration, we asked the annotators to label a higher number of texts following the same guidelines of the previous iterations. We fixed the deadline to Oct 4th, 2022 (+- one month).
Inter-Rater Reliability#
As described in the Inter-Rater Reliability section, we evaluate the reliability of the annotators using several coefficients.
We also address the analysis by considering it as a multi-label problem or several binary problems.
Multi-Label Problem
For all our toxicity labels we calculate the Krippendorff's alpha (using the MASI distance) and the Percent Agreement.
- Krippendorff's alpha: 0.4424 (moderate agreement)
- Percent Agreement: 0.2769
Binary Problem
Feature / metrics | Percent Agreement | Krippendorff's alpha | Gwet's AC1 | Comments |
---|---|---|---|---|
is_offensive | 0.5847 | 0.2174 | 0.5716 | |
is_targeted | 0.4253 | 0.1825 | 0.2790 | |
targeted_type | 0.2223 | 0.4840 | 0.5756 | |
toxic_spans | 0.2249 | 0.4760 (MASI distance) | N/A | |
health | 0.9800 | 0.1424 | 0.9865 | |
ideology | 0.8531 | 0.2909 | 0.8863 | |
insult | 0.4938 | 0.2923 | 0.3549 | |
lgbtqphobia | 0.9550 | 0.4901 | 0.9681 | |
other_lifestyle | 0.9705 | 0.2239 | 0.9798 | |
physical_aspects | 0.9570 | 0.3623 | 0.9700 | |
profanity_obscene | 0.7436 | 0.5530 | 0.7233 | |
racism | 0.9940 | 0.2481 | 0.9960 | |
religious_intolerance | 1.0 | 1.0 | 1.0 | 1 |
sexism | 0.9640 | 0.1880 | 0.9753 | |
xenophobia | 0.9905 | 0.3840 | 0.9936 |
Comments#
- [1] We don't have any text tagged with
religious_intolerance
by our annotators in this iteration.
Conclusions#
As in the previous iteration, we had more consistent annotations which led to a better agreement between the annotators.