Mini Project
Measuring Annotation Quality: A Mini Evaluation Project
How I used Cohen’s Kappa and F1 to diagnose disagreement and refine a pain descriptor guideline.
1. Context
Annotation quality depends on consistency between annotators. Metrics make that visible. I ran a small test on a draft pain descriptor scheme to check that different people could apply the labels the same way.
2. Mini Dataset
I built a small list of pain expressions with two compatible schemes, a metaphor based layer and a clinical layer. Two annotators labelled the same items independently.
Item | Annotator A | Annotator B |
---|---|---|
feels like burning needles | heat | heat |
stabbing in the lower back | sharp | sharp |
throbbing like a drum | rhythmic | pressure |
dull constant ache | pressure | pressure |
like fire spreading | heat | sharp |
Labels shown are the metaphor based layer for clarity.
3. Results
Agreement
- Cohen’s Kappa, 0.53 (moderate)
- Interpretation, inconsistent boundaries for one category
F1 per class
- heat, 0.67
- sharp, 0.67
- rhythmic, 0.00
- pressure, 0.80
Takeaway, the class with very low F1 revealed confusion about its definition. Agreement improved after clarification.
4. What I Changed
- Clarified the rhythmic definition and added examples that distinguish pulsing from intermittent ache.
- Added a short decision rule, if unsure between rhythmic and pressure, choose pressure unless there is explicit cyclical language.
- Updated the shared sheet and ran a quick calibration session.
5. Retest
Agreement
Cohen’s Kappa, 0.81 and stable across a new batch.
F1 per class
rhythmic increased to 0.78. Other classes remained high.
6. 🔧 Tools I Use for Evaluation
I usually export the annotated data as a simple CSV table (one row per item, one column per annotator’s labels),
then calculate agreement scores. For small pilot sets I run the calculations in Python using
sklearn.metrics
— for example cohen_kappa_score
for inter-annotator agreement
and f1_score
to check per-class performance.
For larger or ongoing projects, I use the metrics built directly into the annotation platform. Tools like Prodigy, Label Studio, Doccano or LightTag include agreement reports in the interface, so you can review Cohen’s Kappa or F1 scores without leaving the tool. After checking the metrics, I run short calibration sessions and update the guidelines until agreement improves.
7. Reflection
This mini project shows how I move from disagreement to evidence and then to improvement. Metrics do not replace judgement. They make it easier to focus revision where it matters and to show stakeholders that a guideline is reliable at scale.
See the companion pieces, Annotation in Practice, Two Mini Case Studies and From Conversation Analysis to Data Annotation.