Mini Project

Measuring Annotation Quality: A Mini Evaluation Project

How I used Cohen’s Kappa and F1 to diagnose disagreement and refine a pain descriptor guideline.

Sep 2025

1. Context

Annotation quality depends on consistency between annotators. Metrics make that visible. I ran a small test on a draft pain descriptor scheme to check that different people could apply the labels the same way.

2. Mini Dataset

I built a small list of pain expressions with two compatible schemes, a metaphor based layer and a clinical layer. Two annotators labelled the same items independently.

Item Annotator A Annotator B
feels like burning needlesheatheat
stabbing in the lower backsharpsharp
throbbing like a drumrhythmicpressure
dull constant achepressurepressure
like fire spreadingheatsharp

Labels shown are the metaphor based layer for clarity.

3. Results

Agreement

  • Cohen’s Kappa, 0.53 (moderate)
  • Interpretation, inconsistent boundaries for one category

F1 per class

  • heat, 0.67
  • sharp, 0.67
  • rhythmic, 0.00
  • pressure, 0.80

Takeaway, the class with very low F1 revealed confusion about its definition. Agreement improved after clarification.

4. What I Changed

5. Retest

Agreement

Cohen’s Kappa, 0.81 and stable across a new batch.

F1 per class

rhythmic increased to 0.78. Other classes remained high.

6. 🔧 Tools I Use for Evaluation

I usually export the annotated data as a simple CSV table (one row per item, one column per annotator’s labels), then calculate agreement scores. For small pilot sets I run the calculations in Python using sklearn.metrics — for example cohen_kappa_score for inter-annotator agreement and f1_score to check per-class performance.

For larger or ongoing projects, I use the metrics built directly into the annotation platform. Tools like Prodigy, Label Studio, Doccano or LightTag include agreement reports in the interface, so you can review Cohen’s Kappa or F1 scores without leaving the tool. After checking the metrics, I run short calibration sessions and update the guidelines until agreement improves.

7. Reflection

This mini project shows how I move from disagreement to evidence and then to improvement. Metrics do not replace judgement. They make it easier to focus revision where it matters and to show stakeholders that a guideline is reliable at scale.

See the companion pieces, Annotation in Practice, Two Mini Case Studies and From Conversation Analysis to Data Annotation.