Crowdsourced labels from Amazon Mechanical Turk have 20-30% disagreement rate because annotators interpret instructions differently

devtools0 views
You create a sentiment labeling task on Amazon Mechanical Turk: 'Label this product review as positive, negative, or neutral.' You pay $0.05 per label and collect 3 labels per review for majority voting. Result: on 30% of reviews, the 3 annotators disagree. 'The product is okay but the shipping was terrible' — is that positive (product is okay), negative (shipping was terrible), or neutral (mixed)? Each annotator has a defensible interpretation. You add more detailed guidelines: 'Focus on the product, not the shipping.' Now the disagreement is 20%. You add examples. Now 15%. You can never get below 10-15% because natural language is genuinely ambiguous — there is no single correct label for edge cases. So what? If your training data has 15-20% label noise, your model's theoretical accuracy ceiling is 80-85%. You cannot train a 95% accurate classifier on 85% accurate labels. But you do not know which labels are wrong — the disagreements are randomly distributed. Cleaning the data requires expert review of every disagreed-upon example, which costs 5-10x more than the initial labeling. Most teams accept the noise and wonder why their model plateaus at 80% accuracy. Why does this persist? Crowdworkers are paid pennies per task, spend 5-10 seconds per label, and have no domain expertise. They optimize for speed, not quality. Quality control mechanisms (gold questions, agreement scores, qualification tests) help but cannot solve the fundamental ambiguity of natural language. Expert labeling is 10-50x more expensive ($1-5 per label vs $0.05-0.20). The economics of ML demand large datasets, which demands cheap labeling, which demands crowdworkers, which introduces noise.

Evidence

Snow et al. (2008) 'Cheap and Fast — But is it Good?': showed AMT labels are 80-85% as good as expert labels. Krippendorff's alpha for typical AMT tasks: 0.4-0.7 (1.0 = perfect agreement). AMT worker pay: median $2-6/hour (Hara et al. 2018). Academic standard for 'good' agreement: alpha > 0.8, rarely achieved on subjective NLP tasks. Scale AI and Surge AI charge 5-20x more than AMT for higher quality but still report 5-10% disagreement on subjective tasks.

Comments