Scale NLP Text Labeling Outsourcing | UK AI Companies

Insight Article26 Mar 2026AAbdi Mohamed5 min read

Summarise with AI

What Are NLP Labeling Tasks and Why Are They Hard to Scale?

NLP (natural language processing) labeling teaches models to understand text. Common tasks include: sentiment analysis ("Is this review positive, negative, or neutral?"), named entity recognition—NER ("Find all people, organisations, and locations in this text"), intent classification ("Is this customer query about billing, returns, or general feedback?"), topic modelling ("Which category does this article belong to?"), and semantic similarity ("Do these two sentences mean the same thing?").

Why is it hard to scale? Volume is the first reason. A single NLP project can require 10,000–100,000+ labelled sentences. Human labellers are slow. A single labeller can annotate 200–400 sentences per day, depending on task complexity. To complete 100,000 sentences in 8 weeks, you'd need 6–10 full-time labellers. Second, consistency is hard. Language is ambiguous. One labeller might mark a sentence as "neutral" while another marks it "slightly positive." Maintaining agreement across a team is labour-intensive. Third, domain expertise matters. Labelling medical text requires someone who understands healthcare terminology.

Real-World Example: E-Commerce Reviews

A UK e-commerce platform needed 50,000 product reviews labelled for sentiment (positive, negative, neutral) and aspect categories (product quality, shipping, customer service). They hired 3 in-house labellers. After 6 weeks, inter-rater agreement (Cohen's Kappa) was only 0.71—below the 0.80 threshold. Cost: £12,000 in labour. Outsourcing the same 50,000 reviews to Kenya took 4 weeks, cost £2,500, and achieved Kappa 0.89.

NLP Task Types and Quality Metrics

Different NLP tasks have different complexities and quality measurement methods.

Comparison

Line Item	UK (London)	Treba (Nairobi)	Saving
Task Type	Complexity	UK Cost per 1,000	Kenya Cost per 1,000
Sentiment (3-way: pos/neg/neutral)	Low	£60	£15
Sentiment (5-way: very pos to very neg)	Medium	£120	£30
Named Entity Recognition (NER)	High	£200	£50
Intent Classification (5–10 categories)	Low	£80	£20
Topic Classification (20+ categories)	Medium	£150	£40
Semantic Similarity (0–5 scale)	High	£180	£45

Understanding Quality Metrics

Cohen's Kappa measures agreement between two annotators on categorical tasks. Range: 0–1. Above 0.80 is good. Example: 200 sentences, both raters agree on 180 of them = Kappa ~0.80. F1 Score measures precision and recall on token-level tasks (like NER). Range: 0–1. Above 0.85 is strong. Example: NER task, you correctly find 85% of entities and have few false positives = F1 ~0.88. Spearman Correlation measures agreement on ordinal scales. Range: -1 to 1. Above 0.70 is acceptable. Example: semantic similarity on a 1–5 scale, two raters correlate well = correlation 0.75.

Annotation Guidelines and Consistency

Consistency is built on clear guidelines. A vague guideline produces inconsistent labels. Here's how to create strong guidelines:

The 5-Part Guideline Framework

Part 1: Definition. What does each label mean? Example: "Positive sentiment = the reviewer recommends the product or expresses satisfaction." Not: "Positive = they said anything good." Be precise.

Part 2: Examples. Provide 3–5 real examples per label. Example: ✓ "Perfect fit, excellent quality" = Positive. ✓ "Good but expensive" = Mixed (would need a mixed sentiment category if you have it). ✓ "Arrived damaged" = Negative.

Part 3: Edge Cases. What about ambiguous examples? Example: "It's okay" = Neutral (not positive, even though "okay" sounds acceptable). "Better than expected" = Positive (exceeds baseline). Spell this out.

Part 4: Context Rules. Are there any special cases? Example: "If a review mentions a refund granted, they are usually satisfied despite initial complaint. Mark as Positive." Helps labellers make judgment calls consistently.

Part 5: Forbidden Patterns. What labels should NEVER apply? Example: "Never label a review as Positive just because the reviewer is polite. Mark based on actual product satisfaction." Prevents common mistakes.

Example: NER Guideline for Medical Text

Define: PERSON (actual person, not hypothetical), ORGANISATION (hospital, pharma company, university), CONDITION (disease, symptom), MEDICATION (drug name, brand name, abbreviation).
Examples:
"John Smith was diagnosed with diabetes and prescribed Metformin" → PERSON: John Smith, CONDITION: diabetes, MEDICATION: Metformin, ORGANISATION: (none).
Edge case: "If the patient is referred to only as 'the patient' or 'he/she', do NOT label as PERSON." → This prevents over-labelling pronouns.
Result: Kappa 0.86+ across 10 labellers.

Tools: Prodigy, Doccano, Label Studio

Your choice of tool affects speed, cost, and ease of implementation.

Comparison

Line Item	UK (London)	Treba (Nairobi)	Saving
Tool	Best For	Key Features	Cost
Prodigy	Active learning, rapid iteration	Weak supervision, smart ranking, easy API	£50–300/mo
Doccano	Open-source, budget teams	Free, lightweight, simple UI	£0
Label Studio	Flexible labelling workflows	Multi-type support (text, image, audio), templates	£100–500/mo
Labelbox	Enterprise-scale projects	Managed labellers, QA, analytics	£500–2,000+/mo

For UK teams outsourcing to Kenya, we recommend Doccano (free, self-hosted) or Prodigy (simple API, shared access). Deploy on an AWS EC2 or similar server that both UK and Kenya teams can access. This gives you control and transparency.

Team Structure and Scaling

A typical NLP labelling project requires coordination across several roles:

Comparison

Role	Responsibility	Typical Cost (Kenya)
Role	Responsibility	UK Annual Cost
NLP Project Lead	Guideline creation, quality audits, vendor management, stakeholder updates	£30,000–£40,000
NLP Labellers (team of 5)	Tag text per guidelines, flag ambiguities, log questions	£40,000–£60,000 (5 × £8–12k)
QA Reviewer (0.5 FTE)	Agreement measurement, spot audits, retraining, edge case resolution	£15,000–£20,000
Domain Expert (ad hoc)	Guideline refinement, difficult cases, validation	£10,000–£15,000

Total annual cost (UK in-house, 5 labellers): £95,000–£135,000. Total annual cost (Kenya outsourced): £21,000–£33,500. Saving: 70–80%.

Quality Control: Cohesion, Calibration, and Audits

Consistency doesn't happen by accident. It requires structured QA.

Three-Tier QA Framework

Tier 1: Pre-Work Calibration. Before labelling production data, have 3–5 labellers independently label 50–100 calibration samples. Measure Kappa. If below 0.80, discuss discrepancies, refine guidelines, retry. Once everyone reaches 0.80+, they start production work.

Tier 2: Ongoing Inter-Rater Agreement (IRA) Measurement. Once a week, have a random 3–4% of labelled samples re-scored by a second labeller (or by the QA reviewer). Measure Kappa. If it drops below 0.80, investigate why. Is the guideline unclear? Did a labeller drift? Retrain as needed.

Tier 3: Spot-Check Audits. Every 2 weeks, the QA reviewer manually reviews 2–3% of labelled data. Check for common errors (misinterpreted labels, skipped sentences, copy-paste mistakes). Log feedback and retrain specific labellers if needed.

Example: 100k Sentiment Label Audit

50,000 sentences, 3-way sentiment (pos/neg/neutral), 5 labellers. Calibration: 100 samples, all 5 label independently, Kappa 0.84 (good). Production: Start labelling. Week 1 audit (1,500 samples re-scored): Kappa 0.86 (stable). Week 3 audit: Kappa 0.81 (slight drift—guidelines clarified). Final audit (5,000 samples, 10% quality check): Kappa 0.83 overall. Result: high-quality training data.

Key takeaways

• NLP labelling tasks (sentiment, NER, intent) require 10,000–100,000+ labels; outsourcing cuts labour costs by 70–80%. • Quality depends on clear guidelines (5 parts: definition, examples, edge cases, context, forbidden patterns), not just rater selection. • Quality metrics: Cohen's Kappa (agreement on categories, target > 0.80), F1 score (token-level tasks, target > 0.85), Spearman correlation (ordinal scales, target > 0.70). • Tool selection: Doccano (free, self-hosted) or Prodigy (active learning) work well for remote teams.

Deploy on shared server for UK + Kenya access. • Team structure: 1 project lead, 5 labellers, 0.5 QA reviewer = £21–33k/year in Kenya vs. £95–135k/year in UK. • QA requires calibration (pre-work agreement checks), ongoing IRA measurement (weekly samples), and spot audits (bi-weekly manual review).

Written by

Abdi Mohamed

Founder of Treba. Building UK–Kenya teams across finance, legal, CX, and AI operations.

How to Scale NLP Text Labeling Without Hiring In-House

What Are NLP Labeling Tasks and Why Are They Hard to Scale?

Real-World Example: E-Commerce Reviews

NLP Task Types and Quality Metrics

Understanding Quality Metrics

Annotation Guidelines and Consistency

The 5-Part Guideline Framework

Example: NER Guideline for Medical Text

Tools: Prodigy, Doccano, Label Studio

Team Structure and Scaling

Quality Control: Cohesion, Calibration, and Audits

Three-Tier QA Framework

Example: 100k Sentiment Label Audit

Key takeaways

Frequently Asked Questions

Related insights.

The UK Buyer's Guide to Data Annotation Outsourcing

RLHF Outsourcing: What UK AI Companies Need to Know

Why UK AI Labs Are Outsourcing Model Testing & QA

Scale Your NLP Labelling Today 50,000+ sentences monthly.

How to Scale NLP Text Labeling Without Hiring In-House

What Are NLP Labeling Tasks and Why Are They Hard to Scale?

Real-World Example: E-Commerce Reviews

NLP Task Types and Quality Metrics

Understanding Quality Metrics

Annotation Guidelines and Consistency

The 5-Part Guideline Framework

Example: NER Guideline for Medical Text

Tools: Prodigy, Doccano, Label Studio

Team Structure and Scaling

Quality Control: Cohesion, Calibration, and Audits

Three-Tier QA Framework

Example: 100k Sentiment Label Audit

Key takeaways

Frequently Asked Questions

What's the difference between Kappa > 0.80 and Kappa > 0.85? Does it really matter?

Can we use crowdsourcing platforms like Amazon MTurk instead of hiring a dedicated team?

How do we know if our guidelines are clear enough before full production?

Related insights.

The UK Buyer's Guide to Data Annotation Outsourcing

RLHF Outsourcing: What UK AI Companies Need to Know

Why UK AI Labs Are Outsourcing Model Testing & QA

Scale Your NLP Labelling Today 50,000+ sentences monthly.