Skip to main content
Insight Article5 min read

How to Scale NLP Text Labeling Without Hiring In-House

Outsource NLP labeling tasks: sentiment, entity extraction, intent classification. Tools, team structure, quality metrics, cost comparison.

Insight ArticleTTreba Research5 min read

What Are NLP Labeling Tasks and Why Are They Hard to Scale?

NLP (natural language processing) labeling teaches models to understand text. Common tasks include: sentiment analysis ("Is this review positive, negative, or neutral?"), named entity recognition—NER ("Find all people, organisations, and locations in this text"), intent classification ("Is this customer query about billing, returns, or general feedback?"), topic modelling ("Which category does this article belong to?"), and semantic similarity ("Do these two sentences mean the same thing?").

Why is it hard to scale? Volume is the first reason. A single NLP project can require 10,000–100,000+ labelled sentences. Human labellers are slow. A single labeller can annotate 200–400 sentences per day, depending on task complexity. To complete 100,000 sentences in 8 weeks, you'd need 6–10 full-time labellers. Second, consistency is hard. Language is ambiguous. One labeller might mark a sentence as "neutral" while another marks it "slightly positive." Maintaining agreement across a team is labour-intensive. Third, domain expertise matters. Labelling medical text requires someone who understands healthcare terminology.

Real-World Example: E-Commerce Reviews

A UK e-commerce platform needed 50,000 product reviews labelled for sentiment (positive, negative, neutral) and aspect categories (product quality, shipping, customer service). They hired 3 in-house labellers. After 6 weeks, inter-rater agreement (Cohen's Kappa) was only 0.71—below the 0.80 threshold. Cost: £12,000 in labour. Outsourcing the same 50,000 reviews to Kenya took 4 weeks, cost £2,500, and achieved Kappa 0.89.

NLP Task Types and Quality Metrics

Different NLP tasks have different complexities and quality measurement methods.

Comparison

Line ItemUK (London)Treba (Nairobi)Saving
Task TypeComplexityUK Cost per 1,000Kenya Cost per 1,000
Sentiment (3-way: pos/neg/neutral)Low£60£15
Sentiment (5-way: very pos to very neg)Medium£120£30
Named Entity Recognition (NER)High£200£50
Intent Classification (5–10 categories)Low£80£20
Topic Classification (20+ categories)Medium£150£40
Semantic Similarity (0–5 scale)High£180£45

Understanding Quality Metrics

Cohen's Kappa measures agreement between two annotators on categorical tasks. Range: 0–1. Above 0.80 is good. Example: 200 sentences, both raters agree on 180 of them = Kappa ~0.80. F1 Score measures precision and recall on token-level tasks (like NER). Range: 0–1. Above 0.85 is strong. Example: NER task, you correctly find 85% of entities and have few false positives = F1 ~0.88. Spearman Correlation measures agreement on ordinal scales. Range: -1 to 1. Above 0.70 is acceptable. Example: semantic similarity on a 1–5 scale, two raters correlate well = correlation 0.75.

Annotation Guidelines and Consistency

Consistency is built on clear guidelines. A vague guideline produces inconsistent labels. Here's how to create strong guidelines:

The 5-Part Guideline Framework

Part 1: Definition. What does each label mean? Example: "Positive sentiment = the reviewer recommends the product or expresses satisfaction." Not: "Positive = they said anything good." Be precise.

Part 2: Examples. Provide 3–5 real examples per label. Example: ✓ "Perfect fit, excellent quality" = Positive. ✓ "Good but expensive" = Mixed (would need a mixed sentiment category if you have it). ✓ "Arrived damaged" = Negative.

Part 3: Edge Cases. What about ambiguous examples? Example: "It's okay" = Neutral (not positive, even though "okay" sounds acceptable). "Better than expected" = Positive (exceeds baseline). Spell this out.

Part 4: Context Rules. Are there any special cases? Example: "If a review mentions a refund granted, they are usually satisfied despite initial complaint. Mark as Positive." Helps labellers make judgment calls consistently.

Part 5: Forbidden Patterns. What labels should NEVER apply? Example: "Never label a review as Positive just because the reviewer is polite. Mark based on actual product satisfaction." Prevents common mistakes.

Example: NER Guideline for Medical Text

  • Define: PERSON (actual person, not hypothetical), ORGANISATION (hospital, pharma company, university), CONDITION (disease, symptom), MEDICATION (drug name, brand name, abbreviation).
  • Examples:
  • "John Smith was diagnosed with diabetes and prescribed Metformin" → PERSON: John Smith, CONDITION: diabetes, MEDICATION: Metformin, ORGANISATION: (none).
  • Edge case: "If the patient is referred to only as 'the patient' or 'he/she', do NOT label as PERSON." → This prevents over-labelling pronouns.
  • Result: Kappa 0.86+ across 10 labellers.

Tools: Prodigy, Doccano, Label Studio

Your choice of tool affects speed, cost, and ease of implementation.

Comparison

Line ItemUK (London)Treba (Nairobi)Saving
ToolBest ForKey FeaturesCost
ProdigyActive learning, rapid iterationWeak supervision, smart ranking, easy API£50–300/mo
DoccanoOpen-source, budget teamsFree, lightweight, simple UI£0
Label StudioFlexible labelling workflowsMulti-type support (text, image, audio), templates£100–500/mo
LabelboxEnterprise-scale projectsManaged labellers, QA, analytics£500–2,000+/mo

For UK teams outsourcing to Kenya, we recommend Doccano (free, self-hosted) or Prodigy (simple API, shared access). Deploy on an AWS EC2 or similar server that both UK and Kenya teams can access. This gives you control and transparency.

Team Structure and Scaling

A typical NLP labelling project requires coordination across several roles:

Comparison

RoleResponsibilityTypical Cost (Kenya)
RoleResponsibilityUK Annual Cost
NLP Project LeadGuideline creation, quality audits, vendor management, stakeholder updates£30,000–£40,000
NLP Labellers (team of 5)Tag text per guidelines, flag ambiguities, log questions£40,000–£60,000 (5 × £8–12k)
QA Reviewer (0.5 FTE)Agreement measurement, spot audits, retraining, edge case resolution£15,000–£20,000
Domain Expert (ad hoc)Guideline refinement, difficult cases, validation£10,000–£15,000

Total annual cost (UK in-house, 5 labellers): £95,000–£135,000. Total annual cost (Kenya outsourced): £21,000–£33,500. Saving: 70–80%.

Quality Control: Cohesion, Calibration, and Audits

Consistency doesn't happen by accident. It requires structured QA.

Three-Tier QA Framework

Tier 1: Pre-Work Calibration. Before labelling production data, have 3–5 labellers independently label 50–100 calibration samples. Measure Kappa. If below 0.80, discuss discrepancies, refine guidelines, retry. Once everyone reaches 0.80+, they start production work.

Tier 2: Ongoing Inter-Rater Agreement (IRA) Measurement. Once a week, have a random 3–4% of labelled samples re-scored by a second labeller (or by the QA reviewer). Measure Kappa. If it drops below 0.80, investigate why. Is the guideline unclear? Did a labeller drift? Retrain as needed.

Tier 3: Spot-Check Audits. Every 2 weeks, the QA reviewer manually reviews 2–3% of labelled data. Check for common errors (misinterpreted labels, skipped sentences, copy-paste mistakes). Log feedback and retrain specific labellers if needed.

Example: 100k Sentiment Label Audit

50,000 sentences, 3-way sentiment (pos/neg/neutral), 5 labellers. Calibration: 100 samples, all 5 label independently, Kappa 0.84 (good). Production: Start labelling. Week 1 audit (1,500 samples re-scored): Kappa 0.86 (stable). Week 3 audit: Kappa 0.81 (slight drift—guidelines clarified). Final audit (5,000 samples, 10% quality check): Kappa 0.83 overall. Result: high-quality training data.

Key takeaways

1

• NLP labelling tasks (sentiment, NER, intent) require 10,000–100,000+ labels; outsourcing cuts labour costs by 70–80%. • Quality depends on clear guidelines (5 parts: definition, examples, edge cases, context, forbidden patterns), not just rater selection. • Quality metrics: Cohen's Kappa (agreement on categories, target > 0.80), F1 score (token-level tasks, target > 0.85), Spearman correlation (ordinal scales, target > 0.70). • Tool selection: Doccano (free, self-hosted) or Prodigy (active learning) work well for remote teams.

2

Deploy on shared server for UK + Kenya access. • Team structure: 1 project lead, 5 labellers, 0.5 QA reviewer = £21–33k/year in Kenya vs. £95–135k/year in UK. • QA requires calibration (pre-work agreement checks), ongoing IRA measurement (weekly samples), and spot audits (bi-weekly manual review).

T

Written by

Treba Research

Treba editorial team — expert analysis on outsourcing, compliance, and building distributed UK–Kenya teams.


FAQ

Frequently Asked Questions

WE ARE TREBA

Scale Your NLP Labelling Today 50,000+ sentences monthly.

Domain expertise in healthcare, legal, ecommerce, and finance. Kappa 0.85+ guaranteed.