Search Relevance Evaluation Outsourcing | UK SaaS Companies

Insight Article26 Mar 2026AAbdi Mohamed5 min read

Summarise with AI

What Is Search Relevance Evaluation and Why Does It Matter?

Search relevance evaluation measures whether a search engine returns useful results. A user types a query (e.g., "black leather women's shoes, size 7"). The search engine returns 10 results. A rater evaluates each result on a scale: Relevant (exactly what the user wanted), Partially Relevant (useful but not perfect), Not Relevant (wrong product). This feedback teaches the ranking algorithm what good results look like.

Why does it matter? Search is the primary way ecommerce and SaaS users navigate platforms. A 1% improvement in search relevance can drive 2–5% increase in conversion (because users find what they want faster). A search engine with bad relevance drives users away. For internal enterprise search (legal discovery, knowledge management), relevance is critical for compliance and efficiency.

The challenge: relevance is subjective. What's "relevant" to one rater might be "partially relevant" to another. And the volume is massive. A single search index can have millions of query-result pairs. To train a ranking model, you might need 10,000–100,000+ ratings. That's 4–6 full-time raters for 12 weeks.

Real-World Impact: SaaS Product Search

A UK SaaS company providing project management software had poor search. Users typed "assign task to John" and got results about task templates, not actual tasks. They hired a search relevance team to rate 20,000 query-result pairs. NDCG improved from 0.68 to 0.75 (+10%). User search adoption increased 18%. But the rating work took 6 weeks, 4 raters, and cost £12,000. Outsourcing would have cost £2,400.

Types of Search Relevance Tasks

Search relevance evaluation comes in several forms:

Comparison

Line Item	UK (London)	Treba (Nairobi)	Saving
Task Type	Complexity	UK Cost per 1,000 Queries	Kenya Cost per 1,000 Queries
Query-Result Rating (3-scale: relevant/partial/not)	Low	£80	£20
Query-Result Rating (5-scale: perfect/excellent/good/fair/bad)	Medium	£140	£35
A/B Test Evaluation (compare search results from two versions)	Medium	£120	£30
NDCG Scoring (calculate ranking quality metric)	High	£160	£40
Relevance Grading (medical, legal domain-specific)	Very High	£250	£60

Cost differences reflect complexity and domain expertise. General ecommerce search ("Is this shoe relevant to the query?") is straightforward. Medical search ("Is this diagnosis relevant given the patient's symptoms?") requires healthcare knowledge.

Calibration and Quality Control in Search Relevance

Consistency in search relevance ratings is built on the same framework as NLP annotation: clear guidelines, calibration samples, inter-rater agreement, and spot audits.

The 4-Step Calibration Protocol

Step 1: Detailed Rating Guidelines. Define what "Relevant," "Partially Relevant," and "Not Relevant" mean for your specific search. Example: "Relevant = the product exactly matches the query. Partially Relevant = the product is close but missing a key attribute (e.g., size or colour). Not Relevant = the product is in a different category." Include 10–15 example query-result pairs with explanations.

Step 2: Pre-Work Calibration. Have 3–4 raters independently rate 50–100 calibration samples. Measure agreement (Fleiss' Kappa for multi-rater, or pairwise correlation for pair ranking). Target: Kappa > 0.75 for search relevance (slightly lower than annotation because relevance is more subjective).

Step 3: Production Rating. Once calibrated, raters begin production work. Every 500–1,000 ratings, re-assess agreement on a random sample. If agreement dips, discuss and recalibrate.

Step 4: Weekly Audits. A QA lead spot-checks 5–10% of ratings. Look for systematic errors (e.g., "Rater X always marks size-mismatch as Not Relevant" vs. "Rater Y always marks as Partially Relevant"). Investigate and retrain.

Measuring Quality: NDCG, Precision, Correlation

Three metrics matter: (1) Fleiss' Kappa: inter-rater agreement. Target > 0.75. (2) NDCG (Normalised Discounted Cumulative Gain): measures ranking quality if you're evaluating a ranking system. Calculated from ratings. Target: NDCG improvement of 5–10% before/after rating. (3) Spearman Correlation: if two independent raters rate the same queries, correlation should be > 0.70.

Tools and Workflow for Search Relevance Evaluation

Most search relevance evaluation is done on custom platforms or simple tools:

Comparison

Line Item	UK (London)	Treba (Nairobi)	Saving
Tool/Platform	Best For	Key Features	Cost
Quaere (open-source)	Small-scale search relevance	Simple web UI, local storage, no dependencies	Free
Labelbox	Enterprise-scale evaluation	Batch operations, inter-rater agreement tracking, analytics	£500–2,000/mo
Custom web form (simple)	One-off projects, tight timeline	Minimal setup, CSV export, basic analytics	£0–500 dev cost
Karpukh (search-specific)	Search ranking optimization	Query-result interface, NDCG calculation, rater dashboards	£1,000–3,000/mo

Typical Workflow

Extract queries and top-10 results from your search index. 2. Create a batch of 1,000–2,000 query-result pairs. 3. Deploy to raters via a simple interface (dropdown: Relevant / Partial / Not Relevant). 4. Raters complete ratings (typical rate: 300–500 queries per day). 5. Collect results, calculate Kappa and NDCG. 6. Feed ratings back into your ranking model. 7. Measure improvement in A/B test. 8. Repeat.

Team Structure and Cost Model

A typical search relevance project involves multiple roles:

Comparison

Role	Responsibility	Typical Cost (Kenya)
Role	Responsibility	UK Annual Cost
Search Relevance Lead	Guideline creation, rater onboarding, QA audits, NDCG analysis, model feedback	£28,000–£36,000
Search Raters (team of 3–5)	Rate query-result pairs, flag edge cases, provide explanations (if required)	£16,000–£22,000 (3–5 × £4–5.5k)
Data Analyst (part-time, 0.3 FTE)	NDCG calculation, inter-rater agreement measurement, statistical analysis	£9,000–£12,000
Product Manager (oversight, 0.2 FTE)	Success metric definition, A/B test design, feedback loop to ML team	£8,000–£10,000

Total annual cost (UK in-house, 4 raters): £61,000–£80,000. Total annual cost (Kenya outsourced): £13,000–£18,100. Saving: 70–80%.

From Ratings to Ranking: Closing the Loop

Collecting ratings is half the work. Using them to improve your search is the other half.

Step 1: Aggregate Ratings

If multiple raters rate the same query-result pair, use majority voting or average rating. For pair ranking ("Is A better than B?"), calculate win rates per result.

Step 2: Calculate NDCG and Identify Failures

NDCG measures ranking quality: (1) Ideal ranking: if results are sorted by rating (relevant first), what's the cumulative gain? (2) Actual ranking: what's the cumulative gain of your current search? (3) NDCG = actual / ideal. Identify query-result pairs where your ranking disagrees with raters. These are learning opportunities.

Step 3: Feature Analysis

Why did the raters prefer result B over result A? Was it the product title? The review rating? The price? Have a domain expert (product manager) analyse the top N disagreements. Identify which features correlate with "relevant" ratings.

Step 4: Re-Rank or Re-Train

Two options: (a) Adjust ranking rules (heuristics). E.g., "Boost products with 4.5+ star reviews." Quick, interpretable. (b) Train a learning-to-rank (LTR) model. Use ratings as training labels. More sophisticated, but requires ML expertise. Most companies start with (a), then move to (b).

Step 5: A/B Test and Measure

Deploy the improved ranking to 10% of users. Measure: search adoption (% of users who search), click-through rate, conversion rate, user satisfaction. If metrics improve, roll out to 100%.

Key takeaways

• Search relevance evaluation (query-result rating, NDCG scoring) is critical for ecommerce, SaaS, and enterprise search; outsourcing cuts costs by 70–80%. • UK raters cost £16–22k annually; Kenya-based raters cost £4–6.5k annually.

Task types range from 3-scale (Relevant/Partial/Not) to 5-scale and domain-specific (medical, legal). • Quality calibration: detailed guidelines (10–15 examples), pre-work agreement checks (Kappa > 0.75), weekly audits, and inter-rater correlation > 0.70. • Tools: Quaere (free, self-hosted), Labelbox (enterprise), or custom web form.

Deploy on shared server for UK + Kenya access. • Workflow: extract queries → rate pairs (300–500/day per rater) → calculate NDCG → identify failures → adjust ranking rules or train LTR model → A/B test. • Team structure: 1 lead, 3–5 raters, 0.3 data analyst, 0.2 product manager = £13–18k/year in Kenya vs. £61–80k/year in UK.

Written by

Abdi Mohamed

Founder of Treba. Building UK–Kenya teams across finance, legal, CX, and AI operations.

Outsourced Search Relevance Tuning: A Practical Guide

What Is Search Relevance Evaluation and Why Does It Matter?

Real-World Impact: SaaS Product Search

Types of Search Relevance Tasks

Calibration and Quality Control in Search Relevance

The 4-Step Calibration Protocol

Measuring Quality: NDCG, Precision, Correlation

Tools and Workflow for Search Relevance Evaluation

Typical Workflow

Team Structure and Cost Model

From Ratings to Ranking: Closing the Loop

Step 1: Aggregate Ratings

Step 2: Calculate NDCG and Identify Failures

Step 3: Feature Analysis

Step 4: Re-Rank or Re-Train

Step 5: A/B Test and Measure

Key takeaways

Frequently Asked Questions

Related insights.

The UK Buyer's Guide to Data Annotation Outsourcing

Why UK AI Labs Are Outsourcing Model Testing & QA

How to Scale NLP Text Labeling Without Hiring In-House

Boost Your Search Relevance Rate 50,000+ queries monthly.

Outsourced Search Relevance Tuning: A Practical Guide

What Is Search Relevance Evaluation and Why Does It Matter?

Real-World Impact: SaaS Product Search

Types of Search Relevance Tasks

Calibration and Quality Control in Search Relevance

The 4-Step Calibration Protocol

Measuring Quality: NDCG, Precision, Correlation

Tools and Workflow for Search Relevance Evaluation

Typical Workflow

Team Structure and Cost Model

From Ratings to Ranking: Closing the Loop

Step 1: Aggregate Ratings

Step 2: Calculate NDCG and Identify Failures

Step 3: Feature Analysis

Step 4: Re-Rank or Re-Train

Step 5: A/B Test and Measure

Key takeaways

Frequently Asked Questions

What's NDCG and why is it the right metric for search relevance?

How many query-result pairs do we need to rate to improve our search significantly?

Can we use implicit feedback (clicks, purchases) instead of explicit ratings?

Related insights.

The UK Buyer's Guide to Data Annotation Outsourcing

Why UK AI Labs Are Outsourcing Model Testing & QA

How to Scale NLP Text Labeling Without Hiring In-House

Boost Your Search Relevance Rate 50,000+ queries monthly.