Skip to main content
Insight Article5 min read

Outsourced Search Relevance Tuning: A Practical Guide

Outsource search relevance tuning: query rating, NDCG scoring, A/B evaluation. Team structure, costs, and implementation guide for SaaS platforms.

Insight ArticleTTreba Research5 min read

What Is Search Relevance Evaluation and Why Does It Matter?

Search relevance evaluation measures whether a search engine returns useful results. A user types a query (e.g., "black leather women's shoes, size 7"). The search engine returns 10 results. A rater evaluates each result on a scale: Relevant (exactly what the user wanted), Partially Relevant (useful but not perfect), Not Relevant (wrong product). This feedback teaches the ranking algorithm what good results look like.

Why does it matter? Search is the primary way ecommerce and SaaS users navigate platforms. A 1% improvement in search relevance can drive 2–5% increase in conversion (because users find what they want faster). A search engine with bad relevance drives users away. For internal enterprise search (legal discovery, knowledge management), relevance is critical for compliance and efficiency.

The challenge: relevance is subjective. What's "relevant" to one rater might be "partially relevant" to another. And the volume is massive. A single search index can have millions of query-result pairs. To train a ranking model, you might need 10,000–100,000+ ratings. That's 4–6 full-time raters for 12 weeks.

A UK SaaS company providing project management software had poor search. Users typed "assign task to John" and got results about task templates, not actual tasks. They hired a search relevance team to rate 20,000 query-result pairs. NDCG improved from 0.68 to 0.75 (+10%). User search adoption increased 18%. But the rating work took 6 weeks, 4 raters, and cost £12,000. Outsourcing would have cost £2,400.

Types of Search Relevance Tasks

Search relevance evaluation comes in several forms:

Comparison

Line ItemUK (London)Treba (Nairobi)Saving
Task TypeComplexityUK Cost per 1,000 QueriesKenya Cost per 1,000 Queries
Query-Result Rating (3-scale: relevant/partial/not)Low£80£20
Query-Result Rating (5-scale: perfect/excellent/good/fair/bad)Medium£140£35
A/B Test Evaluation (compare search results from two versions)Medium£120£30
NDCG Scoring (calculate ranking quality metric)High£160£40
Relevance Grading (medical, legal domain-specific)Very High£250£60

Cost differences reflect complexity and domain expertise. General ecommerce search ("Is this shoe relevant to the query?") is straightforward. Medical search ("Is this diagnosis relevant given the patient's symptoms?") requires healthcare knowledge.

Calibration and Quality Control in Search Relevance

Consistency in search relevance ratings is built on the same framework as NLP annotation: clear guidelines, calibration samples, inter-rater agreement, and spot audits.

The 4-Step Calibration Protocol

Step 1: Detailed Rating Guidelines. Define what "Relevant," "Partially Relevant," and "Not Relevant" mean for your specific search. Example: "Relevant = the product exactly matches the query. Partially Relevant = the product is close but missing a key attribute (e.g., size or colour). Not Relevant = the product is in a different category." Include 10–15 example query-result pairs with explanations.

Step 2: Pre-Work Calibration. Have 3–4 raters independently rate 50–100 calibration samples. Measure agreement (Fleiss' Kappa for multi-rater, or pairwise correlation for pair ranking). Target: Kappa > 0.75 for search relevance (slightly lower than annotation because relevance is more subjective).

Step 3: Production Rating. Once calibrated, raters begin production work. Every 500–1,000 ratings, re-assess agreement on a random sample. If agreement dips, discuss and recalibrate.

Step 4: Weekly Audits. A QA lead spot-checks 5–10% of ratings. Look for systematic errors (e.g., "Rater X always marks size-mismatch as Not Relevant" vs. "Rater Y always marks as Partially Relevant"). Investigate and retrain.

Measuring Quality: NDCG, Precision, Correlation

Three metrics matter: (1) Fleiss' Kappa: inter-rater agreement. Target > 0.75. (2) NDCG (Normalised Discounted Cumulative Gain): measures ranking quality if you're evaluating a ranking system. Calculated from ratings. Target: NDCG improvement of 5–10% before/after rating. (3) Spearman Correlation: if two independent raters rate the same queries, correlation should be > 0.70.

Tools and Workflow for Search Relevance Evaluation

Most search relevance evaluation is done on custom platforms or simple tools:

Comparison

Line ItemUK (London)Treba (Nairobi)Saving
Tool/PlatformBest ForKey FeaturesCost
Quaere (open-source)Small-scale search relevanceSimple web UI, local storage, no dependenciesFree
LabelboxEnterprise-scale evaluationBatch operations, inter-rater agreement tracking, analytics£500–2,000/mo
Custom web form (simple)One-off projects, tight timelineMinimal setup, CSV export, basic analytics£0–500 dev cost
Karpukh (search-specific)Search ranking optimizationQuery-result interface, NDCG calculation, rater dashboards£1,000–3,000/mo

Typical Workflow

  1. Extract queries and top-10 results from your search index. 2. Create a batch of 1,000–2,000 query-result pairs. 3. Deploy to raters via a simple interface (dropdown: Relevant / Partial / Not Relevant). 4. Raters complete ratings (typical rate: 300–500 queries per day). 5. Collect results, calculate Kappa and NDCG. 6. Feed ratings back into your ranking model. 7. Measure improvement in A/B test. 8. Repeat.

Team Structure and Cost Model

A typical search relevance project involves multiple roles:

Comparison

RoleResponsibilityTypical Cost (Kenya)
RoleResponsibilityUK Annual Cost
Search Relevance LeadGuideline creation, rater onboarding, QA audits, NDCG analysis, model feedback£28,000–£36,000
Search Raters (team of 3–5)Rate query-result pairs, flag edge cases, provide explanations (if required)£16,000–£22,000 (3–5 × £4–5.5k)
Data Analyst (part-time, 0.3 FTE)NDCG calculation, inter-rater agreement measurement, statistical analysis£9,000–£12,000
Product Manager (oversight, 0.2 FTE)Success metric definition, A/B test design, feedback loop to ML team£8,000–£10,000

Total annual cost (UK in-house, 4 raters): £61,000–£80,000. Total annual cost (Kenya outsourced): £13,000–£18,100. Saving: 70–80%.

From Ratings to Ranking: Closing the Loop

Collecting ratings is half the work. Using them to improve your search is the other half.

Step 1: Aggregate Ratings

If multiple raters rate the same query-result pair, use majority voting or average rating. For pair ranking ("Is A better than B?"), calculate win rates per result.

Step 2: Calculate NDCG and Identify Failures

NDCG measures ranking quality: (1) Ideal ranking: if results are sorted by rating (relevant first), what's the cumulative gain? (2) Actual ranking: what's the cumulative gain of your current search? (3) NDCG = actual / ideal. Identify query-result pairs where your ranking disagrees with raters. These are learning opportunities.

Step 3: Feature Analysis

Why did the raters prefer result B over result A? Was it the product title? The review rating? The price? Have a domain expert (product manager) analyse the top N disagreements. Identify which features correlate with "relevant" ratings.

Step 4: Re-Rank or Re-Train

Two options: (a) Adjust ranking rules (heuristics). E.g., "Boost products with 4.5+ star reviews." Quick, interpretable. (b) Train a learning-to-rank (LTR) model. Use ratings as training labels. More sophisticated, but requires ML expertise. Most companies start with (a), then move to (b).

Step 5: A/B Test and Measure

Deploy the improved ranking to 10% of users. Measure: search adoption (% of users who search), click-through rate, conversion rate, user satisfaction. If metrics improve, roll out to 100%.

Key takeaways

1

• Search relevance evaluation (query-result rating, NDCG scoring) is critical for ecommerce, SaaS, and enterprise search; outsourcing cuts costs by 70–80%. • UK raters cost £16–22k annually; Kenya-based raters cost £4–6.5k annually.

2

Task types range from 3-scale (Relevant/Partial/Not) to 5-scale and domain-specific (medical, legal). • Quality calibration: detailed guidelines (10–15 examples), pre-work agreement checks (Kappa > 0.75), weekly audits, and inter-rater correlation > 0.70. • Tools: Quaere (free, self-hosted), Labelbox (enterprise), or custom web form.

3

Deploy on shared server for UK + Kenya access. • Workflow: extract queries → rate pairs (300–500/day per rater) → calculate NDCG → identify failures → adjust ranking rules or train LTR model → A/B test. • Team structure: 1 lead, 3–5 raters, 0.3 data analyst, 0.2 product manager = £13–18k/year in Kenya vs. £61–80k/year in UK.

T

Written by

Treba Research

Treba editorial team — expert analysis on outsourcing, compliance, and building distributed UK–Kenya teams.


FAQ

Frequently Asked Questions

WE ARE TREBA

Boost Your Search Relevance Rate 50,000+ queries monthly.

NDCG improvement tracking, calibration protocols, A/B test integration, and A/B reporting.