What Is Search Relevance Evaluation and Why Does It Matter?
Search relevance evaluation measures whether a search engine returns useful results. A user types a query (e.g., "black leather women's shoes, size 7"). The search engine returns 10 results. A rater evaluates each result on a scale: Relevant (exactly what the user wanted), Partially Relevant (useful but not perfect), Not Relevant (wrong product). This feedback teaches the ranking algorithm what good results look like.
Why does it matter? Search is the primary way ecommerce and SaaS users navigate platforms. A 1% improvement in search relevance can drive 2–5% increase in conversion (because users find what they want faster). A search engine with bad relevance drives users away. For internal enterprise search (legal discovery, knowledge management), relevance is critical for compliance and efficiency.
The challenge: relevance is subjective. What's "relevant" to one rater might be "partially relevant" to another. And the volume is massive. A single search index can have millions of query-result pairs. To train a ranking model, you might need 10,000–100,000+ ratings. That's 4–6 full-time raters for 12 weeks.
Real-World Impact: SaaS Product Search
A UK SaaS company providing project management software had poor search. Users typed "assign task to John" and got results about task templates, not actual tasks. They hired a search relevance team to rate 20,000 query-result pairs. NDCG improved from 0.68 to 0.75 (+10%). User search adoption increased 18%. But the rating work took 6 weeks, 4 raters, and cost £12,000. Outsourcing would have cost £2,400.
Types of Search Relevance Tasks
Search relevance evaluation comes in several forms:
Comparison
| Line Item | UK (London) | Treba (Nairobi) | Saving |
|---|---|---|---|
| Task Type | Complexity | UK Cost per 1,000 Queries | Kenya Cost per 1,000 Queries |
| Query-Result Rating (3-scale: relevant/partial/not) | Low | £80 | £20 |
| Query-Result Rating (5-scale: perfect/excellent/good/fair/bad) | Medium | £140 | £35 |
| A/B Test Evaluation (compare search results from two versions) | Medium | £120 | £30 |
| NDCG Scoring (calculate ranking quality metric) | High | £160 | £40 |
| Relevance Grading (medical, legal domain-specific) | Very High | £250 | £60 |
Cost differences reflect complexity and domain expertise. General ecommerce search ("Is this shoe relevant to the query?") is straightforward. Medical search ("Is this diagnosis relevant given the patient's symptoms?") requires healthcare knowledge.
Calibration and Quality Control in Search Relevance
Consistency in search relevance ratings is built on the same framework as NLP annotation: clear guidelines, calibration samples, inter-rater agreement, and spot audits.
The 4-Step Calibration Protocol
Step 1: Detailed Rating Guidelines. Define what "Relevant," "Partially Relevant," and "Not Relevant" mean for your specific search. Example: "Relevant = the product exactly matches the query. Partially Relevant = the product is close but missing a key attribute (e.g., size or colour). Not Relevant = the product is in a different category." Include 10–15 example query-result pairs with explanations.
Step 2: Pre-Work Calibration. Have 3–4 raters independently rate 50–100 calibration samples. Measure agreement (Fleiss' Kappa for multi-rater, or pairwise correlation for pair ranking). Target: Kappa > 0.75 for search relevance (slightly lower than annotation because relevance is more subjective).
Step 3: Production Rating. Once calibrated, raters begin production work. Every 500–1,000 ratings, re-assess agreement on a random sample. If agreement dips, discuss and recalibrate.
Step 4: Weekly Audits. A QA lead spot-checks 5–10% of ratings. Look for systematic errors (e.g., "Rater X always marks size-mismatch as Not Relevant" vs. "Rater Y always marks as Partially Relevant"). Investigate and retrain.
Measuring Quality: NDCG, Precision, Correlation
Three metrics matter: (1) Fleiss' Kappa: inter-rater agreement. Target > 0.75. (2) NDCG (Normalised Discounted Cumulative Gain): measures ranking quality if you're evaluating a ranking system. Calculated from ratings. Target: NDCG improvement of 5–10% before/after rating. (3) Spearman Correlation: if two independent raters rate the same queries, correlation should be > 0.70.
Tools and Workflow for Search Relevance Evaluation
Most search relevance evaluation is done on custom platforms or simple tools:
Comparison
| Line Item | UK (London) | Treba (Nairobi) | Saving |
|---|---|---|---|
| Tool/Platform | Best For | Key Features | Cost |
| Quaere (open-source) | Small-scale search relevance | Simple web UI, local storage, no dependencies | Free |
| Labelbox | Enterprise-scale evaluation | Batch operations, inter-rater agreement tracking, analytics | £500–2,000/mo |
| Custom web form (simple) | One-off projects, tight timeline | Minimal setup, CSV export, basic analytics | £0–500 dev cost |
| Karpukh (search-specific) | Search ranking optimization | Query-result interface, NDCG calculation, rater dashboards | £1,000–3,000/mo |
Typical Workflow
- Extract queries and top-10 results from your search index. 2. Create a batch of 1,000–2,000 query-result pairs. 3. Deploy to raters via a simple interface (dropdown: Relevant / Partial / Not Relevant). 4. Raters complete ratings (typical rate: 300–500 queries per day). 5. Collect results, calculate Kappa and NDCG. 6. Feed ratings back into your ranking model. 7. Measure improvement in A/B test. 8. Repeat.
Team Structure and Cost Model
A typical search relevance project involves multiple roles:
Comparison
| Role | Responsibility | Typical Cost (Kenya) |
|---|---|---|
| Role | Responsibility | UK Annual Cost |
| Search Relevance Lead | Guideline creation, rater onboarding, QA audits, NDCG analysis, model feedback | £28,000–£36,000 |
| Search Raters (team of 3–5) | Rate query-result pairs, flag edge cases, provide explanations (if required) | £16,000–£22,000 (3–5 × £4–5.5k) |
| Data Analyst (part-time, 0.3 FTE) | NDCG calculation, inter-rater agreement measurement, statistical analysis | £9,000–£12,000 |
| Product Manager (oversight, 0.2 FTE) | Success metric definition, A/B test design, feedback loop to ML team | £8,000–£10,000 |
Total annual cost (UK in-house, 4 raters): £61,000–£80,000. Total annual cost (Kenya outsourced): £13,000–£18,100. Saving: 70–80%.
From Ratings to Ranking: Closing the Loop
Collecting ratings is half the work. Using them to improve your search is the other half.
Step 1: Aggregate Ratings
If multiple raters rate the same query-result pair, use majority voting or average rating. For pair ranking ("Is A better than B?"), calculate win rates per result.
Step 2: Calculate NDCG and Identify Failures
NDCG measures ranking quality: (1) Ideal ranking: if results are sorted by rating (relevant first), what's the cumulative gain? (2) Actual ranking: what's the cumulative gain of your current search? (3) NDCG = actual / ideal. Identify query-result pairs where your ranking disagrees with raters. These are learning opportunities.
Step 3: Feature Analysis
Why did the raters prefer result B over result A? Was it the product title? The review rating? The price? Have a domain expert (product manager) analyse the top N disagreements. Identify which features correlate with "relevant" ratings.
Step 4: Re-Rank or Re-Train
Two options: (a) Adjust ranking rules (heuristics). E.g., "Boost products with 4.5+ star reviews." Quick, interpretable. (b) Train a learning-to-rank (LTR) model. Use ratings as training labels. More sophisticated, but requires ML expertise. Most companies start with (a), then move to (b).
Step 5: A/B Test and Measure
Deploy the improved ranking to 10% of users. Measure: search adoption (% of users who search), click-through rate, conversion rate, user satisfaction. If metrics improve, roll out to 100%.
Key takeaways
• Search relevance evaluation (query-result rating, NDCG scoring) is critical for ecommerce, SaaS, and enterprise search; outsourcing cuts costs by 70–80%. • UK raters cost £16–22k annually; Kenya-based raters cost £4–6.5k annually.
Task types range from 3-scale (Relevant/Partial/Not) to 5-scale and domain-specific (medical, legal). • Quality calibration: detailed guidelines (10–15 examples), pre-work agreement checks (Kappa > 0.75), weekly audits, and inter-rater correlation > 0.70. • Tools: Quaere (free, self-hosted), Labelbox (enterprise), or custom web form.
Deploy on shared server for UK + Kenya access. • Workflow: extract queries → rate pairs (300–500/day per rater) → calculate NDCG → identify failures → adjust ranking rules or train LTR model → A/B test. • Team structure: 1 lead, 3–5 raters, 0.3 data analyst, 0.2 product manager = £13–18k/year in Kenya vs. £61–80k/year in UK.
Written by
Treba Research
Treba editorial team — expert analysis on outsourcing, compliance, and building distributed UK–Kenya teams.

