The Dirty Data Problem
Gartner's 2023 Data & Analytics Survey found that poor data quality costs organisations an average of 20% of revenue. For a £100m company, that's £20m in lost revenue, inefficient marketing spend, failed analytics, and broken business logic.
Where does dirty data come from? Manually entered customer information with typos. Duplicate records from merged systems. Inconsistent date formats (01/05/2024 vs 2024-05-01 vs May 1, 2024). Missing values. Outdated records. Spreadsheet imports with encoding errors.
The business impact is cascading: your CRM shows the same customer multiple times, wasting marketing budget on duplicates. Your analytics dashboard shows incorrect revenue totals. Your data warehouse fails validation rules. Your ML models train on garbage and produce garbage.
Data cleansing—removing duplicates, standardising formats, fixing missing values, enriching records—is preventative medicine. But it's labour-intensive, so companies outsource it.
What Data Cleansing Involves
Deduplication
Identifying and merging duplicate records. Simple if records match exactly (same name, email, phone). Harder when there are typos, variations, or partial matches. Tools use fuzzy matching algorithms to detect likely duplicates; humans review and merge.
Standardisation
Converting inconsistent data into a standard format. Phone numbers: (020) 1234 5678 → 02012345678. Dates: 01/05/2024 vs 2024-05-01 → 2024-05-01. Names: JOHN DOE vs John Doe → John Doe. Addresses: St vs Street vs Street.
Validation
Checking data against rules. Email must contain @. Phone number must be 11 digits. Postal code must match UK format. Date of birth must be reasonable (not 1800). Records failing validation are flagged or corrected.
Enrichment
Adding missing data. If customer record has name but no email, look it up. If company record has name but no industry classification, append it. Enrichment services (Clearbit, Hunter, RocketReach) integrate data from public sources.
Removal of Invalid Records
Identifying and removing records that can't be salvaged: fake emails (test@example.com), test data, spam entries, deceased customers (for some use cases).
Tools and Platforms
Enterprise Platforms
- Talend: Industry-leading data integration and quality platform. Supports complex transformations, fuzzy matching, deduplication. Expensive (£50k–£200k+/year) but powerful for large-scale operations.
- Informatica: Cloud-based data integration and governance. Similar to Talend; strong in enterprise environments.
- SAS Data Management: Comprehensive suite; focus on large organisations and regulatory compliance.
Open-Source Tools
OpenRefine: Lightweight, free, web-based. Excellent for SMB data cleaning; can handle deduplication, standardisation, and basic validation. Limited scalability (not ideal for >10m records).
Python libraries (Pandas, Dedupe, RecordLinkage): Code-first approach. Flexible but requires engineering resources.
DBT (data build tool): Modern data transformation framework; popular in analytics engineering teams.
Enrichment Services
- Clearbit: Company and person data enrichment; integrates with Salesforce and Marketo.
- Hunter.io: Email finding and verification.
- RocketReach: B2B contact database; good for finding decision-makers.
ROI Calculation Framework
To determine whether outsourcing data cleansing makes financial sense, model the costs and benefits:
Cost of Dirty Data (Annual)
Lost revenue from duplicate marketing spend: if 10% of your database is duplicates and you spend £500k/year on marketing, £50k is wasted on duplicate contacts.
Failed analytics: time spent investigating data quality issues instead of acting on insights. Estimate at £30k–£50k/year for a data-driven organisation.
- Failed ML models: time retraining models that were poisoned by bad data. £20k–£50k/year.
- Operational errors: decisions based on incorrect data (wrong customer count, wrong revenue forecast). £50k–£200k/year.
- Total annual cost of dirty data: £150k–£350k for a mid-sized company.
Cost of Outsourcing Data Cleansing
One-time initial cleansing of your entire database: £10k–£30k depending on size and complexity.
Ongoing maintenance (monthly or quarterly): £2k–£5k/month to keep data clean as new records enter.
ROI Calculation
If outsourcing costs £15k upfront + £3k/month (£36k/year), total is £51k/year. If dirty data costs you £200k/year, ROI is 294% in year 1 (£200k benefit – £51k cost = £149k net benefit).
Break-even is typically 2–3 months. Most organisations see positive ROI within 6 months.
Cost Comparison: Outsourcing vs In-House vs Automation
Comparison
| Line Item | UK (London) | Treba (Nairobi) | Saving |
|---|---|---|---|
| Data Analyst (1 FTE) | £35,000–£45,000 | £9,000–£11,000 | 75% saving |
| Data Quality Engineer (1 FTE) | £50,000–£65,000 | £12,000–£15,000 | 76% saving |
| Data Validation Specialist (2 FTE) | £50,000–£60,000 | £12,000–£16,000 | 75% saving |
| Annual tool cost (Talend or similar) | £50,000–£200,000 | Not required (included in service) | Saving varies |
| Total annual cost (4 FTE + tools) | £185,000–£370,000 | £33,000–£42,000 + tooling | 78% saving |
Outsourcing vs Automation
Automation (e.g., OpenRefine scripts or Talend workflows) works well for recurring, predictable data quality issues. Setup time is 2–4 weeks; ongoing maintenance is minimal. Cost: £5k–£15k setup, £0–£2k/month maintenance.
Outsourcing works best for initial bulk cleansing, complex deduplication with human review, and one-off projects. Cost is higher per-hour but lower upfront investment and no engineering resource required.
Optimal approach: Hybrid. Use outsourcing for initial bulk cleansing (one-time), then implement automated rules for ongoing maintenance.
When to Outsource Data Cleansing
- You have >100k records and no in-house data engineering team.
- Dirty data is directly costing you money (duplicate marketing spend, failed analytics).
- Data quality is slowing down a critical project (e.g., CRM migration, analytics implementation).
- You lack the engineering resources to build and maintain automated cleansing pipelines.
- Your data complexity is high: multiple sources, inconsistent formats, complex deduplication rules.
Team Structure and Workflow
Comparison
| Role | Responsibility | Typical Cost (Kenya) |
|---|---|---|
| Data Quality Manager (In-house) | Owns strategy, reviews cleansing rules, approves output, drives data governance. | £12,000–£15,000 |
| Senior Data Analyst | Designs cleansing logic, supervises complex deduplication, handles exceptions. | £10,000–£13,000 |
| Data Cleansing Specialist (2–3) | Execute deduplication, standardisation, validation, enrichment. | £7,000–£9,000 each |
| QA Validator | Sample-checks cleaned data for accuracy, flags issues, documents errors. | £5,000–£7,000 |
Workflow: Internal stakeholder (product, finance, sales) submits a data cleansing request via ticket system with specifications (dataset size, data quality issues, timeline). The offshore team scopes the work, provides estimate, and begins cleansing. Progress is tracked weekly. Final output is sample-checked by the internal data quality manager. Approved data is delivered; rejected data is re-worked.
Real Cost Example
UK SaaS company, £5m ARR, 500k customer records in Salesforce.
Current state: 15% of records are duplicates (wasting £75k/year in marketing spend). Missing email addresses in 20% of records (hindering outreach). Inconsistent company names (harming analytics).
Option 1: Hire in-house data engineer. Cost: £50k/year. Time to hire: 3 months. Time to build cleansing pipeline: 2 months. Total time to first clean data: 5 months. Risk: engineer leaves after 1 year.
Option 2: Outsource to Kenya. Cost: £12k upfront cleansing + £2k/month maintenance (£36k/year total). Time to first clean data: 4 weeks. Ongoing quality: guaranteed SLA.
Result: Outsourcing saves £38k in year 1 (£50k in-house salary vs £12k+£24k outsourcing), and delivers cleaner data 5 months faster.
Key takeaways
• Dirty data costs organisations 20% of revenue (Gartner 2023). • Data cleansing covers deduplication, standardisation, validation, enrichment, and invalid record removal. • Tools range from enterprise (Talend: £50k–£200k/year) to open-source (OpenRefine: free). • Outsourcing costs 75–78% less than in-house teams; ROI is typically positive in 2–3 months. • Hybrid approach works best: outsource initial bulk cleansing, then automate ongoing maintenance.
Written by
Treba Research
Treba editorial team — expert analysis on outsourcing, compliance, and building distributed UK–Kenya teams.

