Data Quality Monitoring
Data Quality Monitoring is the continuous, automated detection of anomalies in data: freshness lapses, volume spikes or drops, schema changes, distribution shifts, and broken referential integrity. Tools like Monte Carlo, Anomalo, Bigeye, and Soda apply ML to baseline 'normal' for each dataset and alert when something deviates. The discipline differs from manual data quality testing in two ways: (1) it covers data you didn't think to test (anomaly detection finds the unknown unknowns), and (2) it runs continuously, not just in CI/CD. KnowMBA POV: most companies invest heavily in pipeline reliability monitoring (did the job run?) and almost nothing in DATA reliability monitoring (was the data the job produced correct?). The latter causes far more silent business damage.
The Trap
The trap is treating data quality monitoring as a tooling purchase. Buy Monte Carlo, deploy on 800 tables, get 200 alerts/day, mute everything within 3 weeks. The hard work is curating: which 50 tables actually matter, what defines 'normal' for each, who gets paged, what is the runbook. Without that discipline, the platform creates alert fatigue and the underlying problem (silent data corruption) persists.
What to Do
Roll out monitoring in three waves: (1) Tier-1 only โ pick the 20-50 tables that feed executive dashboards, billing, ML models, or public-facing products. Define explicit checks (uniqueness, not-null, freshness, referential integrity) for each. (2) Anomaly detection on tier-1 โ turn on ML-based volume/distribution alerts only on the curated set. (3) Tier-2 broader rollout โ only after tier-1 alerts are tuned and acted on consistently. Skipping straight to 'monitor everything' is how you get alert fatigue.
Formula
In Practice
Monte Carlo Data, founded in 2019, pioneered the 'data observability' category by applying SRE-style monitoring concepts to data: freshness, volume, schema, distribution, and lineage as the 'five pillars.' Customers including Fox, Vimeo, and CreditKarma deployed Monte Carlo to detect data incidents before downstream consumers noticed. By 2024, Monte Carlo was joined by Anomalo (ML-first), Bigeye (open standards), and Soda (developer-friendly OSS) โ the category became standard for data-mature enterprises, but adoption maturity varies wildly even among companies that bought a tool.
Pro Tips
- 01
The single highest-impact monitor is FRESHNESS on tier-1 tables. 'The dashboard is showing yesterday's number as today's' is by far the most common silent failure, and it's also the easiest to detect.
- 02
Severity-tier your alerts. P0 = customer-facing data wrong (page on-call). P1 = exec dashboard wrong (ticket within 1 business hour). P2 = analyst-impacting (ticket within 1 business day). Without severity tiers, every alert is treated the same and the team burns out.
- 03
Track three metrics monthly: MTTD (Mean Time to Detect), MTTR (Mean Time to Resolve), and 'Incidents Caught Before Stakeholder Reported.' The third is the leading indicator that monitoring is actually working.
Myth vs Reality
Myth
โIf our pipelines are reliable, our data is reliableโ
Reality
Pipeline reliability (did the job complete?) and data reliability (was the output correct?) are different problems. A pipeline can run successfully and produce silently corrupt data โ wrong joins, stale upstream source, schema drift the pipeline coerced through. You need separate monitoring for each.
Myth
โData tests in dbt are sufficientโ
Reality
dbt tests catch known failure modes you thought to write. They don't catch schema drift in upstream sources, distribution shifts in input data, or volume anomalies. Tests + observability is the right combo โ either alone leaves blind spots.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge โ answer the challenge or try the live scenario.
Knowledge Check
A finance dashboard has been showing the wrong revenue number for 11 days. The pipeline ran successfully every day. The error was caught when the CFO noticed in board prep. What is the highest-leverage fix to prevent this class of incident?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets โ not absolutes.
Data Incident MTTD (Tier-1 Datasets)
Business-critical data assets with downstream consumers (dashboards, ML, operational systems)Elite
< 1 hour
Strong
1-8 hours
Acceptable
8 hrs - 2 days
Poor
> 2 days (often stakeholder-reported)
Source: Hypothetical: Monte Carlo State of Data Quality 2024 + KnowMBA practitioner interviews
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Monte Carlo Data
2019-Present
Monte Carlo founded the 'data observability' category in 2019 with a platform built around the five pillars: freshness, volume, schema, distribution, and lineage. By 2024, Monte Carlo had hundreds of enterprise customers including Fox, Vimeo, CreditKarma, and PagerDuty, and the data observability category had grown to ~$500M annually with multiple competitors. Monte Carlo's customer reports consistently show 80%+ reduction in data incident MTTD after deployment.
Founded
2019
Marquee Customers
Fox, Vimeo, CreditKarma, PagerDuty
Typical MTTD Reduction
80%+
Category Size (2024)
~$500M annual
Treating data like a production system โ with monitoring, on-call, and incident response โ produces measurable reliability gains. The discipline is a force multiplier on every other data investment.
Anomalo
2018-Present
Anomalo took a different approach to data quality: ML-first anomaly detection on every column of every important table, with no manual rule writing. Customers including Block (Square), Discover, and Buzzfeed deployed Anomalo to catch quality issues that rule-based testing missed. The product validated that the 'unknown unknowns' problem in data quality is real and ML-tractable โ and that tooling choice should match the team's preferred operating model (rule-heavy vs ML-driven).
Founded
2018
Approach
ML-first, no manual rules
Marquee Customers
Block, Discover, Buzzfeed
Different DQ tools optimize for different operating models. ML-first (Anomalo) is best when you have massive table coverage and no time to write rules; rule-first (Soda, Bigeye) is best when you want explicit, auditable controls. Choose based on team preference, not vendor claims.
Decision scenario
Data Quality Monitoring Rollout
You're head of data at a 600-person company. The CFO just discovered a 9-day-old reporting error that misled board prep. CEO wants 'this to never happen again.' You're evaluating Monte Carlo, Anomalo, and Soda. Annual budget approved: $200K. Your team has 4 data engineers + 6 analytics engineers. You have ~1,200 tables in your warehouse, of which ~80 are 'tier-1.'
Total Tables
1,200
Tier-1 Tables
80
Current MTTD
~7 days (stakeholder-reported)
Approved Budget
$200K/year
DE + AE Team Size
10
Decision 1
Your VP of Engineering wants you to deploy on all 1,200 tables to maximize coverage. Your most senior data engineer says start with just the 80 tier-1 tables and tune carefully. The vendor sales engineer says they can deploy on all 1,200 in week one.
Maximum coverage โ deploy on all 1,200 tables in the first month for breadthReveal
Tier-1 first โ deploy on 80 tables with explicit checks + tuned anomaly detection; expand only after 90 days of clean signalโ OptimalReveal
Related concepts
Keep connecting.
The concepts that orbit this one โ each one sharpens the others.
Beyond the concept
Turn Data Quality Monitoring into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h ยท No retainer required
Turn Data Quality Monitoring into a live operating decision.
Use Data Quality Monitoring as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.