Data Observability
Data Observability is the practice of monitoring data pipelines and datasets the same way SRE teams monitor production software โ across five pillars: freshness (when did this dataset last update?), volume (how many rows landed?), schema (did the columns change?), distribution (do the values look normal?), and lineage (what depends on this?). The goal is to detect data incidents (a pipeline silently breaking, a schema change upstream, a distribution shift) BEFORE the CFO emails asking why the dashboard shows $0 revenue. Mature data orgs treat broken pipelines as production incidents โ with on-call rotations, runbooks, MTTR targets, and post-mortems. Without observability, the average data team learns about a broken pipeline from a furious stakeholder; with it, they learn from an automated alert hours or days earlier.
The Trap
The trap is treating data observability as a tooling purchase ('we bought Monte Carlo, we're observable now') without the operational discipline behind it. The tool generates 200 alerts a day, the team mutes them, and within 8 weeks observability is dead. The other trap is observing everything equally โ a startup applying enterprise-grade row-count anomaly detection to every of 4,000 tables, including the 3,200 that nobody queries. Observability has cost (tool fees, engineer time tuning thresholds, alert fatigue) and that cost must be matched to the business criticality of each dataset. The most expensive failure is observability theater: dashboards showing 'data health: 98%' while the three datasets the CEO actually depends on are silently broken.
What to Do
Build a tiered observability program. Tier 1 (CEO/board reports, financial close, billing): full observability โ freshness SLAs, row-count anomaly detection, schema change alerts, distribution drift, on-call rotation. Tier 2 (operational dashboards used daily): freshness + schema monitoring. Tier 3 (long-tail datasets): basic freshness check only or none. Define data SLAs for Tier 1 datasets (e.g., 'orders.fact must land by 6 AM with >99% completeness'). Establish a data on-call rotation with documented runbooks. Track MTTR for data incidents the way engineering tracks software MTTR. Most importantly: publish a public 'data status page' so stakeholders learn about incidents from the data team, not the other way around.
Formula
In Practice
Monte Carlo Data, the data observability category leader, publishes case studies showing what mature observability looks like. Vimeo, after deploying observability, reduced data incident detection time from days (when stakeholders complained) to <1 hour (when alerts fired), and reduced MTTR from ~2 weeks to ~2 days. They credit ~80% reduction in 'time to data trust' with stakeholders. The decisive move was not the tool โ it was establishing a data on-call rotation, runbooks, and public SLAs for the top 50 'tier 1' datasets out of thousands. Untiered observability would have generated alert fatigue; tiered observability focused engineering attention where it mattered.
Pro Tips
- 01
The first metric to track is MTTD (Mean Time To Detection), not MTTR. If your team learns about incidents from stakeholders, your MTTD is days. The goal is to drive MTTD below 1 hour through automated alerts before any other improvement.
- 02
Schema change is the #1 source of silent data incidents. An upstream engineer renames a column, the pipeline silently writes nulls, and no row count or freshness check fires. Schema enforcement at the contract layer (data contracts) plus schema change alerts catches what other observability misses.
- 03
Publish a public data status page (internal-only, like a SaaS status.io page) that shows real-time freshness, incidents, and post-mortems for tier-1 datasets. This single move transforms data trust because stakeholders feel informed instead of surprised โ and it forces the data team to be honest about incidents.
Myth vs Reality
Myth
โData observability tools detect all data quality issuesโ
Reality
Observability detects pipeline-level issues (freshness, volume, schema, basic distribution). It does not detect business-logic issues like 'we're double-counting refunds' or 'this metric definition disagrees with the CFO's spreadsheet'. Those require data testing (dbt tests, expectations) and definition governance. Observability is necessary but not sufficient โ companies that buy a tool and skip the testing/contracts work still ship bad numbers.
Myth
โMore alerts = better observabilityโ
Reality
More alerts = guaranteed alert fatigue and deafened on-call rotations. Mature observability has fewer, higher-signal alerts on tier-1 datasets only. The benchmark for healthy observability is alert-to-incident ratio: >70% of alerts should correspond to a real incident requiring action. Below 30% and the team will start ignoring everything, including the real ones.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge โ answer the challenge or try the live scenario.
Knowledge Check
Your data team learned at 4 PM that the morning 'orders.fact' table never updated โ the CFO discovered it while preparing a board slide. You currently have no observability tooling. What is the highest-ROI first move?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets โ not absolutes.
Data Incident Mean Time to Detection (MTTD)
Industry benchmarks across data observability platform customers (2023-2024)Mature data ops
<30 minutes
Good
30 min - 4 hours
Average
4 - 24 hours
Poor (stakeholders find it first)
>1 day
Source: https://www.montecarlodata.com/blog-data-downtime/
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Vimeo (Monte Carlo customer)
2021-2023
Vimeo's data team was experiencing the classic pattern: stakeholders learning about broken dashboards before the data team did. After deploying Monte Carlo for data observability, they tiered their thousands of datasets, set freshness SLAs on tier-1, established a data on-call rotation with runbooks, and integrated alerts into PagerDuty. Within 6 months, MTTD dropped from days to under an hour for tier-1 incidents. MTTR dropped from ~2 weeks to ~2 days. Stakeholder trust (measured via internal NPS) climbed sharply. The decisive move was operational discipline โ tiering, on-call, runbooks โ with the tool as enabler, not solution.
MTTD Improvement
Days โ <1 hour
MTTR Improvement
~2 weeks โ ~2 days
Tier-1 Datasets Under SLA
Top ~50
Stakeholder Trust
Significant lift
The tool generates alerts; the on-call rotation and runbooks generate trust. Buying observability without operational discipline is observability theater.
Hypothetical: 400-person FinTech
2022
A FinTech bought a leading observability platform for $180K/year and enabled monitoring across all ~1,200 tables in their warehouse without tiering. Within 5 weeks the data on-call was receiving 100+ alerts/day, most on long-tail datasets nobody queried. The team built rules to suppress 80% of alerts. Two months later, a tier-1 revenue pipeline broke and the alert was suppressed by an over-broad rule. Stakeholders found the issue. Trust in the observability program collapsed. The platform was downgraded to a 'nice to have'. They rebuilt the program 14 months later with proper tiering.
Annual Tool Cost
$180K
Tables Monitored Initially
~1,200 (untiered)
Alerts/Day at Peak
100+
Alert Fatigue Outcome
Real incident missed
Untiered observability is alert-fatigue manufacturing. Without tiering and on-call discipline, the tool actively damages data trust by training the team to ignore alerts.
Airbnb
2018-present
Airbnb's internal Data Quality (DQ) tooling and Wall (data discovery + observability) platform applies tiered observability across thousands of datasets. Tier-1 datasets used for financial reporting and ML model training have strict SLAs, on-call rotations, and automated lineage-aware blast-radius alerts. When a tier-1 table breaks, downstream consumers are automatically notified. Airbnb published the architecture and operational model openly, treating data quality as a first-class engineering discipline with the same rigor as production software.
Datasets Tiered
Thousands across Tier 1-3
On-Call Rotation
Yes, with runbooks
Lineage-Aware Alerts
Auto-notify downstream consumers
Treated As
Production engineering discipline
Mature observability operates exactly like SRE for software: tiered, alert-tuned, on-call rotation, post-mortems, runbooks. The tool is incidental; the operating model is the moat.
Decision scenario
The Post-Incident Observability Pitch
You're the new Head of Data at a 600-person SaaS. Three months in, the orders pipeline silently broke for 36 hours, leading the CEO to present wrong numbers at a board meeting. The CEO has given you a $400K observability budget and 60 days to propose a plan. You manage 1,500 tables with a team of 6 data engineers.
Data Engineers
6
Total Tables
1,500
Budget
$400K
Recent Incident MTTD
36 hours
Tier Definitions Today
None
Decision 1
Two vendors are pitching: an enterprise observability platform ($300K/year) that promises 'monitoring across every table' and a lighter platform ($120K/year) focused on configurable tier-1 monitoring. Your CTO favors the enterprise option because 'we should monitor everything'. Your senior engineer favors tiering and the lighter platform.
Buy the enterprise platform and enable monitoring across all 1,500 tables โ broad coverage demonstrates seriousnessReveal
Spend weeks 1-4 with the business defining 30 tier-1 datasets and SLAs. Buy the lighter platform ($120K). Configure full monitoring on tier-1, freshness-only on tier-2 (~150 datasets), nothing on the rest. Set up on-call rotation, runbooks, and a public status page in weeks 5-8.โ OptimalReveal
Related concepts
Keep connecting.
The concepts that orbit this one โ each one sharpens the others.
Beyond the concept
Turn Data Observability into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h ยท No retainer required
Turn Data Observability into a live operating decision.
Use Data Observability as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.