Data StrategyIntermediate7 min read

Data Observability

Data Observability is the practice of monitoring data pipelines and datasets the same way SRE teams monitor production software — across five pillars: freshness (when did this dataset last update?), volume (how many rows landed?), schema (did the columns change?), distribution (do the values look normal?), and lineage (what depends on this?). The goal is to detect data incidents (a pipeline silently breaking, a schema change upstream, a distribution shift) BEFORE the CFO emails asking why the dashboard shows $0 revenue. Mature data orgs treat broken pipelines as production incidents — with on-call rotations, runbooks, MTTR targets, and post-mortems. Without observability, the average data team learns about a broken pipeline from a furious stakeholder; with it, they learn from an automated alert hours or days earlier.

Also known asData Reliability EngineeringPipeline MonitoringData Quality MonitoringData Incident ManagementData SRE

Challenge a friend Browse library

The Trap

The trap is treating data observability as a tooling purchase ('we bought Monte Carlo, we're observable now') without the operational discipline behind it. The tool generates 200 alerts a day, the team mutes them, and within 8 weeks observability is dead. The other trap is observing everything equally — a startup applying enterprise-grade row-count anomaly detection to every of 4,000 tables, including the 3,200 that nobody queries. Observability has cost (tool fees, engineer time tuning thresholds, alert fatigue) and that cost must be matched to the business criticality of each dataset. The most expensive failure is observability theater: dashboards showing 'data health: 98%' while the three datasets the CEO actually depends on are silently broken.

What to Do

Build a tiered observability program. Tier 1 (CEO/board reports, financial close, billing): full observability — freshness SLAs, row-count anomaly detection, schema change alerts, distribution drift, on-call rotation. Tier 2 (operational dashboards used daily): freshness + schema monitoring. Tier 3 (long-tail datasets): basic freshness check only or none. Define data SLAs for Tier 1 datasets (e.g., 'orders.fact must land by 6 AM with >99% completeness'). Establish a data on-call rotation with documented runbooks. Track MTTR for data incidents the way engineering tracks software MTTR. Most importantly: publish a public 'data status page' so stakeholders learn about incidents from the data team, not the other way around.

Formula

Data Trust = (Tier-1 Datasets Meeting SLA × % Stakeholders Aware of Status) ÷ (1 + Active Unresolved Incidents). Trust collapses when stakeholders learn about issues from their dashboards rather than from the data team.

In Practice

Monte Carlo Data, the data observability category leader, publishes case studies showing what mature observability looks like. Vimeo, after deploying observability, reduced data incident detection time from days (when stakeholders complained) to <1 hour (when alerts fired), and reduced MTTR from ~2 weeks to ~2 days. They credit ~80% reduction in 'time to data trust' with stakeholders. The decisive move was not the tool — it was establishing a data on-call rotation, runbooks, and public SLAs for the top 50 'tier 1' datasets out of thousands. Untiered observability would have generated alert fatigue; tiered observability focused engineering attention where it mattered.

Pro Tips

01
The first metric to track is MTTD (Mean Time To Detection), not MTTR. If your team learns about incidents from stakeholders, your MTTD is days. The goal is to drive MTTD below 1 hour through automated alerts before any other improvement.
02
Schema change is the #1 source of silent data incidents. An upstream engineer renames a column, the pipeline silently writes nulls, and no row count or freshness check fires. Schema enforcement at the contract layer (data contracts) plus schema change alerts catches what other observability misses.
03
Publish a public data status page (internal-only, like a SaaS status.io page) that shows real-time freshness, incidents, and post-mortems for tier-1 datasets. This single move transforms data trust because stakeholders feel informed instead of surprised — and it forces the data team to be honest about incidents.

Myth vs Reality

Myth

“Data observability tools detect all data quality issues”

Reality

Observability detects pipeline-level issues (freshness, volume, schema, basic distribution). It does not detect business-logic issues like 'we're double-counting refunds' or 'this metric definition disagrees with the CFO's spreadsheet'. Those require data testing (dbt tests, expectations) and definition governance. Observability is necessary but not sufficient — companies that buy a tool and skip the testing/contracts work still ship bad numbers.

Myth

“More alerts = better observability”

Reality

More alerts = guaranteed alert fatigue and deafened on-call rotations. Mature observability has fewer, higher-signal alerts on tier-1 datasets only. The benchmark for healthy observability is alert-to-incident ratio: >70% of alerts should correspond to a real incident requiring action. Below 30% and the team will start ignoring everything, including the real ones.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

Your data team learned at 4 PM that the morning 'orders.fact' table never updated — the CFO discovered it while preparing a board slide. You currently have no observability tooling. What is the highest-ROI first move?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

Data Incident Mean Time to Detection (MTTD)

Industry benchmarks across data observability platform customers (2023-2024)

Mature data ops

<30 minutes

Good

30 min - 4 hours

Average

4 - 24 hours

Poor (stakeholders find it first)

>1 day

Source: https://www.montecarlodata.com/blog-data-downtime/

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

🎥

Vimeo (Monte Carlo customer)

2021-2023

success

Vimeo's data team was experiencing the classic pattern: stakeholders learning about broken dashboards before the data team did. After deploying Monte Carlo for data observability, they tiered their thousands of datasets, set freshness SLAs on tier-1, established a data on-call rotation with runbooks, and integrated alerts into PagerDuty. Within 6 months, MTTD dropped from days to under an hour for tier-1 incidents. MTTR dropped from ~2 weeks to ~2 days. Stakeholder trust (measured via internal NPS) climbed sharply. The decisive move was operational discipline — tiering, on-call, runbooks — with the tool as enabler, not solution.

MTTD Improvement

Days → <1 hour

MTTR Improvement

~2 weeks → ~2 days

Tier-1 Datasets Under SLA

Top ~50

Stakeholder Trust

Significant lift

The tool generates alerts; the on-call rotation and runbooks generate trust. Buying observability without operational discipline is observability theater.

Source ↗

💸

Hypothetical: 400-person FinTech

2022

failure

A FinTech bought a leading observability platform for $180K/year and enabled monitoring across all ~1,200 tables in their warehouse without tiering. Within 5 weeks the data on-call was receiving 100+ alerts/day, most on long-tail datasets nobody queried. The team built rules to suppress 80% of alerts. Two months later, a tier-1 revenue pipeline broke and the alert was suppressed by an over-broad rule. Stakeholders found the issue. Trust in the observability program collapsed. The platform was downgraded to a 'nice to have'. They rebuilt the program 14 months later with proper tiering.

Annual Tool Cost

$180K

Tables Monitored Initially

~1,200 (untiered)

Alerts/Day at Peak

100+

Alert Fatigue Outcome

Real incident missed

Untiered observability is alert-fatigue manufacturing. Without tiering and on-call discipline, the tool actively damages data trust by training the team to ignore alerts.

🏠

Airbnb

2018-present

success

Airbnb's internal Data Quality (DQ) tooling and Wall (data discovery + observability) platform applies tiered observability across thousands of datasets. Tier-1 datasets used for financial reporting and ML model training have strict SLAs, on-call rotations, and automated lineage-aware blast-radius alerts. When a tier-1 table breaks, downstream consumers are automatically notified. Airbnb published the architecture and operational model openly, treating data quality as a first-class engineering discipline with the same rigor as production software.

Datasets Tiered

Thousands across Tier 1-3

On-Call Rotation

Yes, with runbooks

Lineage-Aware Alerts

Auto-notify downstream consumers

Treated As

Production engineering discipline

Mature observability operates exactly like SRE for software: tiered, alert-tuned, on-call rotation, post-mortems, runbooks. The tool is incidental; the operating model is the moat.

Source ↗

Decision scenario

The Post-Incident Observability Pitch

You're the new Head of Data at a 600-person SaaS. Three months in, the orders pipeline silently broke for 36 hours, leading the CEO to present wrong numbers at a board meeting. The CEO has given you a $400K observability budget and 60 days to propose a plan. You manage 1,500 tables with a team of 6 data engineers.

Data Engineers

Total Tables

1,500

Budget

$400K

Recent Incident MTTD

36 hours

Tier Definitions Today

None

Decision 1

Two vendors are pitching: an enterprise observability platform ($300K/year) that promises 'monitoring across every table' and a lighter platform ($120K/year) focused on configurable tier-1 monitoring. Your CTO favors the enterprise option because 'we should monitor everything'. Your senior engineer favors tiering and the lighter platform.

Buy the enterprise platform and enable monitoring across all 1,500 tables — broad coverage demonstrates seriousnessReveal

Within 6 weeks: ~70 alerts/day, most non-actionable. Engineers begin ignoring alerts. Within 12 weeks: a tier-1 billing pipeline breaks, the alert is buried, the CFO finds the issue. The CEO is presented with the second incident in 5 months. Your tenure is in question. The platform is either ripped out or relegated to nice-to-have. Budget burned, trust collapsed.

MTTD: 36hr → still ~24hr (alert fatigue)Stakeholder Trust: Worse than beforeBudget Recovery: Difficult

Spend weeks 1-4 with the business defining 30 tier-1 datasets and SLAs. Buy the lighter platform ($120K). Configure full monitoring on tier-1, freshness-only on tier-2 (~150 datasets), nothing on the rest. Set up on-call rotation, runbooks, and a public status page in weeks 5-8.Reveal

Week 8: tier-1 monitoring live, on-call active, status page published internally. Week 12: first tier-1 incident detected in 22 minutes (vs 36 hours), resolved in 4 hours. CFO sees the status page and trusts the team. Months 4-12: 0 stakeholder-discovered tier-1 incidents. CEO publicly cites the data team as a model. $280K of the $400K budget remains for additional engineers. The lighter platform proves the operating model matters more than tool surface area.

MTTD: 36hr → <30minStakeholder Trust: Significantly improvedBudget Remaining: $280K of $400K

Related concepts