Observability Strategy
Observability is the practice of instrumenting systems to make their internal state knowable from external outputs โ the three classical signals are metrics (numerical time-series), logs (timestamped events), and traces (distributed request flow). The major commercial platforms are Datadog (broadest, most-expensive), New Relic, Splunk, Dynatrace (APM-centric), Honeycomb (event-based, BubbleUp methodology), and Grafana Cloud (open-source-aligned). The open-source stack centers on Prometheus + Grafana + Loki + Tempo + OpenTelemetry. OpenTelemetry (CNCF, second-largest project after Kubernetes) has become the standard instrumentation framework, allowing organizations to decouple instrumentation from backend. The KnowMBA POV: observability without ownership is just storage cost. Most enterprises buy Datadog or Splunk, ingest everything, build a few dashboards, and discover that nobody actually USES the platform during incidents โ engineers grep logs in their terminals because they can't navigate the platform fast enough. The discipline missing isn't tooling; it's service ownership, on-call rigor, SLO definition, and the practice of actually using observability data to drive decisions.
The Trap
The trap is treating observability as a tooling decision rather than a practice decision. Three failure modes dominate: (1) Cost explosion. Datadog, Splunk, and similar charge by ingest volume. Without retention discipline, sampling, and selective instrumentation, observability bills grow 40-80% year-over-year and eventually rival the infrastructure they monitor. Datadog has multiple seven-figure customer stories and has faced public criticism over surprise bills (most famously the Coinbase $65M Datadog incident reportedly described in industry chatter). (2) Dashboard graveyard. Teams build hundreds of dashboards, but ownership is unclear, half are stale, and during incidents nobody knows which ones to look at. (3) No actual SLOs. Teams have metrics but no formal service-level objectives, no error budgets, no decision framework for when to ship features vs. invest in reliability. The deeper trap: confusing data quantity with insight. Ingesting 10x more telemetry doesn't make systems 10x more observable โ it makes them more expensive and harder to navigate.
What to Do
Six moves. (1) Standardize on OpenTelemetry for instrumentation โ this decouples your code from your backend choice and avoids vendor lock-in. Whether you ship to Datadog, Honeycomb, or Grafana, OTel is the unification layer. (2) Define SLOs per service before adding more dashboards. Without SLOs, observability is decoration. The Google SRE workbook chapters on SLOs are the canonical reference. (3) Set per-team observability budgets (cost AND signal volume) โ make teams accountable for what they ingest. Datadog's per-team usage attribution exists for this reason. (4) Implement aggressive sampling and retention policies โ full-fidelity for recent data (1-7 days), heavy sampling for older data, archival to cold storage for compliance. Most observability platforms support tail-based sampling for traces. (5) Build a small set of high-quality runbook-linked dashboards rather than hundreds of one-off dashboards. Dashboard quality > quantity. (6) Hold blameless postmortems that explicitly evaluate observability โ for every incident, ask 'what would have made this faster to detect or resolve?' and feed those learnings back into instrumentation.
Formula
In Practice
Datadog has become the dominant enterprise observability platform โ public market cap ~$45B as of 2024, ~$2.5B in revenue, and used at thousands of enterprises. The platform's growth and customer concentration produced industry-wide cost concerns: cases of customer Datadog bills exceeding $10M/year are common at large engineering organizations, and one widely-discussed (though never officially confirmed) story circulated about Coinbase incurring a multi-million-dollar Datadog bill from a single misconfigured ingestion pipeline. Honeycomb (Charity Majors, Christine Yang) built a counter-narrative around event-based observability and the limitations of metrics-first approaches, becoming influential in the observability discourse without the ingest-everything cost model. Grafana Labs built the largest open-source-aligned observability platform (Prometheus + Grafana + Loki + Tempo + Mimir), reaching unicorn valuation by 2022. The market converged on a clearer view: observability tooling is excellent but operationally and financially heavy, and the discipline (SLOs, ownership, sampling) matters more than the choice of vendor.
Pro Tips
- 01
Datadog cost grows faster than infrastructure. Track 'observability cost as % of total infrastructure cost' as a KPI. Healthy: 5-15%. Concerning: 20-30%. Crisis: 30%+. Many organizations discover their Datadog bill is approaching parity with their AWS bill โ at which point either the observability budget is wrong, the instrumentation is too verbose, or the platform choice is wrong for the workload.
- 02
OpenTelemetry adoption is the highest-leverage observability decision you can make. OTel decouples instrumentation from backend, meaning you can switch from Datadog to Grafana Cloud, or to Honeycomb, or run a hybrid โ without re-instrumenting your code. The investment in OTel pays for itself the first time you renegotiate with your observability vendor.
- 03
SLOs change observability from monitoring to decision-making. A team without SLOs has a wall of dashboards and no opinion about what's good or bad. A team with SLOs has a clear boundary: when error budget is healthy, ship features; when it's burning, invest in reliability. The Google SRE Workbook's chapters on SLOs and error budgets are foundational reading.
Myth vs Reality
Myth
โMore telemetry equals better observabilityโ
Reality
Ingest volume is not the same as insight. Many organizations ingest terabytes of logs that nobody reads, generate millions of metric series that no dashboard queries, and produce traces at full fidelity that get sampled at query time anyway. The discipline is selective instrumentation aligned to specific failure modes and decisions, not maximum coverage.
Myth
โBuying Datadog (or any platform) solves observabilityโ
Reality
Datadog (Splunk, New Relic, Dynatrace) is excellent platform tooling โ that doesn't address the practice gap. SLO definition, on-call rigor, blameless postmortems, runbook discipline, dashboard ownership, sampling policies โ these are the work, and no platform purchase replaces them. The platform makes the practice scalable; the practice has to exist first.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge โ answer the challenge or try the live scenario.
Scenario Challenge
Your CFO sends you a panicked email: the Datadog bill went from $80K/month last year to $290K/month this quarter, with no commensurate growth in infrastructure. Engineering leadership defends the spend: 'we need this visibility for production reliability.' What's the right diagnostic and response?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets โ not absolutes.
Observability Spend as % of Infrastructure Cost
Cloud-native production environmentsLean
< 8%
Healthy
8-15%
High
15-25%
Out of Proportion
> 25%
Source: Hypothetical: composite from FinOps Foundation observations and platform vendor case studies
Mean Time to Recovery (MTTR) โ Mature Practice
DORA (Accelerate State of DevOps) MTTR tiersElite (DORA top performers)
< 1 hour
High
1-24 hours
Medium
1 day - 1 week
Low
> 1 week
Source: DORA Accelerate State of DevOps Report (annual)
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Datadog
2010-Present
Datadog launched in 2010 and grew into the dominant enterprise observability platform โ public market cap ~$45B as of 2024 and revenue ~$2.5B annually. Datadog's growth came from a comprehensive product (APM + infrastructure monitoring + log management + RUM + security) and aggressive sales motion at enterprises. The flip side: customer cost concerns. Industry-wide reports of Datadog bills exceeding $10M annually became common at larger engineering organizations, and the platform faced criticism about pricing model complexity (per-host, per-custom-metric, per-log-GB, per-trace-event, per-RUM-session โ each with different rates and unit economics). One widely-discussed (though never officially confirmed) story circulated in 2022-2023 about a major crypto company incurring a multi-million-dollar Datadog bill from misconfigured ingestion. The pattern industry-wide: Datadog delivers excellent capability and excellent margins simultaneously, and customer cost discipline must be operated as actively as feature adoption.
Founded
2010 (NYC)
Market Cap (2024)
~$45B
Annual Revenue
~$2.5B
Common Enterprise Bill
$1M-$10M+/year
Pricing Complexity
Multiple per-unit dimensions
Excellent observability tooling can produce excellent vendor margins through cost-model complexity. Organizations that deploy Datadog without per-team cost attribution, sampling discipline, and ingest budgeting routinely discover their bills grew faster than infrastructure. The platform's value is real; the cost discipline is mandatory.
Honeycomb
2016-Present
Honeycomb (founded by Charity Majors and Christine Yang, both ex-Facebook) built an observability platform around 'observability 2.0' principles: high-cardinality event-based data rather than pre-aggregated metrics, BubbleUp methodology for outlier detection, focus on understanding distributed systems behavior rather than monitoring known unknowns. Honeycomb's commercial scale is much smaller than Datadog's, but its intellectual influence has been disproportionate โ the books 'Observability Engineering' and 'Modern Software Engineering' published from this community shaped how a generation of senior engineers think about telemetry. Charity Majors's writing on production excellence, ownership, and operational practice is canonical in the SRE community.
Founded
2016 (San Francisco)
Founders
Charity Majors, Christine Yang (both ex-Facebook)
Commercial Scale
Sub-Datadog, but profitable
Intellectual Influence
'Observability Engineering' (O'Reilly)
Observability is as much a thinking practice as a tooling category. Honeycomb's contribution was less commercial than intellectual: reframing observability around exploration of unknown failure modes rather than monitoring of known metrics. The mature engineering organization adopts the thinking even when the tool is something else.
Related concepts
Keep connecting.
The concepts that orbit this one โ each one sharpens the others.
Beyond the concept
Turn Observability Strategy into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h ยท No retainer required
Turn Observability Strategy into a live operating decision.
Use Observability Strategy as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.