Post-Mortem Discipline
Post-Mortem Discipline is the organizational practice of running structured, blameless retrospectives after every significant incident, project, or change โ and systematically converting findings into permanent process changes. Google's SRE handbook codified the modern blameless post-mortem: the goal is not to assign blame but to identify systemic causes (the conditions that allowed an individual error to cause harm) and ship fixes. Etsy's debrief practice goes further, treating outages as learning opportunities and publishing internal post-mortems widely so the organization compounds lessons. The discipline has three layers: (1) Blameless investigation, (2) Action item ownership with deadlines, (3) Closed-loop verification that action items shipped. Without all three, post-mortems become organizational scar tissue โ meetings that catalog what already broke without changing what comes next.
The Trap
The trap is the post-mortem theater cycle: a meeting happens, action items get logged, no one owns them with a deadline, and 70% never ship. Six months later, the same incident recurs and a new post-mortem is held. The action items list grows; the actual change rate is near zero. KnowMBA POV: post-mortems can only catalog what already broke. They are necessary but insufficient โ pre-mortems uncover what post-mortems can only document after the damage is done. The second trap: blame creeping back in via euphemism ('the engineer who pushed the change' is blame disguised as fact). The moment blame enters, candor leaves and you stop hearing about real causes.
What to Do
Run post-mortems with: (1) Strict blamelessness โ discuss roles and decisions, never name individuals as causes. Frame: 'given the information available at the time, why was this a reasonable decision?' (2) Time-boxed within 5 business days of incident close โ memory degrades fast. (3) Written narrative document, not slides โ narrative captures causal chains slides flatten. (4) 3-7 SMART action items, each with named owner, deadline, and success criterion. (5) Action item review in standing leadership forum at 30/60/90 days. (6) Public publishing internally so other teams compound learning. Track 'action item ship rate' as a meta-metric โ if it's below 70%, the post-mortem process is broken regardless of the meeting quality.
Formula
In Practice
Google's SRE organization codified the blameless post-mortem in the public SRE Book (2016). After every significant incident, an SRE writes a post-mortem document with: timeline, root causes, contributing factors, action items with owners and deadlines, and lessons learned. Critically, post-mortems are reviewed publicly by other SRE teams โ making the company smarter as a system rather than per-team. The discipline is enforced by leadership: if action items don't ship, the post-mortem is reopened. Google estimates the practice has prevented many recurrences of incidents that, without action item follow-through, would have happened repeatedly. Etsy's similar practice (documented by John Allspaw) explicitly treats engineers as the second victim of incidents, not the cause โ preserving the candor needed to find systemic causes.
Pro Tips
- 01
Track action item ship rate as a leading indicator. If your post-mortems generate 10 action items per incident and only 3 ship within 90 days, your post-mortem process is producing scar tissue, not change. Target: 70%+ ship rate within committed deadline.
- 02
Separate the post-mortem document (forensic, blameless, published) from the leadership accountability discussion (private, performance-related, with HR). Conflating the two destroys the candor of the post-mortem because participants self-censor to protect colleagues.
- 03
Publish post-mortems internally โ even painful ones. The compounding value of post-mortems comes from cross-team learning. A post-mortem read only by the team that lived the incident extracts maybe 20% of its value; publishing widely extracts 80%+.
Myth vs Reality
Myth
โBlameless post-mortems mean no one is held accountableโ
Reality
Blameless investigations and accountability are separate processes. The post-mortem identifies what systemic conditions allowed an error to cause harm. Performance management (separate, private, with HR) addresses individual accountability. Conflating them breaks both โ investigation candor collapses and accountability becomes capricious.
Myth
โAction items from post-mortems should be assigned to the team that caused the incidentโ
Reality
Often the right action is in another team (e.g., a platform fix that prevents the class of error entirely). Constraining action items to the team-of-origin guarantees you only treat symptoms. Cross-team action items require leadership to enforce โ but they're where the highest-leverage fixes live.
Myth
โQuick incidents (resolved in <1 hour) don't merit a post-mortemโ
Reality
Quick incidents that resolved by luck or heroic effort are exactly the ones that will recur. Severity-of-impact-this-time is a poor filter. Better filters: was the cause novel? Did we have to use heroics to recover? Did we get lucky on impact? If yes to any, run the post-mortem.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge โ answer the challenge or try the live scenario.
Knowledge Check
Your engineering team runs post-mortems after every Sev1 incident. The meetings are well-attended, the documents are detailed, and action items get logged in Jira. But 6 months in, three of the original incidents have recurred. What's the most likely root cause?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets โ not absolutes.
Action Item Ship Rate (Within Committed Deadline)
Engineering and operations post-mortem programsElite (Google SRE, Etsy)
75-85%
Healthy
60-75%
At-risk
40-60%
Theater
<40%
Source: Google SRE Book (2016); Etsy Code as Craft engineering blog
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Google SRE (Site Reliability Engineering)
2003-present
Google's SRE organization codified the modern blameless post-mortem and made it public via the SRE Book in 2016. The practice: every Sev1 and Sev2 incident generates a written post-mortem within 5 business days, structured as timeline + root causes + contributing factors + action items + lessons. The defining feature is blamelessness โ investigators discuss decisions in context of the information available at the time, never identifying individuals as causes. Post-mortems are published widely across the SRE org, making cross-team learning the default rather than the exception. Action items are tracked with deadlines and reviewed in standing leadership forums. Google credits the discipline with materially reducing recurrence of similar incident classes โ the compounding value comes from organization-wide pattern recognition, not per-incident fixes.
Post-mortem deadline
5 business days from incident close
Blamelessness
Mandatory โ no individual naming
Publishing scope
Org-wide by default
Action item review cadence
Standing leadership forum
The Google SRE post-mortem is the modern reference implementation. The three non-negotiable elements: blameless investigation, action items with deadlines and named owners, and broad internal publishing. Skip any one and the practice degrades into documentation theater.
Etsy (Debrief Culture)
2010-present
Etsy's engineering organization, under former CTO John Allspaw, developed and publicly documented one of the most influential blameless post-mortem cultures outside Google. Allspaw's 2012 essay 'Blameless PostMortems and a Just Culture' framed engineers as the second victim of incidents, not the cause โ a deliberate framing to preserve the candor needed for systemic investigation. Etsy's debriefs explicitly separated 'understanding the system' from 'evaluating individuals.' The practice became foundational to Etsy's continuous deployment culture (60+ deploys per day at peak) โ high deploy velocity is only safe with high-quality learning from incidents.
Deploys per day (peak era)
60+
Debrief framing
Engineer as second victim
Industry influence
Foundation for DevOps/SRE post-mortem practice
Allspaw essay (2012)
Widely cited reference
Etsy's contribution is the framing: when engineers are the second victim of incidents (not the cause), candor becomes possible and systemic causes become visible. The framing precedes the process โ get the framing wrong and no amount of post-mortem template polish saves the practice.
Related concepts
Keep connecting.
The concepts that orbit this one โ each one sharpens the others.
Beyond the concept
Turn Post-Mortem Discipline into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h ยท No retainer required
Turn Post-Mortem Discipline into a live operating decision.
Use Post-Mortem Discipline as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.