Disaster Recovery Planning
Disaster Recovery Planning is the IT-specific discipline of getting systems back online after a major incident โ datacenter loss, region outage, ransomware, catastrophic data corruption, or a destructive human error. It's defined by two metrics: RTO (Recovery Time Objective โ how long you can be down) and RPO (Recovery Point Objective โ how much data loss is acceptable). The four common architectures, from cheapest to most expensive: backup & restore (RTO hours-days, RPO hours), pilot light (RTO hours, RPO minutes), warm standby (RTO minutes, RPO seconds), and active-active multi-region (RTO seconds, RPO ~0). The KnowMBA POV: most enterprises have DR plans they've never tested. A DR plan that has never been exercised end-to-end isn't a plan โ it's a document.
The Trap
The trap is treating DR as a documentation exercise. Companies buy backup tools, write a 90-page DR runbook, file it with compliance, and check the box. When a real incident hits, the runbook references credentials that have rotated, scripts that haven't run in 18 months, vendors no longer under contract, and people who left the company two reorgs ago. The cruel statistic: ~40% of organizations who declare a disaster never recover their pre-incident operations, primarily because the recovery plan didn't survive contact with reality. The other failure mode is RTO/RPO theater โ leadership commits to '4-hour RTO' to look strong, but the underlying architecture would take 36 hours. Nobody discovers the gap until the day it matters.
What to Do
Three operational disciplines. (1) Tier your applications: Tier 1 (revenue-bearing, customer-facing) gets active-active or warm standby; Tier 2 (internal critical) gets pilot light; Tier 3 (everything else) gets backup-and-restore. Most enterprises have ~10% Tier 1, ~30% Tier 2, ~60% Tier 3 โ but spend as if everything were Tier 1. (2) Test the runbook. Run a tabletop drill quarterly and a real failover annually for Tier 1 systems. Netflix-style chaos engineering for the most mature: deliberately break things in production to verify recovery. (3) Measure 'recovery debt' โ the gap between committed RTO/RPO and demonstrated RTO/RPO from the last test. Close the gap or revise the commitment.
Formula
In Practice
Netflix institutionalized DR through Chaos Engineering: tools like Chaos Monkey (kills production instances), Chaos Kong (simulates loss of an entire AWS region), and the broader Simian Army philosophy. The bet: rather than write DR plans and hope, deliberately injure production daily so the system is provably resilient. By 2016, Netflix could lose an entire AWS region (which Chaos Kong simulated regularly) with minimal customer impact, because customer traffic auto-failed-over to other regions. The lesson the broader industry adopted: DR plans you don't test are aspirational; production injection is the only way to prove resilience. Microsoft Azure publishes equivalent guidance for its customers in the Azure Site Recovery and Well-Architected Reliability Pillar documentation.
Pro Tips
- 01
Test failover, not just failback โ most teams can fail OVER but have never practiced returning to the primary region after recovery. The full cycle is what an actual incident requires.
- 02
RPO is usually the harder metric than RTO. Getting systems UP is mostly automation; getting data CONSISTENT to a recent point requires architectural decisions (synchronous replication, change data capture, event sourcing) that have to be made years before the disaster.
- 03
Ransomware has rewritten DR planning. Backup-and-restore strategies fail if the ransomware encrypts the backups too. Modern DR requires immutable backups (write-once, time-locked), separate identity domain for backup systems, and verified restore drills assuming primary identity is compromised.
Myth vs Reality
Myth
โCloud workloads don't need DR planning โ the cloud handles itโ
Reality
Cloud providers handle infrastructure resilience but NOT your application's recovery. AWS S3 is 11-nines durable, but if you delete a bucket or your IAM credentials get compromised and your data gets wiped, you're on your own. Cloud workloads need DR plans for: region failures, account compromise, accidental deletion, ransomware-equivalent data destruction, and provider service-specific outages.
Myth
โActive-active multi-region is always the best DR architectureโ
Reality
Active-active is the most expensive (typically 1.8-2.2x single-region cost) and adds significant operational complexity (data consistency, conflict resolution, deployment coordination). For most non-critical workloads, warm standby or even backup-and-restore is the right cost/risk trade. Don't gold-plate Tier 3 systems with Tier 1 architecture.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge โ answer the challenge or try the live scenario.
Knowledge Check
An enterprise's DR runbook commits to 4-hour RTO for the customer-facing platform. They've never done a full failover test. A regional outage hits. 18 hours later, the platform is back online. What is the most likely root cause of the gap?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets โ not absolutes.
Common DR Architecture RTO/RPO Targets
AWS Well-Architected Reliability Pillar โ Disaster Recovery patternsActive-Active Multi-Region
RTO < 1 min, RPO ~0
Warm Standby
RTO < 30 min, RPO < 1 min
Pilot Light
RTO < 4 hrs, RPO < 15 min
Backup & Restore
RTO 8-24 hrs, RPO 1-24 hrs
No DR / Untested Plan
RTO unknown / unbounded
Source: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/plan-for-disaster-recovery-dr.html
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Netflix (Chaos Engineering / DR Through Production Injection)
2010-present
Netflix invented Chaos Engineering as the answer to a fundamental DR problem: plans you don't test don't work. Starting with Chaos Monkey (which randomly kills production instances), Netflix built a Simian Army of failure-injection tools, including Chaos Kong (simulates loss of an entire AWS region). The philosophy: rather than maintain DR plans and hope, deliberately break production daily so the system is provably resilient. By 2016, Netflix could lose an entire AWS region with minimal customer impact because traffic automatically failed over and the system had been continuously hardened against this exact scenario through repeated production injection. The discipline transformed industry expectations: 'we have a DR plan' is no longer a credible answer if you haven't tested it under real conditions.
First Chaos Tool
Chaos Monkey, 2010
Region-Loss Capability
Demonstrated via Chaos Kong
Production Injection Cadence
Continuous
Industry Influence
Created Chaos Engineering as a discipline
DR plans that haven't been exercised end-to-end are documents, not capabilities. Netflix proved that the only credible way to validate resilience is to deliberately inject failure in production. Most enterprises will never adopt full chaos engineering, but the principle stands: untested DR is theater.
Microsoft Azure Site Recovery (productized DR pattern)
2014-present
Microsoft Azure built and continuously documents Azure Site Recovery (ASR) and the Reliability Pillar of the Azure Well-Architected Framework as the productized expression of enterprise DR best practice. ASR provides automated replication and failover for VMs, applications, and entire workloads across Azure regions or from on-prem to Azure. The Reliability Pillar guidance covers the same RTO/RPO tiering KnowMBA recommends: tier applications by criticality, match architecture to tier, test failover regularly. Microsoft publishes its own internal DR practice as case studies for customers โ making the point that even Microsoft, with deep platform expertise, treats DR as a continuous testing discipline rather than a one-time architecture decision.
Service
Azure Site Recovery (ASR)
Replication Frequency
Configurable, near-real-time
Documented Failover Cadence (recommended)
Quarterly drills, annual full test
Architecture Patterns Supported
Backup, pilot light, warm standby, active-active
Cloud providers offer the building blocks for DR but don't deliver DR maturity for you. ASR is a tool; Azure's reliability guidance is a playbook. The maturity gap between organizations that use both vs those that only buy the tool is the gap between 'DR capability' and 'DR shelfware.'
Decision scenario
The Untested DR Plan Discovery
You're a new CIO at a $1.5B B2B SaaS firm. Reviewing the IT portfolio in week 2, you find the DR plan commits to 2-hour RTO for the platform. The last full failover test was 28 months ago. Engineering tells you informally that 'realistically it would probably take 12-18 hours now' due to architecture changes. The platform generates $1.4M/hour in revenue. The board's risk committee is reviewing IT resilience next quarter.
Committed RTO
2 hours
Estimated Real RTO
12-18 hours
Last Full Failover Test
28 months ago
Revenue per Outage Hour
$1.4M
Recovery Debt
10-16 hours (massive credibility gap)
Decision 1
You can fix this discreetly, fix it loudly, or hope the board doesn't ask. Each path has different consequences.
Tell the board what they want to hear (2-hour RTO confirmed) and quietly run a remediation project to close the gap over 6 monthsReveal
Brief the board honestly: 'Discovered a 12-16 hour recovery debt. Resetting committed RTO to 12 hours immediately while investing $4M over 12 months to architect down to a real 4-hour RTO with quarterly tested drills.' Tie executive comp to demonstrated (not committed) RTO from drills.โ OptimalReveal
Related concepts
Keep connecting.
The concepts that orbit this one โ each one sharpens the others.
Beyond the concept
Turn Disaster Recovery Planning into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h ยท No retainer required
Turn Disaster Recovery Planning into a live operating decision.
Use Disaster Recovery Planning as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.