Model Risk Management
Model Risk Management (MRM) is the discipline that prevents AI systems from causing financial loss, regulatory action, brand damage, or harm to users. It originated in banking under the Federal Reserve's SR 11-7 guidance and has become the operating standard for any organization deploying decision-making models at scale. The MRM framework rests on five pillars: (1) Model Inventory โ every production model documented, owned, and tiered by risk. (2) Independent Validation โ a team that did NOT build the model reviews it. (3) Performance Monitoring โ drift, accuracy, fairness metrics tracked continuously. (4) Lifecycle Governance โ formal approval, change management, retirement. (5) Issue Management โ a register of known limitations with remediation plans. Without MRM, your AI portfolio is a slow-motion lawsuit waiting to happen.
The Trap
The trap is treating MRM as a checkbox exercise that runs AFTER the model is built. By then it's too late โ the model is already deployed, the team is committed, and 'validation' becomes a rubber stamp. The deeper trap is assuming GenAI doesn't need MRM because 'it's just a vendor API.' Wrong: any model influencing a decision (hiring, lending, content moderation, customer-tier assignment, fraud) is subject to anti-discrimination law, regulator scrutiny, and brand risk regardless of whether you trained it. The third trap is over-governance โ applying SR 11-7-grade scrutiny to a low-risk recommendation engine and crushing iteration speed. Tier your models or you'll lose the team.
What to Do
Build a tiered MRM operating model: (1) Tier 1 โ material decisions (lending, insurance, hiring, healthcare): full validation, ongoing monitoring, regulator-ready documentation. (2) Tier 2 โ operational decisions (routing, prioritization, recommendation): lighter validation, automated drift alerts, quarterly review. (3) Tier 3 โ productivity tools (drafting, summarization with human in the loop): minimal governance, usage policy and incident reporting only. Stand up a Model Risk Committee that meets monthly, owns the inventory, and signs off on Tier 1 deployments. Require every model โ including third-party AI โ to have a named owner, a documented purpose, and a kill switch.
Formula
In Practice
Apple Card's 2019 launch is the canonical case study in failed MRM. Goldman Sachs' credit-decision model offered women significantly lower credit limits than men with similar profiles โ including spouses with shared finances. The New York Department of Financial Services investigated. Goldman's defense was that gender was not an input โ but the model still produced discriminatory outputs through correlated features. The lesson MRM enforces: 'we didn't include the protected attribute' is not a defense. You must test outputs for disparate impact across protected classes BEFORE deployment, regardless of inputs. Goldman ultimately changed its credit-decision process and incurred years of regulatory and reputational cost.
Pro Tips
- 01
Adopt the 'three lines of defense' model from banking: (1) the team that builds the model owns first-line risk. (2) An independent risk function validates and challenges. (3) Internal audit periodically tests both. Without independence between build and validation, your MRM is theater.
- 02
Test for disparate impact on protected classes EVEN WHEN protected attributes are not inputs. Use a held-out validation set with demographic labels and measure approval/score distributions across groups. The four-fifths rule (any group's selection rate < 80% of the highest group's rate is presumptively discriminatory) is the regulator's first cut.
- 03
Every Tier 1 model must have a documented 'kill switch' โ a runbook for who has authority to disable the model, under what conditions, and the manual fallback process. If you cannot turn it off in 60 minutes, you do not control it.
Myth vs Reality
Myth
โBuying a vendor model transfers the risk to the vendorโ
Reality
Regulators and courts hold the deploying organization accountable for outcomes, not the model vendor. Goldman, not Apple's underlying scoring vendor, faced the regulator. Your vendor contract may include indemnification, but it does not protect your brand or your license. Vendor-AI requires the same MRM rigor as in-house โ possibly more, because you have less visibility.
Myth
โGenAI doesn't need governance because there's a human in the loopโ
Reality
Human-in-the-loop is a control, not an exemption. Studies show humans rubber-stamp AI recommendations 70-90% of the time once the system has been live for 60+ days (automation bias). The MRM frame still applies: usage policies, output sampling, audit trails, and incident reporting are non-negotiable for any GenAI workflow that touches customers, employees, or regulated decisions.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge โ answer the challenge or try the live scenario.
Knowledge Check
Your fraud team's new model flags 14% of transactions for review (vs. 9% for the prior model). Approval rates by zip-code segment show a 22% lower approval rate in the city's two majority-minority zip codes vs. the citywide baseline. The model does NOT use race as a feature. The team wants to ship next week. What do you do?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets โ not absolutes.
Disparate Impact โ Four-Fifths Rule
EEOC adverse-impact analysis; widely used in credit, hiring, insurance underwriting modelsStrong (no disparate impact concern)
โฅ 0.90
Acceptable (monitor)
0.80 - 0.90
Presumptive Disparate Impact
< 0.80
Source: https://www.eeoc.gov/laws/guidance/section-4-statistical-evidence
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Apple Card / Goldman Sachs
2019-2020
At launch, the Apple Card credit-decision model โ operated by Goldman Sachs โ produced systematically lower credit limits for women than for similarly-situated men, including spouses sharing finances. After viral public complaints, the New York Department of Financial Services investigated. Goldman's defense โ that gender was not a model input โ was rejected. The case became the landmark example of why MRM requires output testing for disparate impact regardless of model inputs.
Trigger
Viral social-media complaints
Regulator Response
NYDFS investigation
Goldman Defense
Gender not a model input (rejected as defense)
Outcome
Process changes, ongoing reputational cost
Disparate-impact testing is required EVEN when protected attributes are not in the model. Correlated features carry the discrimination forward. MRM frameworks make this testing mandatory before launch.
Hypothetical: Mid-sized Health Insurer
2024
Hypothetical: A regional health insurer deployed a GenAI prior-authorization assistant to draft denial letters for review by clinical staff. The MRM team classified it Tier 1 from day one despite vendor protests. They required: legal review of every prompt template, monthly disparate-impact testing across patient demographics, output sampling, and a documented kill switch. Six months in, monitoring caught a prompt change that increased denial rates for a specific procedure code by 28% in two patient subgroups. The team rolled back within 8 hours. Had the model been classified Tier 3, the disparity would have surfaced months later via patient complaints and regulator inquiries.
Risk Tier
Tier 1
Drift Detection Time
8 hours
Patient Subgroups Affected
2 (caught pre-harm)
Counterfactual If Tier 3
Months of undetected disparity
The 'human in the loop' does not downgrade risk on regulated decisions. Tiering is the single highest-leverage MRM choice; over-tiering is recoverable, under-tiering is not.
Decision scenario
The First MRM Framework
You're the new VP of AI Governance at a $2B insurer with 17 production AI models โ none under formal MRM. The CRO wants a framework live in 90 days. Your team is 4 people. Models range from a customer-churn predictor to underwriting decisioning to a GenAI claims summarizer.
Production Models
17
Currently Governed
0
MRM Team Size
4
Timeline
90 days
Decision 1
You can't validate all 17 models in 90 days with a 4-person team. How do you prioritize?
Validate all 17 in parallel โ a 'sprint' approach. Hire contractors to scale up. Aim for full coverage in 90 days.Reveal
Tier all 17 models in week 1. Fully validate the 4 Tier-1 models in 90 days. Apply lighter Tier-2/3 reviews to the rest in parallel. Publish a public-to-the-org tiering rationale.โ OptimalReveal
Related concepts
Keep connecting.
The concepts that orbit this one โ each one sharpens the others.
Beyond the concept
Turn Model Risk Management into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h ยท No retainer required
Turn Model Risk Management into a live operating decision.
Use Model Risk Management as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.