AI StrategyAdvanced8 min read

Model Risk Management

Model Risk Management (MRM) is the discipline that prevents AI systems from causing financial loss, regulatory action, brand damage, or harm to users. It originated in banking under the Federal Reserve's SR 11-7 guidance and has become the operating standard for any organization deploying decision-making models at scale. The MRM framework rests on five pillars: (1) Model Inventory — every production model documented, owned, and tiered by risk. (2) Independent Validation — a team that did NOT build the model reviews it. (3) Performance Monitoring — drift, accuracy, fairness metrics tracked continuously. (4) Lifecycle Governance — formal approval, change management, retirement. (5) Issue Management — a register of known limitations with remediation plans. Without MRM, your AI portfolio is a slow-motion lawsuit waiting to happen.

Also known asMRMAI GovernanceResponsible AIModel ValidationAI Risk Framework

Challenge a friend Browse library

The Trap

The trap is treating MRM as a checkbox exercise that runs AFTER the model is built. By then it's too late — the model is already deployed, the team is committed, and 'validation' becomes a rubber stamp. The deeper trap is assuming GenAI doesn't need MRM because 'it's just a vendor API.' Wrong: any model influencing a decision (hiring, lending, content moderation, customer-tier assignment, fraud) is subject to anti-discrimination law, regulator scrutiny, and brand risk regardless of whether you trained it. The third trap is over-governance — applying SR 11-7-grade scrutiny to a low-risk recommendation engine and crushing iteration speed. Tier your models or you'll lose the team.

What to Do

Build a tiered MRM operating model: (1) Tier 1 — material decisions (lending, insurance, hiring, healthcare): full validation, ongoing monitoring, regulator-ready documentation. (2) Tier 2 — operational decisions (routing, prioritization, recommendation): lighter validation, automated drift alerts, quarterly review. (3) Tier 3 — productivity tools (drafting, summarization with human in the loop): minimal governance, usage policy and incident reporting only. Stand up a Model Risk Committee that meets monthly, owns the inventory, and signs off on Tier 1 deployments. Require every model — including third-party AI — to have a named owner, a documented purpose, and a kill switch.

Formula

Model Risk Score = (Decision Materiality × 0.4) + (Reversibility Risk × 0.2) + (Regulatory Exposure × 0.2) + (Volume × 0.2) — tier into 1/2/3

In Practice

Apple Card's 2019 launch is the canonical case study in failed MRM. Goldman Sachs' credit-decision model offered women significantly lower credit limits than men with similar profiles — including spouses with shared finances. The New York Department of Financial Services investigated. Goldman's defense was that gender was not an input — but the model still produced discriminatory outputs through correlated features. The lesson MRM enforces: 'we didn't include the protected attribute' is not a defense. You must test outputs for disparate impact across protected classes BEFORE deployment, regardless of inputs. Goldman ultimately changed its credit-decision process and incurred years of regulatory and reputational cost.

Pro Tips

01
Adopt the 'three lines of defense' model from banking: (1) the team that builds the model owns first-line risk. (2) An independent risk function validates and challenges. (3) Internal audit periodically tests both. Without independence between build and validation, your MRM is theater.
02
Test for disparate impact on protected classes EVEN WHEN protected attributes are not inputs. Use a held-out validation set with demographic labels and measure approval/score distributions across groups. The four-fifths rule (any group's selection rate < 80% of the highest group's rate is presumptively discriminatory) is the regulator's first cut.
03
Every Tier 1 model must have a documented 'kill switch' — a runbook for who has authority to disable the model, under what conditions, and the manual fallback process. If you cannot turn it off in 60 minutes, you do not control it.

Myth vs Reality

Myth

“Buying a vendor model transfers the risk to the vendor”

Reality

Regulators and courts hold the deploying organization accountable for outcomes, not the model vendor. Goldman, not Apple's underlying scoring vendor, faced the regulator. Your vendor contract may include indemnification, but it does not protect your brand or your license. Vendor-AI requires the same MRM rigor as in-house — possibly more, because you have less visibility.

Myth

“GenAI doesn't need governance because there's a human in the loop”

Reality

Human-in-the-loop is a control, not an exemption. Studies show humans rubber-stamp AI recommendations 70-90% of the time once the system has been live for 60+ days (automation bias). The MRM frame still applies: usage policies, output sampling, audit trails, and incident reporting are non-negotiable for any GenAI workflow that touches customers, employees, or regulated decisions.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

Your fraud team's new model flags 14% of transactions for review (vs. 9% for the prior model). Approval rates by zip-code segment show a 22% lower approval rate in the city's two majority-minority zip codes vs. the citywide baseline. The model does NOT use race as a feature. The team wants to ship next week. What do you do?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

Disparate Impact — Four-Fifths Rule

EEOC adverse-impact analysis; widely used in credit, hiring, insurance underwriting models

Strong (no disparate impact concern)

≥ 0.90

Acceptable (monitor)

0.80 - 0.90

Presumptive Disparate Impact

< 0.80

Source: https://www.eeoc.gov/laws/guidance/section-4-statistical-evidence

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

🍎

Apple Card / Goldman Sachs

2019-2020

failure

At launch, the Apple Card credit-decision model — operated by Goldman Sachs — produced systematically lower credit limits for women than for similarly-situated men, including spouses sharing finances. After viral public complaints, the New York Department of Financial Services investigated. Goldman's defense — that gender was not a model input — was rejected. The case became the landmark example of why MRM requires output testing for disparate impact regardless of model inputs.

Trigger

Viral social-media complaints

Regulator Response

NYDFS investigation

Goldman Defense

Gender not a model input (rejected as defense)

Outcome

Process changes, ongoing reputational cost

Disparate-impact testing is required EVEN when protected attributes are not in the model. Correlated features carry the discrimination forward. MRM frameworks make this testing mandatory before launch.

Source ↗

🏥

Hypothetical: Mid-sized Health Insurer

2024

success

Hypothetical: A regional health insurer deployed a GenAI prior-authorization assistant to draft denial letters for review by clinical staff. The MRM team classified it Tier 1 from day one despite vendor protests. They required: legal review of every prompt template, monthly disparate-impact testing across patient demographics, output sampling, and a documented kill switch. Six months in, monitoring caught a prompt change that increased denial rates for a specific procedure code by 28% in two patient subgroups. The team rolled back within 8 hours. Had the model been classified Tier 3, the disparity would have surfaced months later via patient complaints and regulator inquiries.

Risk Tier

Tier 1

Drift Detection Time

8 hours

Patient Subgroups Affected

2 (caught pre-harm)

Counterfactual If Tier 3

Months of undetected disparity

The 'human in the loop' does not downgrade risk on regulated decisions. Tiering is the single highest-leverage MRM choice; over-tiering is recoverable, under-tiering is not.

Decision scenario

The First MRM Framework

You're the new VP of AI Governance at a $2B insurer with 17 production AI models — none under formal MRM. The CRO wants a framework live in 90 days. Your team is 4 people. Models range from a customer-churn predictor to underwriting decisioning to a GenAI claims summarizer.

Production Models

Currently Governed

MRM Team Size

Timeline

90 days

Decision 1

You can't validate all 17 models in 90 days with a 4-person team. How do you prioritize?

Validate all 17 in parallel — a 'sprint' approach. Hire contractors to scale up. Aim for full coverage in 90 days.Reveal

Day 90 arrives with 17 partial validations, none complete. Three contractors have produced inconsistent documentation. The CRO presents 'in-progress' status to the board, which is interpreted as failure to deliver. Two of the highest-risk models (underwriting, claims denial) remain unvalidated when a regulator comes knocking in month 5.

Tier-1 Models Validated: 0 of 4Documentation Quality: InconsistentRegulator Risk: Unmitigated

Tier all 17 models in week 1. Fully validate the 4 Tier-1 models in 90 days. Apply lighter Tier-2/3 reviews to the rest in parallel. Publish a public-to-the-org tiering rationale.Reveal

Day 90: 4 Tier-1 models fully validated and signed off, 8 Tier-2 models on quarterly-review cadence, 5 Tier-3 models with usage policies in place. The CRO presents a complete framework to the board with documented coverage at every risk level. When the regulator asks about the underwriting model in month 5, you have a 60-page validation report ready. The framework becomes the template for next year's expansion.

Tier-1 Models Validated: 4 of 4Framework Coverage: 100%Regulator Audit: Pass

Related concepts