K
KnowMBAAdvisory
AI StrategyAdvanced8 min read

AI Architecture Review

An AI architecture review is a structured, repeatable inspection of an AI system across seven layers: (1) data and retrieval, (2) model selection and routing, (3) prompt and context management, (4) orchestration (agents, chains, workflows), (5) evaluation and observability, (6) safety, security, and guardrails, and (7) cost, latency, and scaling. The review answers three questions every AI system must satisfy before production: does it produce correct outputs at acceptable latency and cost, does it fail safely when components break, and can it be debugged in production by someone who didn't write it. Most AI features ship without a review and discover their architectural weaknesses during incidents.

Also known asAI System Design ReviewGenAI Architecture AuditLLM Architecture ReviewAI Reference ArchitectureAI Stack Review

The Trap

The trap is treating an AI feature as 'just another microservice' and reusing a standard service review. AI systems fail differently: silent quality regressions from a vendor model update, prompt injection from untrusted input, retrieval drift as the corpus changes, runaway costs from a chatty agent loop, and cascading failures when one tool call hangs and the orchestrator retries forever. A traditional review checks scalability and 99th percentile latency. An AI review must additionally check eval coverage, fallback paths when the model is degraded, output validation, retry and timeout budgets per tool, and cost guardrails. Skipping AI-specific review items is how teams ship demos and operate disasters.

What to Do

Run a structured architecture review BEFORE every AI feature ships and at least quarterly thereafter. Use a 7-layer checklist: (1) Data layer โ€” sources, freshness, PII handling. (2) Model layer โ€” selected model, fallback model, version pinning. (3) Prompt and context โ€” token budget, RAG context limits, jailbreak hardening. (4) Orchestration โ€” max iterations, timeouts per tool, idempotency. (5) Eval and observability โ€” offline eval set, online quality monitoring, alerting thresholds. (6) Safety โ€” input filters, output filters, PII redaction, audit log. (7) Cost and scale โ€” per-tenant quota, blast-radius, kill switch. Score each layer red/yellow/green and require green on all seven before launch.

Formula

AI Architecture Score = min(DataLayer, ModelLayer, PromptLayer, Orchestration, EvalObservability, Safety, CostScale) โ€” the system is only as strong as its weakest layer

In Practice

AWS publishes the 'Generative AI Lens' for the Well-Architected Framework, defining six pillars (operational excellence, security, reliability, performance, cost, and sustainability) specifically for GenAI workloads, with checklist items covering RAG architecture, model selection, evaluation, and guardrails. Microsoft's Azure Well-Architected guidance for AI workloads and the Azure AI Foundry reference architectures play the same role. NVIDIA's reference architectures for inference and Anthropic's published patterns for agent design (e.g., 'Building Effective Agents') give teams concrete review checklists. Companies that adopt these as the basis for an internal review document โ€” and require sign-off before launch โ€” ship dramatically more reliable AI systems.

Pro Tips

  • 01

    Make the review template a living document. Add a row each time an incident reveals a missing check. After 6-12 months, the template becomes the single most valuable asset on your AI team โ€” institutional memory of every way your AI systems can break.

  • 02

    Require an explicit 'kill switch' design in every review. How do you turn this AI feature off in 60 seconds without a deploy? If the answer is 'we'd push a config change and wait for a deploy,' you don't have a kill switch โ€” you have a hope.

  • 03

    Separate the reviewer from the builder. The team that built the system is least likely to see its architectural weaknesses. Rotate reviews across teams or use an AI platform team as the standing reviewer. The friction is the value.

Myth vs Reality

Myth

โ€œArchitecture reviews slow teams down and AI moves too fast for themโ€

Reality

Teams that skip reviews go faster to first launch and dramatically slower thereafter โ€” they spend the next 6 months patching incidents the review would have caught. A 2-hour review front-loads decisions that would otherwise be made under outage pressure. The teams shipping AI fastest in production almost universally have a lightweight review gate.

Myth

โ€œWe use a managed platform (Bedrock, Azure AI Foundry, Vertex) so we don't need an architecture reviewโ€

Reality

Managed platforms solve infrastructure, not application architecture. Your prompt design, retrieval strategy, agent orchestration, eval coverage, and guardrail config are still entirely your responsibility โ€” and entirely where most AI failures originate. The platform handles the easy part.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge โ€” answer the challenge or try the live scenario.

๐Ÿงช

Knowledge Check

You're reviewing a new GenAI feature one week before launch. The system has 99.9% uptime in staging, p95 latency of 1.2s, and a $0.008 per-call cost. Which question is MOST important to answer before signing off?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets โ€” not absolutes.

AI Architecture Review Coverage (Mature Teams)

Enterprises with production GenAI workloads

Elite โ€” Reviews on every change, signed by 2nd team

100% coverage

Good โ€” Reviews at launch and quarterly

80-99%

Average โ€” Reviews at launch only

50-79%

Weak โ€” Ad hoc reviews

20-49%

None โ€” No structured review

< 20%

Source: AWS Generative AI Lens / Microsoft Azure Well-Architected for AI

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

โ˜๏ธ

AWS Well-Architected Framework โ€” Generative AI Lens

2024

success

AWS published the Generative AI Lens for its Well-Architected Framework, codifying review questions across operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability โ€” all specifically for GenAI workloads. Customer teams use this as the basis for internal architecture reviews. Adoption correlates with materially fewer production incidents because the lens forces teams to answer questions like 'how do you protect against prompt injection' and 'how do you monitor for output quality drift' BEFORE production, not after a postmortem.

Pillars Covered

6 (operational, security, reliability, performance, cost, sustainability)

GenAI-Specific Questions

60+ across the lens

Format

Open published checklist + workshop guides

A structured, published reference architecture and review checklist beats every team inventing their own. Adopt one (AWS, Azure, NVIDIA reference) rather than build from scratch.

Source โ†—
๐Ÿฆ

Hypothetical: Mid-Market FinTech

2025

success

Hypothetical: A mid-market fintech launched an AI assistant for customer support without a structured architecture review. Within 4 months they had three P1 incidents: (1) a vendor model update silently changed refusal behavior and the bot started declining legitimate balance inquiries; (2) a prompt-injected user input caused the agent to leak another customer's account ID; (3) a runaway agent loop drove a single weekend's bill from $400 to $38,000. Post-incident, they adopted the AWS Generative AI Lens as their internal review template, added a kill switch and per-tenant quota, and instituted an offline eval that runs hourly. Over the following 6 months they had zero P1 incidents on AI features.

P1 Incidents Pre-Review

3 in 4 months

P1 Incidents Post-Review

0 in 6 months

Cost Spike Prevented (per quota)

~$30K+/month

Time to Implement Review

~2 weeks

The cost of a 2-hour architecture review is trivial compared to the cost of one P1 incident. Adopt a structured review BEFORE you need one.

Decision scenario

The Pre-Launch Architecture Review

You're the AI platform lead. A product team wants to launch a GenAI customer-facing assistant in 6 days. You run a 90-minute architecture review and find 4 issues: (1) no offline eval set, (2) no per-tenant cost cap, (3) no fallback model, (4) prompt directly interpolates user input without sanitization. The product team says shipping on time is critical for a board commitment.

Days to Launch

6

Open Critical Issues

4

Eval Coverage

0%

Kill Switch

Yes (only green item)

Board Commitment

On the line

01

Decision 1

The product VP asks you to sign off and 'fix the issues in a fast-follow.' You know that the prompt-injection issue alone could cause a data leak in week 1 of production.

Sign off and trust the fast-follow plan โ€” board commitment matters more than theoretical risks.Reveal
You ship on day 6. On day 11, a prompt-injection attack causes the assistant to reveal another customer's order history in a public review. The incident triggers a press article and a regulatory inquiry. The cost cap absence is discovered when the same week's bill hits $22K. The 'fast-follow' is now an emergency rollback plus an executive escalation. You ship faster and lose the next 6 weeks rebuilding trust.
P1 Incidents: 0 โ†’ 2 in week 2Trust: Damaged with regulators and customers
Block launch. Offer to ship in 8-10 days with all 4 issues mitigated to yellow or better. Bring the trade-off to the VP and product owner together with concrete risk numbers.Reveal
You spend a tense day defending the position. Engineering pulls in 2 extra people for 4 days. You ship on day 10 with: a 50-example offline eval, a $200/tenant/day cost cap, a Claude fallback configured, and a sanitization layer with output validation. The launch is uneventful. The board commitment is honored 4 days late, which the board doesn't notice. You earn credibility for catching issues that would have been existential.
Launch Delay: 6 days โ†’ 10 daysCritical Issues at Launch: 4 โ†’ 0Long-term credibility: Strengthened

Related concepts

Keep connecting.

The concepts that orbit this one โ€” each one sharpens the others.

Beyond the concept

Turn AI Architecture Review into a live operating decision.

Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.

Typical response time: 24h ยท No retainer required

Turn AI Architecture Review into a live operating decision.

Use AI Architecture Review as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.