K
KnowMBAAdvisory
AI StrategyAdvanced7 min read

AI Context Window Strategy

Context window strategy is how you decide what goes into the model's input window — and equally important, what does NOT. Modern frontier models offer 200K-1M token windows (Claude, Gemini), but that does not mean you should fill them. Cost scales linearly with input tokens; latency scales with input tokens; and accuracy follows a U-shape — models pay attention best to the start and end of context, dropping recall in the middle ('lost in the middle' effect, Liu et al. 2023). The right strategy is rarely 'stuff everything in.' It's: retrieve the smallest sufficient context, structure it predictably, and use prompt caching to amortize the static portion. A 200K-token prompt that costs you $0.60 per call to ship a 90% answer is worse than a 15K-token RAG prompt that costs $0.05 to ship a 92% answer.

Also known asContext Window ManagementLong Context StrategyToken Budget ManagementPrompt Context Strategy

The Trap

The trap is treating long context as a substitute for retrieval. Teams discover Claude or Gemini can hold an entire 500-page contract and start dumping the whole document into every request, paying full input tokens each time and watching latency balloon to 30+ seconds. Worse, accuracy degrades: the model misses key clauses buried in the middle. The opposite trap is over-pruning context to save tokens, where retrieval misses critical information and the model hallucinates because it's missing the answer entirely. Token budgets should be set per request type, not globally.

What to Do

Define a tiered context strategy: (1) Static system context (cached aggressively — see Anthropic prompt caching, OpenAI prompt caching) — instructions, examples, schemas. (2) Retrieved context (sized to top-K relevance, usually 5-15 chunks of 500-1000 tokens each). (3) Conversation history (summarize after N turns). Set a per-request-type token budget and alert when prompts exceed it. Use prompt caching for the 60-80% of context that is identical across calls — Anthropic's prompt caching offers up to 90% savings on cached input tokens. Re-evaluate quarterly: caching support, model recall curves, and pricing all shift.

Formula

Cost per Request = (Cached Input Tokens × Cached Price) + (Uncached Input × Standard Price) + (Output Tokens × Output Price). Cached Price ≈ 0.10 × Standard Price.

In Practice

Anthropic's Claude prompt caching launched with up to 90% cost reduction and ~85% latency reduction on cached input tokens, with cache lifetimes of 5 minutes (with refresh on use). Production teams using Claude with long static system prompts and frequent calls (coding agents, document analysis pipelines, customer support copilots) routinely report 70-85% input-token spend reductions just from caching the static portion of their context. The savings are biggest exactly where prompts are longest — which is also where the cost problem is worst without caching.

Pro Tips

  • 01

    If you're calling a frontier model with a system prompt longer than 1,000 tokens and you're not using prompt caching, you have homework. Cached input is roughly 10% of standard input price across major providers — a 90% line-item reduction on the static portion.

  • 02

    The 'lost in the middle' effect is real and persists in 2026 frontier models. Critical instructions should go at the start AND repeated at the end, not buried in the middle.

  • 03

    Long context is not free even if it 'fits.' A 100K-token prompt at $3/M input tokens costs $0.30 just for input — multiply by request volume and the long-context tax shows up in your invoice fast.

Myth vs Reality

Myth

Bigger context windows make RAG obsolete

Reality

Long context complements RAG, doesn't replace it. Stuffing 1M tokens of corpus into every request is 100x more expensive than retrieving 10 relevant chunks. Long context is for the rare case where you genuinely need cross-document reasoning across a large set; RAG handles the 95% case where you only need a few relevant passages.

Myth

Models attend equally to all parts of the context window

Reality

Recall studies (needle-in-haystack tests, RULER benchmark) consistently show degraded recall in the middle 50-70% of long contexts, even on frontier models advertising '1M token' windows. Position your most important information first or last, never buried.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

Your customer-support copilot has a 12K-token system prompt (instructions + examples + product knowledge), and handles 50K conversations/month. Each conversation averages 4 LLM turns. You're not using prompt caching. What's the highest-leverage optimization?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

Prompt Caching Discount on Cached Input Tokens (frontier models, 2026)

Approximate discounts on the cached portion of input tokens; pricing changes — verify with provider docs

Anthropic Claude (cache read)

~90% off standard input

OpenAI (cached input)

~50% off standard input

Google Gemini (context caching)

~75% off standard input + storage fee

Source: Anthropic prompt caching docs, OpenAI prompt caching docs, Google Vertex AI context caching docs

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

🧠

Anthropic Claude Prompt Caching

2024-2026

success

Anthropic launched prompt caching offering up to 90% cost reduction and ~85% latency reduction on cached input tokens. Production teams running coding agents, document analysis, and customer support copilots — workloads with long, repeated system prompts — publicly reported 70-85% reductions in their input-token line item. The change required no model swap and no prompt rewrite, only marking the cached portion of the prompt with a cache control breakpoint. The wider lesson: large input prompts went from a structural cost problem to a structural cost advantage.

Cached Token Discount

Up to 90%

Latency Reduction on Cached Reads

~85%

Cache TTL

5 min (refreshes on use)

Typical Realized Savings (long-prompt apps)

70-85% of input spend

When the static portion of your prompt is large (5K+ tokens) and your call volume is high, prompt caching is the single highest-leverage configuration change available. The savings come from architecture, not from worse outputs.

Source ↗
💻

Hypothetical: Internal Coding Assistant

2025

success

Hypothetical: A 400-engineer org rolled out an internal coding copilot built on a frontier API. The system prompt was 18K tokens (style guide, repo conventions, common patterns, 10 examples). At 80 calls/engineer/day, monthly input spend hit $94K. Engineering enabled prompt caching on the static 18K-token system prompt with a one-day change. Spend on input tokens dropped to ~$15K/month within a week — output costs unchanged.

Engineers

400

Calls/Engineer/Day

80

Static System Prompt

18K tokens

Input Spend (before)

$94K/month

Input Spend (after caching)

~$15K/month

Hypothetical: Internal AI tools with long fixed system prompts have outsized caching upside because the static portion dwarfs the dynamic portion of every call.

Decision scenario

Long Context vs RAG Decision

You're building a 'chat with our 2,000-page product manual' feature. Your engineer proposes loading the entire manual (~600K tokens) into Claude's 1M context window for every query, citing simplicity. The CFO has separately asked for a unit-cost forecast. You're choosing the architecture this week.

Manual Size

~600K tokens

Expected Queries/Day

10,000

Input Token Price

$3/M tokens

Avg Output

500 tokens

01

Decision 1

If you stuff the full manual into every request, that's 600K input tokens × 10K queries/day. Even with caching, you're paying real money — and accuracy on buried passages may suffer.

Stuff the full 600K-token manual into every request — simplest implementation, no retrieval system to buildReveal
Daily input cost: 600K × 10K × $3/M = $18,000/day = $540K/month — even with prompt caching cutting the static portion to ~10%, you're still spending $54K-$80K/month, and the 'lost in the middle' problem means accuracy on questions about content in the manual's middle sections is noticeably worse than on intro/conclusion content. Engineers ship faster but the unit economics force a rewrite within 6 months.
Monthly Input Spend: $0 → $54K-$540K depending on cachingAccuracy on Mid-Manual Queries: Degraded ~5-10pp vs RAG baseline
Build RAG: chunk the manual, embed once (batch), retrieve top-10 chunks per query (~8K tokens), use cached system promptReveal
Daily input cost: ~8K retrieved × 10K + 2K cached system prompt at 10% = ~$240/day = ~$7.2K/month. Accuracy on targeted questions is higher than full-context stuffing because the model sees only relevant content. Engineering effort is one extra week to build the retrieval layer; payback is two days of operating cost.
Monthly Input Spend: $0 → ~$7KAccuracy on Targeted Queries: Improved (no lost-in-middle effect)

Related concepts

Keep connecting.

The concepts that orbit this one — each one sharpens the others.

Beyond the concept

Turn AI Context Window Strategy into a live operating decision.

Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.

Typical response time: 24h · No retainer required

Turn AI Context Window Strategy into a live operating decision.

Use AI Context Window Strategy as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.