AI StrategyIntermediate7 min read

AI Cost Modeling

AI cost modeling is the practice of forecasting and tracking the true unit economics of an AI system, where pricing is non-linear, usage scales unpredictably, and hidden costs (data prep, evaluation, governance, observability) routinely double the headline vendor invoice. The modern AI cost stack has six components: (1) Inference cost (per-token API or per-call infra). (2) Embedding and vector storage. (3) Data prep and curation. (4) Evaluation infrastructure (test suites, LLM-as-judge spend). (5) Observability and logging. (6) People — prompt engineers, ML engineers, governance staff. Most teams budget only #1 and discover the others at scale. Honest cost models compute cost-per-OUTCOME (per-resolved-ticket, per-correct-answer, per-converted-lead), not cost-per-call.

Also known asAI TCOGenAI Unit EconomicsAI FinOpsToken EconomicsAI Budget Modeling

Challenge a friend Browse library

The Trap

The trap is the 'demo budget' — a $200/month OpenAI bill from prototyping that becomes a $40,000/month bill three months after launch when usage scales. The second trap is forgetting that LLM costs are per-token and prompts grow: a typical RAG application's prompt grows from 500 tokens at launch to 4,000+ tokens within 6 months as developers add few-shot examples, system prompts, and tool definitions. Cost-per-call rises 8x without anyone noticing. The third trap is ignoring eval cost: running an LLM-as-judge on every output for quality assurance can cost MORE than the production inference itself. The fourth: budgeting for the average and being killed by the long tail — 5% of users may drive 60% of inference cost.

What to Do

Build a complete AI cost model BEFORE production: (1) Estimate inference volume at production scale, not pilot scale. (2) Model token usage including system prompt, RAG context, examples, output. (3) Add eval cost (typically 30-100% of production inference). (4) Add observability/logging. (5) Add data prep (ongoing, not just initial). (6) Add people overhead (10-25% of run cost). (7) Compute cost-per-outcome — divide TCO by the number of business outcomes produced. (8) Set up daily cost alerts and tier limits. (9) Re-model quarterly as model prices fall and prompt complexity rises. Treat AI infra cost like AWS — without active FinOps, it grows unboundedly.

Formula

AI TCO = Inference + Embeddings + Eval + Observability + Data Prep + People; Cost-per-Outcome = TCO ÷ Number of business outcomes produced

In Practice

OpenAI's pricing for GPT-4 dropped over 80% from launch to 2024 (from $0.03 per 1K input tokens at GPT-4 launch in March 2023 to GPT-4o-mini at $0.00015 per 1K input tokens) — a 200x reduction in raw token cost in under 18 months. Companies that built cost models assuming 2023 pricing dramatically over-budgeted; companies that signed long-term contracts at 2023 pricing dramatically over-paid. Klarna's publicly disclosed AI economics show one pattern that worked: they architected for portability, kept contract terms short, and re-modeled cost quarterly. Their reported 700-FTE-equivalent customer service AI delivers a $40M annual profit lift partly because they captured the price collapse rather than being locked into early-2023 economics.

Pro Tips

01
Always compute and report cost-per-OUTCOME, not cost-per-call. A $0.005 per-call inference cost means nothing on its own. The right metric is cost-per-resolved-ticket, cost-per-correct-classification, cost-per-document-processed. Vendor dashboards rarely show this — build it yourself.
02
Build a 'prompt size' regression. Track average input + output tokens per call WEEKLY. Engineering teams add few-shot examples and tool definitions over time, doubling prompt size every 2-3 months. Without monitoring, your cost-per-call quietly doubles or triples post-launch.
03
Tier user spend caps. Set per-user, per-team, and per-tenant token quotas with hard cutoffs and grace tiers. Without quotas, the 5% power-user tail will routinely drive 50-70% of inference cost — sometimes a single user generating a script that loops over your API can spike a daily bill by 100x.

Myth vs Reality

Myth

“Inference cost is the dominant AI cost”

Reality

For mature production systems, inference is typically 30-50% of TCO. The rest is people (15-30%), evaluation infrastructure (10-20%), data prep (10-20%), and observability (5-10%). Teams that budget only inference dramatically under-fund the operation and discover the gap mid-quarter.

Myth

“Self-hosting open-source models is always cheaper than API providers”

Reality

Self-hosted Llama 3.1 70B on commodity GPUs typically costs $0.0008-$0.002 per 1K tokens depending on utilization — competitive with frontier APIs but only at high utilization (>40%). Most enterprises run at 5-15% utilization, where API pricing is dramatically cheaper. Self-hosting wins at scale and for compliance, not for the median use case.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

Your engineering team prototyped a GenAI feature using 800 input + 200 output tokens per call. Pilot showed $1,200/month for 200 daily calls. They project $6,000/month at production volume of 1,000 daily calls. What's MOST likely wrong with the projection?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

AI TCO Composition (Production Workloads)

Mature production AI systems in mid-to-large enterprises

Inference (model API or self-hosted)

30-50%

People (engineering + governance)

15-30%

Evaluation infrastructure

10-20%

Data prep + ongoing curation

10-20%

Observability + logging

5-10%

Source: Synthesis of a16z and Menlo Ventures enterprise AI cost surveys 2024

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

📉

Frontier Model Pricing Collapse (OpenAI/Anthropic)

2023-2024

success

OpenAI's GPT-4 pricing dropped from $0.03/1K input tokens at launch (March 2023) to GPT-4o-mini at $0.00015/1K input tokens by late 2024 — a 200x reduction. Anthropic similarly reduced Claude pricing as model efficiency improved. Companies that signed multi-year contracts at 2023 prices over-paid by 5-10x; companies with shorter contracts and active cost monitoring captured the savings.

GPT-4 Input Token Price (Mar 2023)

$0.03 / 1K

GPT-4o-mini Input Price (Late 2024)

$0.00015 / 1K

Effective Reduction

~200x in 18 months

Implication

Re-model AI cost quarterly

Model pricing is collapsing faster than enterprise budgeting cycles. Re-model AI costs quarterly and avoid long pricing commitments.

Source ↗

💼

Hypothetical: Series D B2B SaaS

2024

mixed

Hypothetical: A Series D B2B SaaS launched a GenAI feature with a $4,000/month projected inference budget. Within 90 days, monthly spend hit $58,000 — a 14x overage. Root cause analysis found: (1) average prompt size had grown from 1,200 to 4,800 tokens due to engineering adding RAG context and few-shot examples. (2) An LLM-as-judge eval pipeline had been deployed at 100% sampling rate (planned 10%). (3) Two power users had built integrations that hammered the API in retry loops. After implementing per-user quotas, prompt-size monitoring, and dialing back eval to 10% sampling, monthly spend dropped to $11,000 — still above the original budget but within tolerable range.

Original Budget

$4,000/month

Peak Spend

$58,000/month

Post-Mitigation

$11,000/month

Cost Drivers (in order)

Prompt growth, eval misconfig, power users

Without active AI FinOps — quotas, prompt-size monitoring, eval sampling controls — production AI costs grow non-linearly with usage. Set up monitoring and quotas BEFORE launch.

Related concepts