AI Cost Modeling
AI cost modeling is the practice of forecasting and tracking the true unit economics of an AI system, where pricing is non-linear, usage scales unpredictably, and hidden costs (data prep, evaluation, governance, observability) routinely double the headline vendor invoice. The modern AI cost stack has six components: (1) Inference cost (per-token API or per-call infra). (2) Embedding and vector storage. (3) Data prep and curation. (4) Evaluation infrastructure (test suites, LLM-as-judge spend). (5) Observability and logging. (6) People โ prompt engineers, ML engineers, governance staff. Most teams budget only #1 and discover the others at scale. Honest cost models compute cost-per-OUTCOME (per-resolved-ticket, per-correct-answer, per-converted-lead), not cost-per-call.
The Trap
The trap is the 'demo budget' โ a $200/month OpenAI bill from prototyping that becomes a $40,000/month bill three months after launch when usage scales. The second trap is forgetting that LLM costs are per-token and prompts grow: a typical RAG application's prompt grows from 500 tokens at launch to 4,000+ tokens within 6 months as developers add few-shot examples, system prompts, and tool definitions. Cost-per-call rises 8x without anyone noticing. The third trap is ignoring eval cost: running an LLM-as-judge on every output for quality assurance can cost MORE than the production inference itself. The fourth: budgeting for the average and being killed by the long tail โ 5% of users may drive 60% of inference cost.
What to Do
Build a complete AI cost model BEFORE production: (1) Estimate inference volume at production scale, not pilot scale. (2) Model token usage including system prompt, RAG context, examples, output. (3) Add eval cost (typically 30-100% of production inference). (4) Add observability/logging. (5) Add data prep (ongoing, not just initial). (6) Add people overhead (10-25% of run cost). (7) Compute cost-per-outcome โ divide TCO by the number of business outcomes produced. (8) Set up daily cost alerts and tier limits. (9) Re-model quarterly as model prices fall and prompt complexity rises. Treat AI infra cost like AWS โ without active FinOps, it grows unboundedly.
Formula
In Practice
OpenAI's pricing for GPT-4 dropped over 80% from launch to 2024 (from $0.03 per 1K input tokens at GPT-4 launch in March 2023 to GPT-4o-mini at $0.00015 per 1K input tokens) โ a 200x reduction in raw token cost in under 18 months. Companies that built cost models assuming 2023 pricing dramatically over-budgeted; companies that signed long-term contracts at 2023 pricing dramatically over-paid. Klarna's publicly disclosed AI economics show one pattern that worked: they architected for portability, kept contract terms short, and re-modeled cost quarterly. Their reported 700-FTE-equivalent customer service AI delivers a $40M annual profit lift partly because they captured the price collapse rather than being locked into early-2023 economics.
Pro Tips
- 01
Always compute and report cost-per-OUTCOME, not cost-per-call. A $0.005 per-call inference cost means nothing on its own. The right metric is cost-per-resolved-ticket, cost-per-correct-classification, cost-per-document-processed. Vendor dashboards rarely show this โ build it yourself.
- 02
Build a 'prompt size' regression. Track average input + output tokens per call WEEKLY. Engineering teams add few-shot examples and tool definitions over time, doubling prompt size every 2-3 months. Without monitoring, your cost-per-call quietly doubles or triples post-launch.
- 03
Tier user spend caps. Set per-user, per-team, and per-tenant token quotas with hard cutoffs and grace tiers. Without quotas, the 5% power-user tail will routinely drive 50-70% of inference cost โ sometimes a single user generating a script that loops over your API can spike a daily bill by 100x.
Myth vs Reality
Myth
โInference cost is the dominant AI costโ
Reality
For mature production systems, inference is typically 30-50% of TCO. The rest is people (15-30%), evaluation infrastructure (10-20%), data prep (10-20%), and observability (5-10%). Teams that budget only inference dramatically under-fund the operation and discover the gap mid-quarter.
Myth
โSelf-hosting open-source models is always cheaper than API providersโ
Reality
Self-hosted Llama 3.1 70B on commodity GPUs typically costs $0.0008-$0.002 per 1K tokens depending on utilization โ competitive with frontier APIs but only at high utilization (>40%). Most enterprises run at 5-15% utilization, where API pricing is dramatically cheaper. Self-hosting wins at scale and for compliance, not for the median use case.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge โ answer the challenge or try the live scenario.
Knowledge Check
Your engineering team prototyped a GenAI feature using 800 input + 200 output tokens per call. Pilot showed $1,200/month for 200 daily calls. They project $6,000/month at production volume of 1,000 daily calls. What's MOST likely wrong with the projection?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets โ not absolutes.
AI TCO Composition (Production Workloads)
Mature production AI systems in mid-to-large enterprisesInference (model API or self-hosted)
30-50%
People (engineering + governance)
15-30%
Evaluation infrastructure
10-20%
Data prep + ongoing curation
10-20%
Observability + logging
5-10%
Source: Synthesis of a16z and Menlo Ventures enterprise AI cost surveys 2024
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Frontier Model Pricing Collapse (OpenAI/Anthropic)
2023-2024
OpenAI's GPT-4 pricing dropped from $0.03/1K input tokens at launch (March 2023) to GPT-4o-mini at $0.00015/1K input tokens by late 2024 โ a 200x reduction. Anthropic similarly reduced Claude pricing as model efficiency improved. Companies that signed multi-year contracts at 2023 prices over-paid by 5-10x; companies with shorter contracts and active cost monitoring captured the savings.
GPT-4 Input Token Price (Mar 2023)
$0.03 / 1K
GPT-4o-mini Input Price (Late 2024)
$0.00015 / 1K
Effective Reduction
~200x in 18 months
Implication
Re-model AI cost quarterly
Model pricing is collapsing faster than enterprise budgeting cycles. Re-model AI costs quarterly and avoid long pricing commitments.
Hypothetical: Series D B2B SaaS
2024
Hypothetical: A Series D B2B SaaS launched a GenAI feature with a $4,000/month projected inference budget. Within 90 days, monthly spend hit $58,000 โ a 14x overage. Root cause analysis found: (1) average prompt size had grown from 1,200 to 4,800 tokens due to engineering adding RAG context and few-shot examples. (2) An LLM-as-judge eval pipeline had been deployed at 100% sampling rate (planned 10%). (3) Two power users had built integrations that hammered the API in retry loops. After implementing per-user quotas, prompt-size monitoring, and dialing back eval to 10% sampling, monthly spend dropped to $11,000 โ still above the original budget but within tolerable range.
Original Budget
$4,000/month
Peak Spend
$58,000/month
Post-Mitigation
$11,000/month
Cost Drivers (in order)
Prompt growth, eval misconfig, power users
Without active AI FinOps โ quotas, prompt-size monitoring, eval sampling controls โ production AI costs grow non-linearly with usage. Set up monitoring and quotas BEFORE launch.
Related concepts
Keep connecting.
The concepts that orbit this one โ each one sharpens the others.
Beyond the concept
Turn AI Cost Modeling into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h ยท No retainer required
Turn AI Cost Modeling into a live operating decision.
Use AI Cost Modeling as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.