Foundation Model Selection
Foundation model selection is the disciplined process of choosing which base LLM (or LLMs) to power your AI features given task requirements, latency targets, cost ceiling, deployment constraints (cloud / on-prem / regional), data sensitivity, and vendor risk tolerance. The right answer is rarely 'the best model' โ it's a portfolio: a frontier model for hard or open-ended tasks, a mid-tier model for the majority of throughput, and a small/cheap model (or open-weight model) for high-volume bounded tasks. Single-model strategies are brittle to vendor pricing changes, capability shifts, and outages.
The Trap
Two common traps. First: defaulting to whichever vendor your CTO has a relationship with, regardless of task fit. The result is using a $15/M-input frontier model for tasks a $0.50/M-input model could handle, and burning $400K/year unnecessarily. Second: chasing every benchmark leaderboard. The model that wins on MMLU may underperform on your specific use case (long-context summarization, structured extraction, function calling). Benchmarks are a starting point; your evaluation set is the ground truth.
What to Do
Build a 50-200 example evaluation set covering your real use cases. Run candidate models against it, scoring on: accuracy, latency (p50/p95), cost per task, and qualitative fit (tone, structure, refusal behavior). Then add non-functional criteria: data residency, on-prem/private deployment options, content filter behavior, vendor financial stability, and contract terms (data usage, indemnification, deprecation policy). Make a ranked recommendation per use case tier. Re-evaluate every 6 months โ the model landscape moves fast and yesterday's best can become uneconomic vs. a newer entrant.
Formula
In Practice
Frontier model providers โ OpenAI (GPT-5.x family), Anthropic (Claude family), Google (Gemini family) โ compete on capability, cost, and latency, with rankings shuffling per release. Open-weight models (Meta Llama, Mistral, Alibaba Qwen, DeepSeek) provide lower cost and on-prem options for many tasks. Hugging Face's Open LLM Leaderboard, LMSys Chatbot Arena, and HELM are commonly cited evaluation references. Most production-mature enterprises run a mix โ Anthropic for high-stakes drafting, an open-weight model for high-volume classification, and a frontier model for agentic workflows.
Pro Tips
- 01
The cheapest model that hits your quality threshold is the right model โ not the most capable one. Most teams over-spec by 5-10x because no one runs the cost/quality tradeoff explicitly. Run it.
- 02
Always have a 'fallback' model in your gateway โ when the primary fails or rate-limits, traffic seamlessly routes to a secondary. This single architectural decision converts vendor outages from incidents to non-events. Most production AI gateways now ship with this built-in.
- 03
Negotiate volume contracts for your top one or two models, but keep at least one alternative integrated even if you don't route traffic to it. Vendor-lock-in risk in AI is high โ models deprecate, prices change, capabilities shift. Optionality is cheap to maintain and expensive to retrofit under pressure.
Myth vs Reality
Myth
โThe model with the highest benchmark score is the best choiceโ
Reality
Benchmarks measure general capability. Your use case is specific. A model that scores 92% on MMLU may underperform a 87%-MMLU model on your customer service classification task. Always evaluate on your data โ benchmarks are a screen, not a verdict.
Myth
โWe need to fine-tune a base model to get good resultsโ
Reality
For 80% of use cases, a strong base model with good prompting and retrieval beats a fine-tuned smaller model. Fine-tuning is justified when you've exhausted prompting/RAG and the cost/latency of a smaller fine-tuned model is materially better than a prompted base model. Start with prompting, escalate as needed.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge โ answer the challenge or try the live scenario.
Knowledge Check
A company processes 50M customer support classifications per month. Each classification requires the model to assign one of 12 categories. Which foundation model strategy is most appropriate?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets โ not absolutes.
Foundation Model Strategy Maturity
Enterprises with AI features in productionMature
Multi-model portfolio, eval-driven, gateway with fallback, 6-month review cadence
Functional
2 models in production, periodic re-eval
Single-Vendor
One model for all use cases
Default-Choice
Whatever the engineer picked first, no formal evaluation
Source: Vendor evaluation patterns from Andreessen Horowitz Enterprise AI surveys + practitioner reports
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Hugging Face Open LLM Leaderboard
2023-present
Hugging Face hosts an evolving leaderboard of open-weight LLMs evaluated on standardized benchmarks (MMLU, ARC, HellaSwag, TruthfulQA, GSM8K, Winogrande). The leaderboard demonstrates how rapidly the open-weight model landscape evolves โ top models change every few months, and capability gaps with closed frontier models have narrowed significantly. Enterprises that built selection processes around the leaderboard plus their own evaluation sets have been able to switch models as capability and cost shift, capturing material savings.
Models Evaluated
Thousands
Eval Suite
MMLU, ARC, HellaSwag, TruthfulQA, GSM8K, Winogrande
Update Frequency
Continuous
Public leaderboards plus your own evaluation set is the right combination. Public leaderboards screen for general capability; your eval set verifies fit. Either alone is insufficient.
Hypothetical: Single-Vendor Lock-In Bill Shock
Composite scenario
A B2B SaaS standardized on a single frontier model in 2023 for all AI features. By Q4 2024, monthly AI costs had grown to $480K. A late-2024 evaluation found that 70% of their workload (categorization, summarization of short content, structured extraction) ran equally well on a model 20x cheaper. Migration to a multi-model gateway took 14 weeks of engineering. New monthly AI cost: $145K, a 70% reduction. The lesson: single-vendor commitments at the start of an AI program become structural cost problems within 18 months as the model landscape evolves.
Pre-Migration Monthly Cost
$480K
Post-Migration Monthly Cost
$145K
Annual Savings
~$4M
Migration Investment
~14 weeks engineering
Multi-model architecture is cheap insurance and strategic optionality. Single-vendor strategies that made sense in early 2023 are usually wrong by late 2024.
Decision scenario
Picking Models for a New AI Product
You are CTO of a B2B legal-tech startup launching an AI-powered contract review feature. Three model families are on the table: a frontier closed model, a mid-tier closed model, and an open-weight model you'd self-host. Workload projection: 2M document analyses per month, average 5K input tokens and 1K output tokens. The product targets mid-market law firms; data sensitivity is high (privileged client material).
Projected Monthly Volume
2M analyses
Avg Tokens per Analysis
5K in / 1K out
Customer Type
Mid-market law firms
Data Sensitivity
High (privileged)
Time to Launch
12 weeks
Decision 1
First decision: lead model choice. The frontier model has the best quality on a 100-document evaluation (94% accuracy on key extraction). The mid-tier model scores 88%. The open-weight self-hosted model scores 82%. The frontier model would cost ~$160K/month at projected volume; mid-tier ~$24K/month; self-hosted ~$8K/month plus infrastructure.
Lead with the frontier model โ quality is paramount in legal-tech and the cost is justifiedReveal
Lead with the mid-tier model for default workload, route to frontier only for low-confidence cases (~5% of volume), and keep self-hosted as a future option for volume-dominant tasksโ OptimalReveal
Decision 2
Second decision: data sensitivity. The frontier vendor offers a no-train-on-data zero-retention enterprise tier with BAA-equivalent privacy controls; the mid-tier vendor's standard enterprise contract excludes training but does not guarantee zero retention; the open-weight self-hosted option keeps everything in your VPC.
Use the standard mid-tier contract since training is excluded โ retention is unlikely to be a real issueReveal
Negotiate zero-retention terms with both frontier and mid-tier vendors before launch, and build the architecture so workloads can be routed to self-hosted later if a customer requires itโ OptimalReveal
Related concepts
Keep connecting.
The concepts that orbit this one โ each one sharpens the others.
Beyond the concept
Turn Foundation Model Selection into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h ยท No retainer required
Turn Foundation Model Selection into a live operating decision.
Use Foundation Model Selection as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.