Feature Store Design
A Feature Store is the dedicated infrastructure layer that produces, stores, serves, and governs the features (engineered inputs) used by machine learning models โ both during offline training and online inference. The defining problem it solves: feature parity. Without a feature store, the SQL that computes 'user_avg_order_value_last_30_days' for training is rewritten in Java/Python for online serving, and they drift, producing online/offline skew that silently degrades model accuracy. Feature stores enforce a single feature definition that produces both an offline batch feature (in your warehouse, for training) and an online low-latency feature (in Redis/DynamoDB/Cassandra, for inference). The dominant implementations: Tecton (commercial, founded by ex-Uber Michelangelo team), Feast (open source, originally from Gojek), Databricks Feature Store, Vertex AI Feature Store (Google), SageMaker Feature Store (AWS), and many in-house systems at Uber (Michelangelo), Airbnb (Zipline), Lyft (Dryft), and Netflix.
The Trap
The trap is building a feature store before you have enough ML models in production to justify the complexity. Feature stores are infrastructure tax โ they pay back when you have 5+ production models sharing features, hundreds of features in production, and online inference latency requirements. With 1-2 models and a small feature set, the right answer is just careful engineering: write feature SQL in dbt, materialize to a warehouse table for training, replicate to a cache for serving, and live with the small online/offline drift. KnowMBA POV: most companies that 'need a feature store' actually need 3 things โ better dbt discipline, a cache for online serving, and stop deploying ML to production without observability. Buying Tecton when you have 2 models in production is the same anti-pattern as buying a $250K experimentation platform when you run 10 experiments per year. The other trap: building a feature store in-house without dedicated platform team capacity. Feature stores have a long tail of edge cases (point-in-time correctness, feature backfills, schema evolution, dependency tracking) that consume team capacity for years.
What to Do
Adopt a feature store only when you cross a clear threshold: 5+ ML models in production, 50+ features in production, online inference latency SLAs (sub-100ms), AND a dedicated ML platform team of 3+ engineers. Below that, use lightweight alternatives: dbt + warehouse for offline + Redis/DynamoDB cache for online + careful documentation. When you cross the threshold, decide buy vs build vs open-source: (1) Tecton โ commercial, mature, expensive ($300K-$2M/year), best for serious production ML at scale. (2) Databricks Feature Store โ bundled with the lakehouse, low marginal cost if you're already on Databricks. (3) Feast โ open source, lower cost but requires significant operational investment. (4) In-house โ only if you have Uber/Airbnb-scale ML usage AND a 10+ engineer ML platform team. Sequence rollout: first 3 features in the new feature store should be high-traffic, latency-sensitive features that demonstrate the parity benefit clearly. Then expand.
Formula
In Practice
Uber's Michelangelo (introduced 2017) is the original public feature store at scale โ supporting hundreds of ML use cases (surge pricing, ETA, fraud, search ranking) across the company with both offline and online feature serving. Tecton (founded 2019 by Michelangelo's leads) commercialized the pattern; their published case studies include Atlassian, Plaid, Cash App, Coinbase, and many other production ML shops. Feast (open-sourced by Gojek and Google in 2019, now under the LF AI Foundation) is the dominant open-source feature store, used by Shopify, Robinhood, Twitter (pre-Musk), and many others. Airbnb's Zipline, Lyft's Dryft, and Netflix's feature platforms are all in-house implementations. The recurring pattern: every public feature store case is backed by a serious ML platform team and many production models. Feature stores at smaller scale are infrastructure overhead pretending to be enablement.
Pro Tips
- 01
Point-in-time correctness is the hardest feature store problem and the most-underestimated. When you compute 'user_30day_purchases' for training, the value at training time must match what would have been served at inference time on that historical date โ otherwise you train on data leaks that don't exist in production. Modern feature stores enforce point-in-time joins; in-house implementations frequently get this wrong.
- 02
Online/offline parity testing should be a CI check, not a hope. Every feature deploy should run a test that computes the feature both ways and asserts they match within tolerance. Without this, drift creeps in slowly and only surfaces when model accuracy mysteriously degrades.
- 03
Treat features as products with owners and SLAs. The teams that succeed (Uber Michelangelo, Airbnb Zipline) treat each feature like a data product: documented, tested, monitored, owned, and deprecated through a process. The teams that fail treat features as ephemeral SQL snippets and drown in shadow features.
Myth vs Reality
Myth
โEvery ML team needs a feature storeโ
Reality
Most ML teams need careful engineering: feature SQL in dbt, a warehouse for offline training, a cache for online serving, and documentation. Feature stores are justified at scale (5+ production models, 50+ features, dedicated platform team), not at every ML deployment. Buying a feature store with 2 models in production is the same as buying enterprise software for a 10-person team โ expensive and over-engineered.
Myth
โFeature stores eliminate online/offline skew automaticallyโ
Reality
Feature stores REDUCE skew by enforcing a single source-of-truth feature definition, but skew can still emerge from infrastructure differences (cache TTL vs warehouse refresh), schema evolution, and edge cases in point-in-time correctness. Continuous monitoring of feature distributions across online and offline environments is required regardless of platform. The feature store is necessary but not sufficient.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge โ answer the challenge or try the live scenario.
Knowledge Check
Your ML team has 2 production models, ~25 features, and no dedicated ML platform engineers. The lead data scientist wants to buy Tecton at $400K/year. What's the right call?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets โ not absolutes.
Feature Store Adoption Threshold (Production ML)
Feature store adoption sweet spots by production ML scale1-2 models, <30 features
Use dbt + warehouse + cache
3-5 models, 30-100 features
Open source (Feast) or bundled (Databricks)
5-20 models, 100-500 features
Commercial (Tecton, Vertex, SageMaker)
20+ models, 500+ features
In-house or Tecton + customization
Source: https://www.tecton.ai/blog/what-is-a-feature-store/
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Uber Michelangelo
2017-present
Uber's Michelangelo platform, introduced publicly in 2017, is the original feature store at scale. It supports hundreds of production ML use cases across surge pricing, ETA, fraud detection, search ranking, and recommendations โ with both offline and online feature serving, point-in-time correctness, and dependency tracking. Michelangelo is operated by a dedicated platform team of dozens of engineers and processes billions of feature lookups per day. Many of Tecton's founders came from Michelangelo and commercialized the pattern.
Production ML Use Cases
Hundreds
Feature Lookups per Day
Billions
Platform Team Size
Dozens of engineers
Era
2017+, ongoing
Feature stores at hyperscale require dedicated platform teams. The pattern is battle-tested but the operational investment is substantial.
Tecton
2019-present
Tecton, founded in 2019 by ex-Uber Michelangelo team members, commercialized the production feature store pattern. Customer base includes Atlassian, Plaid, Cash App, Coinbase, and many other serious production ML shops. Tecton's published case studies emphasize the operational burden of in-house feature stores and the time-savings of the commercial platform: customers report cutting time-to-deploy new ML features from weeks to days. Pricing is in the $300K-$2M+/year range depending on scale.
Founded
2019
Notable Customers
Atlassian, Plaid, Cash App, Coinbase
Pricing Range
$300K-$2M+/year
Reported Time-to-Deploy Reduction
Weeks to days
Commercial feature stores are mature and pay for themselves at production scale, but require ML programs serious enough to justify the price.
Feast (Linux Foundation AI)
2019-present
Feast was open-sourced in 2019 by Gojek and Google, now hosted under the Linux Foundation AI. It's the dominant open-source feature store, used by Shopify, Robinhood, Twitter (pre-acquisition), and many others. Feast is feature-rich but requires significant operational investment to deploy and maintain โ companies that succeed with Feast typically have a dedicated ML platform team. Companies that adopt Feast hoping for 'free Tecton' often underestimate the ops burden and either commit to building the platform team or eventually switch to commercial alternatives.
Open-Sourced By
Gojek + Google (2019)
Notable Users
Shopify, Robinhood, Twitter (pre-2022)
Cost Profile
Free license, significant ops
Typical Ops Investment
2-5 engineer-years to operate
Open-source feature stores trade license cost for operational cost. The total cost of ownership is similar to commercial; the choice depends on whether you want to staff the platform team or rent it.
Related concepts
Keep connecting.
The concepts that orbit this one โ each one sharpens the others.
Beyond the concept
Turn Feature Store Design into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h ยท No retainer required
Turn Feature Store Design into a live operating decision.
Use Feature Store Design as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.