Data StrategyAdvanced8 min read

Feature Store Design

A Feature Store is the dedicated infrastructure layer that produces, stores, serves, and governs the features (engineered inputs) used by machine learning models — both during offline training and online inference. The defining problem it solves: feature parity. Without a feature store, the SQL that computes 'user_avg_order_value_last_30_days' for training is rewritten in Java/Python for online serving, and they drift, producing online/offline skew that silently degrades model accuracy. Feature stores enforce a single feature definition that produces both an offline batch feature (in your warehouse, for training) and an online low-latency feature (in Redis/DynamoDB/Cassandra, for inference). The dominant implementations: Tecton (commercial, founded by ex-Uber Michelangelo team), Feast (open source, originally from Gojek), Databricks Feature Store, Vertex AI Feature Store (Google), SageMaker Feature Store (AWS), and many in-house systems at Uber (Michelangelo), Airbnb (Zipline), Lyft (Dryft), and Netflix.

Also known asML Feature StoreFeature PlatformOnline/Offline Feature ServingTecton/Feast ArchitectureFeature Engineering Pipeline

Challenge a friend Browse library

The Trap

The trap is building a feature store before you have enough ML models in production to justify the complexity. Feature stores are infrastructure tax — they pay back when you have 5+ production models sharing features, hundreds of features in production, and online inference latency requirements. With 1-2 models and a small feature set, the right answer is just careful engineering: write feature SQL in dbt, materialize to a warehouse table for training, replicate to a cache for serving, and live with the small online/offline drift. KnowMBA POV: most companies that 'need a feature store' actually need 3 things — better dbt discipline, a cache for online serving, and stop deploying ML to production without observability. Buying Tecton when you have 2 models in production is the same anti-pattern as buying a $250K experimentation platform when you run 10 experiments per year. The other trap: building a feature store in-house without dedicated platform team capacity. Feature stores have a long tail of edge cases (point-in-time correctness, feature backfills, schema evolution, dependency tracking) that consume team capacity for years.

What to Do

Adopt a feature store only when you cross a clear threshold: 5+ ML models in production, 50+ features in production, online inference latency SLAs (sub-100ms), AND a dedicated ML platform team of 3+ engineers. Below that, use lightweight alternatives: dbt + warehouse for offline + Redis/DynamoDB cache for online + careful documentation. When you cross the threshold, decide buy vs build vs open-source: (1) Tecton — commercial, mature, expensive ($300K-$2M/year), best for serious production ML at scale. (2) Databricks Feature Store — bundled with the lakehouse, low marginal cost if you're already on Databricks. (3) Feast — open source, lower cost but requires significant operational investment. (4) In-house — only if you have Uber/Airbnb-scale ML usage AND a 10+ engineer ML platform team. Sequence rollout: first 3 features in the new feature store should be high-traffic, latency-sensitive features that demonstrate the parity benefit clearly. Then expand.

Formula

Feature Store Justification = (Production Models × Shared Features × Online Latency SLA Pressure × ML Platform Team Size) ÷ (Build/License Cost + Operational Burden). Below 5 models or no online latency requirement, the math rarely works out.

In Practice

Uber's Michelangelo (introduced 2017) is the original public feature store at scale — supporting hundreds of ML use cases (surge pricing, ETA, fraud, search ranking) across the company with both offline and online feature serving. Tecton (founded 2019 by Michelangelo's leads) commercialized the pattern; their published case studies include Atlassian, Plaid, Cash App, Coinbase, and many other production ML shops. Feast (open-sourced by Gojek and Google in 2019, now under the LF AI Foundation) is the dominant open-source feature store, used by Shopify, Robinhood, Twitter (pre-Musk), and many others. Airbnb's Zipline, Lyft's Dryft, and Netflix's feature platforms are all in-house implementations. The recurring pattern: every public feature store case is backed by a serious ML platform team and many production models. Feature stores at smaller scale are infrastructure overhead pretending to be enablement.

Pro Tips

01
Point-in-time correctness is the hardest feature store problem and the most-underestimated. When you compute 'user_30day_purchases' for training, the value at training time must match what would have been served at inference time on that historical date — otherwise you train on data leaks that don't exist in production. Modern feature stores enforce point-in-time joins; in-house implementations frequently get this wrong.
02
Online/offline parity testing should be a CI check, not a hope. Every feature deploy should run a test that computes the feature both ways and asserts they match within tolerance. Without this, drift creeps in slowly and only surfaces when model accuracy mysteriously degrades.
03
Treat features as products with owners and SLAs. The teams that succeed (Uber Michelangelo, Airbnb Zipline) treat each feature like a data product: documented, tested, monitored, owned, and deprecated through a process. The teams that fail treat features as ephemeral SQL snippets and drown in shadow features.

Myth vs Reality

Myth

“Every ML team needs a feature store”

Reality

Most ML teams need careful engineering: feature SQL in dbt, a warehouse for offline training, a cache for online serving, and documentation. Feature stores are justified at scale (5+ production models, 50+ features, dedicated platform team), not at every ML deployment. Buying a feature store with 2 models in production is the same as buying enterprise software for a 10-person team — expensive and over-engineered.

Myth

“Feature stores eliminate online/offline skew automatically”

Reality

Feature stores REDUCE skew by enforcing a single source-of-truth feature definition, but skew can still emerge from infrastructure differences (cache TTL vs warehouse refresh), schema evolution, and edge cases in point-in-time correctness. Continuous monitoring of feature distributions across online and offline environments is required regardless of platform. The feature store is necessary but not sufficient.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

Your ML team has 2 production models, ~25 features, and no dedicated ML platform engineers. The lead data scientist wants to buy Tecton at $400K/year. What's the right call?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

Feature Store Adoption Threshold (Production ML)

Feature store adoption sweet spots by production ML scale

1-2 models, <30 features

Use dbt + warehouse + cache

3-5 models, 30-100 features

Open source (Feast) or bundled (Databricks)

5-20 models, 100-500 features

Commercial (Tecton, Vertex, SageMaker)

20+ models, 500+ features

In-house or Tecton + customization

Source: https://www.tecton.ai/blog/what-is-a-feature-store/

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

🚕

Uber Michelangelo

2017-present

success

Uber's Michelangelo platform, introduced publicly in 2017, is the original feature store at scale. It supports hundreds of production ML use cases across surge pricing, ETA, fraud detection, search ranking, and recommendations — with both offline and online feature serving, point-in-time correctness, and dependency tracking. Michelangelo is operated by a dedicated platform team of dozens of engineers and processes billions of feature lookups per day. Many of Tecton's founders came from Michelangelo and commercialized the pattern.

Production ML Use Cases

Hundreds

Feature Lookups per Day

Billions

Platform Team Size

Dozens of engineers

Era

2017+, ongoing

Feature stores at hyperscale require dedicated platform teams. The pattern is battle-tested but the operational investment is substantial.

Source ↗

🏪

Tecton

2019-present

success

Tecton, founded in 2019 by ex-Uber Michelangelo team members, commercialized the production feature store pattern. Customer base includes Atlassian, Plaid, Cash App, Coinbase, and many other serious production ML shops. Tecton's published case studies emphasize the operational burden of in-house feature stores and the time-savings of the commercial platform: customers report cutting time-to-deploy new ML features from weeks to days. Pricing is in the $300K-$2M+/year range depending on scale.

Founded

2019

Notable Customers

Atlassian, Plaid, Cash App, Coinbase

Pricing Range

$300K-$2M+/year

Reported Time-to-Deploy Reduction

Weeks to days

Commercial feature stores are mature and pay for themselves at production scale, but require ML programs serious enough to justify the price.

Source ↗

🍽️

Feast (Linux Foundation AI)

2019-present

success

Feast was open-sourced in 2019 by Gojek and Google, now hosted under the Linux Foundation AI. It's the dominant open-source feature store, used by Shopify, Robinhood, Twitter (pre-acquisition), and many others. Feast is feature-rich but requires significant operational investment to deploy and maintain — companies that succeed with Feast typically have a dedicated ML platform team. Companies that adopt Feast hoping for 'free Tecton' often underestimate the ops burden and either commit to building the platform team or eventually switch to commercial alternatives.

Open-Sourced By

Gojek + Google (2019)

Notable Users

Shopify, Robinhood, Twitter (pre-2022)

Cost Profile

Free license, significant ops

Typical Ops Investment

2-5 engineer-years to operate

Open-source feature stores trade license cost for operational cost. The total cost of ownership is similar to commercial; the choice depends on whether you want to staff the platform team or rent it.

Source ↗

Related concepts