Data StrategyAdvanced8 min read

AB Testing Platform

An AB Testing Platform is the technical and statistical infrastructure that lets product, growth, and marketing teams ship controlled experiments — randomly assigning users to variants, measuring outcomes, and deciding whether a change shipped a winning variant or not. The defining components: (1) randomization service (assigns users deterministically), (2) feature flag delivery (toggles variants in client and server code), (3) event ingestion + experiment computation (measures the outcome metrics), (4) statistical engine (frequentist or Bayesian inference, sequential tests, CUPED variance reduction), (5) experimentation portal (UX for designing, launching, monitoring, deciding). The dominant commercial platforms — Optimizely, AB Tasty, Statsig, GrowthBook, Eppo, LaunchDarkly+Experiments — differ in their split between feature flagging and statistical sophistication. Big tech (Google, Microsoft, Meta, Netflix, Booking) built their own; mid-market and growth-stage companies overwhelmingly buy.

Also known asExperimentation PlatformSplit Testing InfrastructureFeature Flag + ExperimentationStatistical Testing PlatformOptimizely / Statsig / Eppo Type

Challenge a friend Browse library

The Trap

The trap is buying an experimentation platform expecting it to fix a culture that doesn't actually want to learn from experiments. Most companies launch the platform, run 12 experiments in year 1, declare 8 of them 'winners' through eyeball analysis (ignoring the platform's statistics), and quietly stop using it. The honest precondition for an experimentation platform is a culture that accepts losing experiments and ships only the actual winners — which is a higher bar than most leadership claims to meet. KnowMBA POV: experimentation velocity > experimentation rigor for early-stage products. A startup running 50 quick-and-dirty experiments per quarter learns more than a startup running 5 statistically pristine experiments. Once product-market-fit hardens and growth marginal gains shrink (typically post-Series C), velocity yields to rigor and the full statistical platform earns its keep. Buying a $250K/year platform pre-PMF is a status purchase.

What to Do

Pick the right platform by stage. (1) Pre-PMF startup (<$5M ARR, <50 experiments/year): use a free or low-cost tool — GrowthBook (open source), Statsig (free tier), or Posthog Experiments. Optimize for velocity and ease of launching. (2) Growth-stage SaaS ($5M-$100M ARR, 50-500 experiments/year): graduate to Statsig, Eppo, or Optimizely. Invest in CUPED variance reduction and shared metrics definitions. (3) Hyperscale ($100M+ ARR, 500+ experiments/year): consider building in-house (or augmenting Statsig/Eppo with custom analysis) to support sequential tests, switchback experiments, and metric trees. Sequence rollout: shared metric definitions FIRST (so every experiment uses canonical 'activation' and 'retention' definitions), platform SECOND, training and review process THIRD. Skipping shared metrics is the dominant failure mode — every experiment ships a different definition of success and the platform becomes a dashboard cemetery.

Formula

Experimentation Platform ROI = (Experiments per Quarter × % That Ship × Avg Lift × Annual Revenue) − Platform Cost. Below ~50 experiments/year, free tools usually win on ROI. Above ~200/year, a paid platform with CUPED and sequential tests pays for itself.

In Practice

Optimizely (founded 2010, post-IPO 2021) built the modern category of commercial experimentation platforms. AB Tasty competes in mid-market. Statsig (founded 2021 by ex-Facebook engineers, $100M+ ARR by 2024) and Eppo (founded 2020) are the high-velocity modern entrants emphasizing warehouse-native architecture and CUPED variance reduction. GrowthBook (open source) targets cost-conscious teams. Microsoft's Experimentation Platform team has published extensively about sequential testing and metric trees from running tens of thousands of experiments per year on Bing and Microsoft 365. Booking.com famously runs 1,000+ experiments concurrently and has published case studies on shipping wrong winners due to peeking, multiple comparisons, and poor metric definition. The shared lesson across platforms: the platform is necessary but the experimentation culture and shared metrics are 80% of the outcome.

Pro Tips

01
CUPED (Controlled Pre-Experiment Data) variance reduction can reduce required sample sizes by 30-50% — meaning faster experiments and the ability to detect smaller lifts. Modern platforms (Statsig, Eppo, Optimizely) implement it; check that yours does before paying premium.
02
Sequential testing (mSPRT, group sequential) lets you peek at results without inflating false positive rates. Without it, peeking destroys experiment validity — and humans WILL peek. Pick a platform that supports sequential tests if you have non-statistical end users.
03
Shared metric definitions are the single most-undervalued investment. Without them, every experiment ships a different version of 'activation' and analysis becomes incomparable across the org. Build canonical metric definitions in the platform (or in dbt feeding the platform) FIRST. Then add experiments.

Myth vs Reality

Myth

“More experiments always lead to better products”

Reality

Volume alone doesn't deliver outcomes. A team running 200 experiments per quarter with poor metric definitions, peeking violations, and confirmation bias ships more wrong winners than right ones. Microsoft's published research suggests roughly two-thirds of well-designed experiments at hyperscale companies fail to produce the predicted lift — meaning shipping based on intuition (without experimentation) is wrong about two-thirds of the time. Both volume AND rigor matter; either alone underperforms.

Myth

“AB testing platforms are commodities — pick the cheapest”

Reality

Platforms differ materially in CUPED implementation, sequential test support, warehouse-native architecture, metric layer integration, and statistical sophistication. The wrong choice at hyperscale costs millions in shipping wrong winners. The wrong choice at startup scale costs nothing because you're not running enough experiments to need rigor. Match platform sophistication to experimentation velocity, not to brand recognition.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

Your $20M ARR Series B SaaS company runs ~80 experiments per year. You're choosing between a free tool (GrowthBook), a mid-tier platform (Statsig at ~$60K/year), and an enterprise platform (Optimizely at ~$200K/year). What's the right call?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

Experimentation Platform Tier by Volume

Experimentation platform tier sweet spots by experiment volume

<50 exp/year (Free Tools)

GrowthBook, Posthog, Statsig free

50-500 exp/year (Mid-Tier)

Statsig, Eppo, AB Tasty $50K-$200K

500-5,000 exp/year (Enterprise)

Optimizely, Statsig Enterprise $200K-$1M+

5,000+ exp/year (Build In-House)

Booking, Microsoft, Meta, Netflix custom

Source: https://exp-platform.com/Documents/2017-08KDD-CUPED.pdf

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

🧪

Statsig

2021-present

success

Statsig was founded in 2021 by ex-Facebook engineers who built Facebook's internal experimentation platform. The product combines feature flagging, experimentation, and product analytics in a warehouse-native architecture with CUPED, sequential testing, and integrated metric definitions. By 2024, Statsig reportedly crossed $100M ARR with customers including OpenAI, Notion, Atlassian, and Brex. The growth pattern reflects market demand for modern experimentation infrastructure: rigor of big-tech platforms with the UX of a startup tool.

Founded

2021

Reported ARR (2024)

$100M+

Notable Customers

OpenAI, Notion, Atlassian, Brex

Differentiator

Warehouse-native + CUPED + sequential

Modern experimentation platforms have raised the floor on statistical sophistication. Mid-market companies can now access big-tech-quality experimentation infrastructure for $60-200K/year.

Source ↗

🏨

Booking.com

2010-present

success

Booking.com runs one of the largest experimentation programs in the world — 1,000+ concurrent experiments at peak, with a custom-built platform supporting their entire product surface. Booking has published extensively about both wins (>$1B in cumulative annual revenue attributed to experimentation lifts over the years) and failure modes (multiple comparisons, peeking, novelty effects, weekly seasonality). Their public talks emphasize that the platform is necessary but the experimentation CULTURE is what drives outcomes.

Concurrent Experiments (peak)

1,000+

Cumulative Revenue Impact

>$1B over years

Platform

Custom-built in-house

Published Failure Modes

Peeking, multiple comparisons, novelty

At hyperscale, experimentation platform investment pays for itself many times over — but only when paired with the cultural discipline of accepting losing experiments and refusing to ship them.

Source ↗

🅰️

Optimizely

2010-present

success

Optimizely defined the modern category of commercial AB testing platforms, going public in 2021 (later taken private). The platform powers experimentation for tens of thousands of customers across e-commerce, SaaS, and media. Optimizely's published case studies span from early conversion rate optimization (CRO) wins (10-30% lifts on landing pages) to mature programs running hundreds of experiments per quarter. The platform's evolution reflects market maturation: early growth came from CRO simplicity; current growth comes from full feature management + experimentation integration.

Founded

2010

Customer Base

Tens of thousands across industries

Public Listing Era

2021 (later private)

Use Cases

CRO, product experimentation, feature management

The commercial experimentation market is mature, with clear vendor tiers. Match the vendor to your stage and volume; don't pay enterprise prices for mid-market needs.

Source ↗

Decision scenario

The Experimentation Platform Purchase Decision

You're VP Growth at a Series B SaaS company at $15M ARR. Your team currently runs ~30 experiments per year using a basic feature flag tool with manual analysis in SQL. The CMO wants to buy Optimizely ($180K/year). The CEO is skeptical. The data team is overloaded.

Current Experiments per Year

~30

Current Tooling Cost

$0 (manual)

Optimizely Quote

$180K/year

Data Team Capacity

Already overloaded

Win Rate (estimated)

~10% (low confidence)

Decision 1

The CMO wants to sign the Optimizely contract this quarter to 'modernize the experimentation function'. The data team can't credibly support a 6x increase in experiments with current capacity. Win rate is low because metric definitions are inconsistent across experiments.

Sign the Optimizely contract — the platform will force discipline and the CMO is right that experimentation should be a strategic capabilityReveal

Year 1: Optimizely deployed, 50 experiments launched (up from 30), win rate stays at 10% because the underlying metric definitions are still inconsistent. The data team can't support analysis quality. Year 2: 35 experiments (energy fading), $180K spent with no measurable revenue lift. CFO cancels in Year 3. The platform was never the bottleneck.

Year 1 Spend: $0 → $180KWin Rate Improvement: 10% → 10% (no change)Outcome: Cancelled in Year 3

Reject the enterprise platform. Spend Q1 fixing shared metric definitions in dbt, then Q2 deploying Statsig free tier or GrowthBook ($0-$30K/year). Re-evaluate need for an enterprise platform when experiment volume reaches 100+/year and the metric foundation is solid.Reveal

Q1-Q2: shared metric definitions deployed, Statsig free tier integrated. Q3-Q4: experiments increase to 70/year, win rate climbs to 18% because metric definitions are now consistent and CUPED is available. Year 2: ~150 experiments, win rate 22%, ~33 winners × $180K avg revenue = $6M revenue lift. Year 2 graduate to Statsig Pro at $80K/year. Year 3: enterprise discussion makes sense at much higher volume and with proven discipline.

Year 1 Spend: $30K (vs $180K)Year 2 Win Rate: 10% → 22%Year 2 Revenue Lift: $0 → ~$6M

Related concepts