Product Experimentation Practice
Product experimentation practice is the organizational capability to run controlled experiments (A/B tests, multivariate tests, holdout tests) at scale โ not as one-off optimization exercises but as the default way product decisions get made. A mature practice has four components: (1) infrastructure (an experimentation platform like Statsig, GrowthBook, Eppo, Optimizely, or LaunchDarkly Experiments), (2) statistical rigor (proper sample sizes, defined primary metrics, guardrail metrics, sequential testing or fixed-horizon, p-value thresholds), (3) experimental velocity (the number of experiments shipped per quarter), and (4) decision discipline (results actually drive product decisions, including kills). Booking.com famously runs ~1,000 concurrent experiments at peak; Spotify, Netflix, and Airbnb run hundreds. Most companies run 0-3 per quarter and call it an 'experimentation culture.' The gap between aspiration and practice is enormous โ and the gap is where most product roadmaps go to die.
The Trap
The trap is calling yourself 'data-driven' while running 1-2 experiments per quarter. At that velocity, you're not testing โ you're making one bet at a time and calling it science. The other trap: launching experiments without enough traffic to detect the effect size you care about. A test that needs 200,000 users per variant to detect a 2% lift, run on 8,000 users, will produce noise โ and product teams will interpret that noise as a signal. The third trap: testing only what's easy (button colors, copy) instead of what matters (pricing, onboarding flow, feature gating). The fourth: not pre-registering the primary metric, then cherry-picking the metric that 'won' across 6 secondary metrics. This is p-hacking, and it produces ship decisions that fail in the wild.
What to Do
Build the practice in this order: (1) Pick a platform โ Statsig and GrowthBook are strong starting points (Statsig for velocity, GrowthBook for self-hosted). (2) Define a primary metric per experiment, plus 1-2 guardrail metrics (e.g., revenue, retention, latency). (3) Compute sample size BEFORE launching โ use a power calculator; don't run underpowered tests. (4) Pre-register hypotheses in a shared doc. (5) Define ship criteria upfront: primary metric lifts by X% with p<0.05, guardrails don't degrade. (6) Run a weekly 'experiment readout' meeting where every concluded test is reviewed and decisions are documented. (7) Track velocity: target 1 experiment per PM per month at minimum; mature practices hit 2-4. (8) Maintain an 'experiment graveyard' โ published results of every test, including the boring ones, so the org learns.
Formula
In Practice
Booking.com's experimentation practice is the most-cited case in tech. At peak, Booking runs roughly 1,000 concurrent experiments โ every page, every email, every recommendation algorithm is being tested. Their published research notes that the majority of experiments either fail or produce no statistically significant effect โ and that this is the point. A high-velocity experimentation program is a search algorithm, not an optimization engine: most ideas don't work, and the value is in finding the few that do, fast. Booking attributes a meaningful share of their conversion advantage to this discipline. On the platform vendor side, Statsig (founded by ex-Facebook engineers), GrowthBook (open-source), and Eppo (statistical-rigor-first) have built businesses on lowering the cost of running experiments โ Statsig's published case studies show customers like Notion and Atlassian reaching dozens of concurrent experiments within 12 months of adoption, up from near-zero before the platform.
Pro Tips
- 01
Pre-register your primary metric before the test starts. Pick ONE. If you can't decide between conversion and retention, define a composite, but commit to it. Selecting the metric after seeing results is p-hacking; it produces ship decisions that fail in production.
- 02
Most experiments are underpowered. Before launching, compute the minimum detectable effect (MDE) for your sample size. If the MDE is 8% and you expect a 2% lift, the test is statistically guaranteed to be inconclusive โ don't run it. Either get more traffic or test a bigger change.
- 03
Track experiment KILL rate alongside ship rate. A team that ships 80% of experiments has insufficient experimental rigor (they're shipping noise) OR insufficient ambition (they're only testing safe bets). Healthy ship rates are 20-35%; the value is in killing fast.
Myth vs Reality
Myth
โA/B tests give you the truth โ if the test wins, ship itโ
Reality
A/B tests give you a STATISTICAL signal under the conditions you tested. Generalization to other traffic, other seasons, other segments is uncertain. A test that wins for 6 weeks may lose at scale because the early-adopter cohort behaved differently from the broader user base. Always validate ship decisions with post-launch monitoring.
Myth
โYou need 80%+ of experiments to ship for the program to be valuableโ
Reality
Booking.com's program ships well under half. The value is in fast learning, not in a high win rate. A program where 80% ship is either testing trivial changes or has weak statistical rigor โ both ways the value is lower than it looks.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge โ answer the challenge or try the live scenario.
Knowledge Check
Your team ran an A/B test on a new pricing page. Variant B 'won' on conversion (+4%) but the test only had 3,200 users per variant. Power analysis suggests you needed 18,000 per variant to reliably detect a 4% effect. The team wants to ship. What's the right call?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets โ not absolutes.
Experiments Concluded per Quarter (B2B SaaS, growth-stage)
B2B SaaS โ concluded controlled experiments per quarterBooking.com / Netflix tier
> 100/qtr
Mature Practice
30-100/qtr
Growing Practice
10-30/qtr
Aspirational
3-10/qtr
Not a Real Practice
< 3/qtr
Source: Statsig customer benchmarks; Booking.com published research
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Booking.com
2010-present
Booking.com's experimentation program is the most-cited in tech. At peak the company runs roughly 1,000 concurrent experiments โ every page, every email, every recommendation. Booking's published research repeatedly emphasizes that the MAJORITY of experiments fail or produce no statistically significant effect, and that this is the point. The program is a search algorithm: most ideas don't work; the value is in finding the few that do, fast. Booking attributes much of their long-term conversion advantage in online travel to this discipline. The platform is largely homegrown, built over a decade, with deep statistical rigor (sequential testing, CUPED for variance reduction, segment-level analysis as a default).
Concurrent Experiments (peak)
~1,000
Experiment Win Rate
Minority (most fail)
Investment Period
10+ years compounding
Strategic Outcome
Sustained conversion lead in OTA category
Experimentation velocity is a search algorithm. Most attempts fail; the program's value is in failing fast and shipping the rare wins. A high win rate signals insufficient ambition or insufficient rigor.
Statsig + GrowthBook + Eppo + Optimizely (Vendor Pattern)
2020-2024
The four major modern experimentation platforms have each documented similar adoption patterns through customer case studies. Statsig (founded by ex-Facebook engineers) emphasizes velocity โ customers like Notion and Atlassian have published case studies showing experiment counts climbing from near-zero to dozens of concurrent tests within a year. GrowthBook (open-source) targets self-hosted statistical rigor. Eppo emphasizes 'CUPED-by-default' for higher statistical power. Optimizely (the original A/B testing platform) remains strong in enterprise web. All four converge on the same conclusion: organizations that adopt a platform see a 5-15x increase in experiment velocity within 12 months โ not because the platform is magic, but because it removes the engineering bottleneck that kept experimentation rare.
Typical Velocity Lift After Platform Adoption
5-15x within 12 months
Common Bottleneck Removed
Engineering setup time per test
Top Practice (high-functioning teams)
PM-configured tests, not engineer-built
Failure Mode (low-functioning teams)
Bought platform, didn't change process
The platform is necessary but not sufficient. Velocity comes from process change: PMs and designers configuring tests directly, weekly readouts, written ship criteria. The tool is the enabler, not the practice.
Decision scenario
The Experimentation Platform Decision
You're VP Product at a $50M ARR SaaS with 200K MAU. You currently run 3 experiments per quarter using a homegrown feature-flag system. CEO wants 'compounding product wins.' Two paths: (a) hire a senior data scientist ($220K loaded) to design and analyze experiments on the existing system, or (b) adopt Statsig ($120K/year for your scale) and train existing PMs to configure tests in the UI.
ARR
$50M
MAU
200,000
Current Experiments / Qtr
3
Current Ship Rate
~50% (suspect noise)
Decision 1
Both options consume similar annual cash. The data scientist adds analytical depth; the platform adds velocity. Engineering won't get bandwidth either way to refactor the homegrown system.
Hire the data scientist. Statistical rigor will rise; experiments will be analyzed properly. The homegrown system can keep runningReveal
Adopt Statsig. Train 6 PMs to configure experiments. Engineering only needed for instrumentation. Hire a PT data analyst (~$80K) to support readoutsโ OptimalReveal
Related concepts
Keep connecting.
The concepts that orbit this one โ each one sharpens the others.
Beyond the concept
Turn Product Experimentation Practice into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h ยท No retainer required
Turn Product Experimentation Practice into a live operating decision.
Use Product Experimentation Practice as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.