The Ultimate Guide to A/B Testing with Ronny Kohavi

In the world of product development, intuition is often celebrated. We laud the visionary product manager or the designer with the "golden gut." However, the data tells a different, far more humbling story. According to Ronny Kohavi, widely regarded as the world’s leading expert on A/B testing and experimentation, the vast majority of ideas—even those from smart, experienced teams—fail to produce the intended result.

Kohavi has led experimentation platforms at some of the most data-driven companies on the planet: Airbnb, Microsoft, and Amazon. His experience overseeing tens of thousands of experiments offers a stark lesson: if you aren't testing, you are likely deploying code that is flat, if not actively harmful, to your user experience.

This guide explores the tactical realities of creating an experimentation culture, why you should probably abandon your "big bang" redesign, and why trust in your data platform is the most critical asset you have.

Key Takeaways

Most ideas fail: Across Amazon, Microsoft, and Airbnb, the failure rate for experiments ranges from 66% to over 90%. Success requires testing high volumes of ideas to find the few winners.
Beware the "Big Bang" redesign: Large-scale redesigns almost always fail to improve metrics because they bundle good changes with bad ones. Iterative testing is the only safe path.
Define your OEC carefully: Optimizing solely for short-term revenue often leads to long-term churn. You must establish an Overall Evaluation Criterion (OEC) that predicts lifetime value.
Twyman’s Law is absolute: If a result looks too good to be true, it is almost certainly a bug, not a breakthrough.
Scale matters: To detect meaningful changes (e.g., 5-10% improvements), you typically need at least 200,000 users. Below that, focus on qualitative feedback and building the culture.

The Humbling Reality of Experiment Failure Rates

One of the hardest pills for product teams to swallow is the sheer frequency of failure. When an organization begins A/B testing, there is often an assumption that their "better" product managers or engineers will yield higher success rates. The data suggests otherwise.

Kohavi shares specific failure rates from his tenure at major tech giants:

Microsoft: Approximately 66% of ideas failed to improve the metric they were intended to move.
Bing: As the product matured and became more optimized, the failure rate climbed to roughly 85%.
Airbnb Search: In this highly specific domain, 92% of experiments failed to produce a positive outcome.

This does not mean the teams were incompetent; it means that predicting human behavior and system complexity is incredibly difficult. If you do not test, you simply don't see these failures—you launch them. A "flat" result (one that doesn't improve metrics) should generally not be shipped. Every line of new code introduces maintenance debt; if it doesn't add value, it shouldn't exist.

It's amazing how many times I've seen people come up with new designs or a radical new idea and they believe in it... I'm just cautioning them all the time to say hey, if you go for something big, try it out but be ready to fail eighty percent of the time.

The $100 Million Lesson in "Trivial" Changes

While most experiments fail, the ones that succeed can be company-defining. Often, these wins come from the most unexpected places. Kohavi advocates for a "test everything" approach because the magnitude of impact is rarely correlated with the complexity of the engineering effort.

The most famous example from Kohavi’s career at Microsoft involved a trivial change to the display of ads on Bing. An engineer proposed moving the text from the second line of an ad to the first line, effectively making the title longer. The idea sat in the backlog for months, prioritized low because it seemed insignificant.

When an engineer finally decided to implement it—a task that took only days—the results were shocking. Revenue increased by 12%. At the time, that 12% jump was worth over $100 million annually. It was the single largest revenue-generating idea in Bing’s history, accomplished with minimal code.

This reinforces the portfolio approach to experimentation: you must run hundreds of tests to find the "black swans" that pay for all the failures.

Beyond Revenue: The Overall Evaluation Criterion (OEC)

A common pitfall in A/B testing is optimizing for the wrong metric. If you tell a team to "increase revenue," they can easily do so by plastering the site with ads or sending aggressive emails. This raises short-term numbers but destroys the user experience and long-term retention.

To combat this, you must define an Overall Evaluation Criterion (OEC). This is a quantitative measure of the experiment's objective that is causally predictive of long-term goals, such as customer lifetime value.

The Amazon Email Case Study

At Amazon, the team responsible for recommendation emails initially measured success by how much revenue those emails generated. This created a perverse incentive: the team sent more and more emails because the immediate revenue went up.

However, this inevitably led to "spammy" behavior. To fix this, Kohavi’s team had to model the negative cost of an unsubscribe. They determined a dollar value for losing a subscriber and subtracted that from the revenue generated by the campaign. Once this "countervailing metric" was introduced, half of the email campaigns immediately turned negative and were shut down. The OEC forced the team to balance short-term gains with long-term health.

The Danger of Redesigns and When to Start Testing

A prevalent myth in product development is that if a product has hit a local maximum, the only way to break through is a complete, from-scratch redesign. Kohavi strongly advises against this strategy.

Large redesigns suffer from a lack of isolation. You might introduce three features that customers love and seven that they hate. When you launch them all at once, the net result is negative, but you have no way of knowing which parts are responsible. This leads to the "sunk cost fallacy," where teams force a launch because they spent six months building it, even if the data shows it hurts the business.

Instead, apply the one-factor-at-a-time principle. Deconstruct the redesign into smaller chunks and validate them sequentially.

When is a startup ready for A/B testing?

Startups often try to experiment too early. Statistics require volume. Kohavi suggests the following heuristics:

Tens of thousands of users: You can start running very basic tests, but you will only be able to detect massive effects.
200,000+ users: This is the "magic number" where you can begin to detect nuanced changes (5-10% improvements) with statistical significance.

If you are below these thresholds, focus on qualitative feedback and product-market fit. However, you should still invest in the culture of experimentation so you are ready when you scale.

Trust, Validity, and Twyman’s Law

The most dangerous result in A/B testing is not a failure; it is a false positive. If your platform tells you a feature is a winner when it is actually neutral or negative, you will make bad strategic decisions. Trust in the platform is paramount.

Twyman’s Law

Kohavi frequently cites Twyman’s Law: "Any figure that looks interesting or different is usually wrong."

If you run a test and see a 50% increase in conversion, do not celebrate. Investigate. In 9 out of 10 cases, this is a data pipeline error, a logging bug, or a "Sample Ratio Mismatch" (SRM). An SRM occurs when your 50/50 traffic split ends up being 50.2/49.8. While that sounds close, statistically, it implies a severe defect—often caused by bots or browser incompatibilities—that invalidates the entire test.

The Misunderstood P-Value

Finally, most product managers misunderstand statistical significance. A p-value of 0.05 does not mean there is a 95% chance your idea is good. Given the low success rate of ideas generally (the "prior probability"), a p-value of 0.05 actually carries a much higher false positive risk—potentially as high as 26%.

To me, the experimentation platform is the safety net... Trust builds up, it's easy to lose.

To mitigate this, mature organizations like Airbnb require a lower p-value (e.g., 0.01) or mandatory replication of results before declaring a victory. It is better to miss a marginal win than to confidently launch a mistake.

Conclusion

Building an experimentation culture is not just about installing software; it is about building institutional memory. Successful teams document not just their wins, but their surprising failures. They hold quarterly reviews to discuss hypotheses that were wrong, preventing future teams from repeating the same mistakes.

Whether you are a startup or a tech giant, the principles remain the same: remain humble, test everything, distrust results that look too good to be true, and always optimize for the long-term value of the user.

The ultimate guide to A/B testing | Ronny Kohavi (Airbnb, Microsoft, Amazon)

Table of Contents

Key Takeaways

The Humbling Reality of Experiment Failure Rates

The $100 Million Lesson in "Trivial" Changes