26 typical A/B testing mistakes that can lead to up to 42% annual revenue loss
Booking.com’s loss from unsuccessful experiments is 2% of annual revenue. Let’s consider this as a benchmark. What revenue loss might a less experienced team have?
When you think about very costly A/B testing mistakes, the first thing that comes to mind are things like “Buy” button bugs. But it’s not the most dangerous thing though, as it’s very evident, easily recognizable, and short-term.
The worst story I heard was of a 42% of annual revenue drop as a result of deploying a feature based on false positive experiment data.
We at Conversionrate.store have developed ~7,200 A/B tests for 231 clients, including Microsoft. Notably, 72% of our first 100 experiments contained mistakes that we only realized 8 months after starting A/B testing.
4 common problems related to experimentation
Here are four common problems related to experimentation that can dramatically decrease revenue or slow down the growth:
- Implementation of false-positive results
- No A/B testing at all for critical changes
- Direct revenue loss from underperforming variations
- Not maximizing the volume and velocity of experiments

26 typical A/B testing mistakes in 2026
All those issues are interconnected, so let’s go through 26 typical A/B testing mistakes that we see time and time again:
- Hypothesis is not focused on the main bottleneck
- Guessing reasons behind the main bottleneck
- Guessing how to fix the cause of the drop-off
- Holding the wrong metric like conversion-to-purchase as a goal
- Data tracking not at least 90-97% accurate
- No event mapping for all elements on A and B
- Testing more than one hypothesis per experiment
- Stopping the experiment only based on statistical significance
- No MDE and pre-test sample size planning
- No QA of alternative versions after experiment is launched and no monitoring of experiment session recordings
- No regression QA of the control version during an experiment
- No QA of experiment data tracking
- Not eliminating the “novelty effect”
- Implementation of false positive results
- No anomaly detection
- Outliers not cleaned up
- No preliminary A/A or A/A/B tests
- No analytics or tracking of long-term impact of implemented winning versions
- No in-depth post-test research and documentation of results
- Targeting irrelevant traffic segments together in one experiment
- Not checking for sample-ratio mismatch (SRM) for 100% of experiment traffic or all meaningful segments you want to compare
- Experiment data set not visualized
- Deploying winning versions to a different audience than in the experiment
- Low experimentation velocity due to lack of in-house resources or absence of 100% dedicated experimentation teams
- Not leveraging parallel experiments when there is enough traffic
- Not speeding up experiment time with CUPED or similar techniques that leverage historical data on sensitivity of metrics.
A/B/n testing framework
Obviously, the most important thing in implementation of a cro program is to A/B/n test the hypotheses in the most efficient way.
Let’s go through the process as if we were launching a very first experiment.
- Define a macro conversion metric that best describes impact on your revenue growth. We typically define that based on frequency of usage or purchases. For transactional companies like Airbnb or e-commerce stores where users typically make one transaction less than every couple of months, the best metric is average revenue per user (ARPU). For subscriptions or products with long term usage, we define a leading indicator that forecasts LTV, like a 2nd month subscription payment. If you already have a North Star metric, then just choose that.
- Define secondary metrics that should not be dropped like bounce rate, refunds, additional operational costs, specific retention or a usage metric. Such metrics may not necessarily be reflected in short-term revenue but may cause long-term risks.
- Estimate the needed sample sizes and minimal detectable effect of the winning experiments. Define if it’s enough traffic for A/B/n testing or it’s better to go with A/B tests.
- Launch an initial A/A test to check, validate and calibrate the A/B testing tool or in-house traffic split solution and data tracking setups. You can also run a bunch of A/A/B tests if you have sufficient traffic and want to have additional confidence in statistical significance (for example if you want to establish trust with a CRO agency).
- Estimate opportunities for parallel testing where users take part in several experiments at the same time. You may hear that it’s forbidden to test that way in most popular CRO blogs, but companies like Microsoft, Booking, Google, Netflix and LinkedIn do that to run 10,000-50,000 experiments simultaneously.
- Estimate opportunities to cut the time that’s needed for statistical significance like the CUPED method or targeting the test only on users that actually have a different UX (for example if the change is on the 3rd screen of the landing page then only run the test on users who scrolled till the 3rd screen).
- Create an A/B/n testing calendar with approximate estimated times to stop experiments and develop the new ones. Avoid pauses without any live experiments. If we think of growth as a number of experiments then one week without tests means 25% slower monthly growth (and even slower when compounding the decline of each month together).
- We assume that the 57-steps of UX research plan was done and the hypotheses are maniacally prioritized, right?
- Choose a statistical formula that works best for your specific metric, type of dataset and its distribution. Lots of teams just blindly use statistical calculators after reading a bunch of blog posts on A/B tests statistics. Take time to understand the nature of statistical concepts. We recommend the book “Statistical Methods in Online A/B Testing” by Georgi Z. Georgiev as a good foundation for that.
- Prepare an automated dashboard that monitors all needed statistical metrics, sends notifications on significant drops, tracking and splitting issues, and recommends when to stop the test.
- Allocate a dedicated A/B/n test development, QA and analytics team that works on nothing but the experiments. If you don’t feel like doing that or don’t have the resources, read step 64 again – if the whole team is not 100% focused only on growth, it will be inevitably slower. If it’s still hard in terms of resources or it’s hard to hire and build more growth teams you can outsource the A/B test development to a CRO agency. It’s safe, secure as it’s no impact on the actual source code and access to that if done through client-side A/B testing tools like Optimizely.
- Develop the test and conduct manual QA.
- Set up additional data tracking if any new elements are planned on alternative versions.
- Launch an experiment on a small portion of traffic that’s significant enough to check the correctness of tracking, experiment targeting and help to identify bugs and technical problems.
- Ask the QA team to watch visitor session recordings of the alternative versions to detect bugs that were not found during manual QA or by quantitative metrics. That will also help to uncover the use cases and flows that should be tweaked to polish the hypotheses before the final launch.
- Steps 61-70 should be done every time… and in fewer than 7-14 days to avoid days with no testing.
- It’s time to launch!
- Check the experiment metrics in the dashboard and sit tight until the needed sample size is collected or it’s evident that there is a significant issue or drop, or the experiment is likely to never be significant.
- When it looks like it’s time to stop the test, check the outliers and define a method to clean them up if any. Visualize the transactions on the plot to visually understand the nature of the data set. This will help to choose the best way of dealing with outliers like filtering with 3 standart deviation, defining the theshold or replacing transaction volume to average numbers.
- Time to stop the test!
- Conduct post-test analysis to specifically understand why the test won, lost or made no impact by looking at micro-conversions and segments that were impacted by the alternative version. This step is critical to CRO research and learning things for the next hypotheses or creating a tweaked version of the current one. This step makes sure that the experiments actually had no mistakes as you’ll get more data than in the initial pre-launch.
- Check personalization opportunities by looking at separate segments that have statistically valid growth.
- Choose a way to estimate the actual long term impact after implementation. You can check cohorts of A and B after 1,2 and 3 months after stopping the experiment. Or implement the changes on 90% of traffic instead of 100%. Define the amount of traffic and frequency of rolling up new versions based on the needed sample size for significance. Another way to do that is to repeat the winning experiment before implementation or to run B/A some time after implementation. Repeatability of experimental results is a main feature of true scientific knowledge!
- It’s time to implement the winning version and repeat the process time and time again!
Conclusion
A/B testing is one of the most reliable ways to grow conversion, yet that's exactly where the danger lies: the tool built to drive data-based decisions most often fails because of flaws in the process itself.
As our experience shows — even seasoned teams can go months without noticing mistakes that quietly drain revenue.
And the real threat isn't an obvious broken "Buy" button, but the subtler issues: shipping false positives, weak data QA, stopping experiments too early, and low testing velocity. None of the 26 mistakes exists in isolation — they form a single chain where one weak link can invalidate all the rest.
The takeaway is simple:
the goal isn't to "run more tests," but to build a system where the volume, velocity, and validity of experiments grow together. That means sharp hypotheses tied to your funnel's real bottleneck, rigorous QA of both control and variant, clean handling of outliers, anomaly detection, and documented learning for the team.
If even a few points on this list hit home, it's worth reviewing your experimentation process now — before these mistakes cost you a share of your annual revenue.
FAQ
What is the single most expensive A/B testing mistake?
Deploying a feature based on a false positive. Unlike an obvious "Buy" button bug, a false positive looks like a win, gets shipped, and quietly drags revenue down for months. In the worst case we've documented, it contributed to a 42% annual revenue drop.
A structured CRO audit is the fastest way to catch these patterns before they reach production.
What are the main causes of false positives in A/B tests?
The most common causes of false positives in A/B tests are stopping an experiment the moment it hits statistical significance (peeking), running without a pre-defined sample size or minimum detectable effect (MDE), failing to remove outliers, ignoring the novelty effect, and not running A/A or A/A/B validation tests beforehand.
Each of these inflates the chance that random noise looks like a real result.
What is a sample-ratio mismatch (SRM) and why does it matter?
SRM happens when traffic doesn't split the way you intended (e.g., 48/52 instead of 50/50). It signals a broken setup that can invalidate the entire experiment. Check for SRM across 100% of traffic, or at least every meaningful segment you plan to compare.
How accurate does my data tracking need to be?
Aim for at least 90–97% tracking accuracy, with full event mapping for every element on both the control and the variant. If your tracking is unreliable, no amount of statistical rigor downstream will save the result.
What's the difference between a hypothesis and a guess?
A real hypothesis targets your funnel's main bottleneck and is grounded in research into why users drop off. Guessing the bottleneck, guessing the cause, and guessing the fix are three separate mistakes that compound.
For SaaS funnels specifically, our SaaS CRO services focus on isolating the true bottleneck before any test goes live.
Should I test more than one hypothesis in a single experiment?
No. Testing multiple changes at once makes it impossible to attribute the result to a specific cause. Isolate one hypothesis per experiment, or use a properly designed multivariate framework if you have the traffic.
Why do I need A/A tests before A/B tests?
An A/A test (two identical versions) validates that your tooling, randomization, and tracking produce no difference where none should exist. If your A/A test shows a "winner," your setup is flawed and any A/B result is suspect.
How important is QA during a live experiment?
Critical, and it's multi-layered: QA the variant after launch, run regression QA on the control, verify data tracking is firing correctly, and watch session recordings.
Skipping any layer is how broken variants quietly corrupt your results.
What tools do I need to run experiments properly?
You'll want testing, analytics, session recording, and anomaly-detection tools that work together. We break down the options in our guide to the best CRO software tools so you can match the stack to your traffic and team.
What are the common pitfalls companies face when scaling A/B testing programs?
The common pitfalls scaling A/B testing programs companies run into are low experimentation velocity from a lack of dedicated in-house resources, failing to run parallel experiments when traffic allows, not using variance-reduction techniques like CUPED to speed tests up, and deploying winners to audiences different from the test population.
Volume, velocity, and validity all have to scale together, not just one of them.
Why is it a mistake to skip post-test research and documentation?
Without in-depth post-test analysis and a documented record, you lose the compounding learning that makes an experimentation program valuable over time.
Each test should feed the next, not disappear after a single decision.
How do outliers and anomalies affect my results?
Unfiltered outliers and undetected anomalies (bot traffic, a one-off bulk order, a tracking spike) can single-handedly create or erase a "winner." Build anomaly detection and outlier cleanup into your standard analysis, not as an afterthought.
Is it safe to deploy a winning variant to a broader audience?
Only if that audience matches the experiment's population. A result validated on one segment or traffic source won't necessarily hold on a different one. Deploying winners to a mismatched audience is a frequent and avoidable mistake.
Do I need a dedicated team, or can I outsource experimentation?
Both work, but velocity usually suffers without people 100% dedicated to experimentation. Many teams weigh the cost of building in-house against partnering with a specialist; our overview of top CRO companies and a separate guide on hiring a CRO consultant can help you decide which model fits.
How much should a proper CRO and experimentation program cost?
It varies with traffic, test volume, and engagement model. We lay out the ranges and pricing models transparently in our breakdown of CRO pricing so you can budget realistically and avoid paying for activity that doesn't move revenue.
Which mistakes are most common in ecommerce and Shopify stores specifically?
For online stores, the recurring issues are weak tracking accuracy, testing on irrelevant traffic segments, and chasing proxy metrics instead of revenue. Our ecommerce CRO services and dedicated Shopify CRO services are built around catching exactly these mistakes before they cost you sales.