Photo by Science in HD on Unsplash
Content by: Medium
You spent a lot of money on A/B testing, but you didn’t see any significant effects. Is there really nothing that moves the needle?
It’s possible that there is no difference. Or, it could be that there is a true effect that could have led to a big win; however, you missed it because the signal was buried in the noise! So, how can you stop letting the wins slip through your fingers in 2020?
To understand the solution, let me first introduce you to the ‘A/B Testing God’ who decides whether you see an effect– statistical power. Statistical power is a probability. It is the likelihood that an experiment will detect an effect when there exists a real effect. In other words, if you don’t have enough statistical power, even if your new version has an impact on the customers, you won’t be able to see it.
So how can we increase statistical power in A/B tests? Based on its definition, statistical power depends on three factors: 1) Size of the change 2) Sample size 3) The variance of the sample.
Let’s put the three factors above into best practices.
Often times, you prefer small, incremental changes and are afraid to scare your customers with a noticeably new feature. However, you might never see an effect on the test if the change is too small. Big change works because statistical power increases as the magnitude of the change increases.
How do you test a big effect safely without risking the relationship with customers?
Another reason to test large effect is that people spend so much time online and become less sensitive to changes, you need to create something very different to get their attention.
One caveat is the “novelty effect”. It means that sometimes people click the new version just because it looks so different, but it doesn’t mean they like it. To mitigate this, make sure you run your test long enough to give your customers some time to adapt to the surprise and respond with less sensation.
Am I wasting my time to continue the experiment after seeing an effect?
Some people turn off the test and celebrate the win when they see a change. However, if you stop your experiment too early, the results you see can be false positive that is caused by random noise caused by small sample size.
Statistical power increases as the sample size increases, so you need to collect enough samples by running your experiment long enough.
Look at this example below: in the real world, Golden Retrievers are larger than Beagles on average; but when the sample size is too small, you might have a few outliers like baby Golden Retrievers, and you end up missing the truth when detecting breed is larger.
When your test runs longer, your sample is larger, and this way the data will be less affected by outliers, which makes it easier to see a real effect.
So how long is long enough for the run time of A/B tests?
Especially when you are not able to follow the first rule and have to test a small change, you can run the experiment longer to have more samples to increase the statistical power of your test.
2. You should run your test for at least 7 days because there is a ‘day of the week’ effect — your customers’ behavior over the weekend might be different from during the week.
3. No matter how long you plan to run it, set a schedule before you start the experiment and stick to it.
Is the longer the better?
Intuitively, you want to see how different customer segments react to a new feature, but that might reduce your statistical power. What happens is that, when you combine different types of customers during A/B testing, the variance becomes higher. However, statistical power is higher when the sample variance is low.
For example, if you are testing a feature related to the sign-up page for new customers, you don’t need to include the test on the pages where only existing customers have access. This graph below shows that it’s easier to detect a difference when the variance is low (only include one segment of customers).
You might ask, shouldn’t I have a bigger sample size by including everyone? You can have more sample by running the experiment for longer, but don’t include irrelevant samples in your experiment, that will only dilute the effect or create more noise. You should gather enough samples by running the experiment longer, but not by over-targeting. Additionally, if your targeting audience has very low variance, you might able to have enough statistical power with a relatively small sample.
When if you are testing a feature that indeed affects everyone? If you worry that you won’t see any effect due to high variance, you can start testing a small segment to see whether there is any signal. But eventually, you should include everyone in the test, as you can’t generalize the test results from a sub-group to the entire population.
In the industry, we usually set the statistical power as 0.8. Which means, there are still 20% chances that you won’t detect the real effect. So, why don’t we increase it to 95%? You could. Technically, there is a 4th way to increase power. You can do so by increasing a parameter called ‘alpha’ when calculating the statistical power. However, increasing the statistical power this way also increases the likelihood of having false positive; it’s more likely to find ‘fake signal’ when there isn’t one.
There is a trade-off between statistical power and false discovery, and it’s up to you to decide what is more important to you: willing to miss a discovery when there actually is one, or have a false discovery when there isn’t any effect.
For example, if you are detecting a fatal disease, you’d rather have a false discovery than missing the diagnosis, so you want more statistical power; if you are testing for a feature that is very expensive when launching in production, then you want to make sure that there is a real effect, so you want moderate statistical power to have smaller chance of wasting money on a false-positive result.
Unless you are very familiar with statistical testing, I don’t suggest you achieve higher statistical power this way.