Minimum Sample Size for A/B Test (Power analysis)

Isabella
4 min readJan 11, 2021

The purpose of A/B testing is to find out if the treatment group performs significantly better than the control group in a certain success metric (e.g. conversion rate). How can we tell if it is “significantly” better? What can we do to tell if it is significantly better? The null hypothesis of our statistical test is that there is no difference between treatment and control group and we would like to prove that they are statistically different from each other (reject null hypothesis). In order for the tests to be statistically robust, it is crucial to do a priori power analysis.

The aim of this article is to introduce power analysis to calculate the minimum sample size required in A/B tests and briefly explain the math behind it. In power analysis, there are 4 key concepts/parameters:

  1. Type I error (false positive rate) refers to the case where we reject the null hypothesis when it should not be rejected. This is also known as the significance level, alpha. A common value for alpha is 0.05.
  2. Type II Error and Statistical Power
  • Type II error (false negative rate) refers to the case where we fail to reject the null hypothesis when it should be rejected. This is also known as beta. A common value for beta is 0.2.
  • Statistical power is the probability that the null hypothesis is rejected when it should be rejected. This is complementary with Type II error, hence a common value for power is 1–0.2 = 0.8.

3. Effect size (common measurement is Cohen’s h or d) is a standardised way of measuring the magnitude of effect in the experiment. The simplest definition of effect size is defined as the difference between the two group means divided by the pooled standard deviation. Cohen’s d is used for comparison between 2 means and Cohen’s h is used for comparison between 2 proportions. In cases where effect size is unknown, an accepted benchmark set by Cohen as a rule of thumb for effect size is as follows: Small = 0.2, Medium = 0.5, Large = 0.8. Effect size is also known as practical significance. This is set based on the experiment’s unique context on how much of an effect from the treatment will be considered as great/significant for the company.

4. Sample size
These 4 concepts are inter-related and each of them can be expressed as a function of the remaining 3 parameters. Hence, the most common use of the power analysis is to compute the minimum sample size required given the alpha, power and effect size. It is perhaps important to note that if we can increase our effect size, we would require fewer sample size, given the same power and alpha. In order to confidently conclude the impact of a treatment, it is important to have sufficient statistical power. To increase the statistical power, we can choose to increase the effect size, the sample size and/or alpha.

The general workflow of a well-thought out A/B experiment is to calculate the minimum sample size based on a fixed parameters alpha, power and effect size. In the situation where the A/B experiment has already concluded, we could also calculate the statistical power of the results given the alpha, effect size and number of samples collected to see if the experiment has enough power to conclude the results.

In general, a larger sample size, effect size and alpha leads to higher power. Keeping alpha and power constant, a larger effect size requires a smaller sample size. An analogy here would be that a larger fish is easier to catch. (I really like this post and the visual animation helps to provide better understanding.)

Using Python to calculate the sample size required:

# Calculating sample size 
from statsmodels.stats.power import TTestIndPower
# Specify parameters for power analysis
effect = 0.8
alpha = 0.05
power = 0.8
ratio = 1.0 # 50/50 treatment vs control. if 25-75 treatment vs control then k=0.5
# Perform power analysis
analysis = TTestIndPower()
result = analysis.solve_power(effect, power=power, nobs1=None, ratio=ratio, alpha=alpha)
print('Sample Size: %.3f' % result)
Credit: https://vwo.com/blog/how-to-calculate-ab-test-sample-size/

In the example above, we have discussed a simple example of comparing between control and treatment group. Sometimes, we would like to find out if there are differences in treatment effect at a more granular level. For instances, we may want to find out if a treatment appeals to more females than males. In this situation, we may want to consider blocking in our experiment design and analysis.

Block what you can; randomize what you cannot.

Blocking on important variables controls for the variables and hence, reduces unexplained variability. As much as possible, we should block variables that we can (based on available information) and randomise the remaining noise variables that cannot be blocked. However, the cost of blocking is higher variance as it reduces the sample size in the analysis. To improve the power of analysis, the minimum sample size required would be larger such that within each block there is still sufficient sample size.

In general, large sample size will always improve precision and confidence in the results. A large sample size comes at the expense of resources, which could come be in the form of time spent in data collection, cost of experimentation and also reduces the ability of business to gain insights and act quickly.

Lastly, it is important to note that statistical significance (p-values) without understanding of effect size is meaningless. With a sufficiently large sample size, a statistical test will almost always reveal a statistical difference unless the effect size is exactly 0.

--

--

Isabella

Product analyst, curious about data science, personal finance, baking…! Currently snooping around growth — happy to chat if you’ve growth experiences!