What is A/B testing? A/B testing (sometimes called split testing) is basically statistical hypothesis testing applied to web page comparison. You compare two versions of web pages by showing the two variants (call them A and B) randomly to two equally sized groups of visitors at the same time, the one that gives better conversion rate wins.
Why do we need A/B test? The goal is to :
- establish causal relationship between actions and results
- measure impact solely from the change
where is A/B test used? widely used in high tech industry. Major use case:
product iteration:
front end: change ui design, user flow, add new features
- algrithm enhancement: recommendation system, search ranking, ads display
- operations: define coupon value, promotion program marketing optimization
- search engine optimization(SEO)
- campaign performance measurement
Describe the process of A/B test
Design
understand problem & objective
come up with hypothesis
design of experiment
key assumptions:
- the factor to test is the only reason for difference
- all other factors are comparable
- a unit been assigned to A or B is random
- each experiment unit are independent.
assigment unit what is the unit to split A/B? user_id? cookie_id? device_id? session_id? Ip address? split users in test/control? most common 50/50 split sometimes not. time sensitive. eg: holiday marketing campaign
A/A TEST: use A/B test framwork to test two identical versions against each other. There should be no difference between the two groups. The goal: make sure the framework been used is correct
data exploration & parameter estimation(sample variance)
metrics
exposure & duration
should you show the A/B version to all users?
No,May cause bad user experience if test version is bad
start with a small proportion, like 5%, gradually roll out to more users
How long are you going to run the experiment?
In practice, we want to minimize the exposure and duration of an A/B test, because
optimizaiton businesss performance as much as possible
potential negative user experience
inconsisitent user experience
expensive to maintain multiple versions
how to decide exposure %?
size of eligible population’
potential impact
easy to test & debg
how to decide duration?
- minimum sample size
- daily volume & exposure
- seasonality(at least one seasonal period)
sample size calculation
data assumption:
what distribution assumption are you making to your data?
normal distribution, central limit theorem
what is the null-hypothesis of your test?
diff = Ua - Ub = 0
why calculate sample size? can we just let the experiment run until the result is statistically significant?
No highly increase false positive rate(Type I error)
when null hypothesis is true, the chance of reject H0 is 0.05
what if it takes too long to get a desired sample size?
Increase exposure
reduce variance to reduce required sample size
- blocking - run experiment within sub-groups - propensity score matching procedure: 1. run a model to predit Y with appropriate covariates obtain propensity score: predict y_hat 2. check that prepensity score is balanced across test aadn control groups 3. match each test unit to one or more controls on propensity score nearest neighbor matching/ matching with certain width 4. run expriemnt on matched samples 5. conduct post expriment analysis on matched samples what if your data is highly skewed or statistics is hard to approximate with CLT? transformation/ winsorization/capping/bootstrap bootstrap is a resampling method, it can be used to estimate the sampling distribution of any statistics,commonly used in estimating CI & P-value & statistics with complex or no close- form estimators Procedure: 1. randomly generate a sample of size n with replacement from the original data 2. repeate step 1 many times 3. estimate statistics with sampling statistics of the generated samples pros: no assumptions on distribution of original data simple to implement cons: computational expensive
Implement
- code change & testing
- run experiment & monitor
measurement
result measurement
- data exploration
imbalance assigment:
check for % test/contro units. Is the % matching DOE?
mixed assignment:
if # of mixed samples is small, ok to remove. if big, need to figure out why
what is the problem throwing away mixed samples
sanity check
are test/control similar in other factors other than treatment
hypothesis test
conduct test/ multiple testing:
most use T test,
when variance is known is large, can use Z test
when sample size small can use non-parametric methods
for complicated statistics, can use bootstrap to calculate p-value
result analysis
pre-bias adjustment/analysis unit different with assignment unit
cohort analysis
data analysis
decision making
if all metrics move postively:
meet expectations. Yes , ready to lauch
be cautious if result is too good. May need to investigate(outliers)
if some metrics move negatively:
are they as expected? are these metrics important?
deep dive to find causes
if result are neutral?
slice/dice on sub-groups
Multiple testing
what if you have multiple test groups?
false positive rate is much higher when doing multiple testing. need to control family-wise false positive rate
<img src="image-20200225004208305.png" alt="image-20200225004208305" style="zoom:67%;" />
<img src="image-20200225004248014.png" alt="image-20200225004248014" style="zoom:67%;" />
![image-20200225004324771](image-20200225004324771.png)
Pre-bias adjustment
when A/B groups have difference before experiment.
<img src="image-20200225004450709.png" alt="image-20200225004450709" style="zoom:67%;" />
A/B test can be summarized into the 5 steps below:
(1). choose and characterize metrics to evaluate your experiment, i.e. what do you care about, how do you want to measure the effect.
Brain storm potential metrics. Use customer conversion funnel to summarize the process. Invariant metric does not relate to the change. Evaluation metrics are related to the change.
(2). choose significant level (alpha), statistical power (1-beta) and practical significance level you really want to launch the change if the test is statistically significant
(3). Calculate required sample size
(4). Take sample for control/ treatment groups and run the test
(5). Analyze the results and draw valid conclusions
Sanity check: invariant metric does not change in experiment and control
Analyze evaluation metrics
Using pooled mean/conversion probability, then calculate pooled standard deviation, then calculate margin of error (z*sd). Then compare the difference between control and experiment and calculate upper and lower bound of the difference (P-diff +/ - margin of error). Compare with 0 (statistically significant) or required difference to be practically different.
Sign test: confirm the result with sign test. The number of success out of total trial is statistically significant.
Situations we can’t analyze through A/B test A/B test can’t test new experience, because (1) what’s the base of your comparison (2) how much time it will take for the users to adapt to the new experience. Long term effect is hard to test with A/B test
How many variates should we have in A/B test The goal of A/B test should be clear. A number of factors from each different design can muddy the test result water. We suggest running two versions against each other, and then running a second test afterwards to compare the winners.
What do I do if I do not trust the results? If you really don’t trust the results and have ruled out any errors or challenges to the test’s validity, the best thing to do is to run the same test again. Treat it as an entirely separate test and see if you can replicate the results. If you can replicate again and again, you probably have a solid set of results.
What if I do not have control? A control is the existing version of a landing page or webpage that you are testing against. Sometimes you may want to test two versions of a page that never existed before… and that’s oaky. Just choose one of the variations and call that one the control. Try to pick the one that’s the most similar to how you currently design pages and use the other as the treatment.
When A/B test is not useful, what you can do? Analyze the user activity logs Conduct retrospective analysis Conduct user experience research Focus groups and surveys Human evaluation
Metrics The metrics we choose for sanity check are called invariant metrics. They are not supposed to be affected by the experiment. They should not change across control and experiment groups. Evaluation metrics are used to measure which variation is better. For example daily active users (DAU) to measure user engagement; click through rate (CTR) to measure a button design on a webpage.
There are four categories of metrics:
Sums and counts
Distribution (mean, median, percentiles)
Probability and rates (click through probability and click through rate)-baidu 1point3acres
Ratios: any two numbers divide by each other
Sensitivity and robustness: You want to choose a metric that has high sensitivity, so the metric can pick up the change you care about. You also want the metric to be robust against changes you don’t care about. There is a balance between the sensitivity and robustness, you need to look into the data to find out which metric to use.
How to measure the sensitivity and robustness?
- Run experiments
- Use A/A test to see if metrics pick up difference (if yes, then the metric is not robust)
- Retrospective analysis
Significance level, statistical power and practical significance level Usually the significance level is 0.05 and power is 0.8. practical significance level varies depends on each individual test. Practical significance level is higher than statistical significance level. You may not want to launch a change even the test is statistically significant because you need to consider
- The business impact of the change
- Whether it is worth to launch considering the engineering cost, customer support, sales issue and opportunity cost
How to calculate the sample size? Sample size required for valid hypothesis test depends on 5 of the following parameters
- The conversion rate value of control variation (baseline value)
- The minimum difference between control and experiment which is to be identified. The smaller the difference between experiment and control to be identified, the bigger the sample size is required.
- Chosen confidence/significance level
- Chosen statistical power
Type of the test: one or two tailed test. Sample size for two tailed test is relatively bigger. There are different kinds of online testing tools, G-power, Evan Miller, google analytics, etc. If using R, first calculate the z value based on alpha using qnorm(). Then using a grid of sample size values to calculate beta (the pdf of reject the null when the null is true) using pnorm(), so the smallest sample size corresponds to beta <= required beta is the required sample size for valid test. This make use of the fact that as sample size getting big, the estimated standard deviation become smaller, so the power of the test gets big. Formula:
How to split sample? The sample size in control and experiment should be statistically equal.
Correlational VS causal
Advantages of A/B test Scientific way to prove causality, i.e. the changes in metrics are caused by changes introduced in the treatment. Sensitivity: you can detect tiny changes to metrics Detect unexpected consequences