AB_test_prep | Naixin's blog

A/B TEST

Publish Date: 2020-02-24

Author: Naixin Zhang

Word Count: 1.9k

Read Times: 11 Min

Read Count:

What is A/B testing? A/B testing (sometimes called split testing) is basically statistical hypothesis testing applied to web page comparison. You compare two versions of web pages by showing the two variants (call them A and B) randomly to two equally sized groups of visitors at the same time, the one that gives better conversion rate wins.
Why do we need A/B test? The goal is to :
- establish causal relationship between actions and results
- measure impact solely from the change
where is A/B test used? widely used in high tech industry. Major use case:

product iteration:

front end: change ui design, user flow, add new features
- algrithm enhancement: recommendation system, search ranking, ads display
- operations: define coupon value, promotion program marketing optimization
- search engine optimization(SEO)
- campaign performance measurement
Describe the process of A/B test
- Design
  - understand problem & objective
  - come up with hypothesis
  - design of experiment
    - key assumptions:
      1. the factor to test is the only reason for difference
      2. all other factors are comparable
      3. a unit been assigned to A or B is random
      4. each experiment unit are independent.
    - assigment unit what is the unit to split A/B? user_id? cookie_id? device_id? session_id? Ip address? split users in test/control? most common 50/50 split sometimes not. time sensitive. eg: holiday marketing campaign
    - A/A TEST: use A/B test framwork to test two identical versions against each other. There should be no difference between the two groups. The goal: make sure the framework been used is correct
```
      data exploration & parameter estimation(sample variance)
```
    - metrics
    - exposure & duration
      
      should you show the A/B version to all users?
      
      No,May cause bad user experience if test version is bad
      
      start with a small proportion, like 5%, gradually roll out to more users
      
      How long are you going to run the experiment?
      
      In practice, we want to minimize the exposure and duration of an A/B test, because
      - optimizaiton businesss performance as much as possible
      - potential negative user experience
        
        inconsisitent user experience
        
        expensive to maintain multiple versions
        
        how to decide exposure %?
        
        size of eligible population’
        
        potential impact
        
        easy to test & debg
        
        how to decide duration?
        
        minimum sample size
        
        daily volume & exposure
        
        seasonality(at least one seasonal period)
    - sample size calculation
      
      data assumption:
      
      what distribution assumption are you making to your data?
      
      normal distribution, central limit theorem
      
      what is the null-hypothesis of your test?
      
      diff = Ua - Ub = 0
      
      why calculate sample size? can we just let the experiment run until the result is statistically significant?
      
      No highly increase false positive rate(Type I error)
      
      when null hypothesis is true, the chance of reject H0 is 0.05
      
      what if it takes too long to get a desired sample size?
      
      Increase exposure
      
      reduce variance to reduce required sample size
```
  - blocking - run experiment within sub-groups

  -  propensity score matching

      procedure:

    1. run a model to predit Y with appropriate covariates 

       obtain propensity score: predict y_hat

    2.  check that prepensity score is balanced across test aadn control groups

    3. match each test unit to one or more controls on propensity score

       nearest neighbor matching/ matching with certain width

       4. run expriemnt on matched samples

       5. conduct post expriment analysis on matched samples

          what if your data is highly skewed or statistics is  hard to approximate with CLT?

             transformation/ winsorization/capping/bootstrap

          bootstrap is a resampling method, it can be used to estimate the sampling distribution of any statistics,commonly used in estimating CI & P-value & statistics with complex or no close- form estimators

          Procedure:

          1. randomly generate a sample of size n with replacement from the original data

          2. repeate step 1 many times

          3. estimate statistics with sampling statistics of the generated samples

             pros:

             no assumptions on distribution of original data

             simple to implement

             cons:

             computational expensive
```
- Implement
  - code change & testing
  - run experiment & monitor
- measurement
  - result measurement
    - data exploration
    imbalance assigment:
    
    check for % test/contro units. Is the % matching DOE?
    
    mixed assignment:
    
    if # of mixed samples is small, ok to remove. if big, need to figure out why
    
    what is the problem throwing away mixed samples
    
    sanity check
    
    are test/control similar in other factors other than treatment
    - hypothesis test
      
      conduct test/ multiple testing:
      
      most use T test,
      
      when variance is known is large, can use Z test
      
      when sample size small can use non-parametric methods
      
      for complicated statistics, can use bootstrap to calculate p-value
      - result analysis
        
        pre-bias adjustment/analysis unit different with assignment unit
      cohort analysis
  - data analysis
  - decision making
    
    if all metrics move postively:
    
    meet expectations. Yes , ready to lauch
    
    be cautious if result is too good. May need to investigate(outliers)
    
    if some metrics move negatively:
    
    are they as expected? are these metrics important?
    
    deep dive to find causes
    
    if result are neutral?
    
    slice/dice on sub-groups

  Multiple testing

  what if you have multiple test groups?

  false positive rate is much higher when doing multiple testing. need to control family-wise false positive rate

  <img src="image-20200225004208305.png" alt="image-20200225004208305" style="zoom:67%;" />

  <img src="image-20200225004248014.png" alt="image-20200225004248014" style="zoom:67%;" />

  ![image-20200225004324771](image-20200225004324771.png)

  Pre-bias adjustment

  when A/B groups have difference before experiment.

  <img src="image-20200225004450709.png" alt="image-20200225004450709" style="zoom:67%;" />



  A/B test can be summarized into the 5 steps below:
  (1). choose and characterize metrics to evaluate your experiment, i.e. what do you care about, how do you want to measure the effect.
  Brain storm potential metrics. Use customer conversion funnel to summarize the process. Invariant metric does not relate to the change. Evaluation metrics are related to the change.
  (2). choose significant level (alpha), statistical power (1-beta) and practical significance level you really want to launch the change if the test is statistically significant
  (3). Calculate required sample size
  (4). Take sample for control/ treatment groups and run the test
  (5). Analyze the results and draw valid conclusions
  Sanity check: invariant metric does not change in experiment and control
  Analyze evaluation metrics
  Using pooled mean/conversion probability, then calculate pooled standard deviation, then calculate margin of error (z*sd). Then compare the difference between control and experiment and calculate upper and lower bound of the difference (P-diff +/ - margin of error). Compare with 0 (statistically significant) or required difference to be practically different.
  Sign test: confirm the result with sign test. The number of success out of total trial is statistically significant.

Situations we can’t analyze through A/B test A/B test can’t test new experience, because (1) what’s the base of your comparison (2) how much time it will take for the users to adapt to the new experience. Long term effect is hard to test with A/B test
How many variates should we have in A/B test The goal of A/B test should be clear. A number of factors from each different design can muddy the test result water. We suggest running two versions against each other, and then running a second test afterwards to compare the winners.
What do I do if I do not trust the results? If you really don’t trust the results and have ruled out any errors or challenges to the test’s validity, the best thing to do is to run the same test again. Treat it as an entirely separate test and see if you can replicate the results. If you can replicate again and again, you probably have a solid set of results.
What if I do not have control? A control is the existing version of a landing page or webpage that you are testing against. Sometimes you may want to test two versions of a page that never existed before… and that’s oaky. Just choose one of the variations and call that one the control. Try to pick the one that’s the most similar to how you currently design pages and use the other as the treatment.
When A/B test is not useful, what you can do? Analyze the user activity logs Conduct retrospective analysis Conduct user experience research Focus groups and surveys Human evaluation
Metrics The metrics we choose for sanity check are called invariant metrics. They are not supposed to be affected by the experiment. They should not change across control and experiment groups. Evaluation metrics are used to measure which variation is better. For example daily active users (DAU) to measure user engagement; click through rate (CTR) to measure a button design on a webpage.

There are four categories of metrics:

Sums and counts
Distribution (mean, median, percentiles)
Probability and rates (click through probability and click through rate)-baidu 1point3acres
Ratios: any two numbers divide by each other

Sensitivity and robustness: You want to choose a metric that has high sensitivity, so the metric can pick up the change you care about. You also want the metric to be robust against changes you don’t care about. There is a balance between the sensitivity and robustness, you need to look into the data to find out which metric to use.

How to measure the sensitivity and robustness?

Run experiments
Use A/A test to see if metrics pick up difference (if yes, then the metric is not robust)
Retrospective analysis

Significance level, statistical power and practical significance level Usually the significance level is 0.05 and power is 0.8. practical significance level varies depends on each individual test. Practical significance level is higher than statistical significance level. You may not want to launch a change even the test is statistically significant because you need to consider
- The business impact of the change
- Whether it is worth to launch considering the engineering cost, customer support, sales issue and opportunity cost
How to calculate the sample size? Sample size required for valid hypothesis test depends on 5 of the following parameters
- The conversion rate value of control variation (baseline value)
- The minimum difference between control and experiment which is to be identified. The smaller the difference between experiment and control to be identified, the bigger the sample size is required.
- Chosen confidence/significance level
- Chosen statistical power
Type of the test: one or two tailed test. Sample size for two tailed test is relatively bigger. There are different kinds of online testing tools, G-power, Evan Miller, google analytics, etc. If using R, first calculate the z value based on alpha using qnorm(). Then using a grid of sample size values to calculate beta (the pdf of reject the null when the null is true) using pnorm(), so the smallest sample size corresponds to beta <= required beta is the required sample size for valid test. This make use of the fact that as sample size getting big, the estimated standard deviation become smaller, so the power of the test gets big. Formula:

How to split sample? The sample size in control and experiment should be statistically equal.
Correlational VS causal
Advantages of A/B test Scientific way to prove causality, i.e. the changes in metrics are caused by changes introduced in the treatment. Sensitivity: you can detect tiny changes to metrics Detect unexpected consequences

FreddieThis is a computer program that simulates a Data Analyst chatbot, named Freddie, capable of processing user inpu

2020-03-06

statistics_interview_prep

P-Value The p-value is the probability of obtaining at least as extreme results given that the null hypothesis is true.

2020-02-19 statistic knowlege review

statistic knowlege review