AB_test_prep

  • What is A/B testing? A/B testing (sometimes called split testing) is basically statistical hypothesis testing applied to web page comparison. You compare two versions of web pages by showing the two variants (call them A and B) randomly to two equally sized groups of visitors at the same time, the one that gives better conversion rate wins.

  • Why do we need A/B test? The goal is to :

    • establish causal relationship between actions and results
    • measure impact solely from the change
  • where is A/B test used? widely used in high tech industry. Major use case:

​ product iteration:

  • front end: change ui design, user flow, add new features

    • algrithm enhancement: recommendation system, search ranking, ads display
    • operations: define coupon value, promotion program marketing optimization
    • search engine optimization(SEO)
    • campaign performance measurement
  • Describe the process of A/B test

    • Design

      • understand problem & objective

      • come up with hypothesis

      • design of experiment

        • key assumptions:

          1. the factor to test is the only reason for difference
          2. all other factors are comparable
          3. a unit been assigned to A or B is random
          4. each experiment unit are independent.
        • assigment unit what is the unit to split A/B? user_id? cookie_id? device_id? session_id? Ip address? split users in test/control? most common 50/50 split sometimes not. time sensitive. eg: holiday marketing campaign

        • A/A TEST: use A/B test framwork to test two identical versions against each other. There should be no difference between the two groups. The goal: make sure the framework been used is correct

                data exploration & parameter estimation(sample variance)
        • metrics

        • exposure & duration

          should you show the A/B version to all users?

          No,May cause bad user experience if test version is bad

          start with a small proportion, like 5%, gradually roll out to more users

          How long are you going to run the experiment?

          In practice, we want to minimize the exposure and duration of an A/B test, because

          • optimizaiton businesss performance as much as possible

          • potential negative user experience

            • inconsisitent user experience

            • expensive to maintain multiple versions

              how to decide exposure %?

              • size of eligible population’

              • potential impact

              • easy to test & debg

                how to decide duration?

                • minimum sample size
                • daily volume & exposure
                • seasonality(at least one seasonal period)
        • sample size calculation

          data assumption:

          what distribution assumption are you making to your data?

          normal distribution, central limit theorem

          what is the null-hypothesis of your test?

          diff = Ua - Ub = 0

          image-20200225001249432

          image-20200225001346889

          why calculate sample size? can we just let the experiment run until the result is statistically significant?

          ​ No highly increase false positive rate(Type I error)

          ​ when null hypothesis is true, the chance of reject H0 is 0.05

          what if it takes too long to get a desired sample size?

          ​ Increase exposure

          ​ reduce variance to reduce required sample size

            - blocking - run experiment within sub-groups
          
            -  propensity score matching
          
                procedure:
          
              1. run a model to predit Y with appropriate covariates 
          
                 obtain propensity score: predict y_hat
          
              2.  check that prepensity score is balanced across test aadn control groups
          
              3. match each test unit to one or more controls on propensity score
          
                 nearest neighbor matching/ matching with certain width
          
                 4. run expriemnt on matched samples
          
                 5. conduct post expriment analysis on matched samples
          
                    what if your data is highly skewed or statistics is  hard to approximate with CLT?
          
                       transformation/ winsorization/capping/bootstrap
          
                    bootstrap is a resampling method, it can be used to estimate the sampling distribution of any statistics,commonly used in estimating CI & P-value & statistics with complex or no close- form estimators
          
                    Procedure:
          
                    1. randomly generate a sample of size n with replacement from the original data
          
                    2. repeate step 1 many times
          
                    3. estimate statistics with sampling statistics of the generated samples
          
                       pros:
          
                       no assumptions on distribution of original data
          
                       simple to implement
          
                       cons:
          
                       computational expensive
    • Implement

      • code change & testing
      • run experiment & monitor
    • measurement

      • result measurement

        • data exploration

        imbalance assigment:

        check for % test/contro units. Is the % matching DOE?

        mixed assignment:

        if # of mixed samples is small, ok to remove. if big, need to figure out why

        what is the problem throwing away mixed samples

        sanity check

        are test/control similar in other factors other than treatment

        • hypothesis test

          conduct test/ multiple testing:

          most use T test,

          when variance is known is large, can use Z test

          when sample size small can use non-parametric methods

          for complicated statistics, can use bootstrap to calculate p-value

          • result analysis

            pre-bias adjustment/analysis unit different with assignment unit

          cohort analysis

      • data analysis

      • decision making

        if all metrics move postively:

        meet expectations. Yes , ready to lauch

        be cautious if result is too good. May need to investigate(outliers)

        if some metrics move negatively:

        are they as expected? are these metrics important?

        deep dive to find causes

        if result are neutral?

        slice/dice on sub-groups

  Multiple testing

  what if you have multiple test groups?

  false positive rate is much higher when doing multiple testing. need to control family-wise false positive rate

  <img src="image-20200225004208305.png" alt="image-20200225004208305" style="zoom:67%;" />

  <img src="image-20200225004248014.png" alt="image-20200225004248014" style="zoom:67%;" />

  ![image-20200225004324771](image-20200225004324771.png)

  Pre-bias adjustment

  when A/B groups have difference before experiment.

  <img src="image-20200225004450709.png" alt="image-20200225004450709" style="zoom:67%;" />



  A/B test can be summarized into the 5 steps below:
  (1). choose and characterize metrics to evaluate your experiment, i.e. what do you care about, how do you want to measure the effect.
  Brain storm potential metrics. Use customer conversion funnel to summarize the process. Invariant metric does not relate to the change. Evaluation metrics are related to the change.
  (2). choose significant level (alpha), statistical power (1-beta) and practical significance level you really want to launch the change if the test is statistically significant
  (3). Calculate required sample size
  (4). Take sample for control/ treatment groups and run the test
  (5). Analyze the results and draw valid conclusions
  Sanity check: invariant metric does not change in experiment and control
  Analyze evaluation metrics
  Using pooled mean/conversion probability, then calculate pooled standard deviation, then calculate margin of error (z*sd). Then compare the difference between control and experiment and calculate upper and lower bound of the difference (P-diff +/ - margin of error). Compare with 0 (statistically significant) or required difference to be practically different.
  Sign test: confirm the result with sign test. The number of success out of total trial is statistically significant.
  • Situations we can’t analyze through A/B test A/B test can’t test new experience, because (1) what’s the base of your comparison (2) how much time it will take for the users to adapt to the new experience. Long term effect is hard to test with A/B test

  • How many variates should we have in A/B test The goal of A/B test should be clear. A number of factors from each different design can muddy the test result water. We suggest running two versions against each other, and then running a second test afterwards to compare the winners.

  • What do I do if I do not trust the results? If you really don’t trust the results and have ruled out any errors or challenges to the test’s validity, the best thing to do is to run the same test again. Treat it as an entirely separate test and see if you can replicate the results. If you can replicate again and again, you probably have a solid set of results.

  • What if I do not have control? A control is the existing version of a landing page or webpage that you are testing against. Sometimes you may want to test two versions of a page that never existed before… and that’s oaky. Just choose one of the variations and call that one the control. Try to pick the one that’s the most similar to how you currently design pages and use the other as the treatment.

  • When A/B test is not useful, what you can do? Analyze the user activity logs Conduct retrospective analysis Conduct user experience research Focus groups and surveys Human evaluation

  • Metrics The metrics we choose for sanity check are called invariant metrics. They are not supposed to be affected by the experiment. They should not change across control and experiment groups. Evaluation metrics are used to measure which variation is better. For example daily active users (DAU) to measure user engagement; click through rate (CTR) to measure a button design on a webpage.

There are four categories of metrics:

  • Sums and counts

  • Distribution (mean, median, percentiles)

  • Probability and rates (click through probability and click through rate)-baidu 1point3acres

  • Ratios: any two numbers divide by each other

    Sensitivity and robustness: You want to choose a metric that has high sensitivity, so the metric can pick up the change you care about. You also want the metric to be robust against changes you don’t care about. There is a balance between the sensitivity and robustness, you need to look into the data to find out which metric to use.

How to measure the sensitivity and robustness?

  • Run experiments
  • Use A/A test to see if metrics pick up difference (if yes, then the metric is not robust)
  • Retrospective analysis
  • Significance level, statistical power and practical significance level Usually the significance level is 0.05 and power is 0.8. practical significance level varies depends on each individual test. Practical significance level is higher than statistical significance level. You may not want to launch a change even the test is statistically significant because you need to consider

    • The business impact of the change
    • Whether it is worth to launch considering the engineering cost, customer support, sales issue and opportunity cost
  • How to calculate the sample size? Sample size required for valid hypothesis test depends on 5 of the following parameters

    • The conversion rate value of control variation (baseline value)
    • The minimum difference between control and experiment which is to be identified. The smaller the difference between experiment and control to be identified, the bigger the sample size is required.
    • Chosen confidence/significance level
    • Chosen statistical power
  • Type of the test: one or two tailed test. Sample size for two tailed test is relatively bigger. There are different kinds of online testing tools, G-power, Evan Miller, google analytics, etc. If using R, first calculate the z value based on alpha using qnorm(). Then using a grid of sample size values to calculate beta (the pdf of reject the null when the null is true) using pnorm(), so the smallest sample size corresponds to beta <= required beta is the required sample size for valid test. This make use of the fact that as sample size getting big, the estimated standard deviation become smaller, so the power of the test gets big. Formula:

  • How to split sample? The sample size in control and experiment should be statistically equal.

  • Correlational VS causal

  • Advantages of A/B test Scientific way to prove causality, i.e. the changes in metrics are caused by changes introduced in the treatment. Sensitivity: you can detect tiny changes to metrics Detect unexpected consequences


  TOC