Simulating A/B testing and experiment data

6 minute read Published:

Simulating data is super useful for testing methods and data science interview prep

Simulation is a great way to study statistics. If you’re picking up a method for the first time, (e.g., from the Udacity AB testing course) it’s too easy to convince yourself you know what’s going on without actually doing the math or programming. For example, when I see binary treatments and outcomes, I wonder “would I get the same answer if I do binomial proportion test vs. logistic regression?” You don’t have to guess (or derive the answer analytically)!

Just simulate it! If you want to practice a method, make a dataset with the desired treatment effect and other characteristics, simulate it, and see if your method gets the right answer back. The tough part with real data and real experiments (and data science interview problems) is that you won’t know the real answer, you just try to implement a method that makes sense and hope you get it right.

There are a million different experiments you can run and a million datasets that represent those experiments, but I’ll try to outline some easy ones to get started.

Basic: binary treatment (D), continuous outcome (y) with error (e)

library(tibble)

N <- 1000    # number of individuals
sigma_e <- 1 # s.d. of error
beta <- 0.1  # effect of treatment

df <- tibble(
  id = seq_len(N), 
  D = rbinom(N, 1, 0.5), # 50% probability of treatment
  e = rnorm(N, 0, sigma_e),
  y = beta * D + e
)

Basic with covariate: binary treatment (D), continuous covariate (X), continuous outcome (y) with error (e)

library(tibble)

N <- 1000
sigma_e <- 1
sigma_x <- 1  # s.d. of covariate
beta <- 0.1   
beta_x <- 1   # effect of covariate on outcome

df <- tibble(
  id = seq_len(N), 
  D = rbinom(N, 1, 0.5),  # 50% probability of treatment
  X = rnorm(N, 0, sigma_x),
  e = rnorm(N, 0, sigma_e),
  y = beta * D + beta_x * X + e
)

Basic with binary outcome: binary treatment (D), binary outcome (y) with error (e)

library(tibble)

N <- 1000
sigma_e <- 1
sigma_x <- 1  # variance of covariate
beta <- 0.1 # effect of treatment
beta_x <- 1   # effect of covariate on outcome

df <- tibble(
  id = seq_len(N), 
  D = rbinom(N, 1, 0.5),  # 50% probability of treatment
  e = rnorm(N, 0, sigma_e),
  y = as.integer(beta * D + e >= 0)
  # ^^ you can change the probability of 0 or 1 by changing 
  # the cutoff '>=0' to something else here
)

Time series: two periods (T), binary treatment (D), continuous outcome (y) with autoregressive error (e)

library(dplyr)
library(tibble)
library(tidyr)

N <- 100
T <- 2
ar <- 0.5 # autoregressive parameter
beta <- 0.1

df <- crossing(
    id = seq_len(N),
    t = 1:T
  ) %>%
  group_by(id) %>% 
  mutate(
    e = arima.sim(model = list(ar = ar), n = T),
    # 50% probability of treatment, no treatment if t == 1
    D = ifelse(t == 2, rbinom(1, 1, 0.5), 0),  
    y = beta * D + e
  )

Website log data: click events (click), multiple timestamped sessions per user with (session), treatments (treatment) randomly assigned to users (id)

library(dplyr)
library(tibble)
library(tidyr)
library(purrr)
library(digest)
library(lubridate)

N <- 1000
start <- ymd_hms("2018-01-01 0:00:00", tz = "America/Toronto")
lambda <- 5 # determines number of events per user
beta <- 0.1
p <- 0.5    # base probability of outcome.

df <- tibble(
              id = seq_len(N), 
              n_events = rpois(N, lambda)
            ) %>% 
      rowwise() %>% 
      mutate(id = digest(id))

# how many seconds in a day? ty lubridate.
spd <- ddays() / dseconds()

# simulate events to add to user data.
events_df <- df %>% 
    mutate(
            events = list(start + dseconds(runif(n_events, 0, (now() - start) / dseconds()))),
            # ^^ randomly sample sessions as the number of seconds from start date, then add to start date
            # change this from runif to get more realistic timestamps, use different timezones to practice
            # cleaning / validating data, etc. etc.
            treatment = rbinom(1, 1, 0.5)
            # ^^ randomly assign treatment to *users*
          ) %>% # end mutate
    ungroup() %>% 
    mutate(events = map(events, as_tibble)) %>% 
    unnest(events) %>% 
    # base probability of click outcome is p, increased by beta if treated
    mutate(click = rbinom(length(treatment), 1, p + beta * treatment)) %>% 
    select(id, session = value, treatment, click) # re-organize

events_df %>% print(n = 5)
## # A tibble: 4,971 x 4
##   id                               session             treatment click
##   <chr>                            <dttm>                  <int> <int>
## 1 42171b71106f1957827f0b2452a74dc9 2018-01-07 05:36:13         1     0
## 2 42171b71106f1957827f0b2452a74dc9 2018-01-04 17:33:45         1     1
## 3 42171b71106f1957827f0b2452a74dc9 2018-01-06 13:26:51         1     1
## 4 2834e53a55f01c4b18ef398f0d0ef4fe 2018-01-01 06:36:04         0     1
## 5 2834e53a55f01c4b18ef398f0d0ef4fe 2018-02-17 17:12:25         0     1
## # ... with 4,966 more rows

What if you assigned treatment to sessions instead of to users (see this Skyscanner post for an overview of the problems you’re in for)?

events_df <- events_df %>% 
  mutate(treatment = rbinom(length(treatment), 1, 0.5),
         click = rbinom(length(treatment), 1, p + beta * treatment))

What’s the difference between these two datasets?

# with user ids
events_df %>% print(n = 3)
## # A tibble: 4,971 x 4
##   id                               session             treatment click
##   <chr>                            <dttm>                  <int> <int>
## 1 42171b71106f1957827f0b2452a74dc9 2018-01-07 05:36:13         1     1
## 2 42171b71106f1957827f0b2452a74dc9 2018-01-04 17:33:45         1     0
## 3 42171b71106f1957827f0b2452a74dc9 2018-01-06 13:26:51         1     1
## # ... with 4,968 more rows

# session id only
events_df %>% select(-id) %>% print(n = 3)
## # A tibble: 4,971 x 3
##   session             treatment click
##   <dttm>                  <int> <int>
## 1 2018-01-07 05:36:13         1     1
## 2 2018-01-04 17:33:45         1     0
## 3 2018-01-06 13:26:51         1     1
## # ... with 4,968 more rows

In the first case, even though you randomized over sessions instead of users, at least you have the user ids in the first dataset and you might be able to correct for within-user session dependence. If you don’t have the users, you might be in trouble.

So much more!

A huge part of of data science for consumer products is understanding the user experience. If you think you understand the user experience, can you model it? If you think of a pattern in or problem with the experience, can you model it? If not, what makes you think you can design an experiment to test it, or design an experiment to account for it?

Think about what your experience when you use twitter:

  1. you’re in a meeting and you can’t check twitter for two hours
  2. you turn on twitter, you see a tweet from a friend, respond and close twitter
  3. you get reply/like notifications
  4. you turn on twitter again right away

Can you imagine the time dependence / autocorrelation in the pattern of your actions? You have no activity for a long time, but once you turn it on, you turn it on and off and on again consistently until the conversation is over. Thinking through the user experience, then simulating data and a possible experiment is a great way to practice solving A/B testing problems.