Why bother with covariates in A/B testing?

7 minute read Published:

Motivation

I’ll skip the part where I tell you why A/B testing is important. Just look at any data science team in tech, Microsoft, Airbnb, Twitter, Facebook, etc. etc.

And I’m starting to see this statement often:

The variance that pre-experiment data can explain in a metric is unrelated to any effects of the experiment and can, therefore, be removed. (Source: CUPED.)

Economists spend their lives trying to get observational data to behave like they randomly assigned treatments. Usually that involves adding covariates until you get an answer you like, or using increasingly complicated methods (OLS, difference-in-difference, fixed effects, IVs, regression discontinuities, dynamic panels, synthetic controls, a million different standard error estimators, and so on). But A/B testing starts with random assignment! The hard work is already done! But I guess data science has slightly different issues, so let’s take a look.

The first thing economists learn is that you can drop any covariate that is uncorrelated with either the treatment variable or the outcome variable and still get a consistent estimate of the treatment effect. But if you like, you can include a covariate that’s uncorrelated with the treatment but correlated with the outcome to reduce the coefficient’s standard error by reducing the estimate of the standard error of the residuals (the other part of the coefficient’s standard error shouldn’t be affected in large samples because the treatment and covariates are uncorrelated). So the CUPED strategy seems pretty obvious to an economist, but I need the practice so I thought I’d look at it anyway. Does it really work that well? Why?

[Whoops! The answer really is that it reduces variance in the outcomes, the other stuff I find is really just confusing marginal and conditional effects. At first I thought was the point, the econometrician doesn’t know the true model, so doesn’t know to compare the conditional and marginal effects, but I was wrong. Anyway, wait for another post on that!]

R setup and data

Let’s get’s R started and make up a dataset.

library(tidyverse)
library(broom)
library(viridis)

# A function that creates a tibble of fake data, 
# depending on the parameters you put in
create_data <- function(N, beta, p = 0.5, sigma_x = 1, sigma_e = 1) {
  # N: number of observations
  # beta: effect of treatment
  # p: probability of treatment
  # sigma_x, sigma_e: variances of covariates and errors 
  #                   (both independent of the treatment vector)
    
  tibble(treatment = rbinom(N, 1, p),
         covariate = rnorm(N, 0, sigma_x),
         error = rnorm(N, 0, sigma_e),
         # although the errors in logit are logistic, not normal
         # so I should be using rlogis instead of rnorm ^^^^
         outcome = as.integer(beta * treatment + covariate + error > 0))
}

# Create a dataset of parameters,
dfs <- tibble(N = rep(10000, 1000), 
              beta = rep(0.1, 1000)) %>% 
  mutate(true_coefficient = log(pnorm(beta * 1) / (1 - pnorm(beta * 1)))) 

# Then call create_data on each set of parameters to create 1000 independent 
# copies of a dataset (in list-col form)
dfs <- dfs %>% 
  rowwise() %>% 
  mutate(data = list(create_data(N = N, beta = beta))) %>% 
  ungroup()
dfs
## # A tibble: 1,000 x 4
##         N  beta true_coefficient data                 
##     <dbl> <dbl>            <dbl> <list>               
##  1 10000. 0.100            0.160 <tibble [10,000 × 4]>
##  2 10000. 0.100            0.160 <tibble [10,000 × 4]>
##  3 10000. 0.100            0.160 <tibble [10,000 × 4]>
##  4 10000. 0.100            0.160 <tibble [10,000 × 4]>
##  5 10000. 0.100            0.160 <tibble [10,000 × 4]>
##  6 10000. 0.100            0.160 <tibble [10,000 × 4]>
##  7 10000. 0.100            0.160 <tibble [10,000 × 4]>
##  8 10000. 0.100            0.160 <tibble [10,000 × 4]>
##  9 10000. 0.100            0.160 <tibble [10,000 × 4]>
## 10 10000. 0.100            0.160 <tibble [10,000 × 4]>
## # ... with 990 more rows

Use logit to estimate treatment effect

Next, we want to estimate the treatment effect for each dataset. I use glm to estimate a logit model (since the outcome is binary), with two specifications: in one, I only include the treatment, and in the other, I include the treatment and the covariate.

# A function to estimate the two models, save the estimates and return them in a tibble
# to go in another list-col in our original tibble
get_estimates <- function(data) {
  # estimate base model, treatment only, no covariates
  base <- glm(outcome ~ treatment, family = binomial(link = "logit"), data = data)
  # estimate covariate model, including treatment and covariate
  covariate <- glm(outcome ~ treatment + covariate, family = binomial(link = "logit"), data = data)
  # put them together and broom::tidy and bind_rows() the results
  list(base = base, 
       covariate = covariate) %>% 
    map(tidy) %>% 
    bind_rows(.id = "model") %>%
    filter(term == "treatment")
}

Then we estimate the models for all the datasets:

dfs_estimates <- dfs %>% 
  rownames_to_column("id") %>% 
  mutate(estimates = map(data, get_estimates)) %>% 
  unnest(estimates)
dfs_estimates
## # A tibble: 2,000 x 10
##    id         N  beta true_coefficient model    term    estimate std.error
##    <chr>  <dbl> <dbl>            <dbl> <chr>    <chr>      <dbl>     <dbl>
##  1 1     10000. 0.100            0.160 base     treatm…   0.0420    0.0400
##  2 1     10000. 0.100            0.160 covaria… treatm…   0.105     0.0493
##  3 2     10000. 0.100            0.160 base     treatm…   0.0663    0.0400
##  4 2     10000. 0.100            0.160 covaria… treatm…   0.149     0.0496
##  5 3     10000. 0.100            0.160 base     treatm…   0.115     0.0400
##  6 3     10000. 0.100            0.160 covaria… treatm…   0.204     0.0495
##  7 4     10000. 0.100            0.160 base     treatm…   0.178     0.0400
##  8 4     10000. 0.100            0.160 covaria… treatm…   0.207     0.0491
##  9 5     10000. 0.100            0.160 base     treatm…   0.159     0.0400
## 10 5     10000. 0.100            0.160 covaria… treatm…   0.190     0.0493
## # ... with 1,990 more rows, and 2 more variables: statistic <dbl>,
## #   p.value <dbl>

Done and done.

Plot 1000 estimates

dfs_estimates %>% 
  ggplot(aes(x = estimate, fill = model)) +
  geom_density(alpha = 0.65) + 
  geom_vline(xintercept = unique(dfs$true_coefficient), 
             colour = viridis_pal(option = "E")(1)) +
  scale_fill_viridis(option = "E", discrete = TRUE) +
  scale_x_continuous(breaks = c(0, 0.1, 0.16, 0.2, 0.3)) +
  theme_minimal() +
  labs(x = "Estimate", y = "Density", 
       title = "Including uncorrelated covariate reduces attenuation bias in logistic model",
       subtitle = "True coefficient shown as vertical line at 0.16") 

Whoops! Logit models don’t work the same way as OLS! Logit suffers from attenuation bias if you don’t include covariates, even if the treatment is randomly assigned. See some explanations here on stats.stackexchange.com and a paper and its slides. The funny thing is, these are sociology papers and for the life of me I can’t remember anyone mentioning this in my econometrics courses. Maybe because we focus so much on getting OLS and its variants (IV, etc.) correct that I didn’t notice.

Well dang. Now there are two moving parts when we add uncorrelated covariates: (1) including covariates should reduce standard errors of the residuals, which should reduce the standard errors of the coefficients; and (2) including covariates actually corrects for attenuation bias, which directly affects the coefficient estimate itself. Let’s take a closer look at the estimates, standard errors and p-values.

Conclusion

dfs_estimates %>%
  select(id, model, estimate, std.error, p.value) %>% 
  gather(key = type, value = value, -(id:model)) %>% 
  ggplot(aes(x = model, y = value, group = model)) + 
  geom_boxplot() +
  expand_limits(y = 0) + 
  facet_wrap(~ type, scales = "free") + 
  labs(x = "Model", y = "", 
       title = "p-values are lower for covariate model because it corrects for\nattenuation bias, not because covariates reduce standard errors.")

This boxplot shows that p-values are lower in the covariate model because the estimates are corrected by attenuation bias; in other words, they’re farther away from zero (in this case, they’re higher because the true effect is positive). The standard errors are actually higher in the covariate model, which I didn’t expect. If this were OLS, the standard errors would typically (but not always) be smaller, especially as the sample size gets larger.

Next up, I should replicate the CUPED post to look at power and sample size.