Post-selection inference on Friends titles in R

10 minute read Published:

This post applies post-selection procedure to estimating the effect of character names in Friends episode titles on their ratings. Ross is good, apparently.

Goal

I want to be a Friends scriptwriter. Can I pick a title that makes an episode an automatic classic? If I just include a character’s name in the title, does it make it automatically popular? I assume I should just write “The One Where Rachel is Rachel”. But let’s investigate.

The question: what is the effect of a character’s name in the episode title on the episode rating?

Data

We’ll use Friends data we rvested from IMDB in a previous post, or just load it via devtools::install_github('tweed1e/werfriends').

titles <- werfriends::friends_episodes %>% as_tibble()
titles
## # A tibble: 236 x 7
##    season episode title                 rating n_ratings director  writers
##     <dbl>   <dbl> <chr>                  <dbl>     <dbl> <list>    <list> 
##  1     1.      1. The One Where Monica…   8.50     4317. <chr [1]> <chr […
##  2     1.      2. The One with the Son…   8.20     3107. <chr [1]> <chr […
##  3     1.      3. The One with the Thu…   8.30     2900. <chr [1]> <chr […
##  4     1.      4. The One with George …   8.30     2810. <chr [1]> <chr […
##  5     1.      5. The One with the Eas…   8.60     2768. <chr [1]> <chr […
##  6     1.      6. The One with the But…   8.30     2695. <chr [1]> <chr […
##  7     1.      7. The One with the Bla…   9.00     3516. <chr [1]> <chr […
##  8     1.      8. The One Where Nana D…   8.20     2594. <chr [1]> <chr […
##  9     1.      9. The One Where Underd…   8.30     2516. <chr [1]> <chr […
## 10     1.     10. The One with the Mon…   8.20     2544. <chr [1]> <chr […
## # ... with 226 more rows

The problem: all we really have is titles, writers and directors. How do we use natural language information to predict episode ratings? My first idea is to use dummy variables (aka factors with levels c("0","1")) to encode the title words and director and writer names. In other words, the S4E19 episode titled “The One with All the Haste” would have a 1 in variable haste, but a 0 for every character dummy variable ross, rachel, etc. (S4E19 is the ep where Monica and Rachel steal their apartment back from Chandler and Joey. So good.)

Method

So, once we create language dummy variables to try to control for episode characteristics that may affect ratings, we find ourselves with too many predictors. There are only 236 episodes but we create way more language dummies 🤔.

This is a high-dimensional problem, with more explanatory variables than observations, p>>n. A few econometricians have proposed a post-selection estimator to solve these problems, which have a slightly different flavour than normal machine learning / prediction problems.

Consider the goal: I want to know the effect of including a character’s name in an episode title. LASSO selects important variables that predict the outcome (rating, in my case). If we don’t penalize the character names, and there is another variable that is correlated with the outcome and one of the character names, then we end up with a biased estimate of the character effect. E.g., say most “Ross” episodes are directed by Kevin Bright, who is a “good” director according to the mean rating of all the episodes he directed). Then LASSO selects “Ross” to be in the model, but drops “Kevin Bright” because it’s so correlated with “Ross” (which means it doesn’t improve the rating prediction much), and LASSO will tell us that “Ross” has a high positive coefficient. But we know that Ross is trash and it’s only because Kevin Bright directed his episodes!

The solution is called post-selection inference, explained in “High-dimensional methods and inference on structural and treatment effects” (Belloni, et al. 2013). They actually call it a ‘double selection procedure’ or post-double-selection, but post-selection is fine with me.

The idea is: run LASSO on the outcome variable and everything but the character names; then run LASSO on each of the character names and the rest of the RHS variables. Take all the non-zero coefficients from all of these LASSO results, then regress rating on all the selected variables via a linear model (aka ordinary least squares).

Code

There are a few annoying steps.

First, I use tidytext to convert titles to words, and then spread those words into dummy factors for the dataset (the dataset is very ‘wide’ after this). I create dummy factors for directors and writers as well.

Next, run the naive linear model, and save the estimates. Then run naive LASSO and save the estimates. Then run the double-selection procedure.

Process

Do ugly stuff (check source) for full code and end up with this:

# the dataset
episodes
## # A tibble: 236 x 352
##    rating chandler joey  monica phoebe rachel ross  season episode
##     <dbl> <fct>    <fct> <fct>  <fct>  <fct>  <fct> <fct>  <fct>  
##  1   8.50 0        0     1      0      0      0     1      1      
##  2   8.20 0        0     0      0      0      0     1      2      
##  3   8.30 0        0     0      0      0      0     1      3      
##  4   8.30 0        0     0      0      0      0     1      4      
##  5   8.60 0        0     0      0      0      0     1      5      
##  6   8.30 0        0     0      0      0      0     1      6      
##  7   9.00 0        0     0      0      0      0     1      7      
##  8   8.20 0        0     0      0      0      0     1      8      
##  9   8.30 0        0     0      0      0      0     1      9      
## 10   8.20 0        0     0      0      0      0     1      10     
## # ... with 226 more rows, and 343 more variables: n_ratings <fct>,
## #   after <fct>, apothecary <fct>, armadillo <fct>, assistant <fct>,
## #   award <fct>, baby <fct>, babysits <fct>, bag <fct>, ball <fct>,
## #   ballroom <fct>, barbados <fct>, barry <fct>, bath <fct>, beach <fct>,
## #   bed <fct>, bing <fct>, birth <fct>, birthday <fct>, birthing <fct>,
## #   blackout <fct>, blind <fct>, boob <fct>, boobies <fct>, book <fct>,
## #   boots <fct>, box <fct>, brain <fct>, breast <fct>, bullies <fct>,
## #   bus <fct>, butt <fct>, c.h.e.e.s.e <fct>, cake <fct>, candy <fct>,
## #   car <fct>, cat <fct>, caught <fct>, champion <fct>, cheap <fct>,
## #   cheesecakes <fct>, chick <fct>, chicken <fct>, christmas <fct>,
## #   class <fct>, closet <fct>, consuela <fct>, cookies <fct>,
## #   cooking <fct>, cop <fct>, cousin <fct>, cries <fct>, crosses <fct>,
## #   crush <fct>, cry <fct>, cuffs <fct>, dad <fct>, dancing <fct>,
## #   date <fct>, dates <fct>, day <fct>, denial <fct>, detergent <fct>,
## #   device <fct>, dies <fct>, dinner <fct>, dirty <fct>, dogs <fct>,
## #   dollhouse <fct>, donor <fct>, dozen <fct>, dr <fct>, dream <fct>,
## #   dress <fct>, dresses <fct>, duck <fct>, east <fct>, eddie <fct>,
## #   eggplant <fct>, elizabeth <fct>, embryos <fct>, emma <fct>,
## #   engagement <fct>, estelle <fct>, everybody <fct>, evil <fct>,
## #   factor <fct>, fake <fct>, fantasy <fct>, fertility <fct>,
## #   fighting <fct>, fine <fct>, flashback <fct>, flirt <fct>,
## #   football <fct>, forward <fct>, frank <fct>, free <fct>, french <fct>,
## #   fridge <fct>, …
# also created `ep_mm`, a sparse model matrix based on `episodes`

This is a dataset with a million factors for different words that show up in titles, and factors for all the directors and writers. There are only 236 episodes/observations but 351 potential explanatory variables, leaving us in a high-dimensional situation p>>n (although a pretty low dimensional high dimensional situation).

Do naive linear model / OLS

The naive linear model resolves the high dimensional problem by selecting variables manually: I want to control for season and episode effects, and estimate the character effects.

naive <- lm(rating ~ season + episode + 
              chandler + joey + monica + 
              phoebe + rachel + ross, 
            data = episodes)

Do naive LASSO

Or we could use LASSO to select our control variables for us; we remove the penalty on the characters to ensure these coefficients are estimated to be non-zero.

# penalty factor: 0 for the first 6 variables (the characters), 
# then 1 for the rest.
penalty.factor <- c(rep(0, 6), rep(1, ncol(ep_mm) - 6))

# use `glmnet` to run cross-validated lasso with new penalty.factor
c <- cv.glmnet(ep_mm, episodes$rating, penalty.factor = penalty.factor) %>%
  coef(s = 'lambda.min') %>% .[-1,] # save the optimal coefficients 

# then create dataset of coefficient estimates
lasso <- tibble(term = names(c), estimate = c)
lasso
## # A tibble: 236 x 2
##    term      estimate
##    <chr>        <dbl>
##  1 chandler1   0.115 
##  2 joey1      -0.0525
##  3 monica1    -0.0387
##  4 phoebe1    -0.152 
##  5 rachel1    -0.0470
##  6 ross1       0.192 
##  7 season2     0.    
##  8 season3     0.    
##  9 season4     0.    
## 10 season5     0.0565
## # ... with 226 more rows

Do post-selection estimation


get_lasso_coefs <- function(name, x) { 
# function that returns non-zero coefficients that
# that predict one of the RHS variables
  c <- cv.glmnet(x[, colnames(x) != name], x[,name]) %>%
    coef(s = 'lambda.min') %>% .[-1,]
  names(c)[c != 0]
}

get_lasso_coefs_ <- function(y, names, x) { 
# function that returns non-zero coefficients that
# that predict the RHS variables (first removing the characters)
  c <- cv.glmnet(x[, !colnames(x) %in% names], y) %>%
    coef(s = 'lambda.min') %>% .[-1,]
  names(c)[c != 0] # only return names that are non-zero
}

With those functions, now we loop over the relevant character variables and return non-zero LASSO coefficients.

# list of characters
friends <- c("chandler", 
             "joey", 
             "monica", 
             "phoebe", 
             "rachel", 
             "ross") %>% 
  map_chr(paste0, "1")

# map list of characters into lasso coefficient function
lasso_coefs <- friends %>%
  map(get_lasso_coefs, x = ep_mm) %>% 
  unlist() 

# put all the selected variables together
post_vars <- c(friends, 
          lasso_coefs,
          get_lasso_coefs_(episodes$rating, friends, ep_mm)) %>% unique()

# process the variables: 'episode1' should be 'episode1', 
# but 'chandler1' should be 'chandler'
post_vars <- post_vars %>% 
  as_tibble() %>% 
  mutate(value = ifelse(grepl("season|episode", value), 
                        value, 
                        gsub("\\d$", "", value))) %>%
  pull(value) %>% c(gsub("1", "", friends), .) %>% unique()

post_dat <- episodes %>% mutate(x = 1) %>% 
  spread(key = season, value = x, fill = 0, sep = "") %>% mutate(x = 1) %>% 
  spread(key = episode, value = x, fill = 0, sep = "") %>% 
  mutate_at(vars(-rating), factor) %>% 
  select(one_of(c("rating", post_vars)))

# the post-selection lm formula
f <- paste("rating ~ ", paste(post_vars, collapse = " + "))

Results

coefs <- bind_rows(
  # post-selection model coefficients
  lm(f, data = post_dat) %>% summary() %>% tidy() %>% 
    .[2:7,] %>% as_tibble() %>% mutate(model = "post lm"),
  # naive lm model coefficients
  naive %>% tidy() %>% tail() %>% as_tibble() %>% mutate(model = "naive lm"),
  # lasso selection
  lasso[1:6, ] %>%  mutate(model = "lasso")
)

# bar chart for coefficients + errorbars by character and model
coefs %>% 
  ggplot(aes(x = factor(term), y = estimate, group = factor(model), fill = factor(model))) + 
  geom_col(position = 'dodge') + 
  geom_errorbar(
    aes(ymin = estimate - 1.96 * std.error, ymax = estimate + 1.96 * std.error),
    width = 0.4,
    position = position_dodge(width = 0.9)
  ) + labs(title = 'Post-selection model coefficients are the best', x = 'Character', y = 'Estimate')
## Warning: Removed 6 rows containing missing values (geom_errorbar).

The naive linear model (significantly) says Joey sucks! If Joey’s in an episode, the episode is typically 0.19 stars lower! Phoebe also sucks, but not with such certainty. The naive linear model is also pretty sure none of the other characters matter.

The naive LASSO model reverses that: Joey doesn’t matter as much; but Phoebe still sucks and Ross is now great! (NB: you know that’s not true, Ross is trash.)

Ok, I was wrong, the post-selection procedure thinks Ross is great. Most of the other characters’ estimates are closer to zero, and much different than the naive linear model. Phoebe looks especially vindicated by the post-selection model. (The error bars really drive home the uncertainty, thanks to the tips in the ggplot2 docs I found while trying to figure out position = 'dodge' in the goem_cols.) Of course we’re not dealing with a sample of Friends eps, so it’s hard to say what the standard errors are supposd to represent, really. But that’s a subject for another time.

What does that mean for us? I thought character names would matter (e.g., ‘Rachel’ episodes would be rated higher), they don’t, and post-selection inference seems robust. So when I write my Friends episodes, I don’t have to care too much about putting characters in the titles, other than Ross, and I’m defo not writing a Ross episode. Or maybe I’ll write “The One without Ross”.