rvest + imdb -> explore Friends episode titles

11 minute read Published:

This post includes R code to download Friends episode data from IMDB using the package rvest. It analyzes and visualizes episode data.

I always wanted to be a scriptwriter. But my approach to doing creative things is “find the secret, program it, retire”. So what’s the secret to a successful Friends episode? [Really, I want to write/experience a gentle introduction to rvest, and later tidytext and language data science.]

Get data

I’ll get data on Friends episodes from IMDB via rvest (see here for tutorials and documentation).

Start your engines:

library(tidyverse)
library(rvest)
library(skimr)
library(ggplot2)
library(ggrepel)
library(tidytext)

First, find the link, and download some html. This takes some back and forth—I’ll do that annoying thing where I present the finished product as if I thought of it straight away. Suppose we have two functions:

get_ep_data <- function(ep_url, url) {
  # get season no., episode no., episode title, episode rating, no. of ratings
  # director, and writers
  # return the tibble of episode data
}

get_season_data <- function(url) {
  # ...
  # return a tibble of data from all the episodes in a given season
}

Given those functions (we’ll talk about them later), we only need to go through the list of episodes and download everything.

# the base Friends url; it only needs a season number suffix
url <- "http://www.imdb.com/title/tt0108778/episodes?season="

# make a list of all the season urls (there are 10 seasons)
season_urls <- map_chr(1:10, function(n) {paste0(url, n)})

# for each season in the list, download all the data and put it together 
titles <- season_urls %>% map(get_season_data) %>% bind_rows()

Inspect the data

The data includes season, episode, title, rating, number of ratings, director and writers (in list form, because there’s usually more than one writer, and sometimes multiple directors).

titles
## # A tibble: 236 x 7
##    season episode title                 rating n_ratings director  writers
##     <dbl>   <dbl> <chr>                  <dbl>     <dbl> <list>    <list> 
##  1     1.      1. The One Where Monica…   8.50     4317. <chr [1]> <chr […
##  2     1.      2. The One with the Son…   8.20     3107. <chr [1]> <chr […
##  3     1.      3. The One with the Thu…   8.30     2900. <chr [1]> <chr […
##  4     1.      4. The One with George …   8.30     2810. <chr [1]> <chr […
##  5     1.      5. The One with the Eas…   8.60     2768. <chr [1]> <chr […
##  6     1.      6. The One with the But…   8.30     2695. <chr [1]> <chr […
##  7     1.      7. The One with the Bla…   9.00     3516. <chr [1]> <chr […
##  8     1.      8. The One Where Nana D…   8.20     2594. <chr [1]> <chr […
##  9     1.      9. The One Where Underd…   8.30     2516. <chr [1]> <chr […
## 10     1.     10. The One with the Mon…   8.20     2544. <chr [1]> <chr […
## # ... with 226 more rows

The rating distribution is skewed pretty high—the mean rating is 8.54, with a maximum of 9.7 and minimum of 7.4.

Ratings by season

Did Friends get better over time?

ggplot(titles, aes(x = factor(season), y = rating, group = season)) +
  geom_boxplot() + ylim(NA, 10) +
  labs(title = "Friends Ratings By Season", x = "Season #", y = "Rating")

Well, a bit—it peaks in Season 5 (✅), drops a bit in 9 (✅) and comes back for the final season. That checks out. It started good, hit its stride in the middle, started to lose steam around 8 and 9 (see the list of worst episodes ⬇️).

So what are the best and worst episodes?

# The best episodes
titles %>% arrange(-rating) %>% head(5)
## # A tibble: 5 x 7
##   season episode title                  rating n_ratings director writers 
##    <dbl>   <dbl> <chr>                   <dbl>     <dbl> <list>   <list>  
## 1     5.     14. The One Where Everybo…   9.70     5386. <chr [1… <chr [1…
## 2    10.     18. The Last One: Part 2     9.70     7064. <chr [1… <chr [2…
## 3     4.     12. The One with the Embr…   9.50     3979. <chr [1… <chr [2…
## 4    10.     17. The Last One: Part 1     9.50     4311. <chr [1… <chr [2…
## 5     2.     14. The One with the Prom…   9.40     3750. <chr [1… <chr [1…

‘Embryos’ is the ep where the winner of a quiz wins Monica and Rachel’s apartment. So good. ‘Everybody Finds Out’: it’s the one where Phoebe finds out about Monica/Chandler’s relationship; great ep (other than the Phoebe line I always hated “my eyes! my eyes!”). It’s also a great example of common knowledge and rational agents in game theory.

One problem with titles: “The One with the Embryos” gives no information other than a noun that is related to babies. 😟 No character names and no indication of why I like it.

# The worst episodes
titles %>% arrange(rating) %>% head(5)
## # A tibble: 5 x 7
##   season episode title                  rating n_ratings director writers 
##    <dbl>   <dbl> <chr>                   <dbl>     <dbl> <list>   <list>  
## 1     4.     21. The One with the Invi…   7.40     2154. <chr [1… <chr [1…
## 2     6.     20. The One with Mac and …   7.60     1982. <chr [1… <chr [1…
## 3     7.     21. The One with the Vows    7.60     1837. <chr [1… <chr [1…
## 4     8.     19. The One with Joey's I…   7.60     1791. <chr [1… <chr [1…
## 5     9.     10. The One with Christma…   7.70     1814. <chr [1… <chr [1…

They’re all clip shows lmao. Ratings are accurate then.

Are the directors any good?

I’m trying to write a good Friends script, who should I get to direct it? Here, we’ll just have to unnest the list-cols (because of one episode that has two directors, for some reason), and then group and summarize. I’ve already picked out a few directors to highlight with geom_text_repel.

# Get a dataset of directors
dirs <- titles %>% 
  unnest(director) %>%
  group_by(director) %>%
  summarize(n = n(), rating = mean(rating))

# A list of interesting people
notable_dirs <-c("Joe Regalbuto", 
                 "Peter Bonerz",
                 "David Schwimmer",
                 "Kevin Bright", 
                 "Gary Halvorson")

# scatterplot of rating vs. number of episodes directed
ggplot(dirs, aes(x = n, y = rating, colour = director %in% notable_dirs)) + 
  geom_point() + 
  geom_text_repel(aes(label = director)) +
  labs(title = "Directors: Avg. rating vs. No. of episodes",
       x = "No. of eps", y = 'Avg. rating') + 
  theme(legend.position = "none") + 
  scale_colour_manual(values=c("black", "#ce0411"))

There aren’t many that stick out—we’d want Kevin Bright (of Bright/Kaufman/Crane), who has one of the highest average ratings and directed over 50 eps. Gary Halvorson’s done a lot, but he’s only an average Friends director.

Also Peter Bonerz lolol. On the other hand: David Schwimmer directed like 10 episodes! I assume he directed all the ones about Ross:

titles %>% 
  unnest(director) %>% 
  filter(director == "David Schwimmer")
## # A tibble: 10 x 6
##    season episode title                      rating n_ratings director    
##     <dbl>   <dbl> <chr>                       <dbl>     <dbl> <chr>       
##  1     6.      6. The One on the Last Night    8.60     1893. David Schwi…
##  2     7.      4. The One with Rachel's Ass…   8.20     1843. David Schwi…
##  3     7.      7. The One with Ross's Libra…   8.60     1877. David Schwi…
##  4     7.      9. The One with All the Cand…   8.30     1794. David Schwi…
##  5     7.     16. The One with the Truth Ab…   8.70     1908. David Schwi…
##  6     8.      2. The One with the Red Swea…   9.10     2252. David Schwi…
##  7     8.      8. The One with the Stripper    8.90     2009. David Schwi…
##  8     8.     12. The One Where Joey Dates …   8.60     1860. David Schwi…
##  9     9.      5. The One with Phoebe's Bir…   8.50     1792. David Schwi…
## 10    10.      9. The One with the Birth Mo…   8.60     1814. David Schwi…

Nope. I guess he did ok, mainly focusing on Seasons 7 and 8 and coming out with an average rating of 8.61, slightly higher than the series average of 8.54. Well. Fine. Ross still sucks.

Title character breakdown

I bet if I write an episode titled “The One Where Rachel [does X]” it’ll be an automatic classic. Let’s check. First, use tidytext::unnest_tokens to split the titles into words, and then take out common filler words (‘The’, ‘One’, ‘a’, etc.) with the quick and helpful anti_join(stop_words).

title_words <- titles %>% 
  select(-director, -writers) %>% 
  unnest_tokens(word, title) %>% 
  anti_join(stop_words)
## Joining, by = "word"
title_words
## # A tibble: 431 x 5
##    season episode rating n_ratings word          
##     <dbl>   <dbl>  <dbl>     <dbl> <chr>         
##  1     1.      1.   8.50     4317. monica        
##  2     1.      1.   8.50     4317. roommate      
##  3     1.      2.   8.20     3107. sonogram      
##  4     1.      3.   8.30     2900. thumb         
##  5     1.      4.   8.30     2810. george        
##  6     1.      4.   8.30     2810. stephanopoulos
##  7     1.      5.   8.60     2768. east          
##  8     1.      5.   8.60     2768. german        
##  9     1.      5.   8.60     2768. laundry       
## 10     1.      5.   8.60     2768. detergent     
## # ... with 421 more rows

Now we’ll just count the number of times each word occurs in a title, and calculate the average rating for each word:

title_words %>% 
  filter(! word %in% c('1', '2')) %>% 
  mutate(word = gsub("'s", "", word)) %>% # to make "Ross" = "Ross's"
  group_by(word) %>% 
  summarize(n = n(), rating = rating %>% mean() %>% round(2)) %>%
  arrange(-n) %>% head(6)
## # A tibble: 6 x 3
##   word         n rating
##   <chr>    <int>  <dbl>
## 1 rachel      28   8.50
## 2 ross        24   8.68
## 3 joey        16   8.43
## 4 chandler    11   8.66
## 5 phoebe      10   8.35
## 6 monica       9   8.52

The characters are the most frequent title words, and ‘Rachel’ (surprise, surprise) is No. 1. But episodes with ‘Ross’ in the title are actually rated slightly better!

That gives us some info into explaining ratings: some directors are good, with a lot of noise. Some words are popular, but with a lot of noise. And don’t write a f—–g clip show. Next up, let’s use the titles and statistics to explain ratings!

Appendix: rvest

The toughest part of learning rvest is the lack of transparency in the returned objects. I find it difficult to navigate a page with rvest through trial-and-error because it’s hard to see inside the xml_document or xml_nodeset objects—but maybe there’s some xml stuff I don’t understand yet. Anyway, here’s what I found.

First, explore the IMDB html / source. The Friends url is easy to find, and then we need a way to go from each season’s page (that lists the episodes) to each individual episode page (to get the data). Suppose we have the season url and one episode url, let’s write a function to get that episode’s data.

We need: html_session to jump to different links, then read the html with read_html (good name). selectorgadget is a useful chrome extension to find CSS paths that select the objects you like, but I found it easier to just right-click + inspect element and find attributes that identify the things I want (usually class and id; though itemprop turns out to be super useful on these imdb pages).

Here, I’m looking for episode title (inspect element suggests div h1[itemprop="name"]) would select the title (which I need to verify for this one, and then hope it works for the rest 🙏). Also rating (span[itemprop="ratingValue"]) and rating count (span[itemprop=“ratingCount”]`).

After that, we get lists of directors and writers from the cast table, and filter accordingly. Sometimes the writers are listed with credits “written by”, sometimes “writer”, or “teleplay by” or “story by” or whatever else they like to say.

Finally, we stick them all in a tibble and return it. The writers and directors are returned in list-cols, so every episode gets one row.

get_ep_data <- function(ep_url, url) {
  # given a show url (for the session), 
  # and an episode url (relative to the show url) 
  # follow the link to the episode and download the episode data
  print(paste0("S:", stringr::str_extract(url, "(\\d*)$"),
               " E:", stringr::str_extract(ep_url, "(\\d*)$")))
  print(ep_url)
  
  # start a session and jump to the episode url
  session <- html_session(paste0(url)) %>%  jump_to(ep_url)
  
  # then read the episode page.
  page <- session %>% read_html()
  
  title <- page %>% 
    html_nodes('div h1[itemprop="name"]') %>% 
    html_text() %>% trimws()
  
  rating <- page %>% 
    html_nodes('span[itemprop="ratingValue"]') %>% 
    html_text() %>% as.numeric()
  
  n_ratings <- page %>% 
    html_nodes('span[itemprop="ratingCount"]') %>% 
    html_text() %>% gsub(pattern = ",", replacement = "") %>% as.numeric()
  
  # get writer / director info.
  cast_table <- session %>% follow_link(i = "See full cast & crew") %>%
    html_nodes(".simpleCreditsTable") %>% 
    html_table()
  
  # the director is first in the list of tables.
  director <- cast_table %>% .[[1]] %>% .[,1] 
  
  # nope! sometimes it says "teleplay by" and "story by". use those too.
  # I think the writers are in the second table in the list of tables?
  writers <- cast_table %>% .[[2]] %>% 
    filter(grepl("writer|written|teleplay|story", X3)) %>%  .[,1]
  
  tibble(
    season = stringr::str_extract(url, "(\\d*)$") %>% as.numeric(),
    episode = stringr::str_extract(ep_url, "(\\d*)$") %>% as.numeric(), 
    title = title, 
    rating = rating, 
    n_ratings = n_ratings, 
    director = lst(director), 
    writers = lst(writers)
  )
}

Once we have that episode data function, we can write a function to return season data (although you know I wrote this first just to download the season url and look for the episode titles).

First, read_html the url. The episode titles can be found (via a quick inspect element) with the tags strong a[itemprop="name"]; then ask for the href/link attribute via html_attr.

This is my first experiment with purrr::possibly—since html sessions can do weird stuff (with this very detailed bug report: bad stuff happens sometimes), if getting the episode data fails for some reason, just forget about that episode for now.

get_season_data <- function(url) {
  # given the show url, download episode data for every episode in every season
  
  html <- read_html(url)
  
  # get links to all episodes.
  ep_list <- html %>% 
    html_nodes(css = 'strong a[itemprop="name"]') %>%
    html_attr('href')
  
  # for each episode in the list, safely get the data
  safe_get_ep_data <- possibly(get_ep_data, otherwise = NULL)
  
  # return a dataframe of all episode data
  ep_list %>% map(safe_get_ep_data, url = url) %>% bind_rows() 
}