Friends Title Generator, Part 2: parts of speech

8 minute read Published:

This post includes a R code script to generate Friends episode titles, focusing on using parts of speech.

We’re back on the Friends script grind.

titles <- werfriends::friends_episodes %>% select(-director, -writers)
titles
## # A tibble: 236 x 5
##    season episode title                                   rating n_ratings
##     <dbl>   <dbl> <chr>                                    <dbl>     <dbl>
##  1     1.      1. The One Where Monica Gets a Roommate      8.50     4317.
##  2     1.      2. The One with the Sonogram at the End      8.20     3107.
##  3     1.      3. The One with the Thumb                    8.30     2900.
##  4     1.      4. The One with George Stephanopoulos        8.30     2810.
##  5     1.      5. The One with the East German Laundry D…   8.60     2768.
##  6     1.      6. The One with the Butt                     8.30     2695.
##  7     1.      7. The One with the Blackout                 9.00     3516.
##  8     1.      8. The One Where Nana Dies Twice             8.20     2594.
##  9     1.      9. The One Where Underdog Gets Away          8.30     2516.
## 10     1.     10. The One with the Monkey                   8.20     2544.
## # ... with 226 more rows

Approach: instead of calculating word transition probabilities directly from the titles, we’re going to use the same approach to generate sentence structures, which will look like “The One Where [Noun] [Verb]”. Then, I’ll randomly select a noun from all the nouns in the set of titles, and randomly select a verb from all the verbs in the set of titles. That’ll give me “The One Where Ross Dies”. Perfect.

title_words <- titles %>% 
  unnest_tokens(word, title) %>% # convert titles to words
  group_by(season, episode) %>% 
  mutate(lineno = row_number()) %>% 
  group_by(season, episode, lineno) %>%
  # e.g., split "ross's" into list("ross", "'s"):
  mutate(x = ifelse(grepl("'", word), 
                    str_match(word, "([a-z]*)('s)") %>% .[,-1] %>% lst(),
                    word %>% lst())) %>% 
  unnest() %>% # unnest those lists
  ungroup() %>% 
  # if "ross's", save new rows as "ross" and "'s":
  mutate(word = coalesce(x, word)) %>% 
  select(-x)

Get parts of speech data

tidytext comes with the Moby parts of speech dataset. For some reason, this says “a” and “an” are definite articles, not indefinite articles. In addition, they give several options for each word (e.g., “like” could be 6 different parts of speech).

Instead, I found a different parts of speech dataset from Ghent University, which I think is included in the English Lexicon Project. The SUBTLEXus dataset is based on parts of speech tagging via CLAW4 on subtitles in US films and tv shows. I’m not sure about the licensing, but the ELP specifically says it’s for non-commercial use only. So don’t make any money off this post plz.

# readr unzips stuff! tyvm tidytext documentation.
parts_of_speech <- read_tsv("SUBTLEX-US_frequency_list_with_PoS_information_final_text_version.zip")
## Parsed with column specification:
## cols(
##   Word = col_character(),
##   FREQcount = col_integer(),
##   CDcount = col_integer(),
##   FREQlow = col_integer(),
##   Cdlow = col_integer(),
##   SUBTLWF = col_double(),
##   Lg10WF = col_double(),
##   SUBTLCD = col_double(),
##   Lg10CD = col_double(),
##   Dom_PoS_SUBTLEX = col_character(),
##   Freq_dom_PoS_SUBTLEX = col_character(),
##   Percentage_dom_PoS = col_character(),
##   All_PoS_SUBTLEX = col_character(),
##   All_freqs_SUBTLEX = col_character()
## )
parts_of_speech
## # A tibble: 74,286 x 14
##    Word    FREQcount CDcount FREQlow Cdlow   SUBTLWF Lg10WF SUBTLCD Lg10CD
##    <chr>       <int>   <int>   <int> <int>     <dbl>  <dbl>   <dbl>  <dbl>
##  1 a         1041179    8382  976941  8380   2.04e⁺⁴  6.02  99.9     3.92 
##  2 aa             87      70       6     5   1.71e⁺⁰  1.94   0.830   1.85 
##  3 aaa            25      23       5     3   4.90e⁻¹  1.42   0.270   1.38 
##  4 aah          2688     634      52    37   5.27e⁺¹  3.43   7.56    2.80 
##  5 aahed           1       1       1     1   2.00e⁻²  0.301  0.0100  0.301
##  6 aahing          2       2       2     2   4.00e⁻²  0.477  0.0200  0.477
##  7 aahs            5       4       5     4   1.00e⁻¹  0.778  0.0500  0.699
##  8 aal             1       1       1     1   2.00e⁻²  0.301  0.0100  0.301
##  9 aardva…        21      12      14     8   4.10e⁻¹  1.34   0.140   1.11 
## 10 aargh          33      26       2     1   6.50e⁻¹  1.53   0.310   1.43 
## # ... with 74,276 more rows, and 5 more variables: Dom_PoS_SUBTLEX <chr>,
## #   Freq_dom_PoS_SUBTLEX <chr>, Percentage_dom_PoS <chr>,
## #   All_PoS_SUBTLEX <chr>, All_freqs_SUBTLEX <chr>

Apply parts of speech data, and find pos->pos transition probabilities

# join parts_of_speech data to our title_words
titles_pos <- title_words %>% 
  left_join(parts_of_speech %>% select(word = Word, pos = Dom_PoS_SUBTLEX)) %>%
  mutate(pos = case_when(
    grepl("'", word) ~ word, # if my word is "'"
    is.na(pos) ~ "Noun",     # if it didn't match, say it's "Noun"
    TRUE ~ pos               # otherwise use what parts_of_speech says 
  )) 
## Joining, by = "word"

What’s the most frequent sentence structure?

# what's the most frequent title structure?
titles_pos %>% 
  group_by(season, episode) %>%
  mutate(pos = ifelse(lineno<=3, word, pos)) %>% 
  summarize(title = paste0(word, collapse = " "),
            pos = paste0(pos, collapse = " ")) %>% 
  group_by(pos) %>% count(sort = TRUE)
## # A tibble: 101 x 2
## # Groups:   pos [101]
##    pos                                      n
##    <chr>                                <int>
##  1 the one with Article Noun               36
##  2 the one with Article Noun Noun          19
##  3 the one with Name 's Noun               18
##  4 the one with Article Adjective Noun     11
##  5 the one with Determiner Article Noun     8
##  6 the one with Noun 's Noun                6
##  7 the one where Noun Verb                  5
##  8 the one with Article Noun Noun Noun      5
##  9 the one with Name 's Adjective Noun      5
## 10 the one in Name Noun Noun                4
## # ... with 91 more rows

Now we have an idea of typical sentence/title structures. Now all we have to do is calculate the probability that a certain part of speech follows another, then randomly generate a new title structure, and then randomly sample from each part of speech to substitute into the new title structure.

# find the transition probabilities from parts_of_speech to other
# parts_of_speech based on the set of titles we have
pos_transitions <- titles_pos %>% 
  group_by(season, episode) %>%
  filter(lineno >= 2) %>% # only start with "one"
  # and keep "one", "where", "with", "after", etc:
  mutate(pos = ifelse(lineno<=3, word, pos), 
         nxt = lead(pos), 
         nxt = ifelse(is.na(nxt), "EOL", nxt)) %>% # and add the "EOL" character
  group_by(pos, nxt) %>% 
  count() %>% 
  group_by(pos) %>% 
   # calculate the frequency of transitions from `pos` to `nxt` for each `pos`
  mutate(weight = n / sum(n))
pos_transitions
## # A tibble: 119 x 4
## # Groups:   pos [31]
##    pos       nxt           n weight
##    <chr>     <chr>     <int>  <dbl>
##  1 's        Adjective    11 0.216 
##  2 's        Article       1 0.0196
##  3 's        Noun         36 0.706 
##  4 's        Verb          3 0.0588
##  5 Adjective Adjective     1 0.0286
##  6 Adjective EOL           5 0.143 
##  7 Adjective Letter        1 0.0286
##  8 Adjective Name          1 0.0286
##  9 Adjective Noun         22 0.629 
## 10 Adjective Verb          5 0.143 
## # ... with 109 more rows

This gives a tidy dataset of transition probabilities from one part of speech to another.

Final step:

Calculate the frequency of each word in each part of speech. E.g., what’s the frequency that “rachel” is used relative to all the nouns? We’ll use this to replace the parts of speech in a generated sentence structure.

# What's the frequency of each word, by noun.
word_pos_freq <- titles_pos %>% 
  count(word, pos) %>% 
  group_by(pos) %>% 
  mutate(weight = n / sum(n))
word_pos_freq %>% arrange(-n)
## # A tibble: 340 x 4
## # Groups:   pos [19]
##    word   pos             n weight
##    <chr>  <chr>       <int>  <dbl>
##  1 the    Article       350 0.978 
##  2 one    Number        239 0.960 
##  3 with   Preposition   162 0.871 
##  4 where  Adverb         54 0.783 
##  5 's     's             51 1.00  
##  6 rachel Noun           28 0.0940
##  7 ross   Name           24 0.270 
##  8 part   Noun           20 0.0671
##  9 joey   Name           16 0.180 
## 10 all    Determiner     12 0.800 
## # ... with 330 more rows

With that, all we have to do is write a function that generates a title structure using parts of speech,

# generate sentence structure
generate_structure <- function() {
  # initial title; everything starts with "the one" (except "the last one"s).
  new_title <- c("the", "one")
  
  # while the last word we put in wasn't the end of the title.
  while(new_title[length(new_title)] != "EOL") {
    # add a random word to the current title
    new_title <- c(new_title, 
                   pos_transitions %>% 
                     filter(pos == new_title[length(new_title)]) %>% 
                     sample_n(size = 1, weight = weight) %>% 
                     pull(nxt))
  }
  new_title # return the list of the new title's parts of speech.
}


generate_title <- function() {
  # generate new sentence structure.
  new_title <- generate_structure()

  # use the sentence structure, merge on word-pos frequencies to 
  # sample from.
  new_title %>% enframe() %>% 
    left_join(word_pos_freq, by = c("value" = "pos")) %>% 
    mutate(weight = ifelse(is.na(weight), 1, weight)) %>% 
    group_by(name) %>% 
    # sample a word for each pos
    sample_n(size = 1, weight = weight) %>% 
    mutate(word = ifelse(is.na(word), value, word)) %>% 
    # drop the end
    filter(word != "EOL") %>% 
    ungroup() %>% 
    # then collapse the title back together
    summarize(title = paste(word, collapse = " ")) %>% 
    pull(title) %>% 
    gsub(" 's ", "'s ", .) %>%  # stick the possesive back on its noun
    str_to_title() %>%
    gsub("Pbs", "PBS", .) %>% 
    gsub("C.h.e.e.s.e", "C.H.E.E.S.E.", .)# in case we picked PBS.
}

Now make a pos title, R!

generate_title() 
## [1] "The One In Ross And The Factor"

Dang, I hope these are good. But they’re generated when Netlify builds the site so 🤷‍♀️.

And five more, for fun:

replicate(5, generate_title())
## [1] "The One With The New"         "The One With The Stripper"   
## [3] "The One Where Ross And Class" "The One With The Class"      
## [5] "The One With The Night"

And one randomly generated header title:

The One With That Champion