I don’t design to answer questions, I design to ask them1
There are an incredible range of data visualizations. They’re great, really. Sometimes they are very pretty pictures. Which is great, unless the goal wasn’t to draw a pretty picture.
But I find the most memorable (beyond “wow cool, next”) data visualizations take the Scher approach: the readers don’t find answers, they find questions. The design makes them go to the data themselves, or start looking up other articles or explanations or wikipedia to figure out what’s going on (sometimes on a completely different topic).
How to: give an answer, and then give a question
My question: how do Friends ratings vary over time? I want to show that data, but I also want the reader to be able to explore for other, related things they’re might be interested in, even in a static plot. There’s no telling what a reader will come away with. Trust them.
The Friends IMDB data were
rvested from IMDB (see this post). A quick look showed there were some serious outliers; the positive ones were the series finale (a two-parter) and other random ones I liked (e.g., The One With The Embryos!). But the eps with extraordinarily low ratings weren’t ones I particularly remembered, and they seemed to show up around the 20th episode in the season. Why? It turns out they were f—king clip shows. I want to call those out in the plot. (The annotations also make it clear that each point is an episode.)
The seasons are coloured (default ggplot scale, although you may want to try others to handle accessibility issues) because it was difficult to distinguish between the season changes. With this, you can tell (in most seasons) that the season finale is one of the most popular. The
geom_smooth with standard errors gives you an idea of where the outliers are in each season.
Questions I asked myself after doing this
Re: clip shows:
- Is there a reason clip shows are put at that stage in the season?
- Are they always at a date when no one’s watching TV, so they don’t want to waste actor/writer/director time on them?
- What are other shows doing at this time?
- What are the characteristics of clip shows like in other series?
- Why even bother, why not just have one less episode in that season (I mean, that Q answers itself: 🤑 🤑 🤑).
Re: IMDB ratings
- Is there seasonality in the episode ratings?
- how do they compare with Nielsen ratings or viewership
- Would Nielsen ratings (available from wikipedia!) would that answer the clip show questions above?
E.g., The One With Mac and C.H.E.E.S.E. had a 10% drop in viewership from the eps before and after. What about other shows that week? It was April 13, 2000, what else happened? Did they pick that week to run it bc they knew no one would watch anything anyway?
I have so many more questions about the machinery around TV shows, scripts and schedules. I hope this plot leaves you with questions too. Even if they’re about things totally unrelated to Friends IMDB ratings.
A few other options from the r-statistics.co gg gallery could be useful here. Note that this isn’t really time series data (it’s chronological, but there aren’t dates or days of the week), but we’ll use the “Change” section anyway.
First up the time series line plot above. But we could also do a calendar heatmap (but swap the calendar for seasons/episodes) or a season boxplot. Each have their advantages, but don’t accomplish what I’d like for this application.
Kinda cool. But you might need to watch out for rainbow heatmaps see a million complaints from NASA and academia (youtube py talk from Kristen Thyng](https://ocean.tamu.edu/people/faculty/thyngkristenm)) and many blogs (Agile and Eager Eyes, to name a few).
So I changed the scale from go from grey (low) to red.
A great way to compare summary statistics across seasons is the box and whiskers plot via
geom_boxplot. With a little help from Stack Overflow answers and ggrepel, it’s pretty easy to label the outliers the same way—here, I used titles instead of episode numbers.
Great—a simple way to describe the median episode rating over the seasons that also gives more information for the interested reader. Here I should remember that it’s sometimes about what’s not there—I might leave a number or statistic out (by only plotting medians over time, instead of boxplots) because I know it’s not out of the ordinary, but a reader might have external information or priors that leads her to expect an outlier. If I don’t show the fact there isn’t one, she might not believe the data processing, the subsequent analysis, or the story I’m giving.
I have some other options, but they weren’t great for this data. For instance, you can use
facet_grid for the time series plots instead:
Or we could plot all the seasons on one plot, overtop of each other:
I styled them thick to get the overall impression of Season 1 relative to the other seasonal paths (it’s clearly near the bottom for any given episode), but it doesn’t have the lowest lows like the other seasons (aka no Season 1 clip show, thank god). You can also see Season 10 ending high and early at episode 18.
I’m still learning. I can’t tell what people are going to like or learn from. Across the spectrum of readers (more or less topic knowledge, more or less dataviz experience, more or less statistics knowledge), I can’t tell what appeals to each group.
Sometimes I just do just want to give an answer, other times I just want to make a pretty picture, but the most satisfying plots are ones that keep me asking questions. I hope if I do more of these, I’ll get better 📈 and better 📈.