Parliament's gender problem

11 minute read Published:

A look at digital copies of Canadian parliamentary debates 1994--2017, showing gender imbalance in both number of speakers and time pattern of speakers.

(Preface: I’m 🇨🇦 but idk anything about parliament and dk what a hansard was before I started this. So if you see something wrong, get the code from github and diy!)


  1. Women make up 20-25% of Parliament
  2. Number of words spoken by women increases from 1994–2017 from 15% to 28%
  3. Increase mainly comes from doubling involvement in routine government business
  4. Nice to have Trudeau’s gender-balanced cabinet, but female MPs are also becoming more involved in day-to-day operations

(Click to skip to the important pictures.)


When it comes to equality, Canada likes to talk a big game. What really happens in Parliament? Of course Trudeau likes to play up his female cabinet (extra props go to Chrystia Freeland for not capitulating to that weird Belgian area that was holding up the Canada-EU trade agreement, and I hope she can handle you-who-know and NAFTA too, although she switched from trade to foreign affairs).

But you don’t need to take anyone’s word for it; we have data on gender and parliament. What does it say? On one hand, we just set a record for female MPs in the 2015 election! On the other hand, that record is still only 26%. Uh. Not great. But that’s only part of the story. A gender-balance cabinet helps (Trudeau appointed 15 female ministers out of a 30 minister cabinet). But what actually happens when parliament gets down to business?

The data:

Michael Mulley (not a government employee, that would be too obvious) gathered all the parliament data and made a website! You can go to the site or the project’s github page to see what’s up. There’s an API to access the data, but I had no idea how parliament works or what I was looking for, so I downloaded a PostgreSQL copy of the database to go through on my own. It’s about 4GB.

Setup postgres and the database

To get the data into R, you need to

  1. setup a local postgres database,
  2. copy the table to the database,
  3. connect to the database from R
  4. and read the table you’d like

First, I didn’t have postgres. So google like mad and get super frustrated, then finally settle on this strat:

$ brew install postgresql
$ createdb -T template0 openparl
$ psql openparl < openparliament.public.sql

The first one installs postgresql; the next creates the database openparl that I’ll put the data in, and the last one dumps the data I downloaded from the website into the database! Done and done. Now open psql and try to look at what’s inside:

$ psql postgres
postgres-# \list
   Name    |    Owner     | Encoding |   Collate   |    Ctype    |       Access privileges
 openparl  | jessetweedle | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 postgres  | jessetweedle | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 template0 | jessetweedle | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/jessetweedle              +
           |              |          |             |             | jessetweedle=CTc/jessetweedle
 template1 | jessetweedle | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/jessetweedle              +
           |              |          |             |             | jessetweedle=CTc/jessetweedle
(4 rows)

It’s there! It worked ffs! Next, connect to the database and list contents (I’ll show only a few results)

postgres-# \connect openparl
openparl-# \dt
                            List of relations
 Schema |                   Name                   | Type  |    Owner
 public | accounts_logintoken                      | table | jessetweedle
 public | accounts_user                            | table | jessetweedle
 public | activity_activity                        | table | jessetweedle
 public | alerts_politicianalert                   | table | jessetweedle
...(lots more)...
(64 rows)

It turns out the two databases we’re looking for are hansard_statement (the parliament transcripts) and core_politician (for politician gender data). Now all we have to do is leave that running and connect to that table from R! (Aside: the R tutorials can be super frustrating, because they assume you’ve got the database part worked out already—no! that’s the only part I need!)


A hansard is non-verbatim transcript of parliamentary proceedings. Aside: Canada is bilingual, so MPs can speak in either official language, and the Hansard will record which language is spoken, and later translate from one to the other for the official record. Which led to this gem:

…during a Liberal filibuster in the Canadian Senate, Senator Philippe Gigantès was accused of reading one of his books only so that he could get the translation for free through the Hansard. (Wikipedia).

So, we have a record of who speaks at what time during every session of parliament. Let’s take a look at the hansard_statement in the open parliament database.

Import data into R and explore

Now that we’re back home in R, let’s do our regular warm-up exercises:

library(RPostgreSQL)  # to access database
library(tidyverse)    # to tidy things
library(viridis)      # bc I like the colours
library(lubridate)    # to deal with times. also need `hms` library too.

Now, let’s get the data:

con <- dbConnect(dbDriver("PostgreSQL"),
                 dbname = "openparl",
                 host = "localhost",
                 user = "jessetweedle", password = "")

This is mainly ⌘+C/⌘+V from an r-bloggers post by someone named David Zimmerman. Thanks David.

I had to write "openparl" as the dbname because that’s the name of the new local database I created to store the tables; host is local, and user is the name automatically given to me when I set up psql, and the default password is empty.

Now let’s check out the tables; you can modify this code to get all the tables in this database, but I’m focusing on ones with "hansard" in them (because I’ve already written all the code and I know what I need!):

tbl_query <- "SELECT *
              FROM pg_tables
              WHERE schemaname='public';"

dbGetQuery(con, tbl_query) %>%
  as_tibble() %>%
  select(schemaname, tablename, tableowner) %>%
  filter(grepl("hansard", x = tablename))
# A tibble: 5 x 3
#   schemaname tablename                                tableowner
#   <chr>      <chr>                                    <chr>
# 1 public     hansards_document                        jessetweedle
# 2 public     hansards_statement                       jessetweedle
# 3 public     hansards_statement_bills                 jessetweedle
# 4 public     hansards_statement_mentioned_politicians jessetweedle
# 5 public     hansards_oldsequencemapping              jessetweedle

Now just check out the ones we want, and save them to a tibble. First, let’s just have a look at hansard_statements just to check it out:

test_query <-  "SELECT *
                FROM hansards_statement
                LIMIT 20;"

df <- dbGetQuery(con, test_query) %>% as_tibble()
df %>% print(n = 5)
# A tibble: 20 x 27
#        id document_id time                h1_en  h2_en member_id who_en content_en   sequence
#  *  <int>       <int> <dttm>              <chr>  <chr>     <int> <chr>  <chr>           <int>
#  1 645328         388 2008-02-14 13:15:00 Routi… Ques…        NA ""     "<p class=\…       86
#  2 232373        1878 2001-05-03 13:50:00 Gover… Fede…      2611 Ms. M… <p>Mr. Spea…       89
#  3 170961         944 1999-05-10 16:05:00 Gover… Inco…      3066 Mr. G… "<p>Mr. Spe…      210
#  4  27684        1138 1994-11-18 10:00:00 ""     Priv…        NA The S… "<p>My coll…        0
#  5 645329         388 2008-02-14 13:15:00 Routi… Ques…      1534 Mr. T… "<p data-Ho…       87
# # ... with 15 more rows, and 18 more variables: wordcount <int>, politician_id <int>,
# #   procedural <lgl>, h3_en <chr>, who_hocid <int>, content_fr <chr>, statement_type <chr>,
# #   written_question <chr>, source_id <chr>, who_context_en <chr>, slug <chr>,
# #   urlcache <chr>, h1_fr <chr>, h2_fr <chr>, h3_fr <chr>, who_fr <chr>,
# #   who_context_fr <chr>, wordcount_en <int>

Data so cool. Thank you again Michael.

data_query <- "SELECT time, h1_en, h2_en, h3_en, who_en, politician_id,
                  wordcount, who_hocid, who_context_en, name, gender
          FROM hansards_statement
          LEFT JOIN core_politician
          ON hansards_statement.politician_id =
          ORDER BY time, sequence;"

han_df <- dbGetQuery(con, data_query) %>% as_tibble()
han_df %>% print(n = 5)
# A tibble: 2,297,140 x 11
#   time                h1_en h2_en     h3_en who_en        politician_id wordcount who_hocid
#   <dttm>              <chr> <chr>     <chr> <chr>                 <int>     <int>     <int>
# 1 1994-01-17 11:00:00 ""    ""        ""    ""                       NA       301        NA
# 2 1994-01-17 11:25:00 ""    ""        ""    The Clerk of…            NA        27        NA
# 3 1994-01-17 11:25:00 ""    Election… ""    The Presidin…            NA       134        NA
# 4 1994-01-17 11:25:00 ""    Election… ""    Mr. Nunziata           4892        40        NA
# 5 1994-01-17 11:25:00 ""    Election… ""    The Presidin…            NA       289        NA
# ... with 2.297e+06 more rows, and 3 more variables: who_context_en <chr>, name <chr>, gender <chr>

Cool, the two things I want (for now) are time and who_en.


First, an overview of the data we’re going to work with.

han_df <- read_csv("")
han_df <- han_df %>% mutate(time = with_tz(time, "America/Toronto"))
han_df %>% sample_n(5) # an easy way to look at a random sample of observations instead of just the first 10
## # A tibble: 5 x 11
##   time                h1_en h2_en h3_en who_en politician_id wordcount
##   <dttm>              <chr> <chr> <lgl> <chr>          <dbl>     <dbl>
## 1 2010-10-27 16:40:00 <NA>  <NA>  NA    Ms. J…            NA         1
## 2 2007-11-15 09:20:00 <NA>  <NA>  NA    Mr. P…           308         6
## 3 2011-11-01 12:50:00 <NA>  <NA>  NA    Mr. G…            NA        68
## 4 2013-01-30 14:50:00 Oral… Agri… NA    Ms. R…          8479        88
## 5 2014-05-01 09:30:00 <NA>  <NA>  NA    Mr. A…            NA       274
## # … with 4 more variables: who_hocid <lgl>, who_context_en <lgl>,
## #   name <chr>, gender <chr>

Ok, now we’re getting somewhere. I want to check two things: (1) what does the time distribution look like? and (2) can we get the gender of the MPs from names? We got genders from the politician database, but there are some sanity checks we’ll need. Leave that for later.

Cycles of activity

So, the time distribution:

There are cycles of activity; it picks up when session begins, drops off from 1 to 4, then jumps up from 4-5. There are obvious patterns to the activity in the Hansard that definitely correspond to the daily schedule of parliament. Keep that in mind for later. At this stage, we just want to know that the data make sense. There are some outliers left off the graph (e.g., filibustering that lasts through the night).

Gender (im)balance in Parliament: overall stats

From our exploratory analysis before, we know that there is often a gendered title associated with the name, along with the gender given by the politician database. As a first stab to sanity check and validate the data, just call names with “Mr.” (and French equivalents) Male, and names with “Mrs.” (and other English and French equivalents) Female. (This has its own problems—in a few cases, the hansard specifies that a woman is speaking on behalf of a man, and vice versa.) Putting these two things together gives us a more accurate measure of gender.

words_gender <- words %>% 
  filter(who_en != "") %>%
  mutate(gender_x = case_when(
    grepl("(Mr\\.|M\\.)", x = who_en) ~ "M",
    grepl("(Mrs\\.|Ms\\.|Miss|Mlle\\.|Mme\\.)", x = who_en) ~ "F",
    TRUE ~ "")) %>% 
  mutate(gender = ifelse( | (gender_x != gender & gender_x != ""), gender_x, gender)) %>% 
  filter(gender != "")
words_gender %>% select(who_en, gender, gender_x) %>% sample_n(5)
## # A tibble: 5 x 3
##   who_en                                 gender gender_x
##   <chr>                                  <chr>  <chr>   
## 1 Mr. Fin Donnelly                       M      M       
## 2 Mr. Adam Fanaki                        M      M       
## 3 Mr. Ahmadshah Malgarai                 M      M       
## 4 Mr. Scott Simms                        M      M       
## 5 The Acting Speaker (Mr. Andrew Scheer) M      M
Ok, that’s not bad; now what are the overall statistics?
Gender N Freq
F 59715433 22.7%
M 203399873 77.3%

Not great. Does it change over time? We know 2015 was a record breaking election for female candidates; on the other hand, the previous record was 1993. So in between: 🤷.

Whoops. There’s an obvious break in the series for Count—so something happened to the data collection or the structure of the hansard (it’s less likely that parliament itself changed significantly at that time). On the plus side, women get more words in Parliament over time!

Gender (im)balance in Parliament: time and day patterns

Wow. That’s a pattern I didn’t expect—there’s a spike at 2:00PM from Monday-Thursday, but not Friday. Weird. That cyclical activity pattern we noted before probably explains some of this. There’s some information on this in the dataset (under the original heading h1_en), but it’s not very consistent (different capitalization, spelling, naming conventions).

So let’s do some googling to find Parliament’s daily order of business. Mon-Thurs at 2:00PM (and Friday at 11:00AM) is time for “Statements from Members”! That’s where all the spikes are! Which means women are speaking at the time each member can get exactly one minute to speak on any topic. It’s also relatively high at 2:30 (same days), which is Question Period! They’re talking and asking questions at the times they are allowed to do so. But they’re not speaking during regular government business.

Let’s save that data as a csv file, read it in and join it to the data.

Wow—Statements by Members is constant (matching the relative number of female MPs), but the % of words spoken by women has during routine business has doubled over time! Let’s take a look at the deeper categories of business:

% female words
Session Type 1994 2017
Adjournment Proceedings 16.72% 25.88%
Government Orders 17.07% 26.85%
Oral Questions 14.59% 29.60%
Private Members’ Business 20.02% 24.67%
Review of Delegated Legislation 0% 21.09%
Statements by Members 23.57% 25.16%

Dang. Alright. The ratio of female speakers in parliament doubles at times off of regular government business—suggesting they are not involved in the actual running of the government (from this data, we can’t tell whether that’s by choice of the individuals or by design of those in power; you be the judge). Then they jump up when they can (‘statements by members’, ‘oral questions’ and ‘adjournment proceedings’ are all made up on short one minute statements and question period).

That’s it. Parliament’s gender power imbalance is still poor, but has improved over time:

  1. Women make up 20-25% of Parliament
  2. Number of words spoken by women increases from 1994–2017 from 15% to 28%
  3. Increase mainly comes from doubling involvement in routine government business

And that’s before we get to the words! Who knows what these people are actually saying? tidytext does! And after writing this, I found out about Linked Parliamentary Data Project (LiPaD) at the University of Toronto, an even larger historical digital source of Canadian Hansards. Who knows what we can come up with!