Measuring URL health in R

5 minute read Published:

A way to measure health of a list of URLs in R using httr and purrr.

Motivation

I’m working on auditing Canada’s open data portal. One issue that comes up: how to verify if a link to a dataset is useful? It may redirect, it may return 404, it may return an R error, or an R warning, and the otherwise great URL packages don’t have a simple way of getting all the possible errors and warnings in one command.

Data

Here are some possible URLs you’ll run in to; some work, some don’t.

urls <- list(
  times_out = "http://www.acdi-cida.gc.ca/INET/IMAGES.NSF/vLUImages/Open%20Data/$file/Country-Data-Stacked-2002-2003.csv",
  returns_404 = "http://www.cic.gc.ca/english/visit/visas-tool.daklsd",
  returns_warnings = "ftp://ftp.nrcan.gc.ca/ess/sgb_pub/sgb_datasets/mb/FOX_LAKE_WEST_3/FOX_LAKE_WEST_3_SHP.zip",
  success = "https://www.google.ca/",
  redirects = "https://www.google.ca",
  returns_error = "https://boycott.google.ca"
) %>% enframe(value = "url") %>% mutate(url = as.character(url))

URL success + R errors

Ok, first up, success or not:

http_error("https://www.google.ca/")
## [1] FALSE
safely(http_error)("https://boycott.google.ca")$error
## <simpleError in curl::curl_fetch_memory(url, handle = handle): Could not resolve host: boycott.google.ca>

So, you’d think http_error would catch an error like “Could not resolve host” and return TRUE, but instead it crashes R. Hmm. Ok, that’s fine, we’ll just switch http_error out (because I guess it handles some http errors but not others) for a basic HEAD (which we also to handle timeouts later), and use safely for all calls. But we also want to check warnings, and safely doesn’t do that, so let’s use quietly instead:

safely(HEAD)("ftp://ftp.nrcan.gc.ca/ess/sgb_pub/sgb_datasets/mb/FOX_LAKE_WEST_3/FOX_LAKE_WEST_3_SHP.zip")
## $result
## Response [ftp://ftp.nrcan.gc.ca/ess/sgb_pub/sgb_datasets/mb/FOX_LAKE_WEST_3/FOX_LAKE_WEST_3_SHP.zip]
##   Date: 2018-02-17 16:45
##   Status: 350
##   Content-Type: <unknown>
##   Size: 91 B
## <BINARY BODY>
## 
## $error
## NULL
quietly(HEAD)("ftp://ftp.nrcan.gc.ca/ess/sgb_pub/sgb_datasets/mb/FOX_LAKE_WEST_3/FOX_LAKE_WEST_3_SHP.zip")
## $result
## Response [ftp://ftp.nrcan.gc.ca/ess/sgb_pub/sgb_datasets/mb/FOX_LAKE_WEST_3/FOX_LAKE_WEST_3_SHP.zip]
##   Date: 2018-02-17 16:45
##   Status: 350
##   Content-Type: <unknown>
##   Size: 91 B
## <BINARY BODY>
## 
## $output
## [1] ""
## 
## $warnings
## [1] "NAs introduced by coercion to integer range"                              
## [2] "Failed to parse headers:\n213 97668\n350 Restart position accepted (0).\n"
## 
## $messages
## character(0)

But if the url really does return an error, then quietly crashes…ugh. So we have to use safely first, then quietly, and save the errors from safely and the warnings from quietly.

quietly(safely(HEAD))("https://boycott.google.ca/")
## $result
## $result$result
## NULL
## 
## $result$error
## <simpleError in curl::curl_fetch_memory(url, handle = handle): Could not resolve host: boycott.google.ca>
## 
## 
## $output
## [1] ""
## 
## $warnings
## character(0)
## 
## $messages
## character(0)

Timeouts

None of these will timeout yet; so if you have a link in your list that doesn’t return anything, your script will hang. So you can add a timeout to HEAD:

# hangs:
# HEAD("http://www.acdi-cida.gc.ca")

head_timeout <- function(url) { HEAD(url, timeout(1)) }
quietly_safely_head <- quietly(safely(head_timeout))

# times out after 1 second
quietly_safely_head("http://www.acdi-cida.gc.ca")
## $result
## $result$result
## NULL
## 
## $result$error
## <simpleError in curl::curl_fetch_memory(url, handle = handle): Timeout was reached: Operation timed out after 1005 milliseconds with 0 bytes received>
## 
## 
## $output
## [1] ""
## 
## $warnings
## character(0)
## 
## $messages
## character(0)

Redirects

Redirects are annoying; check the status_code from HEAD doesn’t return a redirect code (300), because it follows the redirect to the new page and returns the status of that page. Instead, the HEAD returns a result that includes the url it actually accessed, so just check if that is the same as the url you asked for:

HEAD("https://www.google.ca")
## Response [https://www.google.ca/]
##   Date: 2018-02-17 21:45
##   Status: 200
##   Content-Type: text/html; charset=ISO-8859-1
## <EMPTY BODY>
HEAD("https://www.google.ca")$url
## [1] "https://www.google.ca/"
HEAD("https://www.google.ca")$url == "https://www.google.ca"
## [1] FALSE

That’s an admittedly weak redirect (it forces the browser to add “/” to the end of the url). But those are the kinds of details I’m looking for, as well as more egregious redirects.

Map url access function to list of urls

tests <- urls %>% mutate(test = map(url, quietly_safely_head))
tests %>% select(-name)
## # A tibble: 6 x 2
##   url                                                              test   
##   <chr>                                                            <list> 
## 1 http://www.acdi-cida.gc.ca/INET/IMAGES.NSF/vLUImages/Open%20Dat… <list …
## 2 http://www.cic.gc.ca/english/visit/visas-tool.daklsd             <list …
## 3 ftp://ftp.nrcan.gc.ca/ess/sgb_pub/sgb_datasets/mb/FOX_LAKE_WEST… <list …
## 4 https://www.google.ca/                                           <list …
## 5 https://www.google.ca                                            <list …
## 6 https://boycott.google.ca                                        <list …

Now process the returned url objects

There’s too much information returned, and the objects are too complicated to study directly. Let’s decide on what we really need to know. I want to know the error message (if there is one), the warnings (if they exist), the accessed url (to see if it redirected), and the status code. It’s ugly because those things are returned from different calls and live at different levels in the list.

process_test <- function(test) {
  result <- test$result$result
  error <- test$result$error$message
  warnings <- test$warnings
  head_url <- result$url
  status_code <- result$status_code
  # status codes are different for FTP, see:
  # https://en.wikipedia.org/wiki/List_of_FTP_server_return_codes

  list(error = error,
       warnings = warnings,
       head_url = head_url,
       status_code = status_code) %>%
    t() %>% as_tibble()
}

Now all we have to do is map the process_test function to the list of urls, convert the lists to strings (where appropriate; warnings is itself a list, so I want to leave it like that).

tests %>%
  mutate(res = map(test, process_test)) %>%
  unnest(res) %>%
  mutate_at(c("error", "head_url", "status_code"), as.character) %>% 
  mutate_at(c("error", "head_url", "status_code"), ~ ifelse(. == "NULL", NA, .)) %>%
  mutate(redirect = head_url != url)
## # A tibble: 6 x 8
##   name   url       test   error   warnings head_url   status_code redirect
##   <chr>  <chr>     <list> <chr>   <list>   <chr>      <chr>       <lgl>   
## 1 times… http://w… <list… Timeou… <chr [0… <NA>       <NA>        NA      
## 2 retur… http://w… <list… <NA>    <chr [0… http://ww… 404         FALSE   
## 3 retur… ftp://ft… <list… <NA>    <chr [2… ftp://ftp… 350         FALSE   
## 4 succe… https://… <list… <NA>    <chr [0… https://w… 200         FALSE   
## 5 redir… https://… <list… <NA>    <chr [0… https://w… 200         TRUE    
## 6 retur… https://… <list… Could … <chr [0… <NA>       <NA>        NA

Done!

We have a list of original urls, the errors + warnings + accessed urls + status codes! Now just do this to the 150,000 urls on https://open.canada.ca/data/en/dataset!