This article supplement provides all of the code necessary to reproduce the case study illustration in Jumpstarting the Justice Disciplines: A Computational-Qualitative Approach to Collecting and Analyzing Text and Image Data in Criminology and Criminal Justice Studies (conditional acceptance). To boost the pedagogical value of this resource, we have provided detailed explanations and commentaries throughout each step.
To reproduce the code in this supplement, readers will need at least some background in the R programming language. There are many excellent resources available to learn the basics of R (and the RStudio integrated development environment or IDE, which we recommend). While certainly not an exhaustive list, below are some of our favourite free resources for learning the R/RStudio essentials you’ll need to follow along with this supplement.
Although not always free, there are also many courses available online through websites like Codecademy, Coursera, edX, udemy, and DataCamp.
The rest of this supplement follows the six stages of the framework for using computational methods in qualitative research developed in the article. These stages are: (1) defining the problem; (2) collecting; (3) parsing, exploring, and cleaning; (4) sampling and outputting; (5) analyzing; and (6) findings and discussion. The bulk of material in this supplement is focused on steps 2, 3, and 4, as these are the steps that involved R programming. More in-depth discussions of the remaining steps 1, 5, and 6 can be found in the article.
The first step in designing a project that incorporates computational methods – as with any research project – is to determine a research question. For the sake of brevity, here we only restate the two overarching research questions that guided our collection and analysis of RCMP news releases. We asked:
How do the Royal Canadian Mounted Police (RCMP) visually represent their policing work in Canada? More specifically, what ‘work’ do the images included in RCMP press releases do with respect to conveying a message about policing and social control?
All data for this study were collected via web scraping. The first major step in conducting a web scrape is page exploration/inspection. What the researcher does at this stage is explore the content and structure of the pages they are interested in. The goal is to, first, find the various page elements that one wishes to collect. In this case, as we explain in the article, we are interested in collecting specific data points from thousands of RCMP news releases, including the title of each news release, date, location of RCMP detachment, the main text of the news release, and links to any images contained in the news release.
The second goal is to come up with an algorithmic solution or strategy for collecting this information. A key part of this second step is to carefully examine the source code of the website to determine what tools or libraries will be necessary to execute the scrape. While simpler websites can be scraped using an R package like library(rvest), more sophisticated websites may require that the researcher use an additional set of tools such as library(RSelenium).
It is also important during exploration/inspection to consider the legal and ethical implications of the scrape, which includes reviewing the website’s terms and conditions of use and ‘robots exclusion protocol’ or robots.txt. A website’s robots.txt directives can be obtained by adding /robots.txt to the root directory. The robots.txt for the RCMP’s website is available here. The legal and ethical dimensions of web scraping are discussed more fully in the article.
Another key part of constructing an algorithmic solution is to determine whether the information can (or should) be collected in one or multiple stages, where each stage represents a different script. We typically conduct our web scrapes in two stages: the index scrape and the contents scrape.
The index scrape works by automatically “clicking” through the page elements containing links to each of sources one wants to obtain (in this case RCMP news articles), extracting link information for each individual source, as well other metadata that may be available. In this first scrape, the primary goal is to obtain each of these links, building an index. Next, we write and deploy the script for the contents scrape that visits each of the links in the index and obtains the desired data points.
The index and contents scrape can be thought of like a Google search. When searching the word “crime” on Google, one arrives first at a page containing various links to other websites. This first page (and subsequent pages) can be thought of as comprising the “index”: it contains the links to the pages we may be interested in visiting and consuming information from. Clicking on any given link in a Google result brings us to the website itself. The material on this website can be thought of as the contents, which would be obtained in the second (aka, contents) scrape.
The first step we took in our index scrape was to write a file (specifically, a comma-separated values or CSV file) to our local drive that we could use to store the results of our scrape. Another means of achieving the same result would be to store the results in RStudio’s global environment, writing the results to your local drive after the scrape completes. Two downsides to this second approach are that you cannot view the results of the scrape until the scrape is completed, and if your scrape fails at some point (which it very likely will, especially on more time intensive tasks), you’ll lose the results you had obtained up until that point.
So, using this first approach, we’ll begin by creating a CSV spreadsheet that contains named columns for the data we’ll be collecting in our index scrape (headline_url
, headline_text
, etc.). To do this, we’ll use three tidyverse libraries: library(tibble), library(readr), and library(tidyr). (Remember that for this and subsequent steps, you’ll need to install the libraries before loading them, unless you have them installed already. In RStudio, libraries only need to be installed once, but will need to be loaded each time you launch a new session.)
# give the file you'll be creating a name
filename <- "rcmp-news-index-scrape.csv"
# using the tibble function, create a dataframe with column headers
create_data <- function(
headline_url = NA,
headline_text = NA,
date_published = NA,
metadata_text = NA,
page_url = NA
) {
tibble(
headline_url = headline_url,
headline_text = headline_text,
date_published = date_published,
metadata_text = metadata_text,
page_url = page_url
)
}
# write tibble to csv
write_csv(create_data() %>% drop_na(), filename, append = TRUE, col_names = TRUE)
Next, we’ll write the script for our index scraping algorithm, which will gather the data from the RCMP’s website and populate the CSV file we created in the last chunk of code. (Assuming the last chunk of code ran successfully, you should have a CSV file titled "rcmp-news-index-scrape.csv"
in your working directory.) To conduct our index scrape, we’ll need to install/load an additional library – library(rvest) – that will be used to get and parse the information we want from the news release section of the RCMP’s website. From library(rvest), we will be using six functions: read_html()
, html_node()
, html_nodes()
, html_attr()
, html_text()
, and url_absolute()
.
To locate the information we want, which is embedded in the RCMP website’s HyperText Markup Language or HTML code, we’ll be specifying each of the HTML elements that contains each data point (headline_url
, headline_text
, date_published
, metadata_text
, and page_url
).
As we’ve written about elsewhere, obtaining these elements is more an art than a science. There are developer tools built into every browser to help obtain them. Another popular tool is Andrew Cantino and Kyle Maxwell’s incredibly efficient and user-friendly Chrome browser extension “selector gadget”.
It is vitally important when web scraping to always insert a pause into the code, typically a minimum of 3 seconds, which can be achieved using the base R function Sys.sleep()
. Pausing the loop after each execution (since there are 13,637 URLs, it will be executed 13,637 times) prevents the web scrape from placing undue stress on a website server.
base_url <- 'https://www.rcmp-grc.gc.ca/en/news?page='
max_page_num <- NA # note that these pages are zero-indexed
scrape_page <- function(page_num = 0) {
# grab html only once
page_url <- paste(base_url, page_num, sep = '')
curr_page <- read_html(page_url)
# zero in on news list
news_list <- curr_page %>%
html_node('.list-group')
# grab headline nodes
headline_nodes <- news_list %>%
html_nodes('div > div > a')
# use headline nodes to get urls
headline_url <- headline_nodes %>%
html_attr('href') %>%
url_absolute('https://www.rcmp-grc.gc.ca/en/news')
# use headline nodes to get text
headline_text <- headline_nodes %>%
html_text(trim = TRUE)
# grab metadata field
metadata <- news_list %>%
html_nodes('div > div > span.text-muted')
# use metadata field to grab pubdate
date_published <- metadata %>%
html_nodes('meta[itemprop=datePublished]') %>%
html_attr('content')
# use metadata field to grab metadata text
metadata_text <- metadata %>%
html_text(trim = TRUE)
# build a tibble
page_data <- create_data(
headline_url = headline_url,
headline_text = headline_text,
date_published = date_published,
metadata_text = metadata_text,
page_url = page_url
)
# write to csv
write_csv(page_data, filename, append = TRUE)
max_page <- curr_page %>%
html_node('div.contextual-links-region ul.pagination li:nth-last-child(2)') %>%
html_text(trim = TRUE) %>%
as.numeric() %>%
-(1)
Sys.sleep(20) #20 second delay as per RCMP's robots.txt directives
max_page_num <- max_page
# recur
if ((page_num + 1) <= max_page_num) {
scrape_page(page_num = page_num + 1)
}
}
# run it once
scrape_page()
Once our index scrape is complete, we can (must!) inspect the results before proceeding any further. To do this, we’ll read our CSV file into R using the read_csv()
function from library(readr). To print and inspect the results, we’ll use the paged_table()
function from library(rmarkdown).
index <- read_csv("rcmp-news-index-scrape.csv")
paged_table(index)