crime severity -> crime severities

2009 WAS A BIG YEAR for folks at Statistics Canada. It marks the roll out of a metric called the Police-Reported Crime Severity Index (PRCSI) that has had a lasting impact on how we quantify crime. Before the PRCSI, crime was counted more simply as an unweighted frequency or proportion. With the PRCSI, different crimes get different weights indicative of their relative severity. Where do the weights come from? As the authors that came up with the PRCSI explain in a 2009 working paper:

Read More

predicting fraud using financial statements

Discovered a dataset on Kaggle with financial filings from 170 companies, half fradulent the other not. I built an NLP model in R that processes the text from the “Fillings” column using vectorization to capture the key textual features and applies logistic regression for classification into fraudulent/not fraudulent. Accuracy 79% (50-50 balanced dataset). Not terrible for a baseline model.

Read More

eight new tps datasets

As part of its broader Race and Identity-Based Data Collection (RBDC) Strategy, the Toronto Police has published eight open data sets that it plans to update periodically. To make accessing these eight data sets as convenient as possible, I put together a little R package that grabs the data directly from the TPS’s client-side API, cleans up the column names, and imports it into R in tidy (tibble) format. You can install library(tps.rbdc) from my GitHub here, where I’ve also provided some details on how to use it.

Read More

annotating training data in r

Some collaborators and I recently started a project analyzing a large amount of tweets we obtained via the Twitter API. To analyze these data, we are planning to train a machine learning model, which means we need training data, which means we need annotations (‘ground truth’ as its commonly referred to in computer science).

Read More

parsing your pdfs in r

While it’s fairly straightforward to read a .pdf file into R, we may not want all of the text from our .pdf files to be read in all at once, or into the same row or column of our dataframe. There are parts of our .pdfs we may not want to be included in our analysis, or that we may wish to include as metadata, separated from the main text component of our data set.

Read More