2  Data

2.1 Technical Description

The data we are interested in originates from The Stanford Open Policing Project found here. This project aims to compile diverse information concerning police stops and crimes across multiple cities in the USA. The project is lead by a group of researchers and journalists at Stanford University and the data is requested from law enforcement agencies. Each city possesses its own CSV/R data file with distinct data columns, as policies may vary due to differing state or city laws. For a comprehensive description of the various column types within the datasets, this README provides an exhaustive overview of the data. Each row in the dataset represents a stop that occurred in a particular city.

We have decided to choose the dataset for Nashville (Tennessee) because it has the most comprehensive features among all of the cities while also having a significant number of stops. This will give us more flexibility for a deeper analysis and to provide well-thought-out answers to our questions.

The initial dataset has:

  • Rows: 3092351
  • Columns: 42
  • Data timeframe: 2010 to 2019
  • File size: 100 MB
  • File type: rds file

The table below provides a description of all columns, including their type and meaning, accompanied by an example:

Table 2.1: Columns description for Nashville (TN)
Column Type Description Example
raw_row_number numeric Row number used to join clean data back to raw data 3092350
date string Date in “YYYY-MM-DD” format 2016-09-09
time string Time in “HH:MM:SS” format 21:57:00
location string Address in “Street, City, State, Zipcode” I 40 W & CHARLOTTE AVE, NASHVILLE, TN, 37203
lat numeric Latitude 36.063321
lng numeric Longitude -86.637409
precinct numeric Police precinct 2.0
reporting_area numeric Police reporting area 1863.0
zone numeric Police zone 223.0
subject_age numeric Age of the stopped subject 18.0
subject_race string Race of the stopped subject hispanic
subject_sex string Sex of the stopped subject female
officer_id_hash string Unique officer ID hash e7099bc91c
type string Type of stop: vehicular or pedestrian vehicular
violation string Specific violation of stop vehicle equipment violation
arrest_made bool Indicates whether an arrest was made False
citation_issued bool Indicates whether a citation was issued False
warning_issued bool Indicates whether a warning was issued False
outcome string Indicates whether one of these actions have been taken (arrest, citation, warning and summons) citation
contraband_found bool Indicates whether a contraband was found False
contraband_drugs bool Indicates whether drugs were found False
contraband_weapons bool Indicates whether weapons were found False
frisk_performed bool Indicates whether a frisk was performed False
search_conducted bool Indicates whether a search was conducted False
search_person bool Indicates whether a person was searched False
search_vehicle bool Indicates whether a vehicle was searched False
search_basis string Provides the reason for the search (k9, consent, plain view, probable cause, other) consent
reason_for_stop string Provides the reason for the stop moving traffic violation
vehicle_registration_state string Vehicle state origin (ex: TN for Tennessee) TN
notes string Contains officer notes CITED FOR NOT HAVING INSURANCE
raw_verbal_warning_issued bool Indicates whether a verbal warning was issued False
raw_written_warning_issued bool Indicates whether a written warning was issued False
raw_traffic_citation_issued bool Indicates whether a traffic citation was issued False
raw_misd_state_citation_issued bool Indicates whether a misd state citation was issued False
raw_suspect_ethnicity string Ethnicity of the suspect H
raw_driver_searched bool Indicates whether a driver was searched False
raw_passenger_searched bool Indicates whether a passenger was searched False
raw_search_consent bool Indicates whether a search was consent False
raw_search_arrest bool Indicates whether a search was due to an arrest False
raw_search_warrant bool Indicates whether a search warrant was issued False
raw_search_inventory bool Indicates whether an inventory search False
raw_search_plain_view bool Indicates whether evidence was seized without a warrant False

See Table 2.1.

The data file has been downloaded locally and is available inside the repository as it is not too large. However, we noticed some patterns with missing values that we will touch on later. Combined with the fact that there are over three million rows spanning almost a decade, we decided to subset the data for ease of analysis and computation speed. Our subsetted data includes only data from 2012-2016.

2.2 Research Plan

Studying trends in Nashville traffic stops provides insights into the bigger picture of law enforcement and social dynamics. One can consider any of the many factors that contribute to why someone is stopped to understand policing strategy more. The type of car someone drives may be more scrutinized by an officer, and this may differ according to time of day or day of wekk. Analyzing the intersection of race, gender, and age among those stopped can shed light on potential underlying biases in policing. Through our dive on the policing patterns in Nashville, we hope to tell a more transparent story of the law enforcement system.

We plan to touch on several research areas using our data:

  • To what extent do spatiotemporal factors play a part in traffic stops? This includes features like geographic location and time of day.
  • What are the demographic patterns of those stopped? Are those of a certain gender, age, or race overrepresented in their number of traffic stops?
  • What do the outcomes of traffic stops look like? To answer this we will look at if the suspect was searched, if a ticket was issued, etc.
  • Are there patterns in the types of cars that are stopped?

2.3 Missing value analysis

Code
library(tidyverse)
library(lubridate)

nashville <- readRDS('Datasets/yg821jf8611_tn_nashville_2020_04_01.rds')
na_rate <- data.frame("Variable" = colnames(nashville), 
                "na_prop" = sapply(nashville, function(y) sum(length(which(is.na(y))))) / nrow(nashville)) |> 
  arrange(na_prop)

na_rate |> 
  mutate(Variable = factor(Variable, levels = na_rate$Variable)) |> 
  ggplot(aes(x = Variable, y = na_prop)) +
  geom_col() +
  coord_flip() + 
  labs(y = "Proportion of NA's ") + 
  ggtitle("Proportion of NA Values by Variable") +
  theme_bw(13)

The majority of features have very few or zero missing values. We see that there are four columns (search_basis, contraband_weapons, contraband_found, notes) with a very high percentage of missing values (>75%). This is likely due to officers not taking records in these columns, so we will not consider these four features in our analysis moving forward.

Code
nashville |> 
  mutate(year = year(date)) |> 
  select(year, raw_misd_state_citation_issued, raw_written_warning_issued, zone,
         precinct, reporting_area, lat, lng) |> 
  group_by(year) |> 
  summarise_each(list(na_count = ~mean(is.na(.)))) |> 
  pivot_longer(!year) |> 
  mutate(name = sub("_na_count$", "", name)) |> 
  ggplot(aes(x = name, y = value, fill = name)) + 
  geom_col() +
  coord_flip() +
  labs(y = "Proportion of NA's", x = "Variable") + 
  ggtitle("Proportion of NA's Faceted by Year")+
  facet_wrap(~year) +
  theme_bw(13) + 
  theme(axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        legend.position="bottom") +
  scale_y_continuous(breaks = c(0, .5, 1),
    labels = c('0', '0.5', '1')) 

To further analyze trends in missing values, we faceted the percent of missing values by year for a few of the variables with a most number of missing values after excluding the four variables discussed above. We see that in 2017-2019, there is no data for raw_written_warning_issued, and from 2010-2011 the column raw_misd_state_citation_issued is mostly empty. This may be due to changes in notation over time.

With the fact that there are over three million rows in mind, we decide to subset our data to 2014 only for a more manageable size. Our subsetted data has 413,114 rows.

As we move forward with our analysis, we want to keep as much data as possible. Furthermore, most of the columns have zero or only a handful missing values. For these reasons we did not remove incomplete rows with NA’s; instead we will drop missing observations if needed and only for the columns analyzed in the current plot.