2 Data

2.1 Technical Description

The data we are interested in originates from The Stanford Open Policing Project found here. This project aims to compile diverse information concerning police stops and crimes across multiple cities in the USA. The project is lead by a group of researchers and journalists at Stanford University and the data is requested from law enforcement agencies. Each city possesses its own CSV/R data file with distinct data columns, as policies may vary due to differing state or city laws. For a comprehensive description of the various column types within the datasets, this README provides an exhaustive overview of the data. Each row in the dataset represents a stop that occurred in a particular city.

We have decided to choose the dataset for Nashville (Tennessee) because it has the most comprehensive features among all of the cities while also having a significant number of stops. This will give us more flexibility for a deeper analysis and to provide well-thought-out answers to our questions.

The initial dataset has:

Rows: 3092351
Columns: 42
Data timeframe: 2010 to 2019
File size: 100 MB
File type: rds file

The table below provides a description of all columns, including their type and meaning, accompanied by an example:

Table 2.1: Columns description for Nashville (TN)
Column	Type	Description	Example
raw_row_number	numeric	Row number used to join clean data back to raw data	3092350
date	string	Date in “YYYY-MM-DD” format	2016-09-09
time	string	Time in “HH:MM:SS” format	21:57:00
location	string	Address in “Street, City, State, Zipcode”	I 40 W & CHARLOTTE AVE, NASHVILLE, TN, 37203
lat	numeric	Latitude	36.063321
lng	numeric	Longitude	-86.637409
precinct	numeric	Police precinct	2.0
reporting_area	numeric	Police reporting area	1863.0
zone	numeric	Police zone	223.0
subject_age	numeric	Age of the stopped subject	18.0
subject_race	string	Race of the stopped subject	hispanic
subject_sex	string	Sex of the stopped subject	female
officer_id_hash	string	Unique officer ID hash	e7099bc91c
type	string	Type of stop: vehicular or pedestrian	vehicular
violation	string	Specific violation of stop	vehicle equipment violation
arrest_made	bool	Indicates whether an arrest was made	False
citation_issued	bool	Indicates whether a citation was issued	False
warning_issued	bool	Indicates whether a warning was issued	False
outcome	string	Indicates whether one of these actions have been taken (arrest, citation, warning and summons)	citation
contraband_found	bool	Indicates whether a contraband was found	False
contraband_drugs	bool	Indicates whether drugs were found	False
contraband_weapons	bool	Indicates whether weapons were found	False
frisk_performed	bool	Indicates whether a frisk was performed	False
search_conducted	bool	Indicates whether a search was conducted	False
search_person	bool	Indicates whether a person was searched	False
search_vehicle	bool	Indicates whether a vehicle was searched	False
search_basis	string	Provides the reason for the search (k9, consent, plain view, probable cause, other)	consent
reason_for_stop	string	Provides the reason for the stop	moving traffic violation
vehicle_registration_state	string	Vehicle state origin (ex: TN for Tennessee)	TN
notes	string	Contains officer notes	CITED FOR NOT HAVING INSURANCE
raw_verbal_warning_issued	bool	Indicates whether a verbal warning was issued	False
raw_written_warning_issued	bool	Indicates whether a written warning was issued	False
raw_traffic_citation_issued	bool	Indicates whether a traffic citation was issued	False
raw_misd_state_citation_issued	bool	Indicates whether a misd state citation was issued	False
raw_suspect_ethnicity	string	Ethnicity of the suspect	H
raw_driver_searched	bool	Indicates whether a driver was searched	False
raw_passenger_searched	bool	Indicates whether a passenger was searched	False
raw_search_consent	bool	Indicates whether a search was consent	False
raw_search_arrest	bool	Indicates whether a search was due to an arrest	False
raw_search_warrant	bool	Indicates whether a search warrant was issued	False
raw_search_inventory	bool	Indicates whether an inventory search	False
raw_search_plain_view	bool	Indicates whether evidence was seized without a warrant	False

See Table 2.1.

The data file has been downloaded locally and is available inside the repository as it is not too large. However, we noticed some patterns with missing values that we will touch on later. Combined with the fact that there are over three million rows spanning almost a decade, we decided to subset the data for ease of analysis and computation speed. Our subsetted data includes only data from 2012-2016.

2.2 Research Plan

Studying trends in Nashville traffic stops provides insights into the bigger picture of law enforcement and social dynamics. One can consider any of the many factors that contribute to why someone is stopped to understand policing strategy more. The type of car someone drives may be more scrutinized by an officer, and this may differ according to time of day or day of wekk. Analyzing the intersection of race, gender, and age among those stopped can shed light on potential underlying biases in policing. Through our dive on the policing patterns in Nashville, we hope to tell a more transparent story of the law enforcement system.

We plan to touch on several research areas using our data:

To what extent do spatiotemporal factors play a part in traffic stops? This includes features like geographic location and time of day.
What are the demographic patterns of those stopped? Are those of a certain gender, age, or race overrepresented in their number of traffic stops?
What do the outcomes of traffic stops look like? To answer this we will look at if the suspect was searched, if a ticket was issued, etc.
Are there patterns in the types of cars that are stopped?

2.3 Missing value analysis

Code

library(tidyverse)
library(lubridate)

nashville <- readRDS('Datasets/yg821jf8611_tn_nashville_2020_04_01.rds')
na_rate <- data.frame("Variable" = colnames(nashville), 
                "na_prop" = sapply(nashville, function(y) sum(length(which(is.na(y))))) / nrow(nashville)) |> 
  arrange(na_prop)

na_rate |> 
  mutate(Variable = factor(Variable, levels = na_rate$Variable)) |> 
  ggplot(aes(x = Variable, y = na_prop)) +
  geom_col() +
  coord_flip() + 
  labs(y = "Proportion of NA's ") + 
  ggtitle("Proportion of NA Values by Variable") +
  theme_bw(13)

The majority of features have very few or zero missing values. We see that there are four columns (search_basis, contraband_weapons, contraband_found, notes) with a very high percentage of missing values (>75%). This is likely due to officers not taking records in these columns, so we will not consider these four features in our analysis moving forward.

Code

nashville |> 
  mutate(year = year(date)) |> 
  select(year, raw_misd_state_citation_issued, raw_written_warning_issued, zone,
         precinct, reporting_area, lat, lng) |> 
  group_by(year) |> 
  summarise_each(list(na_count = ~mean(is.na(.)))) |> 
  pivot_longer(!year) |> 
  mutate(name = sub("_na_count$", "", name)) |> 
  ggplot(aes(x = name, y = value, fill = name)) + 
  geom_col() +
  coord_flip() +
  labs(y = "Proportion of NA's", x = "Variable") + 
  ggtitle("Proportion of NA's Faceted by Year")+
  facet_wrap(~year) +
  theme_bw(13) + 
  theme(axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        legend.position="bottom") +
  scale_y_continuous(breaks = c(0, .5, 1),
    labels = c('0', '0.5', '1'))

To further analyze trends in missing values, we faceted the percent of missing values by year for a few of the variables with a most number of missing values after excluding the four variables discussed above. We see that in 2017-2019, there is no data for raw_written_warning_issued, and from 2010-2011 the column raw_misd_state_citation_issued is mostly empty. This may be due to changes in notation over time.

With the fact that there are over three million rows in mind, we decide to subset our data to 2014 only for a more manageable size. Our subsetted data has 413,114 rows.

As we move forward with our analysis, we want to keep as much data as possible. Furthermore, most of the columns have zero or only a handful missing values. For these reasons we did not remove incomplete rows with NA’s; instead we will drop missing observations if needed and only for the columns analyzed in the current plot.