Tutorial: Spatial analysis of census and traffic stop data using R

Print More
Screen Shot 2016-08-06 at 8.37.21 PM

We use R at Trend CT not just for data analysis and creating visualizations, but also for spatial analysis and creating geographic graphics.

If you have a data set with latitude and longitude information, it’s easy to just throw it on a map with a dot for every instance. But what would that tell you? You see the intensity of the cluster of dots over the area but that’s it. If there’s no context or explanation it’s a flashy visualization and that’s it.

This tutorial will show you how to dig deeper and tell better stories with location data.

For this tutorial, we’ll be working with traffic stop data, which Trend CT has written about extensively.

Goal: We’ll figure out which town and census tract each stop occurred in and then pull in demographic data from the Census to determine what types of neighborhoods police tend to pull people over more often.

You could conduct this analysis using software like ArcGIS or QGIS, but we’re going to be doing it all in R.

It is better to stay in a single environment from data importing, to analysis, to exporting visualizations because the produced scripts make it easier for others (including your future self) to replicate and verify your work in the future.

Difficulty Level


We’ll be using many packages, including dplyr (for pipes), as well as working with shape files and census data. You should have a basic understanding of these concepts before continuing.

Getting started

Go through the beginner tutorial to get R and RStudio installed before you begin.

You must download this zip of all the files. It includes the original csv and shape files needed for this project.

And then you can follow along with the chunks of code below or with the final R script.

Importing and preparing the data

Start with the data. It’s raw traffic stops between 2013 and 2014. It includes race, reasons for the stop, and many other factors. The state of Connecticut collects this information from all police departments but only a handful of them included location-specific information. Researchers at Central Connecticut State University’s Center for Municipal and Regional Policy geolocated as many as possible, focusing on eight departments that showed signs of racial profiling.

About 34,000 stops were geolocated.

We’re looking specifically at Hamden for this tutorial— so about 5,500 stops. If you want to see the other data available, check our Github repo or scripts.

Bring in the Hamden data.

Importing the shapefiles

I’ve already downloaded and renamed the Connecticut census tracts shapefiles.

You can download the tracts for other states at the Census website.

We’re using the libraries maptools and ggplot2.

If you don’t have those libraries installed yet, type in install.packages(“maptools”) and install.packages(“ggplot2”). Do this for all future mentions of packages in this tutorial.

With this code, you’ve created towntracts and towntracts_only.

What’s the difference? towntracts is a dataframe and can be used for analyzing data and performing joins and calculations while towntracts_only is a Large SpatialPolygonsDataFrame, which is for rendering spatial data in R graphically.

Here’s what the dataframe towntracts looks like.

long lat order hole piece id group
-73.66336 41.0417 1 FALSE 1 9001010101 9001010101
-73.66348 41.04147 2 FALSE 1 9001010101 9001010101
-73.66362 41.04115 3 FALSE 1 9001010101 9001010101
-73.6641 41.04023 4 FALSE 1 9001010101 9001010101
-73.66428 41.03983 5 FALSE 1 9001010101 9001010101
-73.66451 41.03934 6 FALSE 1 9001010101 9001010101

Mapping the data

Let’s visualize the census tracts borders and the traffic stop locations on top of it.

Points in a polygon

Now it gets more complicated.

We want to count how many dots fell into each census tract border, or polygon.

The code above turned out the dataframe by_tract. Here’s a sample of it.

9 9 166001 9009166001 1660.01
9 9 165500 9009165500 1655
9 9 165600 9009165600 1656
9 9 141900 9009141900 1419
9 9 165900 9009165900 1659

Why did we we bring in and join the additional dataframe tracts_to_towns.csv?

Because Hamden sometimes made traffic stops outside of its jurisdiction. Now we can tell for sure which towns police overextended themselves.

Making a choropleth

A choropleth map is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map. In this instance, it’s total traffic stops.

Excellent. This is a great start but we have a lot of gray tracts.

Let’s try again but without the gray.

Much better. But we’re still unclear which part is Hamden and which are parts of other towns.

We need to bring in an additional shape file— town borders.

That additional line of geom_polygon() code layered on the Hamden town borders. So now we can see which tracts fell outside.

Analyzing the data

Before we move on, we need to also calculate the number of minority stops in Hamden census tracts.

To summarize the code above, we ran the over() function once more, but to a subset of the stops data that was specifically of minority drivers.

Then we figured out the percent of minority drivers stopped per census tract.

This is the result.

id total town_name minority minority_p
9003400200 1 Berlin NA NA
9009141300 7 New Haven 6 85.71
9009141400 3 New Haven 2 66.67
9009141500 96 New Haven 79 82.29
9009141600 1 New Haven 1 100
9009141800 6 New Haven 4 66.67

Importing Census data

We’ll be using the censusapi package from Hanna Recht. It has excellent documentation.

This is what the getCensus() function imported. We now know total and white population per census tract.

NAME state county tract B02001_001E B02001_002E
Census Tract 101.01, Fairfield County, Connecticut 9 1 10101 4392 4050
Census Tract 101.02, Fairfield County, Connecticut 9 1 10102 4134 3800
Census Tract 102.01, Fairfield County, Connecticut 9 1 10201 3387 2928
Census Tract 102.02, Fairfield County, Connecticut 9 1 10202 5066 4447
Census Tract 103, Fairfield County, Connecticut 9 1 10300 4209 3846
Census Tract 104, Fairfield County, Connecticut 9 1 10400 5701 4968

Calculating disparity

Next, we bring the two dataframes (traffic tickets by census tract and population by census tract) together so we can calculate disparity.

Here’s a summary of the results.

tract_code min_disp
400200 NA
141300 35.85
141400 -12.17
141500 -11.99
141600 22.19
141800 11.23

Visualizing geographic disparity

Finally, we can visualize this disparity.

This time, we’ll use a diverging color palette, PuOr.

Nice. Let’s put some annotations in there, though.


Congratulations, you’ve made it.

We can see the census tracts bordering the town of New Haven has the largest disparities. This means there are large gaps between the percent of minorities pulled over in those areas compared to the percent of minorities who actually live there. New Haven has a much higher percent minority population than Hamden and police officers tend to focus their traffic enforcement by there.

Next steps? Well, police argue that they patrol areas with high levels of crime, so if we get the latitude and longitude of every crime sorted by type, we can also compare that to the traffic stops and see if that squares up. There are so many more possible stories. And now you know how to get started.

What do you think?