We use R at Trend CT not just for data analysis and creating visualizations, but also for spatial analysis and creating geographic graphics.
If you have a data set with latitude and longitude information, it’s easy to just throw it on a map with a dot for every instance. But what would that tell you? You see the intensity of the cluster of dots over the area but that’s it. If there’s no context or explanation it’s a flashy visualization and that’s it.
This tutorial will show you how to dig deeper and tell better stories with location data.
For this tutorial, we’ll be working with traffic stop data, which Trend CT has written about extensively.
Goal: We’ll figure out which town and census tract each stop occurred in and then pull in demographic data from the Census to determine what types of neighborhoods police tend to pull people over more often.
You could conduct this analysis using software like ArcGIS or QGIS, but we’re going to be doing it all in R.
It is better to stay in a single environment from data importing, to analysis, to exporting visualizations because the produced scripts make it easier for others (including your future self) to replicate and verify your work in the future.
And then you can follow along with the chunks of code below or with the final R script.
Importing and preparing the data
Start with the data. It’s raw traffic stops between 2013 and 2014. It includes race, reasons for the stop, and many other factors. The state of Connecticut collects this information from all police departments but only a handful of them included location-specific information. Researchers at Central Connecticut State University’s Center for Municipal and Regional Policy geolocated as many as possible, focusing on eight departments that showed signs of racial profiling.
About 34,000 stops were geolocated.
Bring in the Hamden data.
Importing the shapefiles
I’ve already downloaded and renamed the Connecticut census tracts shapefiles.
You can download the tracts for other states at the Census website.
If you don’t have those libraries installed yet, type in install.packages(“maptools”) and install.packages(“ggplot2”). Do this for all future mentions of packages in this tutorial.
With this code, you’ve created towntracts and towntracts_only.
What’s the difference? towntracts is a dataframe and can be used for analyzing data and performing joins and calculations while towntracts_only is a Large SpatialPolygonsDataFrame, which is for rendering spatial data in R graphically.
Here’s what the dataframe towntracts looks like.
Mapping the data
Let’s visualize the census tracts borders and the traffic stop locations on top of it.
Points in a polygon
Now it gets more complicated.
We want to count how many dots fell into each census tract border, or polygon.
The code above turned out the dataframe by_tract. Here’s a sample of it.
Why did we we bring in and join the additional dataframe tracts_to_towns.csv?
Because Hamden sometimes made traffic stops outside of its jurisdiction. Now we can tell for sure which towns police overextended themselves.
Making a choropleth
A choropleth map is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map. In this instance, it’s total traffic stops.
Excellent. This is a great start but we have a lot of gray tracts.
Let’s try again but without the gray.
Much better. But we’re still unclear which part is Hamden and which are parts of other towns.
We need to bring in an additional shape file— town borders.
That additional line of geom_polygon() code layered on the Hamden town borders. So now we can see which tracts fell outside.
Analyzing the data
Before we move on, we need to also calculate the number of minority stops in Hamden census tracts.
To summarize the code above, we ran the over() function once more, but to a subset of the stops data that was specifically of minority drivers.
Then we figured out the percent of minority drivers stopped per census tract.
This is the result.
Importing Census data
This is what the getCensus() function imported. We now know total and white population per census tract.
|Census Tract 101.01, Fairfield County, Connecticut||9||1||10101||4392||4050|
|Census Tract 101.02, Fairfield County, Connecticut||9||1||10102||4134||3800|
|Census Tract 102.01, Fairfield County, Connecticut||9||1||10201||3387||2928|
|Census Tract 102.02, Fairfield County, Connecticut||9||1||10202||5066||4447|
|Census Tract 103, Fairfield County, Connecticut||9||1||10300||4209||3846|
|Census Tract 104, Fairfield County, Connecticut||9||1||10400||5701||4968|
Next, we bring the two dataframes (traffic tickets by census tract and population by census tract) together so we can calculate disparity.
Here’s a summary of the results.
Visualizing geographic disparity
Finally, we can visualize this disparity.
This time, we’ll use a diverging color palette, PuOr.
Nice. Let’s put some annotations in there, though.
Congratulations, you’ve made it.
We can see the census tracts bordering the town of New Haven has the largest disparities. This means there are large gaps between the percent of minorities pulled over in those areas compared to the percent of minorities who actually live there. New Haven has a much higher percent minority population than Hamden and police officers tend to focus their traffic enforcement by there.
Next steps? Well, police argue that they patrol areas with high levels of crime, so if we get the latitude and longitude of every crime sorted by type, we can also compare that to the traffic stops and see if that squares up. There are so many more possible stories. And now you know how to get started.