A Trend CT tool to help work with Connecticut data

Print More

Work with Connecticut-related data long enough and you’ll find yourself repeating the same processes over and over. It’s rare when data comes in a clean-enough format to immediately conduct an exploratory analysis.

A lot of time is spent cleaning and transforming the data so it can be joined with other data for further analysis.

For example, locals might fill in their hometown as North Greenwich, but data from the Census would not recognize that as a standalone town because that hamlet is part of Greenwich. That name would have to be replaced before I could successfully combine it with relevant data, like town population count, to calculate per capita statistics.

Cleaning it by hand can be a time-consuming process.

So I wrote the R package CTNamecleaner to save everyone time.

Essentially, it searches for place names and finds the corresponding town in a Google sheet.

Screen Shot 2016-03-31 at 4.32.31 PM

It’s an ever-growing list that several data folks across Connecticut contribute to as we come across different versions of town names.

This tutorial will go over the quick and easy way to:

  • append a column with the matching official town names
  • append a column with the matching population count

We will go over two different methods to do so

Difficulty level

Beginner. Especially if just using the app at shiny.trendct.org/ctnamecleaner.

Intermediate if using R, but only because it requires background knowledge of dataframes and how functions work in R.

How to use shiny.trendct.org/ctnamecleaner

Go to shiny.trendct.org/ctnamecleaner

Screen Shot 2016-03-31 at 11.50.35 AM
  • Upload a CSV (comma separated value) of your data
    • Excel spreadsheets (xlxs files) will not work
    • The file size limit is 5 mb. Anything larger, use the R package
  • Select the column that has the town names
  • Select whether you want to add a column of clean town names or population
  • Click the Churn button (loading may take a few moments)
Screen Shot 2016-03-31 at 11.57.46 AM

The app will tell you if there were any town names that could not be matched. It will leave that cell in the real.town.name column blank.

Here’s how it looks if Population was selected.

Screen Shot 2016-03-31 at 12.26.41 PM

Download the file and it will add “_appended.csv” to the end of your original file name.

Getting started with CTNamecleaner (R)

The app has a 5 mb file size limit. If you have a bigger dataframe, use the R package.

If you don’t yet have R installed:

Run the following lines of code to install the CTNamecleaner package.

install.packages("devtools")
library(devtools)

install_github("trendct/ctnamecleaner")
library(ctnamecleaner)

Cleaning up town names in R

Let’s assume you have a dataframe in R called towncoffeeshops that looks like

Town Coffeeshops
Andover 2
Centerbrook 5
Yalesville 1

Run this command in R

ctnamecleaner(Town, towncoffeeshops, filename="towncoffeecleaned", case="Upper")

You’ll get a new file called towncoffeecleaned.csv that looks like

Town Coffeeshops real.town.name
Andover 2 ANDOVER
Centerbrook 5 ESSEX
Yalesville 1 WALLINGFORD

Alternatively, you can create a new dataframe without exporting with this function.

new_dataframe <- ctnamecleaner(Town, towncoffeeshops)

So, to summarize, these are the arguments for the ctnamecleaner function:

  • name - Column with town names
  • data - Name of data frame
  • filename - Name of CSV to save. If skipped, CSV will not export
  • case - Output of town string. Options are Upper, Lower, and Title
Appending town population in R

This is exactly like the ctnamecleaner function but instead, the function is ctpopulator.

Let's assume you have a dataframe in R called towncoffeeshops that looks like

Town Coffeeshops
Andover 2
Centerbrook 5
Yalesville 1

Run this command in R

ctpopulator(Town, towncoffeeshops, filename="towncoffeecleaned", case="Upper")

You'll get a new file called towncoffeecleaned.csv that looks like

Town Coffeeshops pop2013
Andover 2 3016
Centerbrook 5 9009
Yalesville 1 1600

Alternatively, you can create a new dataframe without exporting with this function.

new_dataframe <- ctpopulator(Town, towncoffeeshops)

So, to summarize, these are the arguments for the ctpopulator function:

  • name - Column with town names
  • data - Name of data frame
  • filename - Name of CSV to save. If skipped, CSV will not export
  • case - Output of town string. Options are Upper, Lower, and Title

Let us know if you have any questions or any suggestions on how to improve this package. It's all open source in our github repo, so pull requests are welcome!

What do you think?

  • comprocess

    Amazingly useful, Andrew; it saves a lot of work/time. I’ll share with my networks and collaborators. Thanks! Emil