When we analyze Connecticut data sets, we often have to clean them up first, translating the names of villages into the towns they’re part of.
Andrew Ba Tran published CTNameCleaner, a web tool to do just that, as well as a package for users of the R programming language back in April.
I use Python, not R, so I put together a similar package for Python and Pandas users, called CTNameCleaner-py.
If you don’t want to do any programming and just want a web tool to clean up Connecticut place names in spreadsheet files, use this web tool.
If you’re not a Python programmer, but you use the command line to get work done, the first half of this tutorial will still apply to you.
The second half of this document is for Python programmers, especially those who do data analysis with Pandas.
I use pip, but any Python package manager that talks to the Python Package Index PyPi should work.
Installation with pip from the command line on Mac OS/Linux.
$ [sudo] pip install ctnamecleaner
You can also download and unzip the source from the PyPi package page,
cd to the directory and run:
$ python setup.py install
To see the installed package, if you’re using pip, run:
$ pip list | grep ctnamecleaner ctnamecleaner (0.10)
This will install a command line script called
ctclean as well as the main package,
ctlookup that you can use in your Python code.
Command line: Basics
To get the proper town name of a village from the command line, run:
$ ctclean New\ Preston WASHINGTON
Note that the space had to be escaped or it would be treated as two separate arguments. It could also be escaped like this:
$ ctclean "New Preston" WASHINGTON
In cases where
ctclean can’t resolve the town name, it will return an error string. By default it will display
$ ctclean New\ York None
But you can specify a custom error value with the
$ ctclean New\ York --error Ruh\ Roh Ruh Roh
Command line: Processing spreadsheets
If you want to process a spreadsheet with a column of place names, you’ll need to specify an input file (–icsv) and optionally an output file (–ofile) an input column (–icol) and an output column (–ocol) like this:
$ ctclean --icsv Messy_Names.csv --icol town --ocol clean_town --ocsv Clean_Names.csv
(You can find the sample spreadsheets in the demo folder in the GitHub repository for CTNameCleaner-Py)
A few more uses: If you set the output column to the input column, it will be overwritten with the cleaned-up values. If you don’t specify and output file, the csv will be written to stdout (printed on the screen). Putting those together with a custom error value looks like this:
$ ctclean --icsv demo/command_line_demo/Messy_Names.csv --icol town --ocol town --error Ruh\ Roh town,population WASHINGTON,100 BRIDGEWATER,110 RIDGEFIELD,300 Ruh Roh,10000
Finally, if you want to force ctclean to use the latest version of the name cleaner spreadsheet, add the
--latest flag. Note that this will be significantly slower than normal usage, since it has to retrieve the clean town name spreadsheet from the web server.
(To view usage instructions, run
Command line: Shell scripting
You can combine ctclean with any of your other favorite command line utilities, like
Let’s say you have a text file called
Town_Names that looks like this:
"New Preston" Marbledale "New MIlford" "New York" Bridgeport
To convert it to proper names, you might run:
$ cat Town_Names | xargs -L1 ctclean --error TOWN\ NOT\ FOUND WASHINGTON WASHINGTON NEW MILFORD TOWN NOT FOUND BRIDGEPORT
To save the output to a file called
Cleaned_Up, you can redirect the output to a file:
$ cat Town_Names | xargs -L1 ctclean --error TOWN\ NOT\ FOUND > Cleaned_Up
Python: Cleaning Pandas DataFrames
In this section we’ll look at cleaning up a spreadsheet within Pandas. If you’re not familiar with Pandas for data analysis, the rest of this tutorial won’t be very useful for you. If you use R, this isn’t the guide for you, but this is. If you want to learn Pandas for data analysis, this tutorial has plenty of beginner resources.
The demo code for this section is available in a Jupyter notebook here.
Start by importing pandas:
import pandas as pd
The library that is installed with ctnamecleaner is called ctlookup. It can be imported with:
from ctlookup import lookup
To create a Lookup object:
l = lookup.Lookup()
Let’s create a sample DataFrame to clean up. If you have one you want to work with, use that instead:
tmpdf = pd.DataFrame([["New Preston",1,2], ["Marbledale",110,23], ["Bridgeport",100,97], ["B_water",12,44] ], columns=["TOWN","POP","GDP"])
It looks like this:
Now to clean up the values in the “TOWN” column and assign them to a new column called “CLEAN_TOWN”, we use the
.clean_dataframe() method. It takes two arguments, a dataframe and a column to clean up. We use the assignment operator to add it to the dataframe:
tmpdf["TOWN_REAL"] = l.clean_dataframe(tmpdf,"TOWN")
Now the dataframe looks like this:
If we wanted to overwrite the values in the original “TOWN” column, instead of creating a new one, we could just use “TOWN” instead of “TOWN_REAL” in the above code:
tmpdf["TOWN"] = l.clean_dataframe(tmpdf,"TOWN")
By default, town names that aren’t found return None/NaN, but you can override that value by passing the keyword argument
tmpdf["TOWN_REAL"] = l.clean_dataframe(tmpdf,"TOWN",error="NOT FOUND")
Reporting town names and bugs
This tool works by looking up village names in a spreadsheet, which we maintain. If you come across village names that aren’t translating and should be, let us know.
The code for this tool is all open source. Please feel free to contribute code to the GitHub repository.
Advanced: Extending with custom data
The ctlookup Python library can also be extended with other data sets, say for town names in New York or Kentucky, or for populations. It’s too advanced for the scope of this tutorial, but know that it can be done. Extensibility is somewhat documented in the GitHub repository Readme, as well as in the source code itself, but that is too advanced for the scope of this document.