Cleaning up CT place names in Python

Print More

When we analyze Connecticut data sets, we often have to clean them up first, translating the names of villages into the towns they’re part of.

Andrew Ba Tran published CTNameCleaner, a web tool to do just that, as well as a package for users of the R programming language back in April.

I use Python, not R, so I put together a similar package for Python and Pandas users, called CTNameCleaner-py.


If you don’t want to do any programming and just want a web tool to clean up Connecticut place names in spreadsheet files, use this web tool.

If you’re not a Python programmer, but you use the command line to get work done, the first half of this tutorial will still apply to you.

The second half of this document is for Python programmers, especially those who do data analysis with Pandas.


I use pip, but any Python package manager that talks to the Python Package Index PyPi should work.

Installation with pip from the command line on Mac OS/Linux.

$ [sudo] pip install ctnamecleaner

You can also download and unzip the source from the PyPi package page, cd to the directory and run:

$ python install

To see the installed package, if you’re using pip, run:

$ pip list | grep ctnamecleaner
ctnamecleaner (0.10)

This will install a command line script called ctclean as well as the main package, ctlookup that you can use in your Python code.

Command line: Basics

To get the proper town name of a village from the command line, run:

$ ctclean New\ Preston

Note that the space had to be escaped or it would be treated as two separate arguments. It could also be escaped like this:

$ ctclean "New Preston"

In cases where ctclean can’t resolve the town name, it will return an error string. By default it will display None.

$ ctclean New\ York 

But you can specify a custom error value with the -e or --error flag:

$ ctclean New\ York --error Ruh\ Roh
Ruh Roh
Command line: Processing spreadsheets

If you want to process a spreadsheet with a column of place names, you’ll need to specify an input file (–icsv) and optionally an output file (–ofile) an input column (–icol) and an output column (–ocol) like this:

$ ctclean --icsv Messy_Names.csv --icol town --ocol clean_town --ocsv Clean_Names.csv

(You can find the sample spreadsheets in the demo folder in the GitHub repository for CTNameCleaner-Py)

A few more uses: If you set the output column to the input column, it will be overwritten with the cleaned-up values. If you don’t specify and output file, the csv will be written to stdout (printed on the screen). Putting those together with a custom error value looks like this:

$ ctclean --icsv demo/command_line_demo/Messy_Names.csv --icol town --ocol town --error Ruh\ Roh
Ruh Roh,10000

Finally, if you want to force ctclean to use the latest version of the name cleaner spreadsheet, add the --latest flag. Note that this will be significantly slower than normal usage, since it has to retrieve the clean town name spreadsheet from the web server.

(To view usage instructions, run ctclean --help).

Command line: Shell scripting

You can combine ctclean with any of your other favorite command line utilities, like xargs.

Let’s say you have a text file called Town_Names that looks like this:

"New Preston"
"New MIlford"
"New York"

To convert it to proper names, you might run:

$ cat Town_Names | xargs -L1 ctclean --error TOWN\ NOT\ FOUND

To save the output to a file called Cleaned_Up, you can redirect the output to a file:

$ cat Town_Names | xargs -L1 ctclean --error TOWN\ NOT\ FOUND > Cleaned_Up
Python: Cleaning Pandas DataFrames

In this section we’ll look at cleaning up a spreadsheet within Pandas. If you’re not familiar with Pandas for data analysis, the rest of this tutorial won’t be very useful for you. If you use R, this isn’t the guide for you, but this is. If you want to learn Pandas for data analysis, this tutorial has plenty of beginner resources.

The demo code for this section is available in a Jupyter notebook here.

Start by importing pandas:

import pandas as pd

The library that is installed with ctnamecleaner is called ctlookup. It can be imported with:

from ctlookup import lookup

To create a Lookup object:

l = lookup.Lookup()

Let’s create a sample DataFrame to clean up. If you have one you want to work with, use that instead:

tmpdf = pd.DataFrame([["New Preston",1,2],

It looks like this:

Screen Shot 2016-08-29 at 11.33.36 AM

Now to clean up the values in the “TOWN” column and assign them to a new column called “CLEAN_TOWN”, we use the .clean_dataframe() method. It takes two arguments, a dataframe and a column to clean up. We use the assignment operator to add it to the dataframe:

tmpdf["TOWN_REAL"] = l.clean_dataframe(tmpdf,"TOWN")

Now the dataframe looks like this:

Screen Shot 2016-08-29 at 11.37.59 AM

If we wanted to overwrite the values in the original “TOWN” column, instead of creating a new one, we could just use “TOWN” instead of “TOWN_REAL” in the above code:

tmpdf["TOWN"] = l.clean_dataframe(tmpdf,"TOWN")

By default, town names that aren’t found return None/NaN, but you can override that value by passing the keyword argument error:

tmpdf["TOWN_REAL"] = l.clean_dataframe(tmpdf,"TOWN",error="NOT FOUND")
Reporting town names and bugs

This tool works by looking up village names in a spreadsheet, which we maintain. If you come across village names that aren’t translating and should be, let us know.

The code for this tool is all open source. Please feel free to contribute code to the GitHub repository.

Advanced: Extending with custom data

The ctlookup Python library can also be extended with other data sets, say for town names in New York or Kentucky, or for populations. It’s too advanced for the scope of this tutorial, but know that it can be done. Extensibility is somewhat documented in the GitHub repository Readme, as well as in the source code itself, but that is too advanced for the scope of this document.

What do you think?

  • comprocess

    Amazingly useful tool, Jake. I’ll share with my networks and collaborators. Thanks! Emil

  • Joseph Brzezinski

    The clean up process for location names is extremely important as many public and commercial data sources have incompatibilitied. Name cleansing is also important when changing from, say, zip code to town or county or municipal area.
    The app or language isn’t necessarily a major considerstion. I usually use Alteryx but some others may use SAS, SPSS, a database processor or even Excel spreadsheets. Users need to understand what needs to be done more than the tool to do it!