This is one of the most common problems in data:
John Doe. Jonathan Doe. Johnny Q. Doe.
Humans input items into a database, and they spell things differently — or wrong. Humans can figure out these are probably the same people. But a computer? Not so much.
This is often a problem with campaign finance data. When someone donates money to a candidate, they’ll spell their name one way; the next time, they’ll spell it another way.
So what is a computer to do?
The solution: Clustering
Some people build elegant algorithms to sort out names. But if you work at a nonprofit and you just want to sort through the 1,000 donors in your database, that isn’t the best solution.
Instead, we like to use Open Refine. It’s a program, first developed by Google, that cleans easy data. It has a vast array of talents, but in this post, we’ll be discussing a specific tool that solves this problem with multiple names. It’s called “clustering,” and it allows you to rein in big datasets so you can make them meaningful.
What you’ll need: Open Refine and a dataset
Go to the Open Refine website and download the program. There are installation instructions for Windows, Mac and Linux on the page.
When you open the program, you’ll notice that it opens in your browser and the URL is a whole bunch of numbers. This is normal.
If you have a dataset you want to work with, ignore this section. If you don’t, we’ll work with a dataset from the Hartford Open Data portal: All the cars towed in the last 30 days. You can download the spreadsheet here.
Getting started: Opening data
In Open Refine, click on “Create Project” on the left side. The default mode will let you open data from your computer. Click “Choose Files” and navigate to where you saved your tow data (or whatever data you’re working with.)
It will take a few seconds to read the dataset, but eventually you’ll get a screen that looks like this. I’ve annotated this so you can see what each section does:
If you don’t know what any of this means, it’s okay! Refine gives you a preview of what it thinks your data looks like. It usually does a good job of figuring out what kind of data you have. But sometimes, you have to mess with the different options to help it understand.
Click around until the data preview looks good. This sounds like ridiculous advice, but it works if you don’t know what these options mean.
When your data looks good, click on “Create Project.”
How Refine works
This isn’t Microsoft Excel. It’s not your usual spreadsheet program.
Everything you can do in Refine happens using the little “down” arrows next to the header names. The easiest thing you can do is click the down arrow, go to:
Facet > Text Facet
For example, if you did this on the “Make” column of the tow dataset, you’ll see there are tons of Hondas, Nissans and Toyotas that have been towed.
Some things you want to do with Refine require a little knowledge of programming. (Here a guide.) But for clustering, you don’t need any technical skills. It is perhaps the most impressive technical feat you can accomplish with no technical skills.
When you looked at the different ‘Makes’ of cars in the tow database, you probably noticed that there are mispelled duplicates. For exampe, you might see “NISSAN” and “NISS,” even though they both mean the same thing. This probably happens dozens, if not hundreds, of times in this dataset — and going through, by hand, and cleaning this can get absurd.
So, instead, go to the ‘Make’ column and click the down arrow. Go to:
Edit cells > Cluster and edit…
An additional window will pop up. Pay close atttention to the pull-down menus at the top called “Method” and “Key Functioning.” These are different algorithms Refine will use to try to find clusters.
Try different types of “Key Functions” and see what results you get. For example, here’s what I get when I use “metaphone3,” which is an algorithm that sorts by English pronunciation:
Notice how, in the above image, it correctly guessed matches for Toyota, Acura, Mitsubishi and Infiniti. If all the clusters are correct, click the box in the “merge” column and click “Merge Selected and Re-Cluster” at the bottom of the window. In the “New Cell Value” column, you can see what the entire cluster of names will be changed to — and edit that, if necessary.
Continue to merge and recluster until the only results you get are bad clusters. But inevitably, you’ll miss some column.
Click the “down” arrow on the “Make” column and create a “Text Facet” again, using — Facet > Text Facet.
You’ll see how many different values are in that column. If you did a lot of cleaning, it should be fewer than you started with:
Sort this list by “name” at the top of the facet window on the left of the window. When you see a duplicate, like “Chevy” and “Chevrolet,” hover over the incorrect value and click “EDIT” — which will allow you to edit that value. So, here, we’re changing “Chevy” to “Chevrolet.” Then click “Apply.”
Go through your list and do this further. This is how you ensure your data is 100 percent clean. I have yet to find a better way to do this; it’s a little painstaking, but it’s faster than any other method I’ve tried.
After cleaning it up alphabetically, I like to sort by “count” (using the button at the top of the facet window). From there, I look through values with 1 or 2 entries — because, if there’s a super weird misepelling, like spelling “Nissan” as “His5an” you’ll find it down here.
Clustering people’s names can be a little tough because there are often more than one “Jennifer Smith” in a database. So you can use other columns, like addresses, to help you. During the clustering phase, you can see just the clustered entries by hovering on the cluster and clicking on “Browse this cluster.” If addresses match, you probably have a match. But people who name their kids after themselves often make this harder.
Download the cleaned data
After your data is clean, go take a walk. You have time. You just saved yourself a lot of time.
Then click “Export” on the top-right of your window and click the type of file you want to download. If you want to open it in Microsoft Excel, click on “Excel” or “Comma-separated value.”
You can close the program and all your work will be saved. You can even undo the transformations you made on the top-left of the window, under the “Undo/Redo” tab.
What this is useful for
Anytime a human enters text into a database by hand — like the tow data — you’ll have messy entries, which you can clean up with this method. I use clustering anytime a database is created by transcribing paper files into a database, like campaign finance data.
Sometimes, you can find interesting patterns in the mistakes. For example, police officers in Hartford don’t like to spell out “Mitsubishi” when they file their reports.
Have a better way to do this? Have a question? Put it in the comments, and we’ll response prompty!