Google Sheets is a great free tool for working with data online. Previously known as Google Spreadsheets, users can import CSV or Excel files and access it on any computer with an internet connection.
The sharing options are also very powerful as multiple users can work in the same spreadsheet at once. Plus, Google’s search is great for tracking down that obscure data file created years ago.
However, it lacked the advanced functionality of Excel, such as pivot-tables.
Last year, Google introduced add-ons which allowed programmers to expand on its functionality. At the moment, there are some pretty useful add-ons, like random number generators, styling options, or string splitters.
We’ll be working with a data set of nail salon listings from Yelp that we collected for a recent story. The list was compiled by gathering up all salons within a 10-mile radius of every city in Connecticut. Since most towns are less than 10 miles wide, many stores were added to the data frame more than once.
There are less than 900 nail salons across the state but this scraper picked up more than 42,000 listings. That’s a lot of duplicate stores we’ll have to get rid of.
This tutorial will quickly go over how to remove duplicate rows from a spreadsheet.
- Download the nails_yelp_list.csv
- Sign into drive.google.com and create a new Google Sheet.
Importing the data
File > Import… > Upload
Installing the add-on
Add-ons > Get add-ons
Then search for Remove Duplicates.
Click free to install.
Removing duplicate rows
Select the column to search for duplicates in— url for this example.
Add-ons > Remove Duplicates > Find duplicates or uniques
Range has already been selected or reselect it here.
Many options to choose from here. For this example, we just want to focus on the duplicate instance and ignore the first occurrence.
Or you could isolate the non-duplicates if you wanted to find the nail salons that did not overlap within 10-miles of another town. Then select the Uniques option.
Some more advanced options on the target column.
What do you want to do with the identified duplicate rows? You could just change the background color or copy it into another spreadsheet entirely.
For this exercise, we want to delete it. Click Finish
Now, we wait. For this example, it will take a very long time for the browser to churn through 42,000 rows. Step outside for a break.
It took me about a couple of hours to churn through this data set, but now we’ve gone from 42,000 rows to more than 900. Not bad.
Oh, also you can run pivot tables in Google Sheets now. We’ll try that in the next walkthrough.