There’s so much structured data on websites that would be useful for journalists or data enthusiasts if only they could get their hands on it, like listings on online directories, real estate sites, social networks, online shopping sites and government agency landing pages.
However, most sites don’t offer the functionality to copy the data into a nice, machine-readable format like a spreadsheet. If you want that data, you often have to manually copy and paste from the website into Excel or Google Sheets.
This takes forever, and most people can’t afford to make an intern do it or use a service like Amazon Mechanical Turk.
What is web scraping?
Web scraping (also known as screen scraping, web data extraction or web harvesting) is a programming technique to automate the copying and pasting process in a fraction of the time.
Some methods include using libraries from programming languages like Python and R. There are also programs that have been built specifically for web scraping.
We’ll go over how to use one of them, Import.io in this tutorial.
Is web scraping ethical?
I believe it is because there’s no technical difference between a person copying and pasting the information versus a computer.
However, there are ways in which it can be unethical.
- Taking and displaying personal information.
- Scraping too much, too often, resulting in something akin to a “Denial of Service” attack when their server is overloaded with your requests.
- Copying and displaying their content wholesale.
Don’t do any of the above. It’s not nice.
Difficulty level: Beginners.
Download the app import.io and install it. There’s a version for Mac and PC.
Navigating the program
Start it up and click on New at the top right.
You’ve got a lot of options.
- Magic – Tries to detect the data and structure based on the URL you feed it
- Extractor – Gives you more control by letting you choose what specific data you want
- Classic Extractor – Same as Extractor but in a different format
- Authenticated Extractor – Same as Extractor but for websites you have to log into first
- Crawler – A sophisticated version that’s more advanced and gives you more control
- Connector – Sort of a mashup between an extractor and a crawler
Go ahead and click on Magic under the New pulldown menu.
This is a list of “most popular” sites that people like to scrape.
Intro to Zillow data
Click on the Zillow button.
This is the data that Import.io has detected and has guessed the structure for on Zillow.com.
Do you see the similarities to the data from the original website?
Click on Download CSV at the bottom right (Or Extract Data at the top right).
You can select how many pages of data to download.
This is what the exported data looks like.
It’s rough and needs cleaning up, but now you can do a bunch of data analyses on it.
What do you think you can do with this type of data?
Scraping from Airbnb
Click New > Extractor
Let’s take a look at AirBNB data
Type in airbnb.com into the URL bar at the top of the app.
Search for Hartford, CT.
Click the ON button at the top left.
Are you sure? Yes.
We need to train the import.io app to recognize the data and sort them into a data frame.
Hover over the data you want to train.
Hover over the $90 until it turns orange.
We want to turn each of these listings into a row of a spreadsheet.
Select Many rows.
This is also a link, would you like to add this too?
We’re going to need it later.
We can continue to click on +New Column and hover and click for every facet that we want…
Or we can just click Suggest Data and it will guess what we want.
Go ahead and click it.
OK, that’s a lot of interesting data.
It’s pretty much what we want.
Click Done at the top right.
You’ve created an API.
Name it and click Publish.
You’ll be brought to your import.io library.
The problem here is that you’ve only got one page of data.
That means 18 results from the page. There might be hundreds more.
Fortunately, there’s an option for Bulk Extract.
Give it a list of URLs with structure similar to the original one and it will pull out the data.
We need to get the links for the other pages.
Right click on the 2 link at the bottom and copy the link.
The link looks like
And it appears there are 13 pages total.
You could paste this link into the import.io box 13 times and just change the number from 2 to 3 to 4, etc.
But here’s a quicker way.
Open up Google Spreadsheets.
The only thing that changes from page to page in the URL is the number at the very end.
So let’s make a formula that let’s us change it easily.
Delete the 2.
Type it into the column to the right.
You’re separating it.
Drag down the URL so that it copies down for 13 rows.
Put in the numbers so it counts incrementally.
Now, let’s use a formula in the next column to bring those two together.
Drag the formula down so it fills out.
Now you have a list of URLs.
Copy that column over to the import.io app and place it in the Bulk Extract box.
Click Run Query
It will tell you that 14/14 URLs have been processed (1 URL failed because it looked for page 14 when it only goes up to 13)
Click the Export button at the top right when you’re done.
Now you have a new data set to analyze.
The data set also includes links to specific pages for each AirBNB listing that will include more specific information, such as amenities and neighborhood.
It’s a little rough and will require thorough cleaning, but it’s a start. You’ll definitely need to eliminate duplicates, but fortunately there’s another Trend CT tutorial on that.