How to scrape website data without programming using

Print More

There’s so much structured data on websites that would be useful for journalists or data enthusiasts if only they could get their hands on it, like listings on online directories, real estate sites, social networks, online shopping sites and government agency landing pages.

However, most sites don’t offer the functionality to copy the data into a nice, machine-readable format like a spreadsheet. If you want that data, you often have to manually copy and paste from the website into Excel or Google Sheets.

This takes forever, and most people can’t afford to make an intern do it or use a service like Amazon Mechanical Turk.

What is web scraping?

Web scraping (also known as screen scraping, web data extraction or web harvesting) is a programming technique to automate the copying and pasting process in a fraction of the time.

Some methods include using libraries from programming languages like Python and R. There are also programs that have been built specifically for web scraping.

We’ll go over how to use one of them, in this tutorial.


Is web scraping ethical?

I believe it is because there’s no technical difference between a person copying and pasting the information versus a computer.

However, there are ways in which it can be unethical.

  • Taking and displaying personal information.
  • Scraping too much, too often, resulting in something akin to a “Denial of Service” attack when their server is overloaded with your requests.
  • Copying and displaying their content wholesale.

Don’t do any of the above. It’s not nice.

Getting started

Difficulty level: Beginners.

Download the app and install it. There’s a version for Mac and PC.

Navigating the program

Start it up and click on New at the top right.

You’ve got a lot of options.

  • Magic – Tries to detect the data and structure based on the URL you feed it
  • Extractor – Gives you more control by letting you choose what specific data you want
  • Classic Extractor – Same as Extractor but in a different format
  • Authenticated Extractor – Same as Extractor but for websites you have to log into first
  • Crawler – A sophisticated version that’s more advanced and gives you more control
  • Connector – Sort of a mashup between an extractor and a crawler

Go ahead and click on Magic under the New pulldown menu.

This is a list of “most popular” sites that people like to scrape.

Intro to Zillow data

Click on the Zillow button.

This is the data that has detected and has guessed the structure for on

Do you see the similarities to the data from the original website?

Click on Download CSV at the bottom right (Or Extract Data at the top right).

You can select how many pages of data to download.

This is what the exported data looks like.

It’s rough and needs cleaning up, but now you can do a bunch of data analyses on it.

What do you think you can do with this type of data?

Scraping from Airbnb

Click New > Extractor

Let’s take a look at AirBNB data

Type in into the URL bar at the top of the app.

Search for Hartford, CT.

Click the ON button at the top left.

The page will reload and give you an alert that Javascript needs to be enabled.

Click the JS button on the top right to turn on Javascript.

Are you sure? Yes.

We need to train the app to recognize the data and sort them into a data frame.

Hover over the data you want to train.

Hover over the $90 until it turns orange.

Click it.

We want to turn each of these listings into a row of a spreadsheet.

Select Many rows.

This is also a link, would you like to add this too?


We’re going to need it later.

We can continue to click on +New Column and hover and click for every facet that we want…

Or we can just click Suggest Data and it will guess what we want.

Go ahead and click it.

OK, that’s a lot of interesting data.

It’s pretty much what we want.

Click Done at the top right.

You’ve created an API.

Name it and click Publish.

You’ll be brought to your library.

The problem here is that you’ve only got one page of data.

That means 18 results from the page. There might be hundreds more.

Fortunately, there’s an option for Bulk Extract.

Give it a list of URLs with structure similar to the original one and it will pull out the data.

We need to get the links for the other pages.

Right click on the 2 link at the bottom and copy the link.

The link looks like

And it appears there are 13 pages total.

You could paste this link into the box 13 times and just change the number from 2 to 3 to 4, etc.

But here’s a quicker way.

Open up Google Spreadsheets.

The only thing that changes from page to page in the URL is the number at the very end.

So let’s make a formula that let’s us change it easily.

Delete the 2.

Type it into the column to the right.

You’re separating it.

Drag down the URL so that it copies down for 13 rows.

Put in the numbers so it counts incrementally.

Now, let’s use a formula in the next column to bring those two together.

Drag the formula down so it fills out.

Now you have a list of URLs.

Copy that column over to the app and place it in the Bulk Extract box.

Click Run Query

It will tell you that 14/14 URLs have been processed (1 URL failed because it looked for page 14 when it only goes up to 13)

Click the Export button at the top right when you’re done.

Now you have a new data set to analyze.

The data set also includes links to specific pages for each AirBNB listing that will include more specific information, such as amenities and neighborhood.

It’s a little rough and will require thorough cleaning, but it’s a start. You’ll definitely need to eliminate duplicates, but fortunately there’s another Trend CT tutorial on that.

What do you think?

  • Bill

    “Is web scraping ethical?

    I believe it is because there’s no technical difference between a person copying and pasting the information versus a computer.”

    So, committing copyright violations by computer scraping is ethical because it can be done by a person copying and pasting. This is your standard for ethics? Journalists do not have a free pass to violate copyright laws, which implicitly is what this scraping tutorial advocates. Doesn’t really matter how much, how often. Association membership lists are intellectual property of the association. Doesn’t matter that technology allows someone to scrape online lists. Try this on our online membership directories and I’d advise you to get a good lawyer. Unbelievable…

  • Hi Andrew,
    Nice tutorial.
    I have created a similar tutorial but mine goes a little bit further.
    You can share the link with your readers as well.

  • Roberto PeÑacastro Ortega

    Andrew this is great! Thanks a lot for sharing the knowledge, now I’m left with a new question. How can I work out a Custom API with this new data base? And how could I automate the updating of new listings for this data base, or it has to be manual? Thank you very much!