Getting information on craft breweries can be a difficult process, with data dispersed over multiple websites or in formats unusable for analysis.
This method works well for extracting metadata on craft breweries, and overall beer ratings from the popular site Beer Advocate which is an online community which supports beer education, events and a forum to rate beers.
Difficulty level: Intermediate.
This tutorial will make sense if you already know how to use R or have gone through these previous walkthroughs:
- R for beginners: How to transition from Excel to R
- How to put dots on a Leaflet map with R
- How to scrape website data without programming using Import.io
Getting the data from the website
2. We need to determine the URL structure because of the pagination on Beer advocate so we can be sure we're scraping more than one page of the results.
Luckily enough this is fairly simple to do by clicking on each of the results links (ie. 1-20, 21-40, 41-60).
Here are the necessary links:
3. After initially setting up an account on import.io, which can be done through linking your Github account, you can navigate to your my data page and input the previous links into the bulk extractor located in the “How would you like to use this API?” dropdown for your Magic API and press the button to run the queries.
4. The new output page will be a tabular view of all of the extracted link data ready for export in multiple formats such as Spreadsheet (for CSV), HTML, and JSON. We will download the Spreadsheet format for this tutorial.
Now that we have the data we can take it into R to process and visualize it.
5. Read the data into R:
6. So we have some links and columns filled with more than one piece of information, but that's easy to fix by removing duplicate columns, creating coherent column headers, specifying and unifying missing data and extracting the necessary information from other columns.
Geolocation of addresses for plotting on a map
7. Grab some latitudes and longitudes for those craft brewery addresses.
Data visualization using Leaflet