A couple of months ago, I was taken to a the eponymous website on which the deaths of race horses are catalogued, (http://www.horsedeathwatch.com/), and taught a little about how to scrape information from a regularly updated table (of the sort found on that page of the site), using only a google spreadsheet.
Google spreadsheets are invaluable. Fact.
1. Create a new google spreadsheet. Type =importhtml into a cell on your spreadsheet. This start to your formula needs to be followed by some information telling the spreadsheet exactly what information to extract – using the following template: (“url”,”query”,”index”)
- “url” (of the site from which you want to scrape)
- “query” – this can be either “table” or “list”. In this case, horsedeathwatch have thoughtfully collated the information in a table rather than a list, so we’re plumping for “table”.
- “index” – this is simply the number of the table or list, and may require a little trial and error. So if your table is the second on the page, enter “2”. In this case, we will enter “1”.
2. The finished formula (or function) should look a little like this: =importhtml(“http://www.horsedeathwatch.com/”,”table”,”1″)
3. Press enter, and watch the data stream into your spreadsheet, as in the image below (click to enlarge)
4. Note, your spreadsheet will last only as long as the website is up, so create regular back-ups or screenshots.
Importing data from an RSS feed:
Let’s take the Oxford Migration Observatory’s Press Release RSS feed url: http://www.migrationobservatory.ox.ac.uk/press-releases/rss
1. Type =importfeed into your first cell
2. Follow this with (“url”) – this should be the url of the page that you want the information from.
3. So my finished formula in this case is: =importfeed(“http://www.migrationobservatory.ox.ac.uk/press-releases/rss”)