DIY Web Scraping: Fetching and Extracting Data

Given the sheer multitude of accessible data available on the world wide web, the “web scraping” phenomenon has caught on like wildfire. Web scraping is a method for extracting data from websites. The scraping can be done manually, but is preferably done in a programmatic way.

Many free programs are out there to assist you with your forays into web scraping.  For a recent project we used IMacros to automate the fetching/extraction of needed data from the Residential Construction Branch of the U.S. Census.  This website provides data on the number of new housing units authorized by building permits.  Data are available monthly, year-to-date, and annually at the national, state, and most county and county subdivisions.  Prior to January 9, 2017 all building permit data at the county level or below was only available as individual text files.  This meant we had to manually download over 3,142 individual text files in order to obtain the data for all the counties in the U.S. It was a tedious task, to say the least.

Such a manual process would have been too labor intensive to take on without any automation via web scraping. Automating the entire process using IMacro was pretty straightforward and simple. Here’s an outline of the steps:

  • Install the IMacro extension to the Firefox web browser.
  • Test the IMacro recording function by going through the process of selecting and downloading the first file.
  • View the recorded code and create a loop +1 so that the code repeats itself and downloads each text file.
  • Save the files in the same file/folder location to make the process of merging data files into a single file much easier.
  • Extract data easily for every county, with the ability to roll up by state, region and nationally.

Like many data sites, the Building Permits Website now provides access to the FTP directory where you can navigate and download all 3,142 text files without having to enter specific parameters for each file.  However, if you come across websites that do not, we recommend that you get familiar with the site to determine what format the data is in: i.e. tables, individual pages etc.  If you need to scrape from numerous websites, take the time to get familiar with each one, because any change in formatting from site to site can cause havoc if you are not aware of the potential problem of downloading misaligned or incorrect data. Never forget the rule: garbage in, garbage out. Test before you scrape!