Chapter 1. Mapping Foreclosures

Messy Address Parsing

To illustrate how to combine data from disparate sources for statistical analysis and visualization, let’s focus on one of the messiest sources of data around: web pages.

The Philadelphia sheriff’s office posts foreclosure auctions on its website each month. How do we collect this data, massage it into a reasonable form, and work with it? First, create a new folder (for example, ~/Rmashup) to contain our project files. It is helpful to change the R working directory to your newly created folder.

  #In Unix/MacOS
  > setwd("~/Documents/Rmashup/")
  #In Windows
  > setwd("C:/~/Rmashup/")

We can download this foreclosure listings web page from within R (or you may instead choose to save the raw HTML from your web browser):

  > download.file(url="http://www.phillysheriff.com/properties.html",
    destfile="properties.html") 

Here is some of this web page’s source HTML, with addresses highlighted:

    6321 Farnsworth St.
           62nd                                            Ward
    1,379.88 sq. ft.  BRT# 621533500  Improvements: Residential Property
    <br><b>
    HOMER SIMPSON
    &nbsp;   &nbsp;
    </b>   C.P. January Term, 2006  No. 002619 &nbsp; &nbsp; $27,537.87
    &nbsp;  &nbsp;  Phelan Hallinan & Schmieg, L.L.P.
     <hr />
    <center><b>      243-467           </b></center>
    1402 E. Mt. Pleasant Ave.
    &nbsp;   &nbsp;  50th                                            Ward
    approximately 1,416 sq. ft. more or less  BRT# 502440300 ...

The sheriff’s raw HTML listings are inconsistently formatted, but with the right regular expression we can identify street addresses: notice ...

Get Data Mashups in R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.