To illustrate how to combine data from disparate sources for statistical analysis and visualization, let’s focus on one of the messiest sources of data around: web pages.
The Philadelphia sheriff’s office posts foreclosure auctions on its website each month. How do we collect this data, massage it into a reasonable form, and work with it? First, create a new folder (for example, ~/Rmashup) to contain our project files. It is helpful to change the R working directory to your newly created folder.
> setwd("~/Documents/Rmashup/")#In Windows
We can download this foreclosure listings web page from within R (or you may instead choose to save the raw HTML from your web browser):
Here is some of this web page’s source HTML, with addresses highlighted:
6321 Farnsworth St. 62nd Ward 1,379.88 sq. ft. BRT# 621533500 Improvements: Residential Property <br><b> HOMER SIMPSON </b> C.P. January Term, 2006 No. 002619 $27,537.87 Phelan Hallinan & Schmieg, L.L.P. <hr /> <center><b> 243-467 </b></center> 1402 E. Mt. Pleasant Ave. 50th Ward approximately 1,416 sq. ft. more or less BRT# 502440300 ...
The sheriff’s raw HTML listings are inconsistently formatted, but with the right regular expression we can identify street addresses: notice how they ...