Sometimes the only way to get the data you want is to pull it directly from the source.
While I was writing this book, I came across the following request on the Retrosheet mailing list:
|I’m going to be doing the Fans’ Scouting Report for a third|
|year, but this time, I want to do it during the year.|
|I’m looking to get the following information for 2005|
|for all players, as of the all-star break:|
|Anyone who can help, please send me a note offlist.|
|(playerid being whatever your data source is).|
Basically, Tom needed to pull just a subset of data from the MLB.com site. Grabbing data from web pages so that you can reuse it for other purposes is a common task—so much so that it has its own name: spidering. Spidering allows you to write programs that read a web page and pull out just the parts you want, while throwing out the rest.
Web pages are written in a language called HyperText Markup Language (HTML). They contain different tags that explain to your computer how to format the page. Here is a short sample file that shows how this works:
<html> <head> <title>Baseball Sites</title> </head> <body> <h1> Baseball Web Sites </h1> This book describes many different baseball web sites. Here are a few of my favorites:<br> <a href="http://www.baseball1.com">The Baseball Archive</a><br> <a href="http://www.retrosheet.org">Retrosheet</a><br> <a href="http://www.mlb.com">MLB.com</a><br> </body> </html>
The <html> tags tell the ...