Make a Historical Play-by-Play Database

Turn the event files on the Retrosheet web site into an easy-to-query, play-by-play database.

The Retrosheet project includes play-by-play information on almost every game from 1960 through 1992, and every game from 2000 through the present. You can use these files in many ways: you can inspect them with Chadwick, turn them into box scores, or write your own scripts to read them. One of the easiest ways to use this data is to dump it into a MySQL database. This hack shows you a quick way to make a database of play-by-play information. I use this database in many hacks in this book.

The Code

There are a number of required steps:

  1. Fetch the datafile from Retrosheet (http://www.retrosheet.org).

  2. Transform the event file format into tabular flat files.

  3. Import the CSV files into MySQL.

This hack uses a little Perl to fetch and move files. See “Get Perl” [Hack #12] and “Learn Perl” [Hack #13] for more information.

Fetching the data.

The first step in building this database is to get the data. I show how to fetch the data in “Get Historical Play-by-Play Data” [Hack #14] .

Transforming the data.

The second step in making this data accessible is to process it with the BEVENT tool [Hack #15] . The purpose of this step is to create a column in the database for each thing that can happen on a play: base runner movement, player IDs, scoring, etc. Since storage is cheap, the easiest thing to do is to use the tool to dump every possible field. There are a lot ...

Get Baseball Hacks now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.