Chapter 3. A Quick Look into Baseball

In this chapter, we will introduce the dataset we use throughout the book: baseball performance statistics. We will explain the various metrics used in baseball (and in this book), such that if you aren’t a baseball fan you can still follow along.

Nate Silver calls baseball the “perfect dataset.” There are not many human-centered systems for which this comprehensive degree of detail is available, and no richer set of tables for truly demonstrating the full range of analytic patterns.

For readers who are not avid baseball fans, we provide a simple—some might say “oversimplified”—description of the sport and its key statistics. For more details, refer to Joseph Adler’s Baseball Hacks (O’Reilly) or Max Marchi and Jim Albert’s Analyzing Baseball Data with R (Chapman & Hall).

The Data

Our baseball statistics come in tables at multiple levels of detail.

Putting people first as we like to do, the people table lists each player’s name and personal stats (height and weight, birth year, etc.). It has a primary key, the player_id, formed from the first five letters of the player’s last name, first two letters of their first name, and a two-digit disambiguation slug. There are also primary tables for ballparks (parks, which lists information on every stadium that has ever hosted a game) and for teams (teams, which lists every Major League team back to the birth of the game).

The core statistics table is bat_seasons, which gives ...

Get Big Data for Chimps now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.