Chapter 3. A Quick Look into Baseball
In this chapter, we will introduce the dataset we use throughout the book: baseball performance statistics. We will explain the various metrics used in baseball (and in this book), such that if you arenât a baseball fan you can still follow along.
Nate Silver calls baseball the âperfect dataset.â There are not many human-centered systems for which this comprehensive degree of detail is available, and no richer set of tables for truly demonstrating the full range of analytic patterns.
For readers who are not avid baseball fans, we provide a simpleâsome might say âoversimplifiedââdescription of the sport and its key statistics. For more details, refer to Joseph Adlerâs Baseball Hacks (OâReilly) or Max Marchi and Jim Albertâs Analyzing Baseball Data with R (Chapman & Hall).
The Data
Our baseball statistics come in tables at multiple levels of detail.
Putting people first as we like to do, the people
table lists each playerâs name and personal stats (height and weight, birth year, etc.). It has a primary key, the player_id
, formed from the first five letters of the playerâs last name, first two letters of their first name, and a two-digit disambiguation slug. There are also primary tables for ballparks (parks
, which lists information on every stadium that has ever hosted a game) and for teams (teams
, which lists every Major League team back to the birth of the game).
The core statistics table is bat_seasons
, which gives ...
Get Big Data for Chimps now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.