This chapter introduces the dataset we will work with in the rest of the book. It will also cover the kinds of tools we’ll be using, and our reasoning for doing so. Finally, it will outline multiple perspectives we’ll use in analyzing data for you to think about moving forward.
Air travel is an essential part of modern life. It is a fundamental part of globalized culture, linking major cities across the planet into a global urban economy. Thanks to regulation, there is a lot of aviation data out there that is freely available. In the course of the book, we’ll use many aviation datasets. The core, or atomic logs we’ll be using are on-time records for each flight. We will supplement this with data on airlines, weather, routes and more.
Flight on-time records aren’t quite “Big Data” but they are severeal gigabytes per year, uncompressed. We will immediately face a “big” or actually, a “medium” data problem—processing the data on your local machine is just barely feasible. Working with data too large to fit in RAM this way requires that we use scalable tools, which is helpful as a learning device. Air travel is a familiar experience to all of us, giving you a sense for how to analyze and query flight data, and helping you see which techniques are effective! This is cultivating data intuition, a major theme in Agile Data Science.
In this book, we use the same tools that you would use at petabyte scale, but in local mode on your own machine. This is more than ...