In this chapter, we will introduce the dataset we will work on in the rest of the book: your own email inbox. We will also introduce and explain the kinds of tools we’ll be using, and our reasoning for doing so. Finally, we’ll outline multiple perspectives we’ll use in analyzing data for you to think about moving forward.
The book starts with data because in Agile Big Data our process starts with the data.
Email is a fundamental part of the internet. More than that, it is foundational; forming the basis for authentication for the web and social networks. In addition to being abundant and well understood, email is also complex, rich in signal and yields interesting information when mined.
We will be using your own email inbox as the dataset for the application we’ll develop, in order to make the examples relevant. By downloading your gmail inbox and then using it in the examples, we will immediately pose a ‘big’ or, actually, a ‘medium’ data problem - processing the data on your local machine is just barely feasible.Working with data too large to fit in RAM this way requires that we use scalable tools, which is helpful as a learning device. By using your own email inbox, we’ll enable insights into your own little world, helping you see which techniques are effective! This is cultivating ‘data intuition,’ a major theme in Agile Big Data.
In this book we use the same tools that you would use at petabyte scale, but in ‘local mode’ on you own machine. This ...