Preface

Data science is an exciting field to work in. It’s also still very young. Unfortunately, many people, and especially companies, believe that you need new technology in order to tackle the problems posed by data science. However, as this book demonstrates, many things can be accomplished by using the command line instead, and sometimes in a much more efficient way.

Around five years ago, during my PhD program, I gradually switched from using Microsoft Windows to GNU/Linux. Because it was a bit scary at first, I started with having both operating systems installed next to each other (known as dual-boot). The urge to switch back and forth between the two faded and at some point I was even tinkering around with Arch Linux, which allows you to build up your own custom operating system from scratch. All you’re given is the command line, and it’s up to you what you want to make of it. Out of necessity I quickly became comfortable using the command line. Eventually, as spare time got more precious, I settled down with a GNU/Linux distribution known as Ubuntu because of its easy-of-use and large community. Nevertheless, the command line is still where I’m getting most of my work done.

It actually hasn’t been too long ago that I realized that the command line is not just for installing software, system configuration, and searching files. I started learning about command-line tools such as cut, sort, and sed. These are examples of command-line tools that take data as input, do something to it, and print the result. Ubuntu comes with quite a few of them. Once I understood the potential of combining these small tools, I was hooked.

After my PhD, when I became a data scientist, I wanted to use this approach to do data science as much as possible. Thanks to a couple of new, open source command-line tools including scrape, jq, and json2csv, I was even able to use the command line for tasks such as scraping websites and processing lots of JSON data. In September 2013, I decided to write a blog post titled “Seven Command-Line Tools for Data Science.” To my surprise, the blog post got quite a bit of attention and I received a lot of suggestions of other command-line tools. I started wondering whether I could turn this blog post into a book. I’m pleased that, some 10 months later, with the help of many talented people (see the “Acknowledgments” below), I was able to do just that.

I’m sharing this personal story not so much because I think you should know how this book came about, but more because I want you to know that I had to learn about the command line as well. Because the command line is so different from using a graphical user interface, it can be intimidating at first. But if I can learn it, then you can as well. No matter what your current operating system is and no matter how you currently do data science, by the end of this book you will be able to also leverage the power of the command line. If you’re already familiar with the command line, or even if you’re already dreaming in shell scripts, chances are that you’ll still discover a few interesting tricks or command-line tools to use for your next data science project.

What to Expect from This Book

In this book, we’re going to obtain, scrub, explore, and model data—a lot of it. This book is not so much about how to become better at those data science tasks. There are already great resources available that discuss, for example, when to apply which statistical test or how data can be best visualized. Instead, this practical book aims to make you more efficient and more productive by teaching you how to perform those data science tasks at the command line.

While this book discusses over 80 command-line tools, it’s not the tools themselves that matter most. Some command-line tools have been around for a very long time, while others are fairly new and might eventually be replaced by better ones. There are even command-line tools that are being created as you’re reading this. In the past 10 months, I have discovered many amazing command-line tools. Unfortunately, some of them were discovered too late to be included in the book. In short, command-line tools come and go, and that’s OK.

What matters most are the underlying ideas of working with tools, pipes, and data. Most of the command-line tools do one thing and do it well. This is part of the Unix philosophy, which makes several appearances throughout the book. Once you become familiar with the command line, and learn how to combine command-line tools, you will have developed an invaluable skill—and if you can create new tools, you’ll be a cut above.

How to Read This Book

In general, you’re advised to read this book in a linear fashion. Once a concept or command-line tool has been introduced, chances are that we employ it in a later chapter. For example, in Chapter 9, we make use of parallel, which is discussed extensively in Chapter 8.

Data science is a broad field that intersects with many other fields, such as programming, data visualization, and machine learning. As a result, this book touches on many interesting topics that unfortunately cannot be discussed at full length. Throughout the book, there are suggestions for additional reading. It’s not required to read this material in order to follow along with the book, but when you are interested, you can use turn to these suggested readings as jumping-off points.

Who This Book Is For

This book makes just one assumption about you: that you work with data. It doesn’t matter which programming language or statistical computing environment you’re currently using. The book explains all the necessary concepts from the beginning.

It also doesn’t matter whether your operating system is Microsoft Windows, Mac OS X, or some other form of Unix. The book comes with the Data Science Toolbox, which is an easy-to-install virtual environment. It allows you to run the command-line tools and follow along with the code examples in the same environment as this book was written. You don’t have to waste time figuring out how to install all the command-line tools and their dependencies.

The book contains some code in Bash, Python, and R, so it’s helpful if you have some programming experience, but it’s by no means required to follow along.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context.

Tip

This element signifies a tip or suggestion.

Note

This element signifies a general note.

Warning

This element signifies a warning or caution.

Using Code Examples

Supplemental material (virtual machine, data, scripts, and custom command-line tools, etc.) is available for download at https://github.com/jeroenjanssens/data-science-at-the-command-line.

This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Data Science at the Command Line by Jeroen H.M. Janssens (O’Reilly). Copyright 2015 Jeroen H.M. Janssens, 978-1-491-94785-2.”

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at .

Safari® Books Online

Safari Books Online is an on-demand digital library that delivers expert content in both book and video form from the world’s leading authors in technology and business.

Technology professionals, software developers, web designers, and business and creative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training.

Safari Books Online offers a range of plans and pricing for enterprise, government, education, and individuals.

Members have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and hundreds more. For more information about Safari Books Online, please visit us online.

How to Contact Us

We have a web page for this book, where we list non-code-related errata and additional information. You can access this page at:

Any errata related to the code, command-line tools, and virtual machine should be submitted as a ticket through GitHub’s issue tracker at:

Please address comments and questions concerning this book to the publisher:

  • O’Reilly Media, Inc.
  • 1005 Gravenstein Highway North
  • Sebastopol, CA 95472
  • 800-998-9938 (in the United States or Canada)
  • 707-829-0515 (international or local)
  • 707-829-0104 (fax)

To comment or ask technical questions about this book, send email to .

For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Follow us on Twitter: http://twitter.com/oreillymedia

Watch us on YouTube: http://www.youtube.com/oreillymedia

Follow Jeroen on Twitter: @jeroenhjanssens

Acknowledgments

First of all, I’d like to thank Mike Dewar and Mike Loukides for believing that my blog post, “Seven Command-Line Tools for Data Science,” which I wrote in September 2013, could be expanded into a book. I thank Jared Lander for inviting me to speak at the New York Open Statistical Programming Meetup, because the preparations gave me the idea for writing the blog post in the first place.

Special thanks to my technical reviewers Mike Dewar, Brian Eoff, and Shane Reustle for reading various drafts, meticulously testing all the commands, and providing invaluable feedback. Your efforts have improved the book greatly. The remaining errors are entirely my own responsibility.

I had the privilege of working together with four amazing editors, namely: Ann Spencer, Julie Steele, Marie Beaugureau, and Matt Hacker. Thank you for your guidance and for being such great liaisons with the many talented people at O’Reilly. Those people include: Huguette Barriere, Sophia DeMartini, Dan Fauxsmith, Yasmina Greco, Rachel James, Jasmine Kwityn, Ben Lorica, Mike Loukides, Andrew Odewahn, and Christopher Pappas. There are many others whom I haven’t met yet because they are operating behind the scenes. Together they ensured that working with O’Reilly has truly been a pleasure.

This book discusses over 80 command-line tools. Needless to say, without these tools, this book wouldn’t have existed in the first place. I’m therefore extremely grateful to all the authors who created and contributed to these tools. The complete list of authors is unfortunately too long to include here; they are mentioned in Appendix A. Thanks especially to Aaron Crow, Jehiah Czebotar, Christopher Groskopf, Dima Kogan, Sergey Lisitsyn, Francisco J. Martin, and Ole Tange for providing help with their amazing command-line tools.

This book makes heavy use of the Data Science Toolbox, a virtual environment that contains all the command-line tools used in this book. It stands on the shoulders of many giants, and as such, I thank the people behind GNU, Linux, Ubuntu, Amazon Web Services, GitHub, Packer, Ansible, Vagrant, and VirtualBox for making the Data Science Toolbox possible. I thank Matthew Russell for the inspiration and feedback for developing the Data Science Toolbox in the first place; his book Mining the Social Web (O’Reilly) also offers a virtual machine.

Eric Postma and Jaap van den Herik, who supervised me during my PhD program, deserve a special thank you. Over the course of five years they have taught me many lessons. Although writing a technical book is quite different from writing a PhD thesis, many of those lessons proved to be very helpful in the past 10 months as well.

Finally, I’d like to thank my colleagues at YPlan, my friends, my family, and especially my wife, Esther, for supporting me and for disconnecting me from the command line at just the right times.

Get Data Science at the Command Line now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.