Chapter 6. Managing Your Data Workflow

We hope that by now you have come to appreciate that the command line is a very convenient environment for doing data science. You may have noticed that, as a consequence of working at the command line, we:

  • Invoke many different commands

  • Create custom and ad-hoc command-line tools

  • Obtain and generate many (intermediate) files

As this process is of an exploratory nature, our workflow tends to be rather chaotic, which makes it difficult to keep track of what we’ve done. It’s very important that our steps can be reproduced, whether by ourselves or by others. When we, for example, continue with a project from a few weeks earlier, chances are that we have forgotten which commands we have run, on which files, in which order, and with which parameters. Imagine the difficulty of passing on your analysis to a collaborator.

You may recover some lost commands by digging into your Bash history, but this is, of course, not a good approach. A better approach would be to save your commands to a Bash script, such as run.sh. This allows you and your collaborators to at least reproduce the analysis. A shell script is, however, a suboptimal approach because:

  • It’s difficult to read and to maintain.

  • Dependencies between steps are unclear.

  • Every step gets executed every time, which is inefficient and sometimes undesirable.

This is where Drake comes in handy (Factual, 2014). Drake is command-line tool created by Factual that allows you to:

  • Formalize ...

Get Data Science at the Command Line now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.