In the previous chapter we discussed in general how models are used in data science. In this chapter, we’re going to be diving into algorithms.
An algorithm is a procedure or set of steps or rules to accomplish a task. Algorithms are one of the fundamental concepts in, or building blocks of, computer science: the basis of the design of elegant and efficient code, data preparation and processing, and software engineering.
Some of the basic types of tasks that algorithms can solve are sorting, searching, and graph-based computational problems. Although a given task such as sorting a list of objects could be handled by multiple possible algorithms, there is some notion of “best” as measured by efficiency and computational time, which matters especially when you’re dealing with massive amounts of data and building consumer-facing products.
Efficient algorithms that work sequentially or in parallel are the basis of pipelines to process and prepare data. With respect to data science, there are at least three classes of algorithms one should be aware of:
Data munging, preparation, and processing algorithms, such as sorting, MapReduce, or Pregel.
We would characterize these types of algorithms as data engineering, and while we devote a chapter to this, it’s not the emphasis of this book. This is not to say that you won’t be doing data wrangling and munging—just that we don’t emphasize the algorithmic aspect of it.