Chapter 10. Data Manipulation

This chapter summarizes the popular Python libraries related to data manipulation: numeric, text, images, and audio. Almost all of the libraries described here serve a unique purpose, so this chapter’s goal is to describe these libraries, not compare them. Unless noted, all of them can be installed directly from PyPI using pip:

$ pip install library

Table 10-1 briefly describes these libraries.

Table 10-1. Data tools
Python library License Reason to use

IPython

Apache 2.0 license

  • Provides enhanced Python interpreter, with input history, integrated debugger, and graphics and plots in-terminal (with the Qt-enabled version).

Numpy

BSD 3-clause license

  • Provides multidimensional arrays and linear algebra tools, optimized for speed.

SciPy

BSD license

  • Provides functions and utilities related to engineering and science, from linear algebra to signal processing, integration, root finding, statistical distributions, and other topics.

Matplotlib

BSD license

  • Provides scientific plotting.

Pandas

BSD license

  • Provides series and DataFrame objects that can be sorted, merged, grouped, aggregated, indexed, windowed, and subset—a lot like an R Data Frame or the contents of a SQL query.

Scikit-Learn

BSD 3-clause license

  • Provides machine learning algorithms, including dimensionality reduction classification, regression, clustering, model selection, imputing missing data, and preprocessing.

Rpy2

GPLv2 license

Get The Hitchhiker's Guide to Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.