Performing out-of-core computations on large arrays with Dask
Dask is a parallel computing library that offers not only a general framework for distributing complex computations on many nodes, but also a set of convenient high-level APIs to deal with out-of-core computations on large arrays. Dask provides data structures resembling NumPy arrays (dask.array
) and Pandas DataFrames (dask.dataframe
) that efficiently scale to huge datasets. The core idea of Dask is to split a large array into smaller arrays (chunks).
In this recipe, we illustrate the basic principles of dask.array
.
Getting ready
Dask should already be installed in Anaconda, but you can always install it manually with conda install dask
. You also need memory_profiler
, which you can install ...
Get IPython Interactive Computing and Visualization Cookbook - Second Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.