There's more...

In the next recipe, we will discuss how to accelerate processing with parallel processing. Parallel processing of HDF5 is a complex topic, especially parallel writing, which is mostly impossible. Check out https://stackoverflow.com/questions/41367568/dask-and-parallel-hdf5-writing for more information.

To create HDF5 from VCF files, check Alistair Miles' blog post (https://alimanfoo.github.io/2017/06/14/read-vcf.html) and his library, scikit-allel (https://scikit-allel.readthedocs.io/en/latest/). His other library, Zarr, can be a great choice for dealing with persistent data (https://zarr.readthedocs.io/en/stable/).

Get Bioinformatics with Python Cookbook - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.