Take a look at the following steps:
- We will start by doing all the necessary imports. Dask will be responsible for doing the conversion:
from math import ceilimport numpy as npimport h5pyimport dask.array as daimport dask.dataframe as dd
- We then read all the HDF5 datasets that we want to convert. For the sake of our example, we will use positions. If the position is an SNP, the qual and mq0 annotations will be used:
h5_3L = h5py.File('ag1000g.phase1.ar3.pass.3L.h5', 'r')positions = h5_3L['/3L/variants/POS']is_snp = h5_3L['/3L/variants/is_snp']qual = h5_3L['/3L/variants/QUAL']mq0 = h5_3L['/3L/variants/MQ0']
- We will now create a Dask DataFrame:
all_ddf = dd.from_array(positions, columns=['POS'])is_snp_dseries = dd.from_array(is_snp) ...