Data pre-processing and feature engineering

I already stated that all the 24 VCF files contribute 820 GB of data. Therefore, I decided to use the genetic variant of chromosome Y only one two make the demonstration clearer. The size is around 160 MB, which is not meant to pose huge computational challenges. You can download all the VCF files as well as the panel file from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/.

Let us get started. We start by creating SparkSession, the gateway for the Spark application:

val spark:SparkSession = SparkSession    .builder()    .appName("PopStrat")    .master("local[*]")    .config("spark.sql.warehouse.dir", "C:/Exp/")    .getOrCreate()

Then let's show Spark the path of both VCF and the panel file:

val genotypeFile ...

Get Scala Machine Learning Projects now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.