Loading and transforming the data

Even though MLlib is designed with RDDs and DStreams in focus, for ease of transforming the data we will read the data and convert it to a DataFrame.

Note

The DStreams are the basic data abstraction for Spark Streaming (see http://bit.ly/2jIDT2A)

Just like in the previous chapter, we first specify the schema of our dataset.

Note

Note that here (for brevity), we only present a handful of features. You should always check our GitHub account for this book for the latest version of the code: https://github.com/drabastomek/learningPySpark.

Here's the code:

import pyspark.sql.types as typ labels = [ ('INFANT_ALIVE_AT_REPORT', typ.StringType()), ('BIRTH_YEAR', typ.IntegerType()), ('BIRTH_MONTH', typ.IntegerType()), ('BIRTH_PLACE', ...

Get Learning PySpark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.