How it works...

We create an RDD, DataFrame, and Dataset object using a similar method from the same text file and confirm the type using the getClass method:

Dataset: spark.read.textFileRDD: spark.sparkContext.textFileDataFrame: spark.read.text

Please note that they are very similar and sometimes confusing. Spark 2.0 has transformed DataFrame into an alias for Dataset[Row], making it truly a dataset. We showed the preceding methods to let the user pick an example to create their own datatype flavor.

Get Apache Spark 2.x Machine Learning Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.