Spark-based K-means for population-scale clustering

In a previous section, we have seen how the K-means work. So we can directly dive into the implementation. Since the training will be unsupervised, we need to drop the label column (that is, Region):

val sqlContext = sparkSession.sqlContextval schemaDF = sqlContext.createDataFrame(rowRDD, header).drop("Region")schemaDF.printSchema()schemaDF.show(10)>>>
Figure 16: A snapshot of the training dataset for K-means without the label (that is, Region)

Now, we have seen in Chapters 1Analyzing Insurance Severity Claims and Chapter 2Analyzing and Predicting Telecommunication Churn that Spark expects ...

Get Scala Machine Learning Projects now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.