LabeledPoint is a data structure that has been around since the early days for packaging a feature vector along with a label so it can be used in unsupervised learning algorithms. We demonstrate a short recipe that uses LabeledPoint, the Seq data structure, and DataFrame to run a logistic regression for binary classification of the data. The emphasis here is on LabeledPoint, and the regression algorithms are covered in more depth in Chapter 5, Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I and Chapter 6, Practical Machine Learning with Regression and Classification in Spark 2.0 - Part II.
LabeledPoint data structure for Spark ML
Get Apache Spark 2.x Machine Learning Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.