How it works...

We declared two Scala arrays, parallelized them into two RDDs that are separate vectors of x() and y(). We then used the zip() method from the RDD API to produce a paired (that is, zipped) RDD. It results in an RDD in which each member is an (x , y) pair. We then proceed to calculate the mean, sum, and so on, and apply the closed form formula as described to find the intercept and slope for the regression line.

In Spark 2.0, the alternative would have been to use the GLM API out of the box. It is worth mentioning that the maximum number of parameters for a closed normal form scheme supported by GLM is limited to 4,096.

We used a closed form formula to demonstrate that a regression line associated with a set of numbers (Y1, ...

Get Apache Spark 2.x Machine Learning Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.