There's more...

It cannot be emphasized enough how efficient the statistical API is on large datasets. These APIs will provide you with basic elements to implement any statistical learning algorithm from scratch. Based on our research and experience with half versus full matrix factorization, we encourage you to first read the source code and make sure that there isn't an equivalent functionality already implemented in Spark before implementing your own.

While we only demonstrate a basic statistics summary here, Spark comes equipped out of the box with:

  • Correlation: Statistics.corr(seriesX, seriesY, "type of correlation"):
    • Pearson (default)
    • Spearman
  • Stratified sampling - RDD API:
    • With a replacement RDD
    • Without a replacement - requires ...

Get Apache Spark 2.x Machine Learning Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.