Building a classification system with Random Forest Trees in Spark 2.0

In this recipe, we will explore Random Forest implementation in Spark. We will use the Random Forest technique to solve a discrete classification problem. We found random forest implementation very fast due to Spark's exploitation of parallelism (growing many trees at once). We also do not need to worry too much about the hyper-parameters and technically we can get away with just setting the number of trees.

Get Apache Spark 2.x Machine Learning Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.