The following code splits the ratings RDD into training data RDD (75%) and test data RDD (25%). Seed here is optional but is required for reproducibility purposes:
// Split ratings RDD into training RDD (75%) & test RDD (25%) val splits = ratingsDF.randomSplit(Array(0.75, 0.25), seed = 12345L) val (trainingData, testData) = (splits(0), splits(1)) val numTraining = trainingData.count() val numTest = testData.count() println("Training: " + numTraining + " test: " + numTest)
You should notice that there are 78,792 ratings in training and 26,547 ratings in the test DataFrame.