Calculating summary statistics

Summary statistics is used to summarize observations to get a collective sense of the data. The summary includes the following:

This recipe covers how to produce summary statistics.

How to do it…

Import the matrix-related classes:

scala> import org.apache.spark.mllib.linalg.{Vectors,Vector}
scala> import org.apache.spark.mllib.stat.Statistics

Create a personRDD as RDD of vectors:

scala> val personRDD = sc.parallelize(List(Vectors.dense(150,60,25), Vectors.dense(300,80,40)))

Compute the column summary statistics:

scala> val summary = Statistics.colStats(personRDD)

Get Spark Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.