Calculating summary statistics

Summary statistics is used to summarize observations to get a collective sense of the data. The summary includes the following:

  • Central tendency of data—mean, mode, median
  • Spread of data—variance, standard deviation
  • Boundary conditions—min, max

This recipe covers how to produce summary statistics.

How to do it…

  1. Start the Spark shell:
    $ spark-shell
    
  2. Import the matrix-related classes:
    scala> import org.apache.spark.mllib.linalg.{Vectors,Vector}
    scala> import org.apache.spark.mllib.stat.Statistics
    
  3. Create a personRDD as RDD of vectors:
    scala> val personRDD = sc.parallelize(List(Vectors.dense(150,60,25), Vectors.dense(300,80,40)))
    
  4. Compute the column summary statistics:
    scala> val summary = Statistics.colStats(personRDD)
    
  5. Print ...

Get Spark Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.