In this recipe, we will see how to analyze the distribution of various variables in the data. Generally, we can take a histogram/boxplot of the variables to understand the distribution and also identify the outliers. But currently, Spark has no support for plotting the data. Let's see how we can perform analysis by generating frequency tables.
To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Also, have Apache Hadoop 2.6 and Apache Spark 1.6.0 installed.
Let's take an example of load prediction data. Here is what the sample data looks like: