Sampling data

In this recipe, we will see how to generate sample data from the entire population.

Getting ready

To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Also, have Apache Hadoop 2.6 and Apache Spark 1.6.0 installed. Readers are expected to have knowledge of sampling techniques.

How to do it…

Let's take an example of load prediction data. Here is what the sample data looks like:

How to do it…

Note

Download the data from the following location https://github.com/ChitturiPadma/datasets/blob/master/Loan_Prediction_Data.csv.

  1. Here is the code for sampling data from a DataFrame:
     import org.apache.spark._ import org.apache.spark.sql.SQLContext ...

Get Apache Spark for Data Science Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.