Chapter 2. Tricky Statistics with Spark

In this chapter, you will learn the following recipes:

  • Working with Pandas
  • Variable identification
  • Sampling data
  • Summary and descriptive statistics
  • Generating frequency tables
  • Installing Pandas on Linux
  • Installing Pandas from source
  • Using IPython with PySpark
  • Creating Pandas DataFrames over Spark
  • Splitting, slicing, sorting, filtering and grouping DataFrames over Spark.
  • Implementing co-variance and correlation using DataFrames over Spark.
  • Concatenating and merging operations over DataFrames
  • Complex operations over DataFrames.
  • Sparkling Pandas

Introduction

Statistics refers to the mathematics and techniques with which we understand data. It is a vast field which plays a key role in the areas of data mining and artificial ...

Get Apache Spark for Data Science Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.