Chapter 2. Tricky Statistics with Spark

In this chapter, you will learn the following recipes:

Working with Pandas
Variable identification
Sampling data
Summary and descriptive statistics
Generating frequency tables
Installing Pandas on Linux
Installing Pandas from source
Using IPython with PySpark
Creating Pandas DataFrames over Spark
Splitting, slicing, sorting, filtering and grouping DataFrames over Spark.
Implementing co-variance and correlation using DataFrames over Spark.
Concatenating and merging operations over DataFrames
Complex operations over DataFrames.
Sparkling Pandas

Introduction

Statistics refers to the mathematics and techniques with which we understand data. It is a vast field which plays a key role in the areas of data mining and artificial ...

Get Apache Spark for Data Science Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Apache Spark for Data Science Cookbook by Padma Priya Chitturi

Chapter 2. Tricky Statistics with Spark

Introduction

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly