O'Reilly logo
live online training icon Live Online training

Clean, effective data analysis with Python

The Pandas library from .head() to .tail()

Tom Augspurger

You’ll learn how to solving common problems in data analysis by writing clean, readable, efficient code. Pandas will be the primary tool, though integrations with other libraries like scikit-learn, statsmodels, and matplotlib will be demonstrated. The emphasis will be on gradually learning methods for massaging data into the correct form through real applications, rather than an exhaustive walk-through of pandas' API.

This course is aimed at beginner and intermediate PyData users. It covers practical topics such as:

  • The basics of NumPy and its relationship to pandas
  • Selecting and indexing
  • Reshaping and tidy data
  • Grouped operations and summarization
  • Merging and joining
  • Interaction with other PyData libraries (statistics and visualization)
  • Some of the more specialized areas of pandas including Categoricals, time-series analysis, hierarchical indexes, chunked/out of core processing, and data pipelines.

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • The subset of the pandas API that covers the most common problems in data wrangling
  • Where pandas fits in the broader scientific python ecosystem
  • How to get data in and out of pandas DataFrame
  • How pandas interacts with other PyData libraries, like matplotlib, scikit-learn, and statsmodels
  • Fundamental data wrangling techniques like reshaping, groupby, and filtering

And you'll be able to:

  • Compute simple and sophisticated group-wise summary statistics
  • Write efficient and idiomatic pandas code
  • Clean, reshape, and join datasets in preparation for statistical learning or visualization

This training course is for you because...

  • You are a data analyst who needs to preprocess messy data before feeding it to a machine learning algorithm or visualization library

Prerequisites

  • Some experience with python and its built-in data structures.
  • Experience with NumPy and vectorized computation will be helpful, but not required, to get the most out of the training.

Recommended Preparation:

Introduction to Python

Learning Pandas

About your instructor

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Each section is associated with a Jupyter notebook. The instructor will guide the discussion by introducing the large themes of each topic. Each notebook contains many small exercises (and solutions) for checking your understanding as we progress through the notebooks.

Segment 1: Introduction (5 min)

Introduction to your instructor and pandas

Segment 2: Setup and Jupyter Introduction (5 min)

Clone the repository, follow the setup instructions

Segment 3: Indexing (40 min)

How to select subsets of your data

Break (5 min)

Segment 4: Alignment and Operations (25 min)

How pandas uses row-labels to do alignment

Segment 5: Tidy Data (30 min)

Tidy data in pandas

Segment 6: Day 1 Wrap (5 min)
Review of the topics covered, preview of the next day

DAY 2

Segment 7: Groupby (35 min)

Grouped operations

Segment 8: Visualization (35 min)

Plotting with matplotlib, pandas, seaborn, and Altair

Break (5 min)

Segment 9: Performance (35 min)

How to avoid writing slow pandas code

Segment 10: Timeseries (15 min)

Brief introduction to manipulating timeseries data

Segment 14: Integrations (15 min)

Brief examples of how pandas plugs into statsmodels and scikit-learn

Segment 15: Day 2 Wrap (5 min)

Review of topics covered, further resources.