O'Reilly logo
live online training icon Live Online training

Programming with Data: Python and Pandas

enter image description here

Daniel Gerlanc

Whether in R, MATLAB, Stata, or Python, modern data analysis, for many researchers, requires some kind of programming. The preponderance of tools and specialized languages for data analysis suggests that general purpose programming languages like C and Java do not readily address the needs of data scientists; something more is needed.

In this training, you will learn how to accelerate your data analyses using the Python language and Pandas, a library specifically designed for interactive data analysis. Pandas is a massive library, so we will focus on its core functionality, specifically, loading, filtering, grouping, and transforming data. Having completed this workshop, you will understand the fundamentals of Pandas, be aware of common pitfalls, and be ready to perform your own analyses.

What you'll learn-and how you can apply it

  • Use the Split-Apply-Combine technique to calculate grouped summary statistics like mean, median, and standard deviation on your data
  • Load data from flat files, numpy, and native Python data structures and compute on them using Pandas
  • Avoid common pitfalls and “gotchas” in Pandas by understanding the conceptual underpinnings common to most data manipulation libraries and environments

This training course is for you because...

  • You have a solid understanding of Python programming
  • You want to learn how to load and transform tabular data in Python using Pandas
  • You want to accelerate your understanding of Pandas by learning general principles and requirements common to tabular data manipulation frameworks

Prerequisites

  • Intermediate-level programming ability in Python. Attendees should know the difference between a dict, list, and tuple. Familiarity with control-flow (if/else/for/while) and error handling (try/catch) are required.
  • No statistics background is required.

Course Set-up:

  • Step-by-step instructions for setting up a working Python environment with using Anaconda are available here. You will need a working environment to complete the exercises in Jupyter notebook. Alternatively, you may view the notebooks here.

Recommended Preparation:

Recommended Follow-up:

About your instructor

  • Daniel Gerlanc is the Founder and President of EnPlus Advisors, a consultancy specializing in data science and custom software development. He started EnPlus in 2011 after working as a hedge fund quant for 5 years. At EnPlus, he focuses on projects that require expertise in both data analysis and software engineering. He has coauthored several open source R packages, published in peer-reviewed journals, and been an invited speaker at conferences including ODSC and PGConf. He is a graduate of Williams College.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Building Blocks of Tabular Data (35 min)

  • Training Overview (5 min)
  • Introduction to Series (5 min)
  • Creating a Series (5 min)
  • Selecting from and Filtering a Series (10 min)
  • Missing data and Series (5 min)
  • Alignment as a central concept of Pandas (5)

Segment 1 Exercises (15 min)

  • https://github.com/dgerlanc/programming-with-data/blob/master/007-intro-to-pandas-part-1-exercises.ipynb
  • Instructor demonstrates solving exercise (10 min)
  • Break (length: 5 min)

Segment 2: Data Frames - 2D Tabular Data (50 min)

  • Introduction to Data Frames (5 min)
  • Creating Data Frames (10 min)
  • Single Axis Selection (10 min)
  • Multi-Axis Selection (10 min)
  • Selecting and Assigning with Data Frames (10 min)

Segment 2 Exercises (15 min)

  • https://github.com/dgerlanc/programming-with-data/blob/master/008-intro-to-pandas-part-2-exercises.ipynb
  • Instructor demonstrates solving exercise (15 min)
  • Break (length: 10 min)

Segment 3: Split-Apply-Combine (30 min)

  • Review Data Frames: Data Frame Gotchas (5 min)
  • Read external data in from a delimited text file (5 min)
  • Grouping and summary statistics (10 min)
  • Flexible calculation with apply (10 min)

Segment 3 Exercises (15 min)

  • https://github.com/dgerlanc/programming-with-data/blob/master/009-group-pivot-exercises.ipynb
  • Instructor demonstrates solving exercise (10 min)
  • Lecture + Exercises + Break (55 minutes)