O'Reilly logo
live online training icon Live Online training

Introduction to Pandas

Data Munging in Python

Daniel Chen

Python’s popularity has skyrocketed with the creation of Pandas. It has become the de facto python library when working with heterogeneous tabular data, and has since been integrated with various Python libraries. While many tasks can be performed in spreadsheet programs, e.g., Excel, Pandas allows you to script these tasks in Python so you have a complete audit trail for how your data was manipulated. Additionally, more and more datasets are hitting the limits of how much spreadsheet programs can even open, so having an alternative means to work with these types of data is essential.

This Pandas introduction will guide you from “opening” Python, to loading a dataset and beginning the process of cleaning and analyzing data.

What you'll learn-and how you can apply it

  • This Pandas introduction will guide you from “opening” Python to loading a dataset and begin the process of cleaning and analyzing data.

This training course is for you because...

  • You are new to data analytics and/or performing data analytics using Python
  • You want a more reproducible workflow to cleaning and process data
  • You want to learn Python in an applied way by working with data
  • You have used python before, but want to see how it can be used to clean and process datasets

Prerequisites

  • It will help if you know some basic bash/shell commands (On macos/linux: ls, cd, windows: dir, cd)

Participants enrolled in this course need to have the following installed on their computers:

Recommended Preparation:

Pandas Data Analysis with Python Fundamentals LiveLessons (video)

Pandas for Everyone: Python Data Analytics (book)

Tidy data paper: http://vita.had.co.nz/papers/tidy-data.html

Organizing computational (biology) projects: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424

Course setup - Please be sure you have python (and pandas) installed on your computer before class begins: https://www.anaconda.com/download/

About your instructor

  • Daniel Chen, trainer and data scientist, is a graduate student in the interdisciplinary Ph.D. program in genetics, bioinformatics & computational biology (GBCB) at Virginia Polytechnic Institute and State University (Virginia Tech). He is involved with Software Carpentry and Data Carpentry as an instructor and lesson maintainer. He completed his master’s degree in public health at Columbia University Mailman School of Public Health in epidemiology with a certificate in advanced epidemiology and is currently extending his master’s thesis work on attitude diffusion in social networks in the Social and Decision Analytics Laboratory under the Biocomplexity Institute of Virginia Tech.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Day 1

Segment 1 Different ways you can interface with Python (25 min)

  • Anaconda, Python, IPython, and Jupyter notebooks
  • Spyder, nteract, Rodeo IDEs
  • Installing packages

Break + Q&A (5 minutes)

Segment 2 Pandas basics and tour (55 min)

  • Load, subset, slice, filter data
  • The Pandas Series and DataFrame object
  • Quick whirlwind of grouped operations and plots
  • Manual creation of data objects
  • Series and DataFrame object methods
  • Conditional and ‘fancy’ subsetting
  • Saving data: csv, excel, feather, odo library

Break + Q&A (5 minutes)

Segment 3 Assembling data and missing values (55 min)

  • Concatenation
  • Merging
  • Python for loops and list comprehensions
  • How are missing data represented
  • How are missing data created
  • Ways to count and find missing data
  • Calculations with missing data

Break + Q&A (5 minutes)

Segment 4 Data Reshaping (55 min)

  • Hadley’s Tidy Data paper
  • Column headers are values, not variable names.
  • Multiple variables are stored in one column.
  • Variables are stored in both rows and columns.
  • Multiple types of observational units are stored in the same table.
  • A single observational unit is stored in multiple tables.

Break + Q&A (5 minutes)

Segment 5 Data Types (25 min)

  • Different Data types
  • Converting data types

Break + Q&A (5 minutes)

Day 2

Segment 6 Functions (25 min)

  • Writing functions in Python
  • Testing functions and Python unit tests

Break + Q&A (5 minutes)

Segment 7 Strings and text data (55 min)

  • Python strings and pandas ‘object’ types
  • Subsetting and slicing strings
  • String methods
  • String formatting
  • Regular Expressions

Break + Q&A (5 minutes)

Segment 8 Applying Functions (25 min)

  • Custom functions
  • Vectorized Functions
  • Lambda Functions

Break + Q&A (5 minutes)

Segment 9 Grouped Operations (55 min)

  • Aggregation methods
  • Transformation methods
  • Filter methods
  • Iterating over groups

Break + Q&A (5 minutes)

Segment 11 Dates and Times (25 min)

  • Python datetime
  • Converting to datetime types
  • Loading data with dates
  • Extracting date components
  • Date calculations and time deltas
  • datetime methods
  • Date ranges

Break + Q&A (5 minutes)

Segment 11 Modelling (25 min)

  • linear mondels with sklearn and statsmodels
  • dummy variables
  • patsy

Break + Q&A (5 minutes)

Segment 12 Wrapping up (25 min)

  • Additional resources
  • Timing your code
  • Thinking about performance

Break + Q&A (5 minutes)