O'Reilly logo
live online training icon Live Online training

Python Data Handling - A Deeper Dive

David Beazley

Manipulating data is a core part of writing almost any Python program. To represent data, Python provides a small collection of built-in types such as lists, sets, dictionaries, and classes. Additionally, there are useful objects in the standard collections module that are commonly used to solve a variety of data-related problems. Finally, there are third party libraries such as numpy and Pandas that provide additional data handling resources.

In this live training, we’re going to take a deeper look at data representation in Python. Topics will include performance tradeoffs, common programming idioms, and details about Python’s underlying object model.

What you'll learn-and how you can apply it

  • Learn about how and when to use different built-in types according to the problem that’s being addressed.
  • Gain a much deeper awareness of how different types are implemented and their associated costs.
  • Write much more efficient and elegant code for manipulating data.

This training course is for you because...

  • You want to improve the way in which you write Python data handling scripts
  • You’re a data scientist and you want to expand your Python knowledge beyond standard tools such as numpy and Pandas.
  • You’ve written programs for handling data, but have run into various performance-related problems.

Prerequisites

  • This course assumes a prior introduction to Python programming. Attendees should know the basics of editing, running, and debugging simple programs.
  • Some prior exposure to numpy or Pandas will be useful, but is not required.

Materials and downloads needed in advance of class:

  • Python 3.6, numpy, and Pandas is recommended.
  • Installing the “Anaconda Python” distribution for Python 3.6 will satisfy all requirements.

Recommended Preparation:

The Python Programming Language (video)

About your instructor

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Segment 1 Data Structure Shootout (30 min)

  • Instructor will describe different techniques for representing records
  • Participants will try an experiment trying to figure out most efficient data representation for a large CSV file of data.

Segment 2 The collections module (25 min)

  • Instructors will describe a few common data handling problems and show solutions with the collections module.
  • Participants will use collections to answer a few questions about the data in Segment 1

Segment 3 Python Object Model (30 min)

  • Instructor will describe the manner in which Python handles objects and some implementation details about Python containers.
  • Participants will see if they can use this newfound knowledge to more efficiently handle the data read in Segment 1.

Break (15 min)

Segment 4 Thinking in Functions (25 min)

  • Instructors will describe common programming idioms related to a functional programming style. These include list comprehensions, map, reduce, etc.
  • Participants will try some simple data handling experiments to reinforce concepts.

Segment 5 Thinking in Columns (30 min)

  • Instructors will describe an alternative view on data based on arrays and columns. Numpy arrays and Pandas dataframes are introduced. Common array-oriented programming idioms are introduced.
  • Participants will rework some earlier examples using arrays and column oriented thinking.

Segment 6 Thinking in Streams (25 min)

  • Instructors will introduce stream processing as a useful idiom for solving a variety of data handling problems. Topics will include Python iteration, generator functions, and generator expressions.
  • Participants will reformulate earlier examples to utilize a stream-processing approach.