You are previewing Python for Data Analysis.

Python for Data Analysis

Cover of Python for Data Analysis by Wes McKinney Published by O'Reilly Media, Inc.
  1. Python for Data Analysis
  2. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  3. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. Safari® Books Online
    4. How to Contact Us
  4. 1. Preliminaries
    1. What Is This Book About?
    2. Why Python for Data Analysis?
      1. Python as Glue
      2. Solving the “Two-Language” Problem
      3. Why Not Python?
    3. Essential Python Libraries
      1. NumPy
      2. pandas
      3. matplotlib
      4. IPython
      5. SciPy
    4. Installation and Setup
      1. Windows
      2. Apple OS X
      3. GNU/Linux
      4. Python 2 and Python 3
      5. Integrated Development Environments (IDEs)
    5. Community and Conferences
    6. Navigating This Book
      1. Code Examples
      2. Data for Examples
      3. Import Conventions
      4. Jargon
    7. Acknowledgements
  5. 2. Introductory Examples
    1. data from
      1. Counting Time Zones in Pure Python
      2. Counting Time Zones with pandas
    2. MovieLens 1M Data Set
      1. Measuring rating disagreement
    3. US Baby Names 1880-2010
      1. Analyzing Naming Trends
    4. Conclusions and The Path Ahead
  6. 3. IPython: An Interactive Computing and Development Environment
    1. IPython Basics
      1. Tab Completion
      2. Introspection
      3. The %run Command
      4. Executing Code from the Clipboard
      5. Keyboard Shortcuts
      6. Exceptions and Tracebacks
      7. Magic Commands
      8. Qt-based Rich GUI Console
      9. Matplotlib Integration and Pylab Mode
    2. Using the Command History
      1. Searching and Reusing the Command History
      2. Input and Output Variables
      3. Logging the Input and Output
    3. Interacting with the Operating System
      1. Shell Commands and Aliases
      2. Directory Bookmark System
    4. Software Development Tools
      1. Interactive Debugger
      2. Timing Code: %time and %timeit
      3. Basic Profiling: %prun and %run -p
      4. Profiling a Function Line-by-Line
    5. IPython HTML Notebook
    6. Tips for Productive Code Development Using IPython
      1. Reloading Module Dependencies
      2. Code Design Tips
    7. Advanced IPython Features
      1. Making Your Own Classes IPython-friendly
      2. Profiles and Configuration
    8. Credits
  7. 4. NumPy Basics: Arrays and Vectorized Computation
    1. The NumPy ndarray: A Multidimensional Array Object
      1. Creating ndarrays
      2. Data Types for ndarrays
      3. Operations between Arrays and Scalars
      4. Basic Indexing and Slicing
      5. Boolean Indexing
      6. Fancy Indexing
      7. Transposing Arrays and Swapping Axes
    2. Universal Functions: Fast Element-wise Array Functions
    3. Data Processing Using Arrays
      1. Expressing Conditional Logic as Array Operations
      2. Mathematical and Statistical Methods
      3. Methods for Boolean Arrays
      4. Sorting
      5. Unique and Other Set Logic
    4. File Input and Output with Arrays
      1. Storing Arrays on Disk in Binary Format
      2. Saving and Loading Text Files
    5. Linear Algebra
    6. Random Number Generation
    7. Example: Random Walks
      1. Simulating Many Random Walks at Once
  8. 5. Getting Started with pandas
    1. Introduction to pandas Data Structures
      1. Series
      2. DataFrame
      3. Index Objects
    2. Essential Functionality
      1. Reindexing
      2. Dropping entries from an axis
      3. Indexing, selection, and filtering
      4. Arithmetic and data alignment
      5. Function application and mapping
      6. Sorting and ranking
      7. Axis indexes with duplicate values
    3. Summarizing and Computing Descriptive Statistics
      1. Correlation and Covariance
      2. Unique Values, Value Counts, and Membership
    4. Handling Missing Data
      1. Filtering Out Missing Data
      2. Filling in Missing Data
    5. Hierarchical Indexing
      1. Reordering and Sorting Levels
      2. Summary Statistics by Level
      3. Using a DataFrame’s Columns
    6. Other pandas Topics
      1. Integer Indexing
      2. Panel Data
  9. 6. Data Loading, Storage, and File Formats
    1. Reading and Writing Data in Text Format
      1. Reading Text Files in Pieces
      2. Writing Data Out to Text Format
      3. Manually Working with Delimited Formats
      4. JSON Data
      5. XML and HTML: Web Scraping
    2. Binary Data Formats
      1. Using HDF5 Format
      2. Reading Microsoft Excel Files
    3. Interacting with HTML and Web APIs
    4. Interacting with Databases
      1. Storing and Loading Data in MongoDB
  10. 7. Data Wrangling: Clean, Transform, Merge, Reshape
    1. Combining and Merging Data Sets
      1. Database-style DataFrame Merges
      2. Merging on Index
      3. Concatenating Along an Axis
      4. Combining Data with Overlap
    2. Reshaping and Pivoting
      1. Reshaping with Hierarchical Indexing
      2. Pivoting “long” to “wide” Format
    3. Data Transformation
      1. Removing Duplicates
      2. Transforming Data Using a Function or Mapping
      3. Replacing Values
      4. Renaming Axis Indexes
      5. Discretization and Binning
      6. Detecting and Filtering Outliers
      7. Permutation and Random Sampling
      8. Computing Indicator/Dummy Variables
    4. String Manipulation
      1. String Object Methods
      2. Regular expressions
      3. Vectorized string functions in pandas
    5. Example: USDA Food Database
  11. 8. Plotting and Visualization
    1. A Brief matplotlib API Primer
      1. Figures and Subplots
      2. Colors, Markers, and Line Styles
      3. Ticks, Labels, and Legends
      4. Annotations and Drawing on a Subplot
      5. Saving Plots to File
      6. matplotlib Configuration
    2. Plotting Functions in pandas
      1. Line Plots
      2. Bar Plots
      3. Histograms and Density Plots
      4. Scatter Plots
    3. Plotting Maps: Visualizing Haiti Earthquake Crisis Data
    4. Python Visualization Tool Ecosystem
      1. Chaco
      2. mayavi
      3. Other Packages
      4. The Future of Visualization Tools?
  12. 9. Data Aggregation and Group Operations
    1. GroupBy Mechanics
      1. Iterating Over Groups
      2. Selecting a Column or Subset of Columns
      3. Grouping with Dicts and Series
      4. Grouping with Functions
      5. Grouping by Index Levels
    2. Data Aggregation
      1. Column-wise and Multiple Function Application
      2. Returning Aggregated Data in “unindexed” Form
    3. Group-wise Operations and Transformations
      1. Apply: General split-apply-combine
      2. Quantile and Bucket Analysis
      3. Example: Filling Missing Values with Group-specific Values
      4. Example: Random Sampling and Permutation
      5. Example: Group Weighted Average and Correlation
      6. Example: Group-wise Linear Regression
    4. Pivot Tables and Cross-Tabulation
      1. Cross-Tabulations: Crosstab
    5. Example: 2012 Federal Election Commission Database
      1. Donation Statistics by Occupation and Employer
      2. Bucketing Donation Amounts
      3. Donation Statistics by State
  13. 10. Time Series
    1. Date and Time Data Types and Tools
      1. Converting between string and datetime
    2. Time Series Basics
      1. Indexing, Selection, Subsetting
      2. Time Series with Duplicate Indices
    3. Date Ranges, Frequencies, and Shifting
      1. Generating Date Ranges
      2. Frequencies and Date Offsets
      3. Shifting (Leading and Lagging) Data
    4. Time Zone Handling
      1. Localization and Conversion
      2. Operations with Time Zone−aware Timestamp Objects
      3. Operations between Different Time Zones
    5. Periods and Period Arithmetic
      1. Period Frequency Conversion
      2. Quarterly Period Frequencies
      3. Converting Timestamps to Periods (and Back)
      4. Creating a PeriodIndex from Arrays
    6. Resampling and Frequency Conversion
      1. Downsampling
      2. Upsampling and Interpolation
      3. Resampling with Periods
    7. Time Series Plotting
    8. Moving Window Functions
      1. Exponentially-weighted functions
      2. Binary Moving Window Functions
      3. User-Defined Moving Window Functions
    9. Performance and Memory Usage Notes
  14. 11. Financial and Economic Data Applications
    1. Data Munging Topics
      1. Time Series and Cross-Section Alignment
      2. Operations with Time Series of Different Frequencies
      3. Time of Day and “as of” Data Selection
      4. Splicing Together Data Sources
      5. Return Indexes and Cumulative Returns
    2. Group Transforms and Analysis
      1. Group Factor Exposures
      2. Decile and Quartile Analysis
    3. More Example Applications
      1. Signal Frontier Analysis
      2. Future Contract Rolling
      3. Rolling Correlation and Linear Regression
  15. 12. Advanced NumPy
    1. ndarray Object Internals
      1. NumPy dtype Hierarchy
    2. Advanced Array Manipulation
      1. Reshaping Arrays
      2. C versus Fortran Order
      3. Concatenating and Splitting Arrays
      4. Repeating Elements: Tile and Repeat
      5. Fancy Indexing Equivalents: Take and Put
    3. Broadcasting
      1. Broadcasting Over Other Axes
      2. Setting Array Values by Broadcasting
    4. Advanced ufunc Usage
      1. ufunc Instance Methods
      2. Custom ufuncs
    5. Structured and Record Arrays
      1. Nested dtypes and Multidimensional Fields
      2. Why Use Structured Arrays?
      3. Structured Array Manipulations: numpy.lib.recfunctions
    6. More About Sorting
      1. Indirect Sorts: argsort and lexsort
      2. Alternate Sort Algorithms
      3. numpy.searchsorted: Finding elements in a Sorted Array
    7. NumPy Matrix Class
    8. Advanced Array Input and Output
      1. Memory-mapped Files
      2. HDF5 and Other Array Storage Options
    9. Performance Tips
      1. The Importance of Contiguous Memory
      2. Other Speed Options: Cython, f2py, C
  16. A. Python Language Essentials
    1. The Python Interpreter
    2. The Basics
      1. Language Semantics
      2. Scalar Types
      3. Control Flow
    3. Data Structures and Sequences
      1. Tuple
      2. List
      3. Built-in Sequence Functions
      4. Dict
      5. Set
      6. List, Set, and Dict Comprehensions
    4. Functions
      1. Namespaces, Scope, and Local Functions
      2. Returning Multiple Values
      3. Functions Are Objects
      4. Anonymous (lambda) Functions
      5. Closures: Functions that Return Functions
      6. Extended Call Syntax with *args, **kwargs
      7. Currying: Partial Argument Application
      8. Generators
    5. Files and the operating system
  17. Index
  18. About the Author
  19. Colophon
  20. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  21. Copyright
O'Reilly logo

Chapter 5. Getting Started with pandas

pandas will be the primary library of interest throughout much of the rest of the book. It contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to use in NumPy-centric applications.

As a bit of background, I started building pandas in early 2008 during my tenure at AQR, a quantitative investment management firm. At the time, I had a distinct set of requirements that were not well-addressed by any single tool at my disposal:

  • Data structures with labeled axes supporting automatic or explicit data alignment. This prevents common errors resulting from misaligned data and working with differently-indexed data coming from different sources.

  • Integrated time series functionality.

  • The same data structures handle both time series data and non-time series data.

  • Arithmetic operations and reductions (like summing across an axis) would pass on the metadata (axis labels).

  • Flexible handling of missing data.

  • Merge and other relational operations found in popular database databases (SQL-based, for example).

I wanted to be able to do all of these things in one place, preferably in a language well-suited to general purpose software development. Python was a good candidate language for this, but at that time there was not an integrated set of data structures and tools providing this functionality.

Over the last four years, pandas has matured into a quite large ...

The best content for your career. Discover unlimited learning on demand for around $1/day.