You are previewing Python for Data Analysis.

Python for Data Analysis

Cover of Python for Data Analysis by Wes McKinney Published by O'Reilly Media, Inc.
  1. Python for Data Analysis
  2. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  3. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. Safari® Books Online
    4. How to Contact Us
  4. 1. Preliminaries
    1. What Is This Book About?
    2. Why Python for Data Analysis?
      1. Python as Glue
      2. Solving the “Two-Language” Problem
      3. Why Not Python?
    3. Essential Python Libraries
      1. NumPy
      2. pandas
      3. matplotlib
      4. IPython
      5. SciPy
    4. Installation and Setup
      1. Windows
      2. Apple OS X
      3. GNU/Linux
      4. Python 2 and Python 3
      5. Integrated Development Environments (IDEs)
    5. Community and Conferences
    6. Navigating This Book
      1. Code Examples
      2. Data for Examples
      3. Import Conventions
      4. Jargon
    7. Acknowledgements
  5. 2. Introductory Examples
    1. 1.usa.gov data from bit.ly
      1. Counting Time Zones in Pure Python
      2. Counting Time Zones with pandas
    2. MovieLens 1M Data Set
      1. Measuring rating disagreement
    3. US Baby Names 1880-2010
      1. Analyzing Naming Trends
    4. Conclusions and The Path Ahead
  6. 3. IPython: An Interactive Computing and Development Environment
    1. IPython Basics
      1. Tab Completion
      2. Introspection
      3. The %run Command
      4. Executing Code from the Clipboard
      5. Keyboard Shortcuts
      6. Exceptions and Tracebacks
      7. Magic Commands
      8. Qt-based Rich GUI Console
      9. Matplotlib Integration and Pylab Mode
    2. Using the Command History
      1. Searching and Reusing the Command History
      2. Input and Output Variables
      3. Logging the Input and Output
    3. Interacting with the Operating System
      1. Shell Commands and Aliases
      2. Directory Bookmark System
    4. Software Development Tools
      1. Interactive Debugger
      2. Timing Code: %time and %timeit
      3. Basic Profiling: %prun and %run -p
      4. Profiling a Function Line-by-Line
    5. IPython HTML Notebook
    6. Tips for Productive Code Development Using IPython
      1. Reloading Module Dependencies
      2. Code Design Tips
    7. Advanced IPython Features
      1. Making Your Own Classes IPython-friendly
      2. Profiles and Configuration
    8. Credits
  7. 4. NumPy Basics: Arrays and Vectorized Computation
    1. The NumPy ndarray: A Multidimensional Array Object
      1. Creating ndarrays
      2. Data Types for ndarrays
      3. Operations between Arrays and Scalars
      4. Basic Indexing and Slicing
      5. Boolean Indexing
      6. Fancy Indexing
      7. Transposing Arrays and Swapping Axes
    2. Universal Functions: Fast Element-wise Array Functions
    3. Data Processing Using Arrays
      1. Expressing Conditional Logic as Array Operations
      2. Mathematical and Statistical Methods
      3. Methods for Boolean Arrays
      4. Sorting
      5. Unique and Other Set Logic
    4. File Input and Output with Arrays
      1. Storing Arrays on Disk in Binary Format
      2. Saving and Loading Text Files
    5. Linear Algebra
    6. Random Number Generation
    7. Example: Random Walks
      1. Simulating Many Random Walks at Once
  8. 5. Getting Started with pandas
    1. Introduction to pandas Data Structures
      1. Series
      2. DataFrame
      3. Index Objects
    2. Essential Functionality
      1. Reindexing
      2. Dropping entries from an axis
      3. Indexing, selection, and filtering
      4. Arithmetic and data alignment
      5. Function application and mapping
      6. Sorting and ranking
      7. Axis indexes with duplicate values
    3. Summarizing and Computing Descriptive Statistics
      1. Correlation and Covariance
      2. Unique Values, Value Counts, and Membership
    4. Handling Missing Data
      1. Filtering Out Missing Data
      2. Filling in Missing Data
    5. Hierarchical Indexing
      1. Reordering and Sorting Levels
      2. Summary Statistics by Level
      3. Using a DataFrame’s Columns
    6. Other pandas Topics
      1. Integer Indexing
      2. Panel Data
  9. 6. Data Loading, Storage, and File Formats
    1. Reading and Writing Data in Text Format
      1. Reading Text Files in Pieces
      2. Writing Data Out to Text Format
      3. Manually Working with Delimited Formats
      4. JSON Data
      5. XML and HTML: Web Scraping
    2. Binary Data Formats
      1. Using HDF5 Format
      2. Reading Microsoft Excel Files
    3. Interacting with HTML and Web APIs
    4. Interacting with Databases
      1. Storing and Loading Data in MongoDB
  10. 7. Data Wrangling: Clean, Transform, Merge, Reshape
    1. Combining and Merging Data Sets
      1. Database-style DataFrame Merges
      2. Merging on Index
      3. Concatenating Along an Axis
      4. Combining Data with Overlap
    2. Reshaping and Pivoting
      1. Reshaping with Hierarchical Indexing
      2. Pivoting “long” to “wide” Format
    3. Data Transformation
      1. Removing Duplicates
      2. Transforming Data Using a Function or Mapping
      3. Replacing Values
      4. Renaming Axis Indexes
      5. Discretization and Binning
      6. Detecting and Filtering Outliers
      7. Permutation and Random Sampling
      8. Computing Indicator/Dummy Variables
    4. String Manipulation
      1. String Object Methods
      2. Regular expressions
      3. Vectorized string functions in pandas
    5. Example: USDA Food Database
  11. 8. Plotting and Visualization
    1. A Brief matplotlib API Primer
      1. Figures and Subplots
      2. Colors, Markers, and Line Styles
      3. Ticks, Labels, and Legends
      4. Annotations and Drawing on a Subplot
      5. Saving Plots to File
      6. matplotlib Configuration
    2. Plotting Functions in pandas
      1. Line Plots
      2. Bar Plots
      3. Histograms and Density Plots
      4. Scatter Plots
    3. Plotting Maps: Visualizing Haiti Earthquake Crisis Data
    4. Python Visualization Tool Ecosystem
      1. Chaco
      2. mayavi
      3. Other Packages
      4. The Future of Visualization Tools?
  12. 9. Data Aggregation and Group Operations
    1. GroupBy Mechanics
      1. Iterating Over Groups
      2. Selecting a Column or Subset of Columns
      3. Grouping with Dicts and Series
      4. Grouping with Functions
      5. Grouping by Index Levels
    2. Data Aggregation
      1. Column-wise and Multiple Function Application
      2. Returning Aggregated Data in “unindexed” Form
    3. Group-wise Operations and Transformations
      1. Apply: General split-apply-combine
      2. Quantile and Bucket Analysis
      3. Example: Filling Missing Values with Group-specific Values
      4. Example: Random Sampling and Permutation
      5. Example: Group Weighted Average and Correlation
      6. Example: Group-wise Linear Regression
    4. Pivot Tables and Cross-Tabulation
      1. Cross-Tabulations: Crosstab
    5. Example: 2012 Federal Election Commission Database
      1. Donation Statistics by Occupation and Employer
      2. Bucketing Donation Amounts
      3. Donation Statistics by State
  13. 10. Time Series
    1. Date and Time Data Types and Tools
      1. Converting between string and datetime
    2. Time Series Basics
      1. Indexing, Selection, Subsetting
      2. Time Series with Duplicate Indices
    3. Date Ranges, Frequencies, and Shifting
      1. Generating Date Ranges
      2. Frequencies and Date Offsets
      3. Shifting (Leading and Lagging) Data
    4. Time Zone Handling
      1. Localization and Conversion
      2. Operations with Time Zone−aware Timestamp Objects
      3. Operations between Different Time Zones
    5. Periods and Period Arithmetic
      1. Period Frequency Conversion
      2. Quarterly Period Frequencies
      3. Converting Timestamps to Periods (and Back)
      4. Creating a PeriodIndex from Arrays
    6. Resampling and Frequency Conversion
      1. Downsampling
      2. Upsampling and Interpolation
      3. Resampling with Periods
    7. Time Series Plotting
    8. Moving Window Functions
      1. Exponentially-weighted functions
      2. Binary Moving Window Functions
      3. User-Defined Moving Window Functions
    9. Performance and Memory Usage Notes
  14. 11. Financial and Economic Data Applications
    1. Data Munging Topics
      1. Time Series and Cross-Section Alignment
      2. Operations with Time Series of Different Frequencies
      3. Time of Day and “as of” Data Selection
      4. Splicing Together Data Sources
      5. Return Indexes and Cumulative Returns
    2. Group Transforms and Analysis
      1. Group Factor Exposures
      2. Decile and Quartile Analysis
    3. More Example Applications
      1. Signal Frontier Analysis
      2. Future Contract Rolling
      3. Rolling Correlation and Linear Regression
  15. 12. Advanced NumPy
    1. ndarray Object Internals
      1. NumPy dtype Hierarchy
    2. Advanced Array Manipulation
      1. Reshaping Arrays
      2. C versus Fortran Order
      3. Concatenating and Splitting Arrays
      4. Repeating Elements: Tile and Repeat
      5. Fancy Indexing Equivalents: Take and Put
    3. Broadcasting
      1. Broadcasting Over Other Axes
      2. Setting Array Values by Broadcasting
    4. Advanced ufunc Usage
      1. ufunc Instance Methods
      2. Custom ufuncs
    5. Structured and Record Arrays
      1. Nested dtypes and Multidimensional Fields
      2. Why Use Structured Arrays?
      3. Structured Array Manipulations: numpy.lib.recfunctions
    6. More About Sorting
      1. Indirect Sorts: argsort and lexsort
      2. Alternate Sort Algorithms
      3. numpy.searchsorted: Finding elements in a Sorted Array
    7. NumPy Matrix Class
    8. Advanced Array Input and Output
      1. Memory-mapped Files
      2. HDF5 and Other Array Storage Options
    9. Performance Tips
      1. The Importance of Contiguous Memory
      2. Other Speed Options: Cython, f2py, C
  16. A. Python Language Essentials
    1. The Python Interpreter
    2. The Basics
      1. Language Semantics
      2. Scalar Types
      3. Control Flow
    3. Data Structures and Sequences
      1. Tuple
      2. List
      3. Built-in Sequence Functions
      4. Dict
      5. Set
      6. List, Set, and Dict Comprehensions
    4. Functions
      1. Namespaces, Scope, and Local Functions
      2. Returning Multiple Values
      3. Functions Are Objects
      4. Anonymous (lambda) Functions
      5. Closures: Functions that Return Functions
      6. Extended Call Syntax with *args, **kwargs
      7. Currying: Partial Argument Application
      8. Generators
    5. Files and the operating system
  17. Index
  18. About the Author
  19. Colophon
  20. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  21. Copyright
O'Reilly logo

Chapter 7. Data Wrangling: Clean, Transform, Merge, Reshape

Much of the programming work in data analysis and modeling is spent on data preparation: loading, cleaning, transforming, and rearranging. Sometimes the way that data is stored in files or databases is not the way you need it for a data processing application. Many people choose to do ad hoc processing of data from one form to another using a general purpose programming, like Python, Perl, R, or Java, or UNIX text processing tools like sed or awk. Fortunately, pandas along with the Python standard library provide you with a high-level, flexible, and high-performance set of core manipulations and algorithms to enable you to wrangle data into the right form without much trouble.

If you identify a type of data manipulation that isn’t anywhere in this book or elsewhere in the pandas library, feel free to suggest it on the mailing list or GitHub site. Indeed, much of the design and implementation of pandas has been driven by the needs of real world applications.

Combining and Merging Data Sets

Data contained in pandas objects can be combined together in a number of built-in ways:

  • pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.

  • pandas.concat glues or stacks together objects along an axis.

  • combine_first instance method enables splicing together overlapping data to fill in missing values in one object with ...

The best content for your career. Discover unlimited learning on demand for around $1/day.