You are previewing Python for Data Analysis.

Python for Data Analysis

Cover of Python for Data Analysis by Wes McKinney Published by O'Reilly Media, Inc.
  1. Python for Data Analysis
  2. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  3. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. Safari® Books Online
    4. How to Contact Us
  4. 1. Preliminaries
    1. What Is This Book About?
    2. Why Python for Data Analysis?
      1. Python as Glue
      2. Solving the “Two-Language” Problem
      3. Why Not Python?
    3. Essential Python Libraries
      1. NumPy
      2. pandas
      3. matplotlib
      4. IPython
      5. SciPy
    4. Installation and Setup
      1. Windows
      2. Apple OS X
      3. GNU/Linux
      4. Python 2 and Python 3
      5. Integrated Development Environments (IDEs)
    5. Community and Conferences
    6. Navigating This Book
      1. Code Examples
      2. Data for Examples
      3. Import Conventions
      4. Jargon
    7. Acknowledgements
  5. 2. Introductory Examples
    1. 1.usa.gov data from bit.ly
      1. Counting Time Zones in Pure Python
      2. Counting Time Zones with pandas
    2. MovieLens 1M Data Set
      1. Measuring rating disagreement
    3. US Baby Names 1880-2010
      1. Analyzing Naming Trends
    4. Conclusions and The Path Ahead
  6. 3. IPython: An Interactive Computing and Development Environment
    1. IPython Basics
      1. Tab Completion
      2. Introspection
      3. The %run Command
      4. Executing Code from the Clipboard
      5. Keyboard Shortcuts
      6. Exceptions and Tracebacks
      7. Magic Commands
      8. Qt-based Rich GUI Console
      9. Matplotlib Integration and Pylab Mode
    2. Using the Command History
      1. Searching and Reusing the Command History
      2. Input and Output Variables
      3. Logging the Input and Output
    3. Interacting with the Operating System
      1. Shell Commands and Aliases
      2. Directory Bookmark System
    4. Software Development Tools
      1. Interactive Debugger
      2. Timing Code: %time and %timeit
      3. Basic Profiling: %prun and %run -p
      4. Profiling a Function Line-by-Line
    5. IPython HTML Notebook
    6. Tips for Productive Code Development Using IPython
      1. Reloading Module Dependencies
      2. Code Design Tips
    7. Advanced IPython Features
      1. Making Your Own Classes IPython-friendly
      2. Profiles and Configuration
    8. Credits
  7. 4. NumPy Basics: Arrays and Vectorized Computation
    1. The NumPy ndarray: A Multidimensional Array Object
      1. Creating ndarrays
      2. Data Types for ndarrays
      3. Operations between Arrays and Scalars
      4. Basic Indexing and Slicing
      5. Boolean Indexing
      6. Fancy Indexing
      7. Transposing Arrays and Swapping Axes
    2. Universal Functions: Fast Element-wise Array Functions
    3. Data Processing Using Arrays
      1. Expressing Conditional Logic as Array Operations
      2. Mathematical and Statistical Methods
      3. Methods for Boolean Arrays
      4. Sorting
      5. Unique and Other Set Logic
    4. File Input and Output with Arrays
      1. Storing Arrays on Disk in Binary Format
      2. Saving and Loading Text Files
    5. Linear Algebra
    6. Random Number Generation
    7. Example: Random Walks
      1. Simulating Many Random Walks at Once
  8. 5. Getting Started with pandas
    1. Introduction to pandas Data Structures
      1. Series
      2. DataFrame
      3. Index Objects
    2. Essential Functionality
      1. Reindexing
      2. Dropping entries from an axis
      3. Indexing, selection, and filtering
      4. Arithmetic and data alignment
      5. Function application and mapping
      6. Sorting and ranking
      7. Axis indexes with duplicate values
    3. Summarizing and Computing Descriptive Statistics
      1. Correlation and Covariance
      2. Unique Values, Value Counts, and Membership
    4. Handling Missing Data
      1. Filtering Out Missing Data
      2. Filling in Missing Data
    5. Hierarchical Indexing
      1. Reordering and Sorting Levels
      2. Summary Statistics by Level
      3. Using a DataFrame’s Columns
    6. Other pandas Topics
      1. Integer Indexing
      2. Panel Data
  9. 6. Data Loading, Storage, and File Formats
    1. Reading and Writing Data in Text Format
      1. Reading Text Files in Pieces
      2. Writing Data Out to Text Format
      3. Manually Working with Delimited Formats
      4. JSON Data
      5. XML and HTML: Web Scraping
    2. Binary Data Formats
      1. Using HDF5 Format
      2. Reading Microsoft Excel Files
    3. Interacting with HTML and Web APIs
    4. Interacting with Databases
      1. Storing and Loading Data in MongoDB
  10. 7. Data Wrangling: Clean, Transform, Merge, Reshape
    1. Combining and Merging Data Sets
      1. Database-style DataFrame Merges
      2. Merging on Index
      3. Concatenating Along an Axis
      4. Combining Data with Overlap
    2. Reshaping and Pivoting
      1. Reshaping with Hierarchical Indexing
      2. Pivoting “long” to “wide” Format
    3. Data Transformation
      1. Removing Duplicates
      2. Transforming Data Using a Function or Mapping
      3. Replacing Values
      4. Renaming Axis Indexes
      5. Discretization and Binning
      6. Detecting and Filtering Outliers
      7. Permutation and Random Sampling
      8. Computing Indicator/Dummy Variables
    4. String Manipulation
      1. String Object Methods
      2. Regular expressions
      3. Vectorized string functions in pandas
    5. Example: USDA Food Database
  11. 8. Plotting and Visualization
    1. A Brief matplotlib API Primer
      1. Figures and Subplots
      2. Colors, Markers, and Line Styles
      3. Ticks, Labels, and Legends
      4. Annotations and Drawing on a Subplot
      5. Saving Plots to File
      6. matplotlib Configuration
    2. Plotting Functions in pandas
      1. Line Plots
      2. Bar Plots
      3. Histograms and Density Plots
      4. Scatter Plots
    3. Plotting Maps: Visualizing Haiti Earthquake Crisis Data
    4. Python Visualization Tool Ecosystem
      1. Chaco
      2. mayavi
      3. Other Packages
      4. The Future of Visualization Tools?
  12. 9. Data Aggregation and Group Operations
    1. GroupBy Mechanics
      1. Iterating Over Groups
      2. Selecting a Column or Subset of Columns
      3. Grouping with Dicts and Series
      4. Grouping with Functions
      5. Grouping by Index Levels
    2. Data Aggregation
      1. Column-wise and Multiple Function Application
      2. Returning Aggregated Data in “unindexed” Form
    3. Group-wise Operations and Transformations
      1. Apply: General split-apply-combine
      2. Quantile and Bucket Analysis
      3. Example: Filling Missing Values with Group-specific Values
      4. Example: Random Sampling and Permutation
      5. Example: Group Weighted Average and Correlation
      6. Example: Group-wise Linear Regression
    4. Pivot Tables and Cross-Tabulation
      1. Cross-Tabulations: Crosstab
    5. Example: 2012 Federal Election Commission Database
      1. Donation Statistics by Occupation and Employer
      2. Bucketing Donation Amounts
      3. Donation Statistics by State
  13. 10. Time Series
    1. Date and Time Data Types and Tools
      1. Converting between string and datetime
    2. Time Series Basics
      1. Indexing, Selection, Subsetting
      2. Time Series with Duplicate Indices
    3. Date Ranges, Frequencies, and Shifting
      1. Generating Date Ranges
      2. Frequencies and Date Offsets
      3. Shifting (Leading and Lagging) Data
    4. Time Zone Handling
      1. Localization and Conversion
      2. Operations with Time Zone−aware Timestamp Objects
      3. Operations between Different Time Zones
    5. Periods and Period Arithmetic
      1. Period Frequency Conversion
      2. Quarterly Period Frequencies
      3. Converting Timestamps to Periods (and Back)
      4. Creating a PeriodIndex from Arrays
    6. Resampling and Frequency Conversion
      1. Downsampling
      2. Upsampling and Interpolation
      3. Resampling with Periods
    7. Time Series Plotting
    8. Moving Window Functions
      1. Exponentially-weighted functions
      2. Binary Moving Window Functions
      3. User-Defined Moving Window Functions
    9. Performance and Memory Usage Notes
  14. 11. Financial and Economic Data Applications
    1. Data Munging Topics
      1. Time Series and Cross-Section Alignment
      2. Operations with Time Series of Different Frequencies
      3. Time of Day and “as of” Data Selection
      4. Splicing Together Data Sources
      5. Return Indexes and Cumulative Returns
    2. Group Transforms and Analysis
      1. Group Factor Exposures
      2. Decile and Quartile Analysis
    3. More Example Applications
      1. Signal Frontier Analysis
      2. Future Contract Rolling
      3. Rolling Correlation and Linear Regression
  15. 12. Advanced NumPy
    1. ndarray Object Internals
      1. NumPy dtype Hierarchy
    2. Advanced Array Manipulation
      1. Reshaping Arrays
      2. C versus Fortran Order
      3. Concatenating and Splitting Arrays
      4. Repeating Elements: Tile and Repeat
      5. Fancy Indexing Equivalents: Take and Put
    3. Broadcasting
      1. Broadcasting Over Other Axes
      2. Setting Array Values by Broadcasting
    4. Advanced ufunc Usage
      1. ufunc Instance Methods
      2. Custom ufuncs
    5. Structured and Record Arrays
      1. Nested dtypes and Multidimensional Fields
      2. Why Use Structured Arrays?
      3. Structured Array Manipulations: numpy.lib.recfunctions
    6. More About Sorting
      1. Indirect Sorts: argsort and lexsort
      2. Alternate Sort Algorithms
      3. numpy.searchsorted: Finding elements in a Sorted Array
    7. NumPy Matrix Class
    8. Advanced Array Input and Output
      1. Memory-mapped Files
      2. HDF5 and Other Array Storage Options
    9. Performance Tips
      1. The Importance of Contiguous Memory
      2. Other Speed Options: Cython, f2py, C
  16. A. Python Language Essentials
    1. The Python Interpreter
    2. The Basics
      1. Language Semantics
      2. Scalar Types
      3. Control Flow
    3. Data Structures and Sequences
      1. Tuple
      2. List
      3. Built-in Sequence Functions
      4. Dict
      5. Set
      6. List, Set, and Dict Comprehensions
    4. Functions
      1. Namespaces, Scope, and Local Functions
      2. Returning Multiple Values
      3. Functions Are Objects
      4. Anonymous (lambda) Functions
      5. Closures: Functions that Return Functions
      6. Extended Call Syntax with *args, **kwargs
      7. Currying: Partial Argument Application
      8. Generators
    5. Files and the operating system
  17. Index
  18. About the Author
  19. Colophon
  20. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  21. Copyright
O'Reilly logo

Chapter 12. Advanced NumPy

ndarray Object Internals

The NumPy ndarray provides a means to interpret a block of homogeneous data (either contiguous or strided, more on this later) as a multidimensional array object. As you’ve seen, the data type, or dtype, determines how the data is interpreted as being floating point, integer, boolean, or any of the other types we’ve been looking at.

Part of what makes ndarray powerful is that every array object is a strided view on a block of data. You might wonder, for example, how the array view arr[::2, ::-1] does not copy any data. Simply put, the ndarray is more than just a chunk of memory and a dtype; it also has striding information which enables the array to move through memory with varying step sizes. More precisely, the ndarray internally consists of the following:

  • A pointer to data, that is a block of system memory

  • The data type or dtype

  • A tuple indicating the array’s shape; For example, a 10 by 5 array would have shape (10, 5)

    In [8]: np.ones((10, 5)).shape
    Out[8]: (10, 5)
  • A tuple of strides, integers indicating the number of bytes to “step” in order to advance one element along a dimension; For example, a typical (C order, more on this later) 3 x 4 x 5 array of float64 (8-byte) values has strides (160, 40, 8)

    In [9]: np.ones((3, 4, 5), dtype=np.float64).strides
    Out[9]: (160, 40, 8)

    While it is rare that a typical NumPy user would be interested in the array strides, they are the critical ingredient in constructing copyless array views. Strides ...

The best content for your career. Discover unlimited learning on demand for around $1/day.