You are previewing Python for Data Analysis.

Python for Data Analysis

Cover of Python for Data Analysis by Wes McKinney Published by O'Reilly Media, Inc.
  1. Python for Data Analysis
  2. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  3. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. Safari® Books Online
    4. How to Contact Us
  4. 1. Preliminaries
    1. What Is This Book About?
    2. Why Python for Data Analysis?
      1. Python as Glue
      2. Solving the “Two-Language” Problem
      3. Why Not Python?
    3. Essential Python Libraries
      1. NumPy
      2. pandas
      3. matplotlib
      4. IPython
      5. SciPy
    4. Installation and Setup
      1. Windows
      2. Apple OS X
      3. GNU/Linux
      4. Python 2 and Python 3
      5. Integrated Development Environments (IDEs)
    5. Community and Conferences
    6. Navigating This Book
      1. Code Examples
      2. Data for Examples
      3. Import Conventions
      4. Jargon
    7. Acknowledgements
  5. 2. Introductory Examples
    1. data from
      1. Counting Time Zones in Pure Python
      2. Counting Time Zones with pandas
    2. MovieLens 1M Data Set
      1. Measuring rating disagreement
    3. US Baby Names 1880-2010
      1. Analyzing Naming Trends
    4. Conclusions and The Path Ahead
  6. 3. IPython: An Interactive Computing and Development Environment
    1. IPython Basics
      1. Tab Completion
      2. Introspection
      3. The %run Command
      4. Executing Code from the Clipboard
      5. Keyboard Shortcuts
      6. Exceptions and Tracebacks
      7. Magic Commands
      8. Qt-based Rich GUI Console
      9. Matplotlib Integration and Pylab Mode
    2. Using the Command History
      1. Searching and Reusing the Command History
      2. Input and Output Variables
      3. Logging the Input and Output
    3. Interacting with the Operating System
      1. Shell Commands and Aliases
      2. Directory Bookmark System
    4. Software Development Tools
      1. Interactive Debugger
      2. Timing Code: %time and %timeit
      3. Basic Profiling: %prun and %run -p
      4. Profiling a Function Line-by-Line
    5. IPython HTML Notebook
    6. Tips for Productive Code Development Using IPython
      1. Reloading Module Dependencies
      2. Code Design Tips
    7. Advanced IPython Features
      1. Making Your Own Classes IPython-friendly
      2. Profiles and Configuration
    8. Credits
  7. 4. NumPy Basics: Arrays and Vectorized Computation
    1. The NumPy ndarray: A Multidimensional Array Object
      1. Creating ndarrays
      2. Data Types for ndarrays
      3. Operations between Arrays and Scalars
      4. Basic Indexing and Slicing
      5. Boolean Indexing
      6. Fancy Indexing
      7. Transposing Arrays and Swapping Axes
    2. Universal Functions: Fast Element-wise Array Functions
    3. Data Processing Using Arrays
      1. Expressing Conditional Logic as Array Operations
      2. Mathematical and Statistical Methods
      3. Methods for Boolean Arrays
      4. Sorting
      5. Unique and Other Set Logic
    4. File Input and Output with Arrays
      1. Storing Arrays on Disk in Binary Format
      2. Saving and Loading Text Files
    5. Linear Algebra
    6. Random Number Generation
    7. Example: Random Walks
      1. Simulating Many Random Walks at Once
  8. 5. Getting Started with pandas
    1. Introduction to pandas Data Structures
      1. Series
      2. DataFrame
      3. Index Objects
    2. Essential Functionality
      1. Reindexing
      2. Dropping entries from an axis
      3. Indexing, selection, and filtering
      4. Arithmetic and data alignment
      5. Function application and mapping
      6. Sorting and ranking
      7. Axis indexes with duplicate values
    3. Summarizing and Computing Descriptive Statistics
      1. Correlation and Covariance
      2. Unique Values, Value Counts, and Membership
    4. Handling Missing Data
      1. Filtering Out Missing Data
      2. Filling in Missing Data
    5. Hierarchical Indexing
      1. Reordering and Sorting Levels
      2. Summary Statistics by Level
      3. Using a DataFrame’s Columns
    6. Other pandas Topics
      1. Integer Indexing
      2. Panel Data
  9. 6. Data Loading, Storage, and File Formats
    1. Reading and Writing Data in Text Format
      1. Reading Text Files in Pieces
      2. Writing Data Out to Text Format
      3. Manually Working with Delimited Formats
      4. JSON Data
      5. XML and HTML: Web Scraping
    2. Binary Data Formats
      1. Using HDF5 Format
      2. Reading Microsoft Excel Files
    3. Interacting with HTML and Web APIs
    4. Interacting with Databases
      1. Storing and Loading Data in MongoDB
  10. 7. Data Wrangling: Clean, Transform, Merge, Reshape
    1. Combining and Merging Data Sets
      1. Database-style DataFrame Merges
      2. Merging on Index
      3. Concatenating Along an Axis
      4. Combining Data with Overlap
    2. Reshaping and Pivoting
      1. Reshaping with Hierarchical Indexing
      2. Pivoting “long” to “wide” Format
    3. Data Transformation
      1. Removing Duplicates
      2. Transforming Data Using a Function or Mapping
      3. Replacing Values
      4. Renaming Axis Indexes
      5. Discretization and Binning
      6. Detecting and Filtering Outliers
      7. Permutation and Random Sampling
      8. Computing Indicator/Dummy Variables
    4. String Manipulation
      1. String Object Methods
      2. Regular expressions
      3. Vectorized string functions in pandas
    5. Example: USDA Food Database
  11. 8. Plotting and Visualization
    1. A Brief matplotlib API Primer
      1. Figures and Subplots
      2. Colors, Markers, and Line Styles
      3. Ticks, Labels, and Legends
      4. Annotations and Drawing on a Subplot
      5. Saving Plots to File
      6. matplotlib Configuration
    2. Plotting Functions in pandas
      1. Line Plots
      2. Bar Plots
      3. Histograms and Density Plots
      4. Scatter Plots
    3. Plotting Maps: Visualizing Haiti Earthquake Crisis Data
    4. Python Visualization Tool Ecosystem
      1. Chaco
      2. mayavi
      3. Other Packages
      4. The Future of Visualization Tools?
  12. 9. Data Aggregation and Group Operations
    1. GroupBy Mechanics
      1. Iterating Over Groups
      2. Selecting a Column or Subset of Columns
      3. Grouping with Dicts and Series
      4. Grouping with Functions
      5. Grouping by Index Levels
    2. Data Aggregation
      1. Column-wise and Multiple Function Application
      2. Returning Aggregated Data in “unindexed” Form
    3. Group-wise Operations and Transformations
      1. Apply: General split-apply-combine
      2. Quantile and Bucket Analysis
      3. Example: Filling Missing Values with Group-specific Values
      4. Example: Random Sampling and Permutation
      5. Example: Group Weighted Average and Correlation
      6. Example: Group-wise Linear Regression
    4. Pivot Tables and Cross-Tabulation
      1. Cross-Tabulations: Crosstab
    5. Example: 2012 Federal Election Commission Database
      1. Donation Statistics by Occupation and Employer
      2. Bucketing Donation Amounts
      3. Donation Statistics by State
  13. 10. Time Series
    1. Date and Time Data Types and Tools
      1. Converting between string and datetime
    2. Time Series Basics
      1. Indexing, Selection, Subsetting
      2. Time Series with Duplicate Indices
    3. Date Ranges, Frequencies, and Shifting
      1. Generating Date Ranges
      2. Frequencies and Date Offsets
      3. Shifting (Leading and Lagging) Data
    4. Time Zone Handling
      1. Localization and Conversion
      2. Operations with Time Zone−aware Timestamp Objects
      3. Operations between Different Time Zones
    5. Periods and Period Arithmetic
      1. Period Frequency Conversion
      2. Quarterly Period Frequencies
      3. Converting Timestamps to Periods (and Back)
      4. Creating a PeriodIndex from Arrays
    6. Resampling and Frequency Conversion
      1. Downsampling
      2. Upsampling and Interpolation
      3. Resampling with Periods
    7. Time Series Plotting
    8. Moving Window Functions
      1. Exponentially-weighted functions
      2. Binary Moving Window Functions
      3. User-Defined Moving Window Functions
    9. Performance and Memory Usage Notes
  14. 11. Financial and Economic Data Applications
    1. Data Munging Topics
      1. Time Series and Cross-Section Alignment
      2. Operations with Time Series of Different Frequencies
      3. Time of Day and “as of” Data Selection
      4. Splicing Together Data Sources
      5. Return Indexes and Cumulative Returns
    2. Group Transforms and Analysis
      1. Group Factor Exposures
      2. Decile and Quartile Analysis
    3. More Example Applications
      1. Signal Frontier Analysis
      2. Future Contract Rolling
      3. Rolling Correlation and Linear Regression
  15. 12. Advanced NumPy
    1. ndarray Object Internals
      1. NumPy dtype Hierarchy
    2. Advanced Array Manipulation
      1. Reshaping Arrays
      2. C versus Fortran Order
      3. Concatenating and Splitting Arrays
      4. Repeating Elements: Tile and Repeat
      5. Fancy Indexing Equivalents: Take and Put
    3. Broadcasting
      1. Broadcasting Over Other Axes
      2. Setting Array Values by Broadcasting
    4. Advanced ufunc Usage
      1. ufunc Instance Methods
      2. Custom ufuncs
    5. Structured and Record Arrays
      1. Nested dtypes and Multidimensional Fields
      2. Why Use Structured Arrays?
      3. Structured Array Manipulations: numpy.lib.recfunctions
    6. More About Sorting
      1. Indirect Sorts: argsort and lexsort
      2. Alternate Sort Algorithms
      3. numpy.searchsorted: Finding elements in a Sorted Array
    7. NumPy Matrix Class
    8. Advanced Array Input and Output
      1. Memory-mapped Files
      2. HDF5 and Other Array Storage Options
    9. Performance Tips
      1. The Importance of Contiguous Memory
      2. Other Speed Options: Cython, f2py, C
  16. A. Python Language Essentials
    1. The Python Interpreter
    2. The Basics
      1. Language Semantics
      2. Scalar Types
      3. Control Flow
    3. Data Structures and Sequences
      1. Tuple
      2. List
      3. Built-in Sequence Functions
      4. Dict
      5. Set
      6. List, Set, and Dict Comprehensions
    4. Functions
      1. Namespaces, Scope, and Local Functions
      2. Returning Multiple Values
      3. Functions Are Objects
      4. Anonymous (lambda) Functions
      5. Closures: Functions that Return Functions
      6. Extended Call Syntax with *args, **kwargs
      7. Currying: Partial Argument Application
      8. Generators
    5. Files and the operating system
  17. Index
  18. About the Author
  19. Colophon
  20. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  21. Copyright
O'Reilly logo

Chapter 3. IPython: An Interactive Computing and Development Environment

Act without doing; work without effort. Think of the small as large and the few as many. Confront the difficult while it is still easy; accomplish the great task by a series of small acts.


People often ask me, “What is your Python development environment?” My answer is almost always the same, “IPython and a text editor”. You may choose to substitute an Integrated Development Environment (IDE) for a text editor in order to take advantage of more advanced graphical tools and code completion capabilities. Even if so, I strongly recommend making IPython an important part of your workflow. Some IDEs even provide IPython integration, so it’s possible to get the best of both worlds.

The IPython project began in 2001 as Fernando Pérez’s side project to make a better interactive Python interpreter. In the subsequent 11 years it has grown into what’s widely considered one of the most important tools in the modern scientific Python computing stack. While it does not provide any computational or data analytical tools by itself, IPython is designed from the ground up to maximize your productivity in both interactive computing and software development. It encourages an execute-explore workflow instead of the typical edit-compile-run workflow of many other programming languages. It also provides very tight integration with the operating system’s shell and file system. Since much of data analysis coding involves exploration, ...

The best content for your career. Discover unlimited learning on demand for around $1/day.