You are previewing Python Data Science Handbook.
O'Reilly logo
Python Data Science Handbook

Book Description

The Python Data Science Handbook provides a reference to the breadth of computational and statistical methods that are central to data-intensive science, research, and discovery. People with a programming background who want to use Python effectively for data science tasks will learn how to face a variety of problems: e.g., how can I read this data format into my script? How can I manipulate, transform, and clean this data? How can I visualize this type of data? How can I use this data to gain insight, answer questions, or to build statistical or machine learning models?

This book is a reference for day-to-day Python-enabled data science, covering both the computational and statistical skills necessary to effectively work with . The discussion is augmented with frequent example applications, showing how the wide breadth of open source Python tools can be used together to analyze, manipulate, visualize, and learn from data.

Table of Contents

  1. Preface
    1. What Is Data Science?
    2. Who Is This Book For?
    3. What to Expect from This Book
    4. Why Python?
      1. Python 2 vs Python 3
    5. Other Miscellany
    6. Setting Up Your Computer
      1. Installing from Source
      2. Using the pip: the Python Package Index
      3. Using System Distributions
      4. Third-party Distributions
      5. My Recommendation: Anaconda & conda
  2. 1. A Whirlwind Tour of the Python Language
    1. Python is Glue
    2. The Zen of Python
    3. How to Run Python Code
      1. The Python Interpreter
    4. A Quick Tour of Python Language Syntax
      1. Comments are marked by #
      2. End-of-line Terminates a Statement
      3. Semicolon can Optionally Termnate a Statement
      4. Indentation: Whitespace Matters!
      5. Whitespace Within Lines Does Not Matter
      6. Parentheses are for Grouping or Calling
      7. Finishing Up and Learning More
      8. Sidebar: Note on the print() Function
    5. Basic Python Semantics: Variables and Objects
      1. Python Variables are Pointers
      2. Everything is an Object
    6. Basic Python Semantics: Operators
      1. Arithmetic Operations
      2. Bitwise Operations
      3. Assignment Operations
      4. Comparison Operations
      5. Boolean Operations
      6. Identity and Membership Operators
      7. Summary
    7. Built-in Types: Simple Values
      1. Numeric Types
      2. String Type
      3. None Type
      4. Boolean Type
    8. Built-in Data Structures
      1. Lists
      2. Tuples
      3. Dictionaries
      4. Sets
      5. More Specialized Data Structures
    9. Control Flow
      1. Conditional Statements: if-elif-else:
      2. for loops
      3. while loops
      4. break and continue: Fine Tuning Your Loops
      5. Loops With an else Block
    10. Defining and Using Functions
      1. Using Functions
      2. Defining Functions
      3. Default Argument Values
      4. *args and **kwargs: Flexible Arguments
      5. Anonymous (lambda) Functions
    11. Errors and Exceptions
      1. Runtime Errors
      2. Catching Exceptions: try and except
      3. Raising Exceptions: raise
      4. Advanced Topics
      5. try...except...else...finally
    12. Iterators
      1. Iterating over lists
      2. range(): A List is Not Always a List
      3. Useful Iterators
      4. Advanced Iterators: itertools
    13. List Comprehensions
      1. Basic List Comprehensions
      2. Multiple Iteration
      3. Conditionals on the Iterator
      4. Conditionals on the Value
      5. Other Types of Comprehensions
      6. Dict Comprehension
      7. Generator Expressions
    14. Generators
      1. List Comprehensions vs Generator Expressions
      2. Generator Functions: yield
      3. Example: Prime Number Generator
    15. Modules and Packages
      1. Loading Modules: the import Statement
      2. Python’s Standard Library
      3. Third-party modules
    16. String Manipulation and Regular Expressions
      1. Simple String Manipulation in Python
      2. Format Strings
      3. Flexible Pattern Matching with Regular Expressions
    17. Further Python Resources
      1. More Advanced Python Language Features
      2. More Built-in Modules
      3. More Third-Party Modules
  3. 2. IPython: Beyond Normal Python
    1. Shell or Notebook?
      1. Launching the IPython Shell
      2. Launching the IPython Notebook
    2. Help and Documentation in IPython
      1. Accessing Documentation with "?"
      2. Accessing Source Code with "??"
      3. Exploring Modules with Tab-Completion
    3. Keyboard Shortcuts in the IPython Shell
      1. Navigation shortcuts
      2. Text Entry Shortcuts
      3. Command History Shortcuts
      4. Miscellaneous Shortcuts
    4. IPython Magic Commands
      1. Pasting Code Blocks: %paste and %cpaste
      2. Running External Code: %run
      3. Timing Code Execution: %timeit
      4. Help on Magic Functions: ?, %magic, and %lsmagic
    5. Input and Output History
      1. IPython’s In and Out Objects
      2. Underscore Shortcuts and Previous Outputs
      3. Suppressing Output
      4. Related Magic Commands
    6. IPython and Shell Commands
      1. Quick Introduction to the Shell
      2. Shell Commands in IPython
      3. Passing Values To and From the Shell
      4. Shell-related Magic Commands
    7. Errors and Debugging
      1. Controlling Exceptions: %xmode
      2. Debugging: When Reading Tracebacks is Not Enough
    8. Profiling and Timing Code
      1. Timing Code Snippets: %timeit and %time
      2. Profiling Full Scripts: %prun
      3. Line-by-line Profiling with %lprun
      4. Profiling Memory Use: %memit and %mprun
    9. More IPython Resources
      1. Web Resources
      2. Books
  4. 3. Introduction to NumPy
    1. Reminder about Built-in Documentation
    2. Understanding Data Types in Python
      1. A Python Integer is More than just an Integer
      2. A Python List is More than just a List
      3. Fixed-type arrays in Python
      4. Creating Arrays from Python Lists
      5. Creating arrays from scratch
      6. NumPy Standard Data Types
    3. The Basics of NumPy Arrays
      1. NumPy Array Attributes
      2. Array Indexing: Accessing Single Elements
      3. Array Slicing: Accessing Subarrays
      4. Reshaping of Arrays
      5. Array Concatenation and Splitting
      6. Summary
    4. Random Number Generation
      1. Understanding a Simple “Random” Sequence
      2. Built-in tools: Python’s random module
      3. Efficient Random Numbers: numpy.random
      4. Simultaneously Using Multiple Chains
      5. Random Numbers: Further Resources
    5. Computation on NumPy Arrays: Universal Functions
      1. The Slowness of Loops
      2. Introducing UFuncs
      3. Exploring NumPy’s UFuncs
      4. Advanced Ufunc Features
      5. Finding More
    6. Aggregations: Min, Max, and Everything In Between
      1. Examples of NumPy Aggregates
      2. Example: How Tall is the Average US President?
    7. Computation on Arrays: Broadcasting
      1. Introducing Broadcasting
      2. Rules of Broadcasting
      3. Broadcasting in Practice
      4. Utility Routines for Broadcasting
    8. Comparisons, Masks, and Boolean Logic
      1. Example: Counting Rainy Days
      2. Comparison Operators as ufuncs
      3. Working with Boolean Arrays
      4. Returning to Seattle’s Rain
      5. Boolean Arrays as Masks
      6. Sidebar: "&" vs. "and“...
    9. Fancy Indexing
      1. Exploring Fancy Indexing
      2. Combined Indexing
      3. Generating Indices: np.where
      4. Example: Selecting Random Points
      5. Modifying values with Fancy Indexing
      6. Example: Binning data
    10. Numpy Indexing Tricks
      1. np.mgrid: Convenient Multi-dimensional Mesh Grids
      2. np.ogrid: Convenient Open Grids
      3. np.ix_: Open Index Grids
      4. np.r_: concatenation along rows
      5. np.c_: concatenation along columns
      6. Why Index Tricks?
    11. Sorting Arrays
      1. Sidebar: Big-O Notation
      2. Fast Sorts in Python
      3. Fast Sorts in NumPy: np.sort and np.argsort
      4. Partial Sorts: Partitioning
      5. Example: K Nearest Neighbors
    12. Searching and Counting Values In Arrays
      1. Python Standard Library Tools
      2. Searching for Values in NumPy Arrays
      3. Counting and Binning
    13. Structured Data: NumPy’s Structured Arrays
      1. Creating Structured Arrays
      2. More Advanced Compound Types
      3. RecordArrays: Structured Arrays with a Twist
      4. On to Pandas
  5. 4. Introduction to Pandas
    1. Installing and Using Pandas
    2. Reminder about Built-in Documentation
    3. Introducing Pandas Objects
      1. Pandas Series
      2. Pandas DataFrame
      3. Pandas Index
      4. Looking Forward
    4. Data Indexing and Selection
      1. Data Selection in Series
      2. Data Selection in DataFrame
    5. Operations in Pandas
      1. Ufuncs: Index Preservation
      2. UFuncs: Index Alignment
      3. Ufuncs: Operations between DataFrame and Series
      4. Summary
    6. Handling Missing Data
      1. Tradeoffs in Missing Data Conventions
      2. Missing Data in Pandas
      3. Operating on Null Values
      4. Summary
    7. Hierarchical Indexing
      1. A Multiply-Indexed Series
      2. Aside: Panel Data
      3. Methods of MultiIndex Creation
      4. Indexing and Slicing a MultiIndex
      5. Rearranging Multi-Indices
      6. Data Aggregations on Multi-Indices
      7. Summary
    8. Combining Datasets: Concat & Append
      1. Recall: Concatenation of NumPy Arrays
      2. Simple Concatenation with pd.concat
    9. Combining Datasets: Merge and Join
      1. Relational Algebra
      2. Categories of Joins
      3. Specification of the Merge Key
      4. Specifying Set Arithmetic for Joins
      5. Overlapping Column Names: The suffixes Keyword
      6. Example: US States Data
    10. Aggregation and Grouping
      1. Planets Data
      2. Simple Aggregation in Pandas
      3. Group By: Split, Apply, Combine
    11. Pivot Tables
      1. Motivating Pivot Tables
      2. Pivot Tables By Hand
      3. Pivot Table Syntax
      4. Example: Birthrate Data
    12. Vectorized String Operations
      1. Introducing Pandas String Operations
      2. Tables of Pandas String Methods
      3. Further Information
      4. Example: Recipe Database
    13. Working with Time Series
      1. Dates and Times in Python
      2. Pandas TimeSeries: Indexing by Time
      3. Pandas TimeSeries Data Structures
      4. Frequencies and Offsets
      5. Resampling, Shifting, and Windowing
      6. Where to Learn More
      7. Example: Visualizing Seattle Bicycle Counts
    14. High-Performance Pandas: eval() and query()
      1. Motivating query() and eval(): Compound Expressions
      2. pandas.eval() for Efficient Operations
      3. DataFrame.eval() for Column-wise Operations
      4. DataFrame.query() Method
      5. Performance: When to Use these functions
      6. Learning More
    15. Further Resources
  6. 5. Introduction to Matplotlib
    1. General matplotlib tips
      1. Importing Matplotlib
      2. show() or no show()? How to Display your Plots
      3. Saving Figures to File
      4. Learning More about matplotlib
    2. Sidebar: Two Interfaces for the Price of One
      1. MatLab-style Interface
    3. Simple Line Plots
      1. Adjusting the Plot: Line Colors and Styles
      2. Adjusting the Plot: Axes limits
      3. Labeling Plots
      4. Sidebar: Gotchas
    4. Simple Scatter Plots
      1. Scatter Plots with plt.plot
      2. Scatter Plots with plt.scatter
      3. A Note on Efficiency
    5. Visualizing Errors
      1. Basic Errorbars
      2. Continuous Errors
    6. Density and Contour Plots
      1. Visualizing a 3D function
    7. Histograms and Binnings
      1. Two-dimensional Histograms and Binnings
      2. Binned Statistics
    8. Customizing Legends
      1. Choosing Elements for the Legend
      2. Faking the Legend
      3. Multiple Legends
    9. Customizing Colorbars
      1. Customizing Colorbars
      2. Example: Hand-written Digits
    10. Multiple Subplots
      1. plt.axes: subplots by-hand
      2. plt.subplot: simple grids of subplots
      3. plt.subplots: the whole grid in one go
    11. Text and Annotation
      1. The Cost of Storage over Time
      2. Arrows and Annotation
    12. Customizing Ticks
      1. Removing Ticks or Labels
      2. Reducing or Increasing the Number of Ticks
      3. Fancy Tick Formats
      4. Summary of Formatters and Locators
    13. Customizing Matplotlib: Configurations and Style Sheets
      1. Plot Customization By Hand
      2. Changing the Defaults: rcParams
      3. Stylesheets
    14. Three-dimensional Plotting in Matplotlib
      1. 3D Points and Lines
      2. 3D Contour Plots
      3. Wireframes and Surface Plots
      4. Surface Triangulations
    15. Geographic Data with Basemap
      1. Map Projections
      2. Drawing a Map Background
      3. Plotting Data On Maps
      4. Example: California Cities
      5. Example: Surface Temperature Data
    16. Visualization With Seaborn
      1. Seaborn vs. Matplotlib
      2. Exploring Seaborn Plots
      3. Example: Exploring New York City Marathon Data
  7. 6. Machine Learning
    1. What Is Machine Learning?
      1. Machine Learning vs. Statistical Modeling
      2. Categories of Machine Learning
      3. Qualitative Examples of Machine Learning Applications
      4. Summary
      5. Figure Code
    2. Introducing Scikit-Learn
      1. Data Representation in Scikit-Learn
      2. Scikit-Learn’s Estimator API
      3. Application: Exploring Hand-written Digits
    3. Hyperparameters and Model Validation
      1. Thinking About Model Validation
      2. Selecting the Best Model
      3. Learning Curves
      4. Validation in Practice: Grid Search
      5. Summary
      6. Figure Code
    4. In Depth: Naive Bayes Classification
      1. Bayesian Classification
      2. Gaussian Naive Bayes
      3. Multinomial Naive Bayes
      4. When to Use Naive Bayes
      5. Figures
    5. In Depth: Linear Regression
      1. Simple Linear Regression
      2. Basis Function Regression
      3. Regularization
      4. Example: Predicting Bicycle Traffic
      5. Figures
    6. In-Depth: Support Vector Machines
      1. Motivating Support Vector Machines
      2. Support Vector Machines: Maximizing the Margin
      3. Example: Face Recognition
      4. Support Vector Machine Summary
    7. In-Depth: Decision Trees and Random Forests
      1. Motivating Random Forests: Decision Trees
      2. Ensembles of Estimators: Random Forests
      3. Random Forest Regression
      4. Example: Random Forest for Classifying Digits
      5. Summary of Random Forests
      6. Figure Code
    8. In Depth: Principal Component Analysis
      1. Introducing Principal Component Analysis
      2. PCA as Noise Filtering
      3. Example: Eigenfaces
      4. Principal Component Analysis Summary
      5. Figures
    9. In-Depth: Manifold Learning
      1. Manifold Learning: “HELLO”
      2. Multidimensional Scaling (MDS)
      3. MDS as Manifold Learning
      4. Nonlinear Embeddings: Where MDS Fails
      5. Nonlinar Manifolds: Locally Linear Embedding
      6. Some Thoughts on Manifold Methods
      7. Example: Isomap on Faces
      8. Example: Visualizing Structure in Digits
      9. Figures
    10. In Depth: K-Means Clustering
      1. Introducing K-Means
      2. K-Means Algorithm: Expectation Maximization
    11. In Depth: Gaussian Mixture Models
      1. Motivating GMM: Weaknesses of K Means
      2. Generalizing E-M: Gaussian Mixture Models
      3. GMM as Density Estimation
      4. Example: GMM for Generating New Data
      5. Figures
      6. Covariance Type
    12. In-Depth: Kernel Density Estimation
      1. Motivating KDE: Histograms
      2. Kernel Density Estimation in Practice
      3. Example: KDE on a Sphere
      4. Example: Not-So-Naive Bayes
    13. Feature Engineering: Working with Images
      1. HOG features
      2. HOG in Action: A Simple Face Detector
      3. Caveats and Improvements
    14. Learning More
      1. Machine Learning in Python
      2. General Machine Learning