O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Mastering pandas

Book Description

Master the features and capabilities of pandas, a data analysis toolkit for Python

In Detail

Python is a ground breaking language for its simplicity and succinctness, allowing the user to achieve a great deal with a few lines of code, especially compared to other programming languages. The pandas brings these features of Python into the data analysis realm, by providing expressiveness, simplicity, and powerful capabilities for the task of data analysis. By mastering pandas, users will be able to do complex data analysis in a short period of time, as well as illustrate their findings using the rich visualization capabilities of related tools such as IPython and matplotlib.

This book is an in-depth guide to the use of pandas for data analysis, for either the seasoned data analysis practitioner or the novice user. It provides a basic introduction to the pandas framework, and takes users through the installation of the library and the IPython interactive environment. Thereafter, you will learn basic as well as advanced features, such as MultiIndexing, modifying data structures, and sampling data, which provide powerful capabilities for data analysis.

What You Will Learn

  • Download, install, and set up Python, pandas, and related tools to perform data analysis for different operating environments
  • Practice using IPython as an interactive environment for doing data analysis using pandas
  • Master the core features of pandas used in data analysis
  • Get to grips with the more advanced features of pandas
  • Understand the basics of using matplotlib to plot data analysis results
  • Analyze real-world datasets using pandas
  • Acquire knowledge of using pandas for basic statistical analysis

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Mastering pandas
    1. Table of Contents
    2. Mastering pandas
    3. Credits
    4. About the Author
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Introduction to pandas and Data Analysis
      1. Motivation for data analysis
        1. We live in a big data world
        2. 4 V's of big data
          1. Volume of big data
          2. Velocity of big data
          3. Variety of big data
          4. Veracity of big data
        3. So much data, so little time for analysis
        4. The move towards real-time analytics
      2. How Python and pandas fit into the data analytics mix
      3. What is pandas?
      4. Benefits of using pandas
      5. Summary
    9. 2. Installation of pandas and the Supporting Software
      1. Selecting a version of Python to use
      2. Python installation
        1. Linux
          1. Installing Python from compressed tarball
        2. Windows
          1. Core Python installation
          2. Third-party Python software installation
        3. Mac OS X
          1. Installation using a package manager
      3. Installation of Python and pandas from a third-party vendor
      4. Continuum Analytics Anaconda
        1. Installing Anaconda
          1. Linux
          2. Mac OS X
          3. Windows
          4. Final step for all platforms
      5. Other numeric or analytics-focused Python distributions
      6. Downloading and installing pandas
        1. Linux
          1. Ubuntu/Debian
          2. Red Hat
          3. Ubuntu/Debian
          4. Fedora
          5. OpenSuse
        2. Mac
          1. Source installation
          2. Binary installation
        3. Windows
          1. Binary Installation
          2. Source installation
          3. IPython
          4. IPython Notebook
      7. IPython installation
        1. Linux
        2. Windows
        3. Mac OS X
        4. Install via Anaconda (for Linux/Mac OS X)
        5. Wakari by Continuum Analytics
        6. Virtualenv
          1. Virtualenv installation and usage
      8. Summary
    10. 3. The pandas Data Structures
      1. NumPy ndarrays
        1. NumPy array creation
          1. NumPy arrays via numpy.array
          2. NumPy array via numpy.arange
          3. NumPy array via numpy.linspace
          4. NumPy array via various other functions
            1. numpy.ones
            2. numpy.zeros
            3. numpy.eye
            4. numpy.diag
            5. numpy.random.rand
            6. numpy.empty
            7. numpy.tile
        2. NumPy datatypes
        3. NumPy indexing and slicing
          1. Array slicing
          2. Array masking
          3. Complex indexing
        4. Copies and views
        5. Operations
          1. Basic operations
          2. Reduction operations
          3. Statistical operators
          4. Logical operators
        6. Broadcasting
        7. Array shape manipulation
          1. Flattening a multi-dimensional array
          2. Reshaping
          3. Resizing
          4. Adding a dimension
        8. Array sorting
      2. Data structures in pandas
        1. Series
          1. Series creation
            1. Using numpy.ndarray
            2. Using Python dictionary
            3. Using scalar values
          2. Operations on Series
            1. Assignment
            2. Slicing
            3. Other operations
        2. DataFrame
          1. DataFrame Creation
            1. Using dictionaries of Series
            2. Using a dictionary of ndarrays/lists
            3. Using a structured array
            4. Using a Series structure
          2. Operations
            1. Selection
            2. Assignment
            3. Deletion
            4. Alignment
            5. Other mathematical operations
        3. Panel
          1. Using 3D NumPy array with axis labels
          2. Using a Python dictionary of DataFrame objects
          3. Using the DataFrame.to_panel method
          4. Other operations
      3. Summary
    11. 4. Operations in pandas, Part I – Indexing and Selecting
      1. Basic indexing
        1. Accessing attributes using dot operator
        2. Range slicing
      2. Label, integer, and mixed indexing
        1. Label-oriented indexing
          1. Selection using a Boolean array
        2. Integer-oriented indexing
        3. The .iat and .at operators
        4. Mixed indexing with the .ix operator
        5. MultiIndexing
        6. Swapping and reordering levels
        7. Cross sections
      3. Boolean indexing
        1. The is in and any all methods
        2. Using the where() method
        3. Operations on indexes
      4. Summary
    12. 5. Operations in pandas, Part II – Grouping, Merging, and Reshaping of Data
      1. Grouping of data
        1. The groupby operation
          1. Using groupby with a MultiIndex
          2. Using the aggregate method
          3. Applying multiple functions
          4. The transform() method
          5. Filtering
      2. Merging and joining
        1. The concat function
        2. Using append
        3. Appending a single row to a DataFrame
        4. SQL-like merging/joining of DataFrame objects
          1. The join function
      3. Pivots and reshaping data
        1. Stacking and unstacking
          1. The stack() function
        2. Other methods to reshape DataFrames
          1. Using the melt function
            1. The pandas.get_dummies() function
      4. Summary
    13. 6. Missing Data, Time Series, and Plotting Using Matplotlib
      1. Handling missing data
        1. Handling missing values
      2. Handling time series
        1. Reading in time series data
          1. DateOffset and TimeDelta objects
        2. Time series-related instance methods
          1. Shifting/lagging
          2. Frequency conversion
          3. Resampling of data
          4. Aliases for Time Series frequencies
        3. Time series concepts and datatypes
          1. Period and PeriodIndex
            1. PeriodIndex
          2. Conversions between Time Series datatypes
      3. A summary of Time Series-related objects
        1. Plotting using matplotlib
      4. Summary
    14. 7. A Tour of Statistics – The Classical Approach
      1. Descriptive statistics versus inferential statistics
      2. Measures of central tendency and variability
        1. Measures of central tendency
          1. The mean
          2. The median
          3. The mode
          4. Computing measures of central tendency of a dataset in Python
        2. Measures of variability, dispersion, or spread
          1. Range
          2. Quartile
          3. Deviation and variance
      3. Hypothesis testing – the null and alternative hypotheses
        1. The null and alternative hypotheses
          1. The alpha and p-values
          2. Type I and Type II errors
        2. Statistical hypothesis tests
          1. Background
          2. The z-test
          3. The t-test
            1. Types of t-tests
          4. A t-test example
        3. Confidence intervals
          1. An illustrative example
        4. Correlation and linear regression
          1. Correlation
          2. Linear regression
          3. An illustrative example
      4. Summary
    15. 8. A Brief Tour of Bayesian Statistics
      1. Introduction to Bayesian statistics
      2. Mathematical framework for Bayesian statistics
        1. Bayes theory and odds
        2. Applications of Bayesian statistics
      3. Probability distributions
        1. Fitting a distribution
          1. Discrete probability distributions
          2. Discrete uniform distributions
            1. The Bernoulli distribution
            2. The binomial distribution
            3. The Poisson distribution
            4. The Geometric distribution
            5. The negative binomial distribution
          3. Continuous probability distributions
            1. The continuous uniform distribution
            2. The exponential distribution
            3. The normal distribution
      4. Bayesian statistics versus Frequentist statistics
        1. What is probability?
        2. How the model is defined
        3. Confidence (Frequentist) versus Credible (Bayesian) intervals
      5. Conducting Bayesian statistical analysis
      6. Monte Carlo estimation of the likelihood function and PyMC
        1. Bayesian analysis example – Switchpoint detection
      7. References
      8. Summary
    16. 9. The pandas Library Architecture
      1. Introduction to pandas' file hierarchy
      2. Description of pandas' modules and files
        1. pandas/core
        2. pandas/io
        3. pandas/tools
        4. pandas/sparse
        5. pandas/stats
        6. pandas/util
        7. pandas/rpy
        8. pandas/tests
        9. pandas/compat
        10. pandas/computation
        11. pandas/tseries
        12. pandas/sandbox
      3. Improving performance using Python extensions
      4. Summary
    17. 10. R and pandas Compared
      1. R data types
        1. R lists
        2. R DataFrames
      2. Slicing and selection
        1. R-matrix and NumPy array compared
        2. R lists and pandas series compared
          1. Specifying column name in R
          2. Specifying column name in pandas
        3. R's DataFrames versus pandas' DataFrames
          1. Multicolumn selection in R
          2. Multicolumn selection in pandas
      3. Arithmetic operations on columns
      4. Aggregation and GroupBy
        1. Aggregation in R
        2. The pandas' GroupBy operator
      5. Comparing matching operators in R and pandas
        1. R %in% operator
        2. The pandas isin() function
      6. Logical subsetting
        1. Logical subsetting in R
        2. Logical subsetting in pandas
      7. Split-apply-combine
        1. Implementation in R
        2. Implementation in pandas
      8. Reshaping using melt
        1. The R melt() function
        2. The pandas melt() function
      9. Factors/categorical data
        1. An R example using cut()
        2. The pandas solution
      10. Summary
    18. 11. Brief Tour of Machine Learning
      1. Role of pandas in machine learning
      2. Installation of scikit-learn
        1. Installing via Anaconda
        2. Installing on Unix (Linux/Mac OS X)
        3. Installing on Windows
      3. Introduction to machine learning
        1. Supervised versus unsupervised learning
        2. Illustration using document classification
          1. Supervised learning
          2. Unsupervised learning
        3. How machine learning systems learn
      4. Application of machine learning – Kaggle Titanic competition
        1. The titanic: machine learning from disaster problem
        2. The problem of overfitting
      5. Data analysis and preprocessing using pandas
        1. Examining the data
        2. Handling missing values
      6. A naïve approach to Titanic problem
      7. The scikit-learn ML/classifier interface
      8. Supervised learning algorithms
        1. Constructing a model using Patsy for scikit-learn
        2. General boilerplate code explanation
        3. Logistic regression
        4. Support vector machine
        5. Decision trees
        6. Random forest
      9. Unsupervised learning algorithms
        1. Dimensionality reduction
        2. K-means clustering
      10. Summary
    19. Index