You are previewing Mastering Python Data Analysis.
O'Reilly logo
Mastering Python Data Analysis

Book Description

Become an expert at using Python for advanced statistical analysis of data using real-world examples

About This Book

  • Clean, format, and explore data using graphical and numerical summaries

  • Leverage the IPython environment to efficiently analyze data with Python

  • Packed with easy-to-follow examples to develop advanced computational skills for the analysis of complex data

  • Who This Book Is For

    If you are a competent Python developer who wants to take your data analysis skills to the next level by solving complex problems, then this advanced guide is for you. Familiarity with the basics of applying Python libraries to data sets is assumed.

    What You Will Learn

  • Read, sort, and map various data into Python and Pandas

  • Recognise patterns so you can understand and explore data

  • Use statistical models to discover patterns in data

  • Review classical statistical inference using Python, Pandas, and SciPy

  • Detect similarities and differences in data with clustering

  • Clean your data to make it useful

  • Work in Jupyter Notebook to produce publication ready figures to be included in reports

  • In Detail

    Python, a multi-paradigm programming language, has become the language of choice for data scientists for data analysis, visualization, and machine learning. Ever imagined how to become an expert at effectively approaching data analysis problems, solving them, and extracting all of the available information from your data? Well, look no further, this is the book you want!

    Through this comprehensive guide, you will explore data and present results and conclusions from statistical analysis in a meaningful way. You’ll be able to quickly and accurately perform the hands-on sorting, reduction, and subsequent analysis, and fully appreciate how data analysis methods can support business decision-making.

    You’ll start off by learning about the tools available for data analysis in Python and will then explore the statistical models that are used to identify patterns in data. Gradually, you’ll move on to review statistical inference using Python, Pandas, and SciPy. After that, we’ll focus on performing regression using computational tools and you’ll get to understand the problem of identifying clusters in data in an algorithmic way. Finally, we delve into advanced techniques to quantify cause and effect using Bayesian methods and you’ll discover how to use Python’s tools for supervised machine learning.

    Style and approach

    This book takes a step-by-step approach to reading, processing, and analyzing data in Python using various methods and tools. Rich in examples, each topic connects to real-world examples and retrieves data directly online where possible. With this book, you are given the knowledge and tools to explore any data on your own, encouraging a curiosity befitting all data scientists.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at If you purchased this book elsewhere, you can visit and register to have the code file.

    Table of Contents

    1. Mastering Python Data Analysis
      1. Mastering Python Data Analysis
      2. Credits
      3. About the Authors
      4. About the Reviewer
        1. Why subscribe?
        2. Free access for Packt account holders
      6. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Downloading the color images of this book
          3. Errata
          4. Piracy
          5. Questions
      7. 1. Tools of the Trade
        1. Before you start
        2. Using the notebook interface
        3. Imports
        4. An example using the Pandas library
        5. Summary
      8. 2. Exploring Data
        1. The General Social Survey
          1. Obtaining the data
          2. Reading the data
        2. Univariate data
          1. Histograms
            1. Making things pretty
            2. Characterization
          2. Concept of statistical inference
          3. Numeric summaries and boxplots
        3. Relationships between variables – scatterplots
        4. Summary
      9. 3. Learning About Models
        1. Models and experiments
        2. The cumulative distribution function
        3. Working with distributions
        4. The probability density function
        5. Where do models come from?
        6. Multivariate distributions
        7. Summary
      10. 4. Regression
        1. Introducing linear regression
          1. Getting the dataset
          2. Testing with linear regression
        2. Multivariate regression
          1. Adding economic indicators
          2. Taking a step back
        3. Logistic regression
          1. Some notes
        4. Summary
      11. 5. Clustering
        1. Introduction to cluster finding
          1. Starting out simple – John Snow on cholera
        2. K-means clustering
          1. Suicide rate versus GDP versus absolute latitude
        3. Hierarchical clustering analysis
          1. Reading in and reducing the data
          2. Hierarchical cluster algorithm
        4. Summary
      12. 6. Bayesian Methods
        1. The Bayesian method
          1. Credible versus confidence intervals
          2. Bayes formula
          3. Python packages
        2. U.S. air travel safety record
          1. Getting the NTSB database
          2. Binning the data
          3. Bayesian analysis of the data
            1. Binning by month
          4. Plotting coordinates
            1. Cartopy
            2. Mpl toolkits – basemap
        3. Climate change - CO2 in the atmosphere
          1. Getting the data
          2. Creating and sampling the model
        4. Summary
      13. 7. Supervised and Unsupervised Learning
        1. Introduction to machine learning
        2. Scikit-learn
        3. Linear regression
          1. Climate data
          2. Checking with Bayesian analysis and OLS
        4. Clustering
        5. Seeds classification
          1. Visualizing the data
          2. Feature selection
          3. Classifying the data
            1. The SVC linear kernel
            2. The SVC Radial Basis Function
            3. The SVC polynomial
            4. K-Nearest Neighbour
            5. Random Forest
          4. Choosing your classifier
        6. Summary
      14. 8. Time Series Analysis
        1. Introduction
        2. Pandas and time series data
        3. Indexing and slicing
        4. Resampling, smoothing, and other estimates
        5. Stationarity
        6. Patterns and components
          1. Decomposing components
          2. Differencing
        7. Time series models
          1. Autoregressive – AR
          2. Moving average – MA
          3. Selecting p and q
            1. Automatic function
            2. The (Partial) AutoCorrelation Function
          4. Autoregressive Integrated Moving Average – ARIMA
        8. Summary
      15. A. More on Jupyter Notebook and matplotlib Styles
        1. Jupyter Notebook
          1. Useful keyboard shortcuts
            1. Command mode shortcuts
            2. Edit mode shortcuts
          2. Markdown cells
          3. Notebook Python extensions
            1. Installing the extensions
            2. Codefolding
            3. Collapsible headings
            4. Help panel
            5. Initialization cells
            6. NbExtensions menu item
            7. Ruler
            8. Skip-traceback
            9. Table of contents
          4. Other Jupyter Notebook tips
            1. External connections
            2. Export
            3. Additional file types
        2. Matplotlib styles
        3. Useful resources
          1. General resources
          2. Packages
          3. Data repositories
          4. Visualization of data
        4. Summary