Python: Advanced Predictive Analytics

Book description

Gain practical insights by exploiting data in your business to build advanced predictive modeling applications

About This Book

  • A step-by-step guide to predictive modeling including lots of tips, tricks, and best practices
  • Learn how to use popular predictive modeling algorithms such as Linear Regression, Decision Trees, Logistic Regression, and Clustering
  • Master open source Python tools to build sophisticated predictive models

Who This Book Is For

This book is designed for business analysts, BI analysts, data scientists, or junior level data analysts who are ready to move on from a conceptual understanding of advanced analytics and become an expert in designing and building advanced analytics solutions using Python. If you are familiar with coding in Python (or some other programming/statistical/scripting language) but have never used or read about predictive analytics algorithms, this book will also help you.

What You Will Learn

  • Understand the statistical and mathematical concepts behind predictive analytics algorithms and implement them using Python libraries
  • Get to know various methods for importing, cleaning, sub-setting, merging, joining, concatenating, exploring, grouping, and plotting data with pandas and NumPy
  • Master the use of Python notebooks for exploratory data analysis and rapid prototyping
  • Get to grips with applying regression, classification, clustering, and deep learning algorithms
  • Discover advanced methods to analyze structured and unstructured data
  • Visualize the performance of models and the insights they produce
  • Ensure the robustness of your analytic applications by mastering the best practices of predictive analysis

In Detail

Social Media and the Internet of Things have resulted in an avalanche of data. Data is powerful but not in its raw form; it needs to be processed and modeled, and Python is one of the most robust tools out there to do so. It has an array of packages for predictive modeling and a suite of IDEs to choose from. Using the Python programming language, analysts can use these sophisticated methods to build scalable analytic applications. This book is your guide to getting started with predictive analytics using Python.

You'll balance both statistical and mathematical concepts, and implement them in Python using libraries such as pandas, scikit-learn, and NumPy. Through case studies and code examples using popular open-source Python libraries, this book illustrates the complete development process for analytic applications. Covering a wide range of algorithms for classification, regression, clustering, as well as cutting-edge techniques such as deep learning, this book illustrates explains how these methods work. You will learn to choose the right approach for your problem and how to develop engaging visualizations to bring to life the insights of predictive modeling.

Finally, you will learn best practices in predictive modeling, as well as the different applications of predictive modeling in the modern world. The course provides you with highly practical content from the following Packt books:

1. Learning Predictive Analytics with Python

2. Mastering Predictive Analytics with Python

Style and approach

This course aims to create a smooth learning path that will teach you how to effectively perform predictive analytics using Python. Through this comprehensive course, you'll learn the basics of predictive analytics and progress to predictive modeling in the modern world.

Table of contents

  1. Python: Advanced Predictive Analytics
    1. Table of Contents
    2. Python: Advanced Predictive Analytics
    3. Credits
    4. Preface
      1. What this learning path covers
      2. What you need for this learning path
      3. Who this learning path is for
      4. Reader feedback
      5. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    5. 1. Module 1
      1. 1. Getting Started with Predictive Modelling
        1. Introducing predictive modelling
          1. Scope of predictive modelling
            1. Ensemble of statistical algorithms
            2. Statistical tools
            3. Historical data
            4. Mathematical function
            5. Business context
          2. Knowledge matrix for predictive modelling
          3. Task matrix for predictive modelling
        2. Applications and examples of predictive modelling
          1. LinkedIn's "People also viewed" feature
            1. What it does?
            2. How is it done?
          2. Correct targeting of online ads
            1. How is it done?
          3. Santa Cruz predictive policing
            1. How is it done?
          4. Determining the activity of a smartphone user using accelerometer data
            1. How is it done?
          5. Sport and fantasy leagues
            1. How was it done?
        3. Python and its packages – download and installation
          1. Anaconda
          2. Standalone Python
          3. Installing a Python package
            1. Installing pip
            2. Installing Python packages with pip
        4. Python and its packages for predictive modelling
        5. IDEs for Python
        6. Summary
      2. 2. Data Cleaning
        1. Reading the data – variations and examples
          1. Data frames
          2. Delimiters
        2. Various methods of importing data in Python
          1. Case 1 – reading a dataset using the read_csv method
            1. The read_csv method
            2. Use cases of the read_csv method
              1. Passing the directory address and filename as variables
              2. Reading a .txt dataset with a comma delimiter
              3. Specifying the column names of a dataset from a list
          2. Case 2 – reading a dataset using the open method of Python
            1. Reading a dataset line by line
            2. Changing the delimiter of a dataset
          3. Case 3 – reading data from a URL
          4. Case 4 – miscellaneous cases
            1. Reading from an .xls or .xlsx file
            2. Writing to a CSV or Excel file
        3. Basics – summary, dimensions, and structure
        4. Handling missing values
          1. Checking for missing values
          2. What constitutes missing data?
            1. How missing values are generated and propagated
          3. Treating missing values
            1. Deletion
            2. Imputation
        5. Creating dummy variables
        6. Visualizing a dataset by basic plotting
          1. Scatter plots
          2. Histograms
          3. Boxplots
        7. Summary
      3. 3. Data Wrangling
        1. Subsetting a dataset
          1. Selecting columns
          2. Selecting rows
          3. Selecting a combination of rows and columns
          4. Creating new columns
        2. Generating random numbers and their usage
          1. Various methods for generating random numbers
          2. Seeding a random number
          3. Generating random numbers following probability distributions
            1. Probability density function
            2. Cumulative density function
            3. Uniform distribution
            4. Normal distribution
          4. Using the Monte-Carlo simulation to find the value of pi
            1. Geometry and mathematics behind the calculation of pi
          5. Generating a dummy data frame
        3. Grouping the data – aggregation, filtering, and transformation
          1. Aggregation
          2. Filtering
          3. Transformation
          4. Miscellaneous operations
        4. Random sampling – splitting a dataset in training and testing datasets
          1. Method 1 – using the Customer Churn Model
          2. Method 2 – using sklearn
          3. Method 3 – using the shuffle function
        5. Concatenating and appending data
        6. Merging/joining datasets
          1. Inner Join
          2. Left Join
          3. Right Join
          4. An example of the Inner Join
          5. An example of the Left Join
          6. An example of the Right Join
          7. Summary of Joins in terms of their length
        7. Summary
      4. 4. Statistical Concepts for Predictive Modelling
        1. Random sampling and the central limit theorem
        2. Hypothesis testing
          1. Null versus alternate hypothesis
          2. Z-statistic and t-statistic
          3. Confidence intervals, significance levels, and p-values
          4. Different kinds of hypothesis test
          5. A step-by-step guide to do a hypothesis test
          6. An example of a hypothesis test
        3. Chi-square tests
        4. Correlation
        5. Summary
      5. 5. Linear Regression with Python
        1. Understanding the maths behind linear regression
          1. Linear regression using simulated data
            1. Fitting a linear regression model and checking its efficacy
            2. Finding the optimum value of variable coefficients
        2. Making sense of result parameters
          1. p-values
          2. F-statistics
          3. Residual Standard Error
        3. Implementing linear regression with Python
          1. Linear regression using the statsmodel library
          2. Multiple linear regression
          3. Multi-collinearity
            1. Variance Inflation Factor
        4. Model validation
          1. Training and testing data split
          2. Summary of models
          3. Linear regression with scikit-learn
          4. Feature selection with scikit-learn
        5. Handling other issues in linear regression
          1. Handling categorical variables
          2. Transforming a variable to fit non-linear relations
          3. Handling outliers
          4. Other considerations and assumptions for linear regression
        6. Summary
      6. 6. Logistic Regression with Python
        1. Linear regression versus logistic regression
        2. Understanding the math behind logistic regression
          1. Contingency tables
          2. Conditional probability
          3. Odds ratio
          4. Moving on to logistic regression from linear regression
          5. Estimation using the Maximum Likelihood Method
            1. Likelihood function:
            2. Log likelihood function:
            3. Building the logistic regression model from scratch
          6. Making sense of logistic regression parameters
            1. Wald test
            2. Likelihood Ratio Test statistic
            3. Chi-square test
        3. Implementing logistic regression with Python
          1. Processing the data
          2. Data exploration
          3. Data visualization
          4. Creating dummy variables for categorical variables
          5. Feature selection
          6. Implementing the model
        4. Model validation and evaluation
          1. Cross validation
        5. Model validation
          1. The ROC curve
            1. Confusion matrix
        6. Summary
      7. 7. Clustering with Python
        1. Introduction to clustering – what, why, and how?
          1. What is clustering?
          2. How is clustering used?
          3. Why do we do clustering?
        2. Mathematics behind clustering
          1. Distances between two observations
            1. Euclidean distance
            2. Manhattan distance
            3. Minkowski distance
            4. The distance matrix
          2. Normalizing the distances
          3. Linkage methods
            1. Single linkage
            2. Compete linkage
            3. Average linkage
            4. Centroid linkage
            5. Ward's method
          4. Hierarchical clustering
          5. K-means clustering
        3. Implementing clustering using Python
          1. Importing and exploring the dataset
          2. Normalizing the values in the dataset
          3. Hierarchical clustering using scikit-learn
          4. K-Means clustering using scikit-learn
            1. Interpreting the cluster
        4. Fine-tuning the clustering
          1. The elbow method
          2. Silhouette Coefficient
        5. Summary
      8. 8. Trees and Random Forests with Python
        1. Introducing decision trees
          1. A decision tree
        2. Understanding the mathematics behind decision trees
          1. Homogeneity
          2. Entropy
          3. Information gain
          4. ID3 algorithm to create a decision tree
          5. Gini index
          6. Reduction in Variance
          7. Pruning a tree
          8. Handling a continuous numerical variable
          9. Handling a missing value of an attribute
        3. Implementing a decision tree with scikit-learn
          1. Visualizing the tree
          2. Cross-validating and pruning the decision tree
        4. Understanding and implementing regression trees
          1. Regression tree algorithm
          2. Implementing a regression tree using Python
        5. Understanding and implementing random forests
          1. The random forest algorithm
          2. Implementing a random forest using Python
          3. Why do random forests work?
          4. Important parameters for random forests
        6. Summary
      9. 9. Best Practices for Predictive Modelling
        1. Best practices for coding
          1. Commenting the codes
          2. Defining functions for substantial individual tasks
            1. Example 1
            2. Example 2
            3. Example 3
          3. Avoid hard-coding of variables as much as possible
          4. Version control
          5. Using standard libraries, methods, and formulas
        2. Best practices for data handling
        3. Best practices for algorithms
        4. Best practices for statistics
        5. Best practices for business contexts
        6. Summary
      10. A. A List of Links
    6. 2. Module 2
      1. 1. From Data to Decisions – Getting Started with Analytic Applications
        1. Designing an advanced analytic solution
          1. Data layer: warehouses, lakes, and streams
          2. Modeling layer
          3. Deployment layer
          4. Reporting layer
        2. Case study: sentiment analysis of social media feeds
          1. Data input and transformation
          2. Sanity checking
          3. Model development
          4. Scoring
          5. Visualization and reporting
        3. Case study: targeted e-mail campaigns
          1. Data input and transformation
          2. Sanity checking
          3. Model development
          4. Scoring
          5. Visualization and reporting
        4. Summary
      2. 2. Exploratory Data Analysis and Visualization in Python
        1. Exploring categorical and numerical data in IPython
          1. Installing IPython notebook
          2. The notebook interface
          3. Loading and inspecting data
          4. Basic manipulations – grouping, filtering, mapping, and pivoting
          5. Charting with Matplotlib
        2. Time series analysis
          1. Cleaning and converting
          2. Time series diagnostics
          3. Joining signals and correlation
        3. Working with geospatial data
          1. Loading geospatial data
          2. Working in the cloud
        4. Introduction to PySpark
          1. Creating the SparkContext
          2. Creating an RDD
          3. Creating a Spark DataFrame
        5. Summary
      3. 3. Finding Patterns in the Noise – Clustering and Unsupervised Learning
        1. Similarity and distance metrics
          1. Numerical distance metrics
          2. Correlation similarity metrics and time series
          3. Similarity metrics for categorical data
          4. K-means clustering
        2. Affinity propagation – automatically choosing cluster numbers
        3. k-medoids
        4. Agglomerative clustering
          1. Where agglomerative clustering fails
        5. Streaming clustering in Spark
        6. Summary
      4. 4. Connecting the Dots with Models – Regression Methods
        1. Linear regression
          1. Data preparation
          2. Model fitting and evaluation
          3. Statistical significance of regression outputs
          4. Generalize estimating equations
          5. Mixed effects models
          6. Time series data
          7. Generalized linear models
          8. Applying regularization to linear models
        2. Tree methods
          1. Decision trees
          2. Random forest
        3. Scaling out with PySpark – predicting year of song release
        4. Summary
      5. 5. Putting Data in its Place – Classification Methods and Analysis
        1. Logistic regression
          1. Multiclass logistic classifiers: multinomial regression
          2. Formatting a dataset for classification problems
          3. Learning pointwise updates with stochastic gradient descent
          4. Jointly optimizing all parameters with second-order methods
        2. Fitting the model
        3. Evaluating classification models
          1. Strategies for improving classification models
        4. Separating Nonlinear boundaries with Support vector machines
          1. Fitting and SVM to the census data
          2. Boosting – combining small models to improve accuracy
          3. Gradient boosted decision trees
        5. Comparing classification methods
        6. Case study: fitting classifier models in pyspark
        7. Summary
      6. 6. Words and Pixels – Working with Unstructured Data
        1. Working with textual data
          1. Cleaning textual data
          2. Extracting features from textual data
          3. Using dimensionality reduction to simplify datasets
        2. Principal component analysis
          1. Latent Dirichlet Allocation
          2. Using dimensionality reduction in predictive modeling
        3. Images
          1. Cleaning image data
          2. Thresholding images to highlight objects
          3. Dimensionality reduction for image analysis
        4. Case Study: Training a Recommender System in PySpark
        5. Summary
      7. 7. Learning from the Bottom Up – Deep Networks and Unsupervised Features
        1. Learning patterns with neural networks
          1. A network of one – the perceptron
          2. Combining perceptrons – a single-layer neural network
          3. Parameter fitting with back-propagation
          4. Discriminative versus generative models
          5. Vanishing gradients and explaining away
          6. Pretraining belief networks
          7. Using dropout to regularize networks
          8. Convolutional networks and rectified units
          9. Compressing Data with autoencoder networks
          10. Optimizing the learning rate
        2. The TensorFlow library and digit recognition
          1. The MNIST data
          2. Constructing the network
        3. Summary
      8. 8. Sharing Models with Prediction Services
        1. The architecture of a prediction service
        2. Clients and making requests
          1. The GET requests
          2. The POST request
          3. The HEAD request
          4. The PUT request
          5. The DELETE request
        3. Server – the web traffic controller
          1. Application – the engine of the predictive services
        4. Persisting information with database systems
        5. Case study – logistic regression service
          1. Setting up the database
          2. The web server
          3. The web application
            1. The flow of a prediction service – training a model
            2. On-demand and bulk prediction
        6. Summary
      9. 9. Reporting and Testing – Iterating on Analytic Systems
        1. Checking the health of models with diagnostics
          1. Evaluating changes in model performance
          2. Changes in feature importance
          3. Changes in unsupervised model performance
        2. Iterating on models through A/B testing
          1. Experimental allocation – assigning customers to experiments
          2. Deciding a sample size
          3. Multiple hypothesis testing
        3. Guidelines for communication
          1. Translate terms to business values
          2. Visualizing results
            1. Case Study: building a reporting service
          3. The report server
          4. The report application
          5. The visualization layer
        4. Summary
    7. Bibliography
    8. Index

Product information

  • Title: Python: Advanced Predictive Analytics
  • Author(s): Ashish Kumar, Joseph Babcock
  • Release date: December 2017
  • Publisher(s): Packt Publishing
  • ISBN: 9781788992367