You are previewing Learning Predictive Analytics with R.
O'Reilly logo
Learning Predictive Analytics with R

Book Description

Get to grips with key data visualization and predictive analytic skills using R

About This Book

  • Acquire predictive analytic skills using various tools of R

  • Make predictions about future events by discovering valuable information from data using R

  • Comprehensible guidelines that focus on predictive model design with real-world data

  • Who This Book Is For

    If you are a statistician, chief information officer, data scientist, ML engineer, ML practitioner, quantitative analyst, and student of machine learning, this is the book for you. You should have basic knowledge of the use of R. Readers without previous experience of programming in R will also be able to use the tools in the book.

    What You Will Learn

  • Customize R by installing and loading new packages

  • Explore the structure of data using clustering algorithms

  • Turn unstructured text into ordered data, and acquire knowledge from the data

  • Classify your observations using Naïve Bayes, k-NN, and decision trees

  • Reduce the dimensionality of your data using principal component analysis

  • Discover association rules using Apriori

  • Understand how statistical distributions can help retrieve information from data using correlations, linear regression, and multilevel regression

  • Use PMML to deploy the models generated in R

  • In Detail

    R is statistical software that is used for data analysis. There are two main types of learning from data: unsupervised learning, where the structure of data is extracted automatically; and supervised learning, where a labeled part of the data is used to learn the relationship or scores in a target attribute. As important information is often hidden in a lot of data, R helps to extract that information with its many standard and cutting-edge statistical functions.

    This book is packed with easy-to-follow guidelines that explain the workings of the many key data mining tools of R, which are used to discover knowledge from your data.

    You will learn how to perform key predictive analytics tasks using R, such as train and test predictive models for classification and regression tasks, score new data sets and so on. All chapters will guide you in acquiring the skills in a practical way. Most chapters also include a theoretical introduction that will sharpen your understanding of the subject matter and invite you to go further.

    The book familiarizes you with the most common data mining tools of R, such as k-means, hierarchical regression, linear regression, association rules, principal component analysis, multilevel modeling, k-NN, Naïve Bayes, decision trees, and text mining. It also provides a description of visualization techniques using the basic visualization tools of R as well as lattice for visualizing patterns in data organized in groups. This book is invaluable for anyone fascinated by the data mining opportunities offered by GNU R and its packages.

    Style and approach

    This is a practical book, which analyzes compelling data about life, health, and death with the help of tutorials. It offers you a useful way of interpreting the data that’s specific to this book, but that can also be applied to any other data.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of Contents

    1. Learning Predictive Analytics with R
      1. Table of Contents
      2. Learning Predictive Analytics with R
      3. Credits
      4. About the Author
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. Prediction
        2. Supervised and unsupervised learning
          1. Unsupervised learning
          2. Supervised learning
        3. Classification and regression problems
          1. Classification
          2. Regression
        4. The role of field knowledge in data modeling
        5. Caveats
        6. What this book covers
        7. What you need for this book
        8. Who this book is for
        9. Conventions
        10. Reader feedback
        11. Customer support
          1. Downloading the example code
          2. Downloading the color images of this book
          3. Errata
          4. Piracy
          5. eBooks, discount offers, and more
          6. Questions
      8. 1. Setting GNU R for Predictive Analytics
        1. Installing GNU R
        2. The R graphic user interface
        3. The menu bar of the R console
          1. A quick look at the File menu
          2. A quick look at the Misc menu
        4. Packages
          1. Installing packages in R
          2. Loading packages in R
        5. Summary
      9. 2. Visualizing and Manipulating Data Using R
        1. The roulette case
        2. Histograms and bar plots
        3. Scatterplots
        4. Boxplots
        5. Line plots
        6. Application – Outlier detection
        7. Formatting plots
        8. Summary
      10. 3. Data Visualization with Lattice
        1. Loading and discovering the lattice package
        2. Discovering multipanel conditioning with xyplot()
        3. Discovering other lattice plots
          1. Histograms
          2. Stacked bars
          3. Dotplots
          4. Displaying data points as text
        4. Updating graphics
        5. Case study – exploring cancer-related deaths in the US
          1. Discovering the dataset
          2. Integrating supplementary external data
        6. Summary
      11. 4. Cluster Analysis
        1. Distance measures
        2. Learning by doing – partition clustering with kmeans()
          1. Setting the centroids
          2. Computing distances to centroids
          3. Computing the closest cluster for each case
          4. Tasks performed by the main function
            1. Internal validation
        3. Using k-means with public datasets
          1. Understanding the data with the all.us.city.crime.1970 dataset
          2. Finding the best number of clusters in the life.expectancy.1971 dataset
            1. External validation
        4. Summary
      12. 5. Agglomerative Clustering Using hclust()
        1. The inner working of agglomerative clustering
        2. Agglomerative clustering with hclust()
          1. Exploring the results of votes in Switzerland
          2. The use of hierarchical clustering on binary attributes
        3. Summary
      13. 6. Dimensionality Reduction with Principal Component Analysis
        1. The inner working of Principal Component Analysis
        2. Learning PCA in R
          1. Dealing with missing values
          2. Selecting how many components are relevant
          3. Naming the components using the loadings
          4. PCA scores
            1. Accessing the PCA scores
          5. PCA scores for analysis
          6. PCA diagnostics
        3. Summary
      14. 7. Exploring Association Rules with Apriori
        1. Apriori – basic concepts
          1. Association rules
          2. Itemsets
          3. Support
          4. Confidence
          5. Lift
        2. The inner working of apriori
          1. Generating itemsets with support-based pruning
          2. Generating rules by using confidence-based pruning
        3. Analyzing data with apriori in R
          1. Using apriori for basic analysis
          2. Detailed analysis with apriori
            1. Preparing the data
            2. Analyzing the data
            3. Coercing association rules to a data frame
            4. Visualizing association rules
        4. Summary
      15. 8. Probability Distributions, Covariance, and Correlation
        1. Probability distributions
          1. Introducing probability distributions
            1. Discrete uniform distribution
          2. The normal distribution
          3. The Student's t-distribution
          4. The binomial distribution
          5. The importance of distributions
        2. Covariance and correlation
          1. Covariance
          2. Correlation
            1. Pearson's correlation
            2. Spearman's correlation
        3. Summary
      16. 9. Linear Regression
        1. Understanding simple regression
          1. Computing the intercept and slope coefficient
          2. Obtaining the residuals
          3. Computing the significance of the coefficient
        2. Working with multiple regression
        3. Analyzing data in R: correlation and regression
          1. First steps in the data analysis
          2. Performing the regression
          3. Checking for the normality of residuals
          4. Checking for variance inflation
          5. Examining potential mediations and comparing models
          6. Predicting new data
        4. Robust regression
        5. Bootstrapping
        6. Summary
      17. 10. Classification with k-Nearest Neighbors and Naïve Bayes
        1. Understanding k-NN
        2. Working with k-NN in R
          1. How to select k
        3. Understanding Naïve Bayes
        4. Working with Naïve Bayes in R
        5. Computing the performance of classification
        6. Summary
      18. 11. Classification Trees
        1. Understanding decision trees
        2. ID3
          1. Entropy
          2. Information gain
        3. C4.5
          1. The gain ratio
          2. Post-pruning
        4. C5.0
        5. Classification and regression trees and random forest
          1. CART
          2. Random forest
            1. Bagging
        6. Conditional inference trees and forests
        7. Installing the packages containing the required functions
          1. Installing C4.5
          2. Installing C5.0
          3. Installing CART
          4. Installing random forest
          5. Installing conditional inference trees
          6. Loading and preparing the data
        8. Performing the analyses in R
          1. Classification with C4.5
            1. The unpruned tree
            2. The pruned tree
          2. C50
          3. CART
            1. Pruning
            2. Random forests in R
          4. Examining the predictions on the testing set
          5. Conditional inference trees in R
        9. Caret – a unified framework for classification
        10. Summary
      19. 12. Multilevel Analyses
        1. Nested data
        2. Multilevel regression
          1. Random intercepts and fixed slopes
          2. Random intercepts and random slopes
        3. Multilevel modeling in R
          1. The null model
          2. Random intercepts and fixed slopes
          3. Random intercepts and random slopes
        4. Predictions using multilevel models
          1. Using the predict() function
          2. Assessing prediction quality
        5. Summary
      20. 13. Text Analytics with R
        1. An introduction to text analytics
        2. Loading the corpus
        3. Data preparation
          1. Preprocessing and inspecting the corpus
          2. Computing new attributes
        4. Creating the training and testing data frames
        5. Classification of the reviews
          1. Document classification with k-NN
          2. Document classification with Naïve Bayes
          3. Classification using logistic regression
          4. Document classification with support vector machines
        6. Mining the news with R
          1. A successful document classification
          2. Extracting the topics of the articles
          3. Collecting news articles in R from the New York Times article search API
        7. Summary
      21. 14. Cross-validation and Bootstrapping Using Caret and Exporting Predictive Models Using PMML
        1. Cross-validation and bootstrapping of predictive models using the caret package
          1. Cross-validation
          2. Performing cross-validation in R with caret
          3. Bootstrapping
          4. Performing bootstrapping in R with caret
          5. Predicting new data
        2. Exporting models using PMML
          1. What is PMML?
          2. A brief description of the structure of PMML objects
          3. Examples of predictive model exportation
            1. Exporting k-means objects
            2. Hierarchical clustering
            3. Exporting association rules (apriori objects)
            4. Exporting Naïve Bayes objects
            5. Exporting decision trees (rpart objects)
            6. Exporting random forest objects
            7. Exporting logistic regression objects
            8. Exporting support vector machine objects
        3. Summary
      22. A. Exercises and Solutions
        1. Exercises
          1. Chapter 1 – Setting GNU R for Predictive Modeling
          2. Chapter 2 – Visualizing and Manipulating Data Using R
          3. Chapter 3 – Data Visualization with Lattice
          4. Chapter 4 – Cluster Analysis
          5. Chapter 5 – Agglomerative Clustering Using hclust()
          6. Chapter 6 – Dimensionality Reduction with Principal Component Analysis
          7. Chapter 7 – Exploring Association Rules with Apriori
          8. Chapter 8 – Probability Distributions, Covariance, and Correlation
          9. Chapter 9 – Linear Regression
          10. Chapter 10 – Classification with k-Nearest Neighbors and Naïve Bayes
          11. Chapter 11 – Classification Trees
          12. Chapter 12 – Multilevel Analyses
          13. Chapter 13 – Text Analytics with R
        2. Solutions
          1. Chapter 1 – Setting GNU R for Predictive Modeling
          2. Chapter 2 – Visualizing and Manipulating Data Using R
          3. Chapter 3 – Data Visualization with Lattice
          4. Chapter 4 – Cluster Analysis
          5. Chapter 5 – Agglomerative Clustering Using hclust()
          6. Chapter 6 – Dimensionality Reduction with Principal Component Analysis
          7. Chapter 7 – Exploring Association Rules with Apriori
          8. Chapter 8 – Probability Distributions, Covariance, and Correlation
          9. Chapter 9 – Linear Regression
          10. Chapter 10 – Classification with k-Nearest Neighbors and Naïve Bayes
          11. Chapter 11 – Classification Trees
          12. Chapter 12 – Multilevel Analyses
          13. Chapter 13 – Text Analytics with R
      23. B. Further Reading and References
        1. Preface
        2. Chapter 1 – Setting GNU R for Predictive Modeling
        3. Chapter 2 – Visualizing and Manipulating Data Using R
        4. Chapter 3 – Data Visualization with Lattice
        5. Chapter 4 – Cluster Analysis
        6. Chapter 5 – Agglomerative Clustering Using hclust()
        7. Chapter 6 – Dimensionality Reduction with Principal Component Analysis
        8. Chapter 7 – Exploring Association Rules with Apriori
        9. Chapter 8 – Probability Distributions, Covariance, and Correlation
        10. Chapter 9 – Linear Regression
        11. Chapter 10 – Classification with k-Nearest Neighbors and Naïve Bayes
        12. Chapter 11 – Classification Trees
        13. Chapter 12 – Multilevel Analyses
        14. Chapter 13 – Text Analytics with R
        15. Chapter 14 – Cross-validation and Bootstrapping Using Caret and Exporting Predictive Models Using PMML
      24. Index