You are previewing Machine Learning with R.
O'Reilly logo
Machine Learning with R

Book Description

R gives you access to the cutting-edge software you need to prepare data for machine learning. No previous knowledge required – this book will take you methodically through every stage of applying machine learning.

  • Harness the power of R for statistical computing and data science

  • Use R to apply common machine learning algorithms with real-world applications

  • Prepare, examine, and visualize data for analysis

  • Understand how to choose between machine learning models

  • Packed with clear instructions to explore, forecast, and classify data

  • In Detail

    Machine learning, at its core, is concerned with transforming data into actionable knowledge. This fact makes machine learning well-suited to the present-day era of "big data" and "data science". Given the growing prominence of R—a cross-platform, zero-cost statistical programming environment—there has never been a better time to start applying machine learning. Whether you are new to data science or a veteran, machine learning with R offers a powerful set of methods for quickly and easily gaining insight from your data.

    "Machine Learning with R" is a practical tutorial that uses hands-on examples to step through real-world application of machine learning. Without shying away from the technical details, we will explore Machine Learning with R using clear and practical examples. Well-suited to machine learning beginners or those with experience. Explore R to find the answer to all of your questions.

    How can we use machine learning to transform data into action? Using practical examples, we will explore how to prepare data for analysis, choose a machine learning method, and measure the success of the process.

    We will learn how to apply machine learning methods to a variety of common tasks including classification, prediction, forecasting, market basket analysis, and clustering. By applying the most effective machine learning methods to real-world problems, you will gain hands-on experience that will transform the way you think about data.

    "Machine Learning with R" will provide you with the analytical tools you need to quickly gain insight from complex data.

    Table of Contents

    1. Machine Learning with R
      1. Table of Contents
      2. Machine Learning with R
      3. Credits
      4. About the Author
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers and more
          1. Why Subscribe?
          2. Free Access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Introducing Machine Learning
        1. The origins of machine learning
        2. Uses and abuses of machine learning
          1. Ethical considerations
        3. How do machines learn?
          1. Abstraction and knowledge representation
          2. Generalization
          3. Assessing the success of learning
        4. Steps to apply machine learning to your data
        5. Choosing a machine learning algorithm
          1. Thinking about the input data
          2. Thinking about types of machine learning algorithms
          3. Matching your data to an appropriate algorithm
        6. Using R for machine learning
          1. Installing and loading R packages
            1. Installing an R package
            2. Installing a package using the point-and-click interface
            3. Loading an R package
        7. Summary
      9. 2. Managing and Understanding Data
        1. R data structures
        2. Vectors
        3. Factors
          1. Lists
          2. Data frames
          3. Matrixes and arrays
        4. Managing data with R
          1. Saving and loading R data structures
          2. Importing and saving data from CSV files
          3. Importing data from SQL databases
        5. Exploring and understanding data
          1. Exploring the structure of data
          2. Exploring numeric variables
            1. Measuring the central tendency – mean and median
            2. Measuring spread – quartiles and the five-number summary
            3. Visualizing numeric variables – boxplots
            4. Visualizing numeric variables – histograms
            5. Understanding numeric data – uniform and normal distributions
            6. Measuring spread – variance and standard deviation
          3. Exploring categorical variables
            1. Measuring the central tendency – the mode
          4. Exploring relationships between variables
            1. Visualizing relationships – scatterplots
            2. Examining relationships – two-way cross-tabulations
        6. Summary
      10. 3. Lazy Learning – Classification Using Nearest Neighbors
        1. Understanding classification using nearest neighbors
          1. The kNN algorithm
            1. Calculating distance
            2. Choosing an appropriate k
            3. Preparing data for use with kNN
          2. Why is the kNN algorithm lazy?
        2. Diagnosing breast cancer with the kNN algorithm
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
            1. Transformation – normalizing numeric data
            2. Data preparation – creating training and test datasets
          3. Step 3 – training a model on the data
          4. Step 4 – evaluating model performance
          5. Step 5 – improving model performance
            1. Transformation – z-score standardization
            2. Testing alternative values of k
        3. Summary
      11. 4. Probabilistic Learning – Classification Using Naive Bayes
        1. Understanding naive Bayes
          1. Basic concepts of Bayesian methods
            1. Probability
            2. Joint probability
            3. Conditional probability with Bayes' theorem
          2. The naive Bayes algorithm
            1. The naive Bayes classification
            2. The Laplace estimator
            3. Using numeric features with naive Bayes
        2. Example – filtering mobile phone spam with the naive Bayes algorithm
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
          3. Data preparation – processing text data for analysis
            1. Data preparation – creating training and test datasets
            2. Visualizing text data – word clouds
            3. Data preparation – creating indicator features for frequent words
          4. Step 3 – training a model on the data
          5. Step 4 – evaluating model performance
          6. Step 5 – improving model performance
        3. Summary
      12. 5. Divide and Conquer – Classification Using Decision Trees and Rules
        1. Understanding decision trees
          1. Divide and conquer
          2. The C5.0 decision tree algorithm
            1. Choosing the best split
            2. Pruning the decision tree
        2. Example – identifying risky bank loans using C5.0 decision trees
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
            1. Data preparation – creating random training and test datasets
          3. Step 3 – training a model on the data
          4. Step 4 – evaluating model performance
          5. Step 5 – improving model performance
            1. Boosting the accuracy of decision trees
            2. Making some mistakes more costly than others
        3. Understanding classification rules
          1. Separate and conquer
          2. The One Rule algorithm
          3. The RIPPER algorithm
          4. Rules from decision trees
        4. Example – identifying poisonous mushrooms with rule learners
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
          3. Step 3 – training a model on the data
          4. Step 4 – evaluating model performance
          5. Step 5 – improving model performance
        5. Summary
      13. 6. Forecasting Numeric Data – Regression Methods
        1. Understanding regression
          1. Simple linear regression
          2. Ordinary least squares estimation
          3. Correlations
          4. Multiple linear regression
        2. Example – predicting medical expenses using linear regression
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
            1. Exploring relationships among features – the correlation matrix
            2. Visualizing relationships among features – the scatterplot matrix
          3. Step 3 – training a model on the data
          4. Step 4 – evaluating model performance
          5. Step 5 – improving model performance
            1. Model specification – adding non-linear relationships
            2. Transformation – converting a numeric variable to a binary indicator
            3. Model specification – adding interaction effects
            4. Putting it all together – an improved regression model
        3. Understanding regression trees and model trees
          1. Adding regression to trees
        4. Example – estimating the quality of wines with regression trees and model trees
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
          3. Step 3 – training a model on the data
            1. Visualizing decision trees
          4. Step 4 – evaluating model performance
            1. Measuring performance with mean absolute error
          5. Step 5 – improving model performance
        5. Summary
      14. 7. Black Box Methods – Neural Networks and Support Vector Machines
        1. Understanding neural networks
          1. From biological to artificial neurons
          2. Activation functions
          3. Network topology
            1. The number of layers
            2. The direction of information travel
            3. The number of nodes in each layer
          4. Training neural networks with backpropagation
        2. Modeling the strength of concrete with ANNs
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
          3. Step 3 – training a model on the data
          4. Step 4 – evaluating model performance
          5. Step 5 – improving model performance
        3. Understanding Support Vector Machines
          1. Classification with hyperplanes
          2. Finding the maximum margin
            1. The case of linearly separable data
            2. The case of non-linearly separable data
          3. Using kernels for non-linear spaces
        4. Performing OCR with SVMs
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
          3. Step 3 – training a model on the data
          4. Step 4 – evaluating model performance
          5. Step 5 – improving model performance
        5. Summary
      15. 8. Finding Patterns – Market Basket Analysis Using Association Rules
        1. Understanding association rules
          1. The Apriori algorithm for association rule learning
            1. Measuring rule interest – support and confidence
            2. Building a set of rules with the Apriori principle
        2. Example – identifying frequently purchased groceries with association rules
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
            1. Data preparation – creating a sparse matrix for transaction data
            2. Visualizing item support – item frequency plots
            3. Visualizing transaction data – plotting the sparse matrix
          3. Step 3 – training a model on the data
          4. Step 4 – evaluating model performance
          5. Step 5 – improving model performance
            1. Sorting the set of association rules
            2. Taking subsets of association rules
            3. Saving association rules to a file or data frame
        3. Summary
      16. 9. Finding Groups of Data – Clustering with k-means
        1. Understanding clustering
          1. Clustering as a machine learning task
          2. The k-means algorithm for clustering
            1. Using distance to assign and update clusters
            2. Choosing the appropriate number of clusters
          3. Finding teen market segments using k-means clustering
          4. Step 1 – collecting data
          5. Step 2 – exploring and preparing the data
            1. Data preparation – dummy coding missing values
            2. Data preparation – imputing missing values
          6. Step 3 – training a model on the data
          7. Step 4 – evaluating model performance
          8. Step 5 – improving model performance
        2. Summary
      17. 10. Evaluating Model Performance
        1. Measuring performance for classification
          1. Working with classification prediction data in R
          2. A closer look at confusion matrices
          3. Using confusion matrices to measure performance
          4. Beyond accuracy – other measures of performance
            1. The kappa statistic
            2. Sensitivity and specificity
            3. Precision and recall
            4. The F-measure
          5. Visualizing performance tradeoffs
            1. ROC curves
        2. Estimating future performance
          1. The holdout method
          2. Cross-validation
          3. Bootstrap sampling
        3. Summary
      18. 11. Improving Model Performance
        1. Tuning stock models for better performance
          1. Using caret for automated parameter tuning
            1. Creating a simple tuned model
            2. Customizing the tuning process
        2. Improving model performance with meta-learning
          1. Understanding ensembles
          2. Bagging
          3. Boosting
          4. Random forests
            1. Training random forests
            2. Evaluating random forest performance
        3. Summary
      19. 12. Specialized Machine Learning Topics
        1. Working with specialized data
          1. Getting data from the Web with the RCurl package
          2. Reading and writing XML with the XML package
          3. Reading and writing JSON with the rjson package
          4. Reading and writing Microsoft Excel spreadsheets using xlsx
          5. Working with bioinformatics data
          6. Working with social network data and graph data
        2. Improving the performance of R
          1. Managing very large datasets
            1. Making data frames faster with data.table
            2. Creating disk-based data frames with ff
            3. Using massive matrices with bigmemory
          2. Learning faster with parallel computing
            1. Measuring execution time
            2. Working in parallel with foreach
            3. Using a multitasking operating system with multicore
            4. Networking multiple workstations with snow and snowfall
            5. Parallel cloud computing with MapReduce and Hadoop
          3. GPU computing
          4. Deploying optimized learning algorithms
            1. Building bigger regression models with biglm
            2. Growing bigger and faster random forests with bigrf
            3. Training and evaluating models in parallel with caret
        3. Summary
      20. Index