You are previewing Practical Machine Learning.
O'Reilly logo
Practical Machine Learning

Book Description

Tackle the real-world complexities of modern machine learning with innovative, cutting-edge, techniques

About This Book

  • Fully-coded working examples using a wide range of machine learning libraries and tools, including Python, R, Julia, and Spark

  • Comprehensive practical solutions taking you into the future of machine learning

  • Go a step further and integrate your machine learning projects with Hadoop

  • Who This Book Is For

    This book has been created for data scientists who want to see machine learning in action and explore its real-world application. With guidance on everything from the fundamentals of machine learning and predictive analytics to the latest innovations set to lead the big data revolution into the future, this is an unmissable resource for anyone dedicated to tackling current big data challenges. Knowledge of programming (Python and R) and mathematics is advisable if you want to get started immediately.

    What You Will Learn

  • Implement a wide range of algorithms and techniques for tackling complex data

  • Get to grips with some of the most powerful languages in data science, including R, Python, and Julia

  • Harness the capabilities of Spark and Hadoop to manage and process data successfully

  • Apply the appropriate machine learning technique to address real-world problems

  • Get acquainted with Deep learning and find out how neural networks are being used at the cutting-edge of machine learning

  • Explore the future of machine learning and dive deeper into polyglot persistence, semantic data, and more

  • In Detail

    Finding meaning in increasingly larger and more complex datasets is a growing demand of the modern world. Machine learning and predictive analytics have become the most important approaches to uncover data gold mines. Machine learning uses complex algorithms to make improved predictions of outcomes based on historical patterns and the behaviour of data sets. Machine learning can deliver dynamic insights into trends, patterns, and relationships within data, immensely valuable to business growth and development.

    This book explores an extensive range of machine learning techniques uncovering hidden tricks and tips for several types of data using practical and real-world examples. While machine learning can be highly theoretical, this book offers a refreshing hands-on approach without losing sight of the underlying principles. Inside, a full exploration of the various algorithms gives you high-quality guidance so you can begin to see just how effective machine learning is at tackling contemporary challenges of big data.

    This is the only book you need to implement a whole suite of open source tools, frameworks, and languages in machine learning. We will cover the leading data science languages, Python and R, and the underrated but powerful Julia, as well as a range of other big data platforms including Spark, Hadoop, and Mahout. Practical Machine Learning is an essential resource for the modern data scientists who want to get to grips with its real-world application.

    With this book, you will not only learn the fundamentals of machine learning but dive deep into the complexities of real world data before moving on to using Hadoop and its wider ecosystem of tools to process and manage your structured and unstructured data.

    You will explore different machine learning techniques for both supervised and unsupervised learning; from decision trees to Naïve Bayes classifiers and linear and clustering methods, you will learn strategies for a truly advanced approach to the statistical analysis of data. The book also explores the cutting-edge advancements in machine learning, with worked examples and guidance on deep learning and reinforcement learning, providing you with practical demonstrations and samples that help take the theory–and mystery–out of even the most advanced machine learning methodologies.

    Style and approach

    A practical data science tutorial designed to give you an insight into the practical application of machine learning, this book takes you through complex concepts and tasks in an accessible way. Featuring information on a wide range of data science techniques, Practical Machine Learning is a comprehensive data science resource.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at If you purchased this book elsewhere, you can visit and register to have the code file.

    Table of Contents

    1. Practical Machine Learning
      1. Table of Contents
      2. Practical Machine Learning
      3. Credits
      4. Foreword
      5. About the Author
      6. Acknowledgments
      7. About the Reviewers
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      9. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Downloading the color images of this book
          3. Errata
          4. Piracy
          5. Questions
      10. 1. Introduction to Machine learning
        1. Machine learning
          1. Definition
          2. Core Concepts and Terminology
          3. What is learning?
            1. Data
            2. Labeled and unlabeled data
            3. Tasks
            4. Algorithms
            5. Models
              1. Logical models
              2. Geometric models
              3. Probabilistic models
          4. Data and inconsistencies in Machine learning
            1. Under-fitting
            2. Over-fitting
            3. Data instability
            4. Unpredictable data formats
          5. Practical Machine learning examples
          6. Types of learning problems
            1. Classification
            2. Clustering
            3. Forecasting, prediction or regression
            4. Simulation
            5. Optimization
            6. Supervised learning
            7. Unsupervised learning
            8. Semi-supervised learning
            9. Reinforcement learning
            10. Deep learning
        2. Performance measures
          1. Is the solution good?
            1. Mean squared error (MSE)
            2. Mean absolute error (MAE)
            3. Normalized MSE and MAE (NMSE and NMAE)
            4. Solving the errors: bias and variance
        3. Some complementing fields of Machine learning
          1. Data mining
          2. Artificial intelligence (AI)
          3. Statistical learning
          4. Data science
        4. Machine learning process lifecycle and solution architecture
        5. Machine learning algorithms
          1. Decision tree based algorithms
          2. Bayesian method based algorithms
          3. Kernel method based algorithms
          4. Clustering methods
          5. Artificial neural networks (ANN)
          6. Dimensionality reduction
          7. Ensemble methods
          8. Instance based learning algorithms
          9. Regression analysis based algorithms
          10. Association rule based learning algorithms
        6. Machine learning tools and frameworks
        7. Summary
      11. 2. Machine learning and Large-scale datasets
        1. Big data and the context of large-scale Machine learning
          1. Functional versus Structural – A methodological mismatch
            1. Commoditizing information
            2. Theoretical limitations of RDBMS
            3. Scaling-up versus Scaling-out storage
            4. Distributed and parallel computing strategies
          2. Machine learning: Scalability and Performance
            1. Too many data points or instances
            2. Too many attributes or features
            3. Shrinking response time windows – need for real-time responses
            4. Highly complex algorithm
            5. Feed forward, iterative prediction cycles
          3. Model selection process
          4. Potential issues in large-scale Machine learning
        2. Algorithms and Concurrency
          1. Developing concurrent algorithms
        3. Technology and implementation options for scaling-up Machine learning
          1. MapReduce programming paradigm
          2. High Performance Computing (HPC) with Message Passing Interface (MPI)
          3. Language Integrated Queries (LINQ) framework
          4. Manipulating datasets with LINQ
          5. Graphics Processing Unit (GPU)
          6. Field Programmable Gate Array (FPGA)
          7. Multicore or multiprocessor systems
        4. Summary
      12. 3. An Introduction to Hadoop's Architecture and Ecosystem
        1. Introduction to Apache Hadoop
          1. Evolution of Hadoop (the platform of choice)
          2. Hadoop and its core elements
        2. Machine learning solution architecture for big data (employing Hadoop)
          1. The Data Source layer
          2. The Ingestion layer
          3. The Hadoop Storage layer
          4. The Hadoop (Physical) Infrastructure layer – supporting appliance
          5. Hadoop platform / Processing layer
          6. The Analytics layer
          7. The Consumption layer
            1. Explaining and exploring data with Visualizations
            2. Security and Monitoring layer
            3. Hadoop core components framework
              1. Hadoop Distributed File System (HDFS)
                1. Secondary Namenode and Checkpoint process
                2. Splitting large data files
                3. Block loading to the cluster and replication
            4. Writing to and reading from HDFS
            5. Handling failures
            6. HDFS command line
            7. RESTFul HDFS
          8. MapReduce
            1. MapReduce architecture
            2. What makes MapReduce cater to the needs of large datasets?
            3. MapReduce execution flow and components
            4. Developing MapReduce components
              1. InputFormat
              2. OutputFormat
              3. Mapper implementation
        3. Hadoop 2.x
          1. Hadoop ecosystem components
          2. Hadoop installation and setup
            1. Installing Jdk 1.7
            2. Creating a system user for Hadoop (dedicated)
            3. Disable IPv6
            4. Steps for installing Hadoop 2.6.0
            5. Starting Hadoop
          3. Hadoop distributions and vendors
        4. Summary
      13. 4. Machine Learning Tools, Libraries, and Frameworks
        1. Machine learning tools – A landscape
        2. Apache Mahout
          1. How does Mahout work?
          2. Installing and setting up Apache Mahout
            1. Setting up Maven
            2. Setting-up Apache Mahout using Eclipse IDE
            3. Setting up Apache Mahout without Eclipse
          3. Mahout Packages
          4. Implementing vectors in Mahout
        3. R
          1. Installing and setting up R
          2. Integrating R with Apache Hadoop
            1. Approach 1 – Using R and Streaming APIs in Hadoop
            2. Approach 2 – Using the Rhipe package of R
            3. Approach 3 – Using RHadoop
            4. Summary of R/Hadoop integration approaches
            5. Implementing in R (using examples)
              1. R Expressions
                1. Assignments
                2. Functions
              2. R Vectors
                1. Assigning, accessing, and manipulating vectors
              3. R Matrices
              4. R Factors
              5. R Data Frames
              6. R Statistical frameworks
        4. Julia
          1. Installing and setting up Julia
            1. Downloading and using the command line version of Julia
            2. Using Juno IDE for running Julia
            3. Using Julia via the browser
          2. Running the Julia code from the command line
          3. Implementing in Julia (with examples)
          4. Using variables and assignments
            1. Numeric primitives
            2. Data structures
            3. Working with Strings and String manipulations
            4. Packages
            5. Interoperability
              1. Integrating with C
              2. Integrating with Python
              3. Integrating with MATLAB
            6. Graphics and plotting
          5. Benefits of adopting Julia
          6. Integrating Julia and Hadoop
        5. Python
          1. Toolkit options in Python
          2. Implementation of Python (using examples)
            1. Installing Python and setting up scikit-learn
              1. Loading data
        6. Apache Spark
          1. Scala
          2. Programming with Resilient Distributed Datasets (RDD)
        7. Spring XD
        8. Summary
      14. 5. Decision Tree based learning
        1. Decision trees
          1. Terminology
          2. Purpose and uses
          3. Constructing a Decision tree
            1. Handling missing values
            2. Considerations for constructing Decision trees
              1. Choosing the appropriate attribute(s)
                1. Information gain and Entropy
                2. Gini index
                3. Gain ratio
              2. Termination Criteria / Pruning Decision trees
            3. Decision trees in a graphical representation
            4. Inducing Decision trees – Decision tree algorithms
              1. CART
              2. C4.5
            5. Greedy Decision trees
            6. Benefits of Decision trees
          4. Specialized trees
            1. Oblique trees
            2. Random forests
            3. Evolutionary trees
            4. Hellinger trees
        2. Implementing Decision trees
          1. Using Mahout
          2. Using R
          3. Using Spark
          4. Using Python (scikit-learn)
          5. Using Julia
        3. Summary
      15. 6. Instance and Kernel Methods Based Learning
        1. Instance-based learning (IBL)
          1. Nearest Neighbors
            1. Value of k in KNN
            2. Distance measures in KNN
              1. Euclidean distance
              2. Hamming distance
              3. Minkowski distance
            3. Case-based reasoning (CBR)
            4. Locally weighed regression (LWR)
          2. Implementing KNN
            1. Using Mahout
            2. Using R
            3. Using Spark
            4. Using Python (scikit-learn)
            5. Using Julia
        2. Kernel methods-based learning
          1. Kernel functions
          2. Support Vector Machines (SVM)
            1. Inseparable Data
          3. Implementing SVM
            1. Using Mahout
            2. Using R
            3. Using Spark
            4. Using Python (Scikit-learn)
            5. Using Julia
        3. Summary
      16. 7. Association Rules based learning
        1. Association rules based learning
          1. Association rule – a definition
          2. Apriori algorithm
            1. Rule generation strategy
              1. Rules for defining appropriate minsup
              2. Apriori – the downside
          3. FP-growth algorithm
          4. Apriori versus FP-growth
        2. Implementing Apriori and FP-growth
          1. Using Mahout
          2. Using R
          3. Using Spark
          4. Using Python (Scikit-learn)
          5. Using Julia
        3. Summary
      17. 8. Clustering based learning
        1. Clustering-based learning
        2. Types of clustering
          1. Hierarchical clustering
          2. Partitional clustering
        3. The k-means clustering algorithm
          1. Convergence or stopping criteria for the k-means clustering
            1. K-means clustering on disk
          2. Advantages of the k-means approach
          3. Disadvantages of the k-means algorithm
          4. Distance measures
          5. Complexity measures
        4. Implementing k-means clustering
          1. Using Mahout
          2. Using R
          3. Using Spark
          4. Using Python (scikit-learn)
          5. Using Julia
        5. Summary
      18. 9. Bayesian learning
        1. Bayesian learning
          1. Statistician's thinking
            1. Important terms and definitions
            2. Probability
              1. Types of events
                1. Mutually exclusive or disjoint events
                2. Independent events
                3. Dependent events
            3. Types of probability
            4. Distribution
            5. Bernoulli distribution
            6. Binomial distribution
              1. Poisson probability distribution
              2. Exponential distribution
              3. Normal distribution
              4. Relationship between the distributions
          2. Bayes' theorem
          3. Naïve Bayes classifier
            1. Multinomial Naïve Bayes classifier
            2. The Bernoulli Naïve Bayes classifier
        2. Implementing Naïve Bayes algorithm
          1. Using Mahout
          2. Using R
          3. Using Spark
          4. Using scikit-learn
          5. Using Julia
        3. Summary
      19. 10. Regression based learning
        1. Regression analysis
          1. Revisiting statistics
            1. Properties of expectation, variance, and covariance
              1. Properties of variance
              2. Properties of covariance
              3. Example
            2. ANOVA and F Statistics
          2. Confounding
          3. Effect modification
        2. Regression methods
          1. Simple regression or simple linear regression
          2. Multiple regression
          3. Polynomial (non-linear) regression
          4. Generalized Linear Models (GLM)
          5. Logistic regression (logit link)
            1. Odds ratio in logistic regression
              1. Model
          6. Poisson regression
        3. Implementing linear and logistic regression
          1. Using Mahout
          2. Using R
          3. Using Spark
          4. Using scikit-learn
          5. Using Julia
        4. Summary
      20. 11. Deep learning
        1. Background
          1. The human brain
          2. Neural networks
            1. Neuron
            2. Synapses
            3. Artificial neurons or perceptrons
              1. Linear neurons
              2. Rectified linear neurons / linear threshold neurons
              3. Binary threshold neurons
              4. Sigmoid neurons
              5. Stochastic binary neurons
            4. Neural Network size
              1. An example
            5. Neural network types
              1. Multilayer fully connected feedforward networks or Multilayer Perceptrons (MLP)
              2. Jordan networks
              3. Elman networks
              4. Radial Bias Function (RBF) networks
              5. Hopfield networks
              6. Dynamic Learning Vector Quantization (DLVQ) networks
              7. Gradient descent method
          3. Backpropagation algorithm
          4. Softmax regression technique
        2. Deep learning taxonomy
          1. Convolutional neural networks (CNN/ConvNets)
            1. Convolutional layer (CONV)
            2. Pooling layer (POOL)
            3. Fully connected layer (FC)
          2. Recurrent Neural Networks (RNNs)
          3. Restricted Boltzmann Machines (RBMs)
          4. Deep Boltzmann Machines (DBMs)
          5. Autoencoders
        3. Implementing ANNs and Deep learning methods
          1. Using Mahout
          2. Using R
          3. Using Spark
          4. Using Python (Scikit-learn)
          5. Using Julia
        4. Summary
      21. 12. Reinforcement learning
        1. Reinforcement Learning (RL)
          1. The context of Reinforcement Learning
            1. Examples of Reinforcement Learning
            2. Evaluative Feedback
              1. n-Armed Bandit problem
              2. Action-value methods
              3. Reinforcement comparison methods
            3. The Reinforcement Learning problem – the world grid example
            4. Markov Decision Process (MDP)
            5. Basic RL model – agent-environment interface
            6. Delayed rewards
            7. The policy
          2. Reinforcement Learning – key features
        2. Reinforcement learning solution methods
          1. Dynamic Programming (DP)
            1. Generalized Policy Iteration (GPI)
          2. Monte Carlo methods
          3. Temporal difference (TD) learning
            1. Sarsa - on-Policy TD
          4. Q-Learning – off-Policy TD
          5. Actor-critic methods (on-policy)
          6. R Learning (Off-policy)
          7. Implementing Reinforcement Learning algorithms
            1. Using Mahout
            2. Using R
            3. Using Spark
            4. Using Python (Scikit-learn)
            5. Using Julia
        3. Summary
      22. 13. Ensemble learning
        1. Ensemble learning methods
          1. The wisdom of the crowd
          2. Key use cases
            1. Recommendation systems
            2. Anomaly detection
            3. Transfer learning
            4. Stream mining or classification
          3. Ensemble methods
            1. Supervised ensemble methods
              1. Boosting
                1. AdaBoost
              2. Bagging
              3. Wagging
                1. Random forests
                2. Gradient boosting machines (GBM)
            2. Unsupervised ensemble methods
        2. Implementing ensemble methods
          1. Using Mahout
          2. Using R
          3. Using Spark
          4. Using Python (Scikit-learn)
          5. Using Julia
        3. Summary
      23. 14. New generation data architectures for Machine learning
        1. Evolution of data architectures
        2. Emerging perspectives & drivers for new age data architectures
        3. Modern data architectures for Machine learning
          1. Semantic data architecture
            1. The business data lake
            2. Semantic Web technologies
              1. Ontology and data integration
            3. Vendors
          2. Multi-model database architecture / polyglot persistence
            1. Vendors
          3. Lambda Architecture (LA)
            1. Vendors
        4. Summary
      24. Index