You are previewing Apache Mahout Essentials.
O'Reilly logo
Apache Mahout Essentials

Book Description

Implement top-notch machine learning algorithms for classification, clustering, and recommendations with Apache Mahout

In Detail

Apache Mahout is a scalable machine learning library with algorithms for clustering, classification, and recommendations. It empowers users to analyze patterns in large, diverse, and complex datasets faster and more scalably.

This book is an all-inclusive guide to analyzing large and complex datasets using Apache Mahout. It explains complicated but very effective machine learning algorithms simply, in relation to real-world practical examples.

Starting from the fundamental concepts of machine learning and Apache Mahout, this book guides you through Apache Mahout's implementations of machine learning techniques including classification, clustering, and recommendations. During this exciting walkthrough, real-world applications, a diverse range of popular algorithms and their implementations, code examples, evaluation strategies, and best practices are given for each technique. Finally, you will learn vdata visualization techniques for Apache Mahout to bring your data to life.

What You Will Learn

  • Get started with the fundamentals of Big Data, batch, and real-time data processing with an introduction to Mahout and its applications

  • Understand the key machine learning concepts behind algorithms in Apache Mahout

  • Apply machine learning algorithms provided by Apache Mahout in real-world practical scenarios

  • Implement and evaluate widely-used clustering, classification, and recommendation algorithms using Apache Mahout

  • Discover tips and tricks to improve the accuracy and performance of your results

  • Set up Apache Mahout in a production environment with Apache Hadoop

  • Glance at the Spark DSL advancements in Apache Mahout 1.0

  • Provide dynamic and interactive data visualizations for Apache Mahout

  • Build a recommendation engine for real-time use cases and use user-based and item-based recommendation algorithms

  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at If you purchased this book elsewhere, you can visit and register to have the files e-mailed directly to you.

    Table of Contents

    1. Apache Mahout Essentials
      1. Table of Contents
      2. Apache Mahout Essentials
      3. Credits
      4. About the Author
      5. About the Reviewers
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Downloading the color images of this book
          3. Errata
          4. Piracy
          5. Questions
      8. 1. Introducing Apache Mahout
        1. Machine learning in a nutshell
          1. Features
          2. Supervised learning versus unsupervised learning
        2. Machine learning applications
          1. Information retrieval
          2. Business
            1. Market segmentation (clustering)
            2. Stock market predictions (regression)
          3. Health care
            1. Using a mammogram for cancer tissue detection
        3. Machine learning libraries
          1. Open source or commercial
          2. Scalability
          3. Languages used
          4. Algorithm support
          5. Batch processing versus stream processing
        4. The story so far
        5. Apache Mahout
          1. Setting up Apache Mahout
        6. How Apache Mahout works?
          1. The high-level design
          2. The distribution
        7. From Hadoop MapReduce to Spark
          1. Problems with Hadoop MapReduce
          2. In-memory data processing with Spark and H2O
          3. Why is Mahout shifting from Hadoop MapReduce to Spark?
        8. When is it appropriate to use Apache Mahout?
        9. Summary
      9. 2. Clustering
        1. Unsupervised learning and clustering
        2. Applications of clustering
          1. Computer vision and image processing
        3. Types of clustering
          1. Hard clustering versus soft clustering
          2. Flat clustering versus hierarchical clustering
          3. Model-based clustering
        4. K-Means clustering
          1. Getting your hands dirty!
          2. Running K-Means using Java programming
            1. Data preparation
            2. Understanding important parameters
          3. Cluster visualization
        5. Distance measure
          1. Writing a custom distance measure
        6. K-Means clustering with MapReduce
          1. MapReduce in Apache Mahout
          2. The map function
          3. The reduce function
        7. Additional clustering algorithms
          1. Canopy clustering
          2. Fuzzy K-Means
          3. Streaming K-Means
            1. The streaming step
            2. The ball K-Means step
          4. Spectral clustering
          5. Dirichlet clustering
        8. Text clustering
          1. The vector space model and TF-IDF
          2. N-grams and collocations
          3. Preprocessing text with Lucene
          4. Text clustering with the K-Means algorithm
          5. Topic modeling
        9. Optimizing clustering performance
          1. Selecting the right features
          2. Selecting the right algorithms
          3. Selecting the right distance measure
          4. Evaluating clusters
          5. The initialization of centroids and the number of clusters
          6. Tuning up parameters
            1. The decision on infrastructure
        10. Summary
      10. 3. Regression and Classification
        1. Supervised learning
          1. Target variables and predictor variables
        2. Predictive analytics' techniques
          1. Regression-based prediction
          2. Model-based prediction
          3. Tree-based prediction
        3. Classification versus regression
        4. Linear regression with Apache Spark
          1. How does linear regression work?
          2. A real-world example
            1. The impact of smoking on mortality and different diseases
          3. Linear regression with one variable and multiple variables
          4. The integration of Apache Spark
            1. Setting up Apache Spark with Apache Mahout
          5. An example script
            1. Distributed row matrix
            2. An explanation of the code
          6. Mahout references
          7. The bias-variance trade-off
          8. How to avoid over-fitting and under-fitting
        5. Logistic regression with SGD
          1. Logistic functions
          2. Minimizing the cost function
          3. Multinomial logistic regression versus binary logistic regression
          4. A real-world example
          5. An example script
          6. Testing and evaluation
            1. The confusion matrix
            2. The area under the curve
        6. The Naïve Bayes algorithm
          1. The Bayes theorem
          2. Text classification
          3. Naïve assumption and its pros and cons in text classification
          4. Improvements that Apache Mahout has made to the Naïve Bayes classification
          5. A text classification coding example using the 20 newsgroups' example
            1. Understand the 20 newsgroups' dataset
          6. Text classification using Naïve Bayes – a MapReduce implementation with Hadoop
          7. Text classification using Naïve Bayes – the Spark implementation
          8. The Markov chain
        7. Hidden Markov Model
          1. A real-world example – developing a POS tagger using HMM supervised learning
          2. POS tagging
          3. HMM for POS tagging
          4. HMM implementation in Apache Mahout
          5. HMM supervised learning
            1. The important parameters
            2. Returns
          6. The Baum Welch algorithm
            1. A code example
            2. The important parameters
          7. The Viterbi evaluator
          8. The Apache Mahout references
        8. Summary
      11. 4. Recommendations
        1. Collaborative versus content-based filtering
          1. Content-based filtering
          2. Collaborative filtering
          3. Hybrid filtering
        2. User-based recommenders
          1. A real-world example – movie recommendations
          2. Data models
          3. The similarity measure
          4. The neighborhood
          5. Recommenders
          6. Evaluation techniques
            1. The IR-based method (precision/recall)
          7. Addressing the issues with inaccurate recommendation results
        3. Item-based recommenders
          1. Item-based recommenders with Spark
        4. Matrix factorization-based recommenders
          1. Alternative least squares
        5. Singular value decomposition
          1. Algorithm usage tips and tricks
        6. Summary
      12. 5. Apache Mahout in Production
        1. Introduction
        2. Apache Mahout with Hadoop
          1. YARN with MapReduce 2.0
            1. The resource manager
            2. The application manager
            3. A node manager
            4. The application master
            5. Containers
          2. Managing storage with HDFS
          3. The life cycle of a Hadoop application
        3. Setting up Hadoop
          1. Setting up Mahout in local mode
            1. Prerequisites
              1. Java installation
          2. Setting up Mahout in Hadoop distributed mode
            1. Prerequisites
              1. Creating a Hadoop user
              2. Passwordless SSH configuration
            2. The pseudo-distributed mode
              1. Configuration changes
              2. Formatting the DFS filesystem
              3. Starting the servers
            3. The fully-distributed mode
              1. Prerequisites
              2. Host file configuration
              3. Hadoop configuration changes
              4. Formatting the DFS filesystem
              5. Starting servers
        4. Monitoring Hadoop
          1. Commands/scripts
          2. Data nodes
          3. Node managers
          4. Web UIs
        5. Setting up Mahout with Hadoop's fully-distributed mode
        6. Troubleshooting Hadoop
        7. Optimization tips
        8. Summary
      13. 6. Visualization
        1. The significance of visualization in machine learning
        2. D3.js
        3. A visualization example for K-Means clustering
        4. Summary
      14. Index