You are previewing Learning Apache Mahout.
O'Reilly logo
Learning Apache Mahout

Book Description

Acquire practical skills in Big Data Analytics and explore data science with Apache Mahout

In Detail

In the past few years the generation of data and our capability to store and process it has grown exponentially. There is a need for scalable analytics frameworks and people with the right skills to get the information needed from this Big Data. Apache Mahout is one of the first and most prominent Big Data machine learning platforms. It implements machine learning algorithms on top of distributed processing platforms such as Hadoop and Spark.

Starting with the basics of Mahout and machine learning, you will explore prominent algorithms and their implementation in Mahout development. You will learn about Mahout building blocks, addressing feature extraction, reduction and the curse of dimensionality, delving into classification use cases with the random forest and Naïve Bayes classifier and item and user-based recommendation. You will then work with clustering Mahout using the K-means algorithm and implement Mahout without MapReduce. Finish with a flourish by exploring end-to-end use cases on customer analytics and test analytics to get a real-life practical know-how of analytics projects.

What You Will Learn

  • Configure Mahout on Linux systems and set up the development environment

  • Become familiar with the Mahout command line utilities and Java APIs

  • Understand the core concepts of machine learning and the classes that implement them

  • Integrate Apache Mahout with newer platforms such as Apache Spark

  • Solve classification, clustering, and recommendation problems with Mahout

  • Explore frequent pattern mining and topic modeling, the two main application areas of machine learning

  • Understand feature extraction, reduction, and the curse of dimensionality

  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of Contents

    1. Learning Apache Mahout
      1. Table of Contents
      2. Learning Apache Mahout
      3. Credits
      4. About the Author
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Introduction to Mahout
        1. Why Mahout
          1. Simple techniques and more data is better
          2. Sampling is difficult
          3. Community and license
        2. When Mahout
          1. Data too large for single machine
          2. Data already on Hadoop
          3. Algorithms implemented in Mahout
        3. How Mahout
          1. Setting up the development environment
            1. Configuring Maven
            2. Configuring Mahout
            3. Configuring Eclipse with the Maven plugin and Mahout
          2. Mahout command line
            1. A clustering example
              1. Reuter's raw data file
            2. A classification example
          3. Mahout API – a Java program example
            1. The dataset
          4. Parallel versus in-memory execution mode
        4. Summary
      9. 2. Core Concepts in Machine Learning
        1. Supervised learning
          1. Determine the objective
          2. Decide the training data
          3. Create and clean the training set
          4. Feature extraction
          5. Train the models
            1. Bagging
            2. Boosting
          6. Validation
            1. Holdout-set validation
            2. K-fold cross validation
          7. Evaluation
            1. Bias-variance trade-off
            2. Function complexity and amount of training data
            3. Dimensionality of the input space
            4. Noise in data
        2. Unsupervised learning
          1. Cluster analysis
            1. Objective
            2. Feature representation
              1. Feature normalization
                1. Row normalization
                2. Column normalization
                  1. Rescaling
                  2. Standardization
              2. A notion of similarity and dissimilarity
                1. Euclidean distance measure
                2. Squared Euclidean distance measure
                3. Manhattan distance measure
                4. Tanimoto distance measure
            3. Algorithm for clustering
            4. A stopping criteria
          2. Frequent pattern mining
            1. Measures for identifying interesting rules
              1. Support
              2. Confidence
              3. Lift
              4. Conviction
            2. Things to consider
              1. Actionable rules
              2. What association to look for
        3. Recommender system
          1. Collaborative filtering
            1. Cold start
            2. Scalability
            3. Sparsity
          2. Content-based filtering
        4. Model efficacy
          1. Classification
            1. Confusion matrix
            2. ROC curve and AUC
              1. Features of ROC graphs
              2. Evaluating classifier using the ROC curve
                1. Area-based accuracy measure
                2. Euclidian distance comparison
          2. Regression
            1. Mean absolute error
            2. Root mean squared error
            3. R-square
            4. Adjusted R-square
          3. Recommendation system
            1. Score difference
            2. Precision and recall
          4. Clustering
            1. The internal evaluation
              1. The intra-cluster distance
              2. The inter-cluster distance
              3. The Davies–Bouldin index
              4. The Dunn index
            2. The external evaluation
              1. The Rand index
              2. F-measure
        5. Summary
      10. 3. Feature Engineering
        1. Feature engineering
          1. Feature construction
            1. Categorical features
              1. Merging categories
              2. Converting to binary variables
              3. Converting to continuous variables
            2. Continuous features
              1. Binning
              2. Binarization
              3. Feature standardization
                1. Rescaling
                2. Mean standardization
                3. Scaling to unit norm
              4. Feature transformation derived from the problem domain
                1. Ratios
                2. Frequency
                3. Aggregate transformations
                4. Normalization
              5. Mathematical transformations
          2. Feature extraction
          3. Feature selection
            1. Filter-based feature selection
            2. Wrapper-based feature selection
              1. Backward selection
              2. Forward selection
              3. Recursive feature elimination
            3. Embedded feature selection
          4. Dimensionality reduction
        2. Summary
      11. 4. Classification with Mahout
        1. Classification
          1. White box models
          2. Black box models
        2. Logistic regression
          1. Mahout logistic regression command line
            1. Getting the data
            2. Model building via command line
              1. Splitting the dataset
            3. Train the model command line option
              1. Interpreting the output
            4. Testing the model
          2. Prediction
        3. Adaptive regression model
        4. Code example with logistic regression
          1. Train the model
            1. The LogisticRegressionParameter and CsvRecordFactory classes
            2. A code example without the parameter class
          2. Testing the online regression model
          3. Getting predictions from OnlineLogisticRegression
          4. A CrossFoldLearner example
        5. Random forest
          1. Bagging
          2. Random subsets of features
          3. Out-of-bag error estimate
          4. Random forest using the command line
          5. Predictions from random forest
        6. Naïve Bayes classifier
          1. Numeric features with naïve Bayes
            1. Command line
        7. Summary
      12. 5. Frequent Pattern Mining and Topic Modeling
        1. Frequent pattern mining
          1. Building FP Tree
          2. Constructing the tree
          3. Identifying frequent patterns from FP Tree
        2. Importing the Mahout source code into Eclipse
        3. Frequent pattern mining with Mahout
          1. Extending the command line of Mahout
          2. Getting the data
            1. Data description
            2. Frequent pattern mining with Mahout API
              1. MapReduce execution
              2. Linear execution
                1. Formatting the results and computing metrics
          3. Topic modeling using LDA
            1. LDA using the Mahout command line
        4. Summary
      13. 6. Recommendation with Mahout
        1. Collaborative filtering
          1. Similarity measures
            1. Pearson correlation similarity
            2. Euclidean distance similarity
            3. Computing similarity without a preference value
              1. Tanimoto coefficient similarity
              2. Log-likelihood similarity
          2. Evaluating recommender
          3. User-based recommender system
            1. User neighborhood
              1. Fixed size neighborhood
              2. Threshold-based neighborhood
            2. The dataset
            3. Mahout code example
              1. Building the recommender
              2. Evaluating the recommender
          4. Item-based recommender system
            1. Mahout code example
              1. Building the recommender
              2. Evaluating the recommender
          5. Inferring preferences
        2. Summary
      14. 7. Clustering with Mahout
        1. k-means
          1. Deciding the number of clusters
          2. Deciding the initial centroid
            1. Random points
            2. Points from the dataset
            3. Partition by range
            4. Canopy centroids
          3. Advantages and disadvantages
        2. Canopy clustering
        3. Fuzzy k-means
          1. Deciding the fuzzy factor
        4. A Mahout command-line example
          1. Getting the data
          2. Preprocessing the data
          3. k-means
          4. Canopy clustering
          5. Fuzzy k-means
          6. Streaming k-means
        5. A Mahout Java example
          1. k-means
            1. Cluster evaluation
        6. Summary
      15. 8. New Paradigm in Mahout
        1. Moving beyond MapReduce
        2. Apache Spark
          1. Configuring Spark with Mahout
          2. Basics of Mahout Scala DSL
            1. Imports
        3. In-core types
          1. Vector
            1. Initializing a vector inline
            2. Accessing elements of a vector
            3. Setting values of an element
            4. Vector arithmetic
            5. Vector operations with a scalar
          2. Matrix
            1. Initializing the matrix
            2. Accessing elements of a matrix
            3. Setting the matrix column
            4. Copy by reference
        4. Spark Mahout basics
          1. Initializing the Spark context
          2. Optimizer actions
          3. Computational actions
          4. Caching in Spark's block manager
        5. Linear regression with Mahout Spark
        6. Summary
      16. 9. Case Study – Churn Analytics and Customer Segmentation
        1. Churn analytics
          1. Getting the data
          2. Data exploration
            1. Installing R
              1. Summary statistics
              2. Correlation
          3. Feature engineering
          4. Model training and validation
            1. Logistic regression
            2. Adaptive logistic regression
            3. Random forest
          5. Customer segmentation
          6. Preprocessing
            1. Feature extraction
              1. Day calls
              2. Evening calls
              3. International calls
              4. Preprocessing the files
            2. Creating the clusters using fuzzy k-means
            3. Clustering using k-means
            4. Evaluation
        2. Summary
      17. 10. Case Study – Text Analytics
        1. Text analytics
          1. Vector space model
            1. Preprocessing
              1. Tokenization
              2. Stop word removal
              3. Stemming
              4. Preprocessing example
            2. Document indexing
            3. TF-IDF weighting
            4. n-grams
            5. Normalization
        2. Clustering text
          1. The dataset
          2. Feature extraction
          3. The clustering job
        3. Categorizing text
          1. The dataset
          2. Feature extraction
          3. The classification job
        4. Summary
      18. Index