You are previewing Apache Mahout Cookbook.
O'Reilly logo
Apache Mahout Cookbook

Book Description

Whether you’re a beginner or advanced user of Apache Mahout, this cookbook will expand your skills through a host of recipes, illustrations, and real-world examples. Your data mining will take on a totally new level of capability.

  • Learn how to set up a Mahout development environment

  • Start testing Mahout in a standalone Hadoop cluster

  • Learn to find stock market direction using logistic regression

  • Over 35 recipes with real-world examples to help both skilled and the non-skilled developers get the hang of the different features of Mahout

  • In Detail

    The rise of the Internet and social networks has created a new demand for software that can analyze large datasets that can scale up to 10 billion rows. Apache Hadoop has been created to handle such heavy computational tasks. Mahout gained recognition for providing data mining classification algorithms that can be used with such kind of datasets.

    "Apache Mahout Cookbook" provides a fresh, scope-oriented approach to the Mahout world for both beginners as well as advanced users. The book gives an insight on how to write different data mining algorithms to be used in the Hadoop environment and choose the best one suiting the task in hand.

    "Apache Mahout Cookbook" looks at the various Mahout algorithms available, and gives the reader a fresh solution-centered approach on how to solve different data mining tasks. The recipes start easy but get progressively complicated. A step-by-step approach will guide the developer in the different tasks involved in mining a huge dataset. You will also learn how to code your Mahout’s data mining algorithm to determine the best one for a particular task. Coupled with this, a whole chapter is dedicated to loading data into Mahout from an external RDMS system. A lot of attention has also been put on using your data mining algorithm inside your code so as to be able to use it in an Hadoop environment. Theoretical aspects of the algorithms are covered for information purposes, but every chapter is written to allow the developer to get into the code as quickly and smoothly as possible. This means that with every recipe, the book provides the code for reusing it using Maven as well as the Maven Mahout source code.

    By the end of this book you will be able to code your procedure to do various data mining tasks with different algorithms and to evaluate and choose the best ones for your tasks.

    Table of Contents

    1. Apache Mahout Cookbook
      1. Table of Contents
      2. Apache Mahout Cookbook
      3. Credits
      4. About the Author
      5. Acknowledgments
      6. About the Reviewers
      7. www.PacktPub.com
        1. Support files, eBooks, discount offers and more
          1. Why subscribe?
          2. Free Access for Packt account holders
      8. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      9. 1. Mahout is Not So Difficult!
        1. Introduction
        2. Installing Java and Hadoop
          1. Getting ready
          2. How to do it...
        3. Setting up a Maven and NetBeans development environment
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        4. Coding a basic recommender
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
      10. 2. Using Sequence Files – When and Why?
        1. Introduction
        2. Creating sequence files from the command line
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Generating sequence files from code
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Reading sequence files from code
          1. Getting ready
          2. How to do it…
          3. How it works…
      11. 3. Integrating Mahout with an External Datasource
        1. Introduction
        2. Importing an external datasource into HDFS
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        3. Exporting data from HDFS to RDBMS
          1. How to do it…
          2. How it works...
        4. Creating a Sqoop job to deal with RDBMS
          1. How to do it...
          2. How it works...
          3. There's more...
        5. Importing data using Sqoop API
          1. Getting ready
          2. How to do it…
          3. How it works...
      12. 4. Implementing the Naϊve Bayes classifier in Mahout
        1. Introduction
        2. Using the Mahout text classifier to demonstrate the basic use case
          1. Getting ready
          2. How to do it…
          3. How it works...
          4. There's more
        3. Using the Naïve Bayes classifier from code
          1. Getting ready
          2. How to do it…
          3. How it works...
          4. There's more
        4. Using Complementary Naïve Bayes from the command line
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. See also
        5. Coding the Complementary Naïve Bayes classifier
          1. Getting ready
          2. How to do it…
          3. How it works...
      13. 5. Stock Market Forecasting with Mahout
        1. Introduction
        2. Preparing data for logistic regression
          1. Getting ready
          2. How to do it…
          3. How it works…
        3. Predicting GOOG movements using logistic regression
          1. Getting ready
          2. How to do it…
          3. How it works…
            1. The confusion matrix
        4. Using adaptive logistic regression in Java code
          1. Getting ready
          2. How to do it…
          3. How it works…
        5. Using logistic regression on large-scale datasets
          1. Getting ready
          2. How to do it…
          3. How it works...
          4. See also
        6. Using Random Forest to forecast market movements
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. See also
      14. 6. Canopy Clustering in Mahout
        1. Introduction
        2. Command-line-based Canopy clustering
          1. Getting ready
          2. How to do it…
          3. How it works...
        3. Command-line-based Canopy clustering with parameters
          1. Getting ready
          2. How to do it…
          3. How it works...
        4. Using Canopy clustering from the Java code
          1. Getting ready
          2. How to do it…
          3. How it works...
        5. Coding your own cluster distance evaluation
          1. Getting ready
          2. How to do it…
          3. How it works...
          4. See also
      15. 7. Spectral Clustering in Mahout
        1. Introduction
        2. Using EigenCuts from the command line
          1. Getting ready
          2. How to do it…
        3. Using EigenCuts from Java code
          1. Getting ready
          2. How to do it…
          3. How it works…
        4. Creating a similarity matrix from raw data
          1. Getting ready
          2. How to do it…
          3. How it works…
        5. Using spectral clustering with image segmentation
          1. Getting ready
          2. How to do it…
          3. How it works
      16. 8. K-means Clustering
        1. Introduction
        2. Using K-means clustering from Java code
          1. Getting started
          2. How to do it…
          3. How it works…
        3. Clustering traffic accidents using K-means
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. See also
        4. K-means clustering using MapReduce
          1. Getting ready
          2. How to do it…
          3. How it works…
        5. Using K-means clustering from the command line
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. See also
      17. 9. Soft Computing with Mahout
        1. Introduction
        2. Frequent Pattern Mining with Mahout
          1. Getting ready
          2. How to do it…
          3. How it works…
        3. Creating metrics for Frequent Pattern Mining
          1. Getting ready
          2. How to do it…
          3. How it works…
        4. Using Frequent Pattern Mining from Java code
          1. Getting ready
          2. How to do it…
        5. Using LDA for creating topics
          1. Getting ready
          2. How to do it…
          3. How it works...
      18. 10. Implementing the Genetic Algorithm in Mahout
        1. Introduction
        2. Setting up Mahout for using GA
          1. Getting ready
          2. How to do it…
        3. Using the genetic algorithm over graphs
          1. Getting ready
          2. How to do it…
          3. How it works...
        4. Using the genetic algorithm from Java code
          1. Getting ready
          2. How to do it…
          3. How it works...
          4. There's more...
      19. Index