You are previewing Programming MapReduce with Scalding.
O'Reilly logo
Programming MapReduce with Scalding

Book Description

A practical guide to designing, testing, and implementing complex MapReduce applications in Scala

  • Develop MapReduce applications using a functional development language in a lightweight, high-performance, and testable way

  • Recognize the Scalding capabilities to communicate with external data stores and perform machine learning operations

  • Full of illustrations and diagrams, practical examples, and tips for deeper understanding of MapReduce application development

In Detail

Programming MapReduce with Scalding is a practical guide to setting up a development environment and implementing simple and complex MapReduce transformations in Scalding, using a test-driven development methodology and other best practices.

This book will first introduce you to how the Cascading framework allows for higher abstraction reasoning over MapReduce applications and then dive into how Scala DSL Scalding enables us to develop elegant and testable applications. It will then teach you how to test Scalding jobs and how to define specifications and behavior-driven development (BDD) with Scalding. This book will also demonstrate how to monitor and maintain cluster stability and efficiently access SQL, NoSQL, and search platforms.

Programming MapReduce with Scalding provides hands-on information starting from proof of concept applications and progressing to production-ready implementations.

Table of Contents

  1. Programming MapReduce with Scalding
    1. Table of Contents
    2. Programming MapReduce with Scalding
    3. Credits
    4. About the Author
    5. About the Reviewers
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Introduction to MapReduce
      1. The Hadoop platform
      2. MapReduce
        1. A MapReduce example
      3. MapReduce abstractions
      4. Introducing Cascading
        1. What happens inside a pipe
        2. Pipe assemblies
        3. Cascading extensions
      5. Summary
    9. 2. Get Ready for Scalding
      1. Why Scala?
      2. Scala basics
      3. Scala build tools
      4. Hello World in Scala
      5. Development editors
      6. Installing Hadoop in five minutes
      7. Running our first Scalding job
      8. Submitting a Scalding job in Hadoop
      9. Summary
    10. 3. Scalding by Example
      1. Reading and writing files
        1. Best practices to read and write files
        2. TextLine parsing
        3. Executing in the local and Hadoop modes
      2. Understanding the core capabilities of Scalding
        1. Map-like operations
        2. Join operations
        3. Pipe operations
        4. Grouping/reducing functions
      3. Operations on groups
        1. Composite operations
      4. A simple example
      5. Typed API
      6. Summary
    11. 4. Intermediate Examples
      1. Logfile analysis
        1. Completing the implementation
      2. Exploring ad targeting
        1. Calculating daily points
        2. Calculating historic points
        3. Generating targeted ads
      3. Summary
    12. 5. Scalding Design Patterns
      1. The external operations pattern
      2. The dependency injection pattern
      3. The late bound dependency pattern
      4. Summary
    13. 6. Testing and TDD
      1. Introduction to testing
      2. MapReduce testing challenges
      3. Development lifecycle with testing strategy
      4. TDD for Scalding developers
        1. Implementing the TDD methodology
          1. Decomposing the algorithm
          2. Defining acceptance tests
          3. Implementing integration tests
          4. Implementing unit tests
          5. Implementing the MapReduce logic
          6. Defining and performing system tests
      5. Black box testing
      6. Summary
    14. 7. Running Scalding in Production
      1. Executing Scalding in a Hadoop cluster
      2. Scheduling execution
      3. Coordinating job execution
      4. Configuring using a property file
      5. Configuring using Hadoop parameters
      6. Monitoring Scalding jobs
      7. Using slim JAR files
      8. Scalding execution throttling
      9. Summary
    15. 8. Using External Data Stores
      1. Interacting with external systems
      2. SQL databases
      3. NoSQL databases
        1. Understanding HBase
        2. Reading from HBase
        3. Writing in HBase
        4. Using advanced HBase features
      4. Search platforms
        1. Elastic search
      5. Summary
    16. 9. Matrix Calculations and Machine Learning
      1. Text similarity using TF-IDF
      2. Setting a similarity using the Jaccard index
      3. K-Means using Mahout
      4. Other libraries
      5. Summary
    17. Index