You are previewing Data Algorithms.
O'Reilly logo
Data Algorithms

Book Description

If you are ready to dive into the MapReduce framework for processing large datasets, this practical book takes you step by step through the algorithms and tools you need to build distributed MapReduce applications with Apache Hadoop or Apache Spark. Each chapter provides a recipe for solving a massive computational problem, such as building a recommendation system. You’ll learn how to implement the appropriate MapReduce solution with code that you can use in your projects.

Dr. Mahmoud Parsian covers basic design patterns, optimization techniques, and data mining and machine learning solutions for problems in bioinformatics, genomics, statistics, and social network analysis. This book also includes an overview of MapReduce, Hadoop, and Spark.

Table of Contents

  1. Foreword
  2. Preface
    1. What Is MapReduce?
      1. Simple Explanation of MapReduce
      2. When to Use MapReduce
      3. What MapReduce Isn’t
      4. Why Use MapReduce?
    2. Hadoop and Spark
    3. What Is in This Book?
    4. What Is the Focus of This Book?
    5. Who Is This Book For?
    6. Online Resources
    7. What Software Is Used in This Book?
    8. Conventions Used in This Book
    9. Using Code Examples
    10. Safari® Books Online
    11. How to Contact Us
    12. Acknowledgments
    13. Comments and Questions for This Book
  3. 1. Secondary Sort: Introduction
    1. Solutions to the Secondary Sort Problem
      1. Implementation Details
      2. Data Flow Using Plug-in Classes
    2. MapReduce/Hadoop Solution to Secondary Sort
      1. Input
      2. Expected Output
      3. map() Function
      4. reduce() Function
      5. Hadoop Implementation Classes
      6. Sample Run of Hadoop Implementation
      7. How to Sort in Ascending or Descending Order
    3. Spark Solution to Secondary Sort
      1. Time Series as Input
      2. Expected Output
      3. Option 1: Secondary Sorting in Memory
      4. Spark Sample Run
      5. Option #2: Secondary Sorting Using the Spark Framework
      6. Further Reading on Secondary Sorting
  4. 2. Secondary Sort: A Detailed Example
    1. Secondary Sorting Technique
    2. Complete Example of Secondary Sorting
      1. Input Format
      2. Output Format
      3. Composite Key
    3. Sample Run—Old Hadoop API
      1. Input
      2. Running the MapReduce Job
      3. Output
    4. Sample Run—New Hadoop API
      1. Input
      2. Running the MapReduce Job
      3. Output
  5. 3. Top 10 List
    1. Top N, Formalized
    2. MapReduce/Hadoop Implementation: Unique Keys
      1. Implementation Classes in MapReduce/Hadoop
      2. Top 10 Sample Run
      3. Finding the Top 5
      4. Finding the Bottom 10
    3. Spark Implementation: Unique Keys
      1. RDD Refresher
      2. Spark’s Function Classes
      3. Review of the Top N Pattern for Spark
      4. Complete Spark Top 10 Solution
      5. Sample Run: Finding the Top 10
      6. Parameterizing Top N
      7. Finding the Bottom N
    4. Spark Implementation: Nonunique Keys
      1. Complete Spark Top 10 Solution
      2. Sample Run
    5. Spark Top 10 Solution Using takeOrdered()
      1. Complete Spark Implementation
      2. Finding the Bottom N
      3. Alternative to Using takeOrdered()
    6. MapReduce/Hadoop Top 10 Solution: Nonunique Keys
      1. Sample Run
  6. 4. Left Outer Join
    1. Left Outer Join Example
      1. Example Queries
    2. Implementation of Left Outer Join in MapReduce
      1. MapReduce Phase 1: Finding Product Locations
      2. MapReduce Phase 2: Counting Unique Locations
      3. Implementation Classes in Hadoop
      4. Sample Run
    3. Spark Implementation of Left Outer Join
      1. Spark Program
      2. Running the Spark Solution
      3. Running Spark on YARN
    4. Spark Implementation with leftOuterJoin()
      1. Spark Program
      2. Sample Run on YARN
  7. 5. Order Inversion
    1. Example of the Order Inversion Pattern
    2. MapReduce/Hadoop Implementation of the Order Inversion Pattern
      1. Custom Partitioner
      2. Relative Frequency Mapper
      3. Relative Frequency Reducer
      4. Implementation Classes in Hadoop
    3. Sample Run
      1. Input
      2. Running the MapReduce Job
      3. Generated Output
  8. 6. Moving Average
    1. Example 1: Time Series Data (Stock Prices)
    2. Example 2: Time Series Data (URL Visits)
    3. Formal Definition
    4. POJO Moving Average Solutions
      1. Solution 1: Using a Queue
      2. Solution 2: Using an Array
      3. Testing the Moving Average
      4. Sample Run
    5. MapReduce/Hadoop Moving Average Solution
      1. Input
      2. Output
      3. Option #1: Sorting in Memory
      4. Sample Run
      5. Option #2: Sorting Using the MapReduce Framework
      6. Sample Run
  9. 7. Market Basket Analysis
    1. MBA Goals
    2. Application Areas for MBA
    3. Market Basket Analysis Using MapReduce
      1. Input
      2. Expected Output for Tuple2 (Order of 2)
      3. Expected Output for Tuple3 (Order of 3)
      4. Informal Mapper
      5. Formal Mapper
      6. Reducer
      7. MapReduce/Hadoop Implementation Classes
      8. Sample Run
    4. Spark Solution
      1. MapReduce Algorithm Workflow
      2. Input
      3. Spark Implementation
      4. YARN Script for Spark
      5. Creating Item Sets from Transactions
  10. 8. Common Friends
    1. Input
    2. POJO Common Friends Solution
    3. MapReduce Algorithm
      1. The MapReduce Algorithm in Action
    4. Solution 1: Hadoop Implementation Using Text
      1. Sample Run for Solution 1
    5. Solution 2: Hadoop Implementation Using ArrayListOfLongsWritable
      1. Sample Run for Solution 2
    6. Spark Solution
      1. Spark Program
      2. Sample Run of Spark Program
  11. 9. Recommendation Engines Using MapReduce
    1. Customers Who Bought This Item Also Bought
      1. Input
      2. Expected Output
      3. MapReduce Solution
    2. Frequently Bought Together
      1. Input and Expected Output
      2. MapReduce Solution
    3. Recommend Connection
      1. Input
      2. Output
      3. MapReduce Solution
      4. Spark Implementation
      5. Sample Run of Spark Program
  12. 10. Content-Based Recommendation: Movies
    1. Input
    2. MapReduce Phase 1
    3. MapReduce Phases 2 and 3
      1. MapReduce Phase 2: Mapper
      2. MapReduce Phase 2: Reducer
      3. MapReduce Phase 3: Mapper
      4. MapReduce Phase 3: Reducer
      5. Similarity Measures
    4. Movie Recommendation Implementation in Spark
      1. High-Level Solution in Spark
      2. Sample Run of Spark Program
  13. 11. Smarter Email Marketing with the Markov Model
    1. Markov Chains in a Nutshell
    2. Markov Model Using MapReduce
      1. Generating Time-Ordered Transactions with MapReduce
      2. Hadoop Solution 1: Time-Ordered Transactions
      3. Hadoop Solution 2: Time-Ordered Transactions
      4. Generating State Sequences
      5. Generating a Markov State Transition Matrix with MapReduce
      6. Using the Markov Model to Predict the Next Smart Email Marketing Date
    3. Spark Solution
      1. Input Format
      2. High-Level Steps
      3. Spark Program
      4. Script to Run the Spark Program
      5. Sample Run
  14. 12. K-Means Clustering
    1. What Is K-Means Clustering?
    2. Application Areas for Clustering
    3. Informal K-Means Clustering Method: Partitioning Approach
    4. K-Means Distance Function
    5. K-Means Clustering Formalized
    6. MapReduce Solution for K-Means Clustering
      1. MapReduce Solution: map()
      2. MapReduce Solution: combine()
      3. MapReduce Solution: reduce()
    7. K-Means Implementation by Spark
      1. Sample Run of Spark K-Means Implementation
  15. 13. k-Nearest Neighbors
    1. kNN Classification
    2. Distance Functions
    3. kNN Example
    4. An Informal kNN Algorithm
    5. Formal kNN Algorithm
    6. Java-like Non-MapReduce Solution for kNN
    7. kNN Implementation in Spark
      1. Formalizing kNN for the Spark Implementation
      2. Input Data Set Formats
      3. Spark Implementation
      4. YARN shell script
  16. 14. Naive Bayes
    1. Training and Learning Examples
      1. Numeric Training Data
      2. Symbolic Training Data
    2. Conditional Probability
    3. The Naive Bayes Classifier in Depth
      1. Naive Bayes Classifier Example
    4. The Naive Bayes Classifier: MapReduce Solution for Symbolic Data
      1. Stage 1: Building a Classifier Using Symbolic Training Data
      2. Stage 2: Using the Classifier to Classify New Symbolic Data
    5. The Naive Bayes Classifier: MapReduce Solution for Numeric Data
    6. Naive Bayes Classifier Implementation in Spark
      1. Stage 1: Building a Classifier Using Training Data
      2. Stage 2: Using the Classifier to Classify New Data
    7. Using Spark and Mahout
      1. Apache Spark
      2. Apache Mahout
  17. 15. Sentiment Analysis
    1. Sentiment Examples
    2. Sentiment Scores: Positive or Negative
    3. A Simple MapReduce Sentiment Analysis Example
      1. map() Function for Sentiment Analysis
      2. reduce() Function for Sentiment Analysis
    4. Sentiment Analysis in the Real World
  18. 16. Finding, Counting, and Listing All Triangles in Large Graphs
    1. Basic Graph Concepts
    2. Importance of Counting Triangles
    3. MapReduce/Hadoop Solution
      1. Step 1: MapReduce in Action
      2. Step 2: Identify Triangles
      3. Step 3: Remove Duplicate Triangles
      4. Hadoop Implementation Classes
      5. Sample Run
    4. Spark Solution
      1. High-Level Steps
      2. Sample Run
  19. 17. K-mer Counting
    1. Input Data for K-mer Counting
      1. Sample Data for K-mer Counting
    2. Applications of K-mer Counting
    3. K-mer Counting Solution in MapReduce/Hadoop
      1. The map() Function
      2. The reduce() Function
      3. Hadoop Implementation Classes
    4. K-mer Counting Solution in Spark
      1. Spark Solution
      2. Sample Run
  20. 18. DNA Sequencing
    1. Input Data for DNA Sequencing
    2. Input Data Validation
    3. DNA Sequence Alignment
    4. MapReduce Algorithms for DNA Sequencing
      1. Step 1: Alignment
      2. Step 2: Recalibration
      3. Step 3: Variant Detection
  21. 19. Cox Regression
    1. The Cox Model in a Nutshell
      1. Cox Regression Basic Terminology
    2. Cox Regression Using R
      1. Expression Data
    3. Cox Regression Application
    4. Cox Regression POJO Solution
    5. Input for MapReduce
      1. Input Format
    6. Cox Regression Using MapReduce
      1. Cox Regression Phase 1: map()
      2. Cox Regression Phase 1: reduce()
      3. Cox Regression Phase 2: map()
      4. Sample Output Generated by Phase 1 reduce() Function
      5. Sample Output Generated by the Phase 2 map() Function
      6. Cox Regression Script for MapReduce
  22. 20. Cochran-Armitage Test for Trend
    1. Cochran-Armitage Algorithm
    2. Application of Cochran-Armitage
    3. MapReduce Solution
      1. Input
      2. Expected Output
      3. Mapper
      4. Reducer
      5. MapReduce/Hadoop Implementation Classes
      6. Sample Run
  23. 21. Allelic Frequency
    1. Basic Definitions
      1. Chromosome
      2. Bioset
      3. Allele and Allelic Frequency
      4. Source of Data for Allelic Frequency
      5. Allelic Frequency Analysis Using Fisher’s Exact Test
      6. Fisher’s Exact Test
    2. Formal Problem Statement
    3. MapReduce Solution for Allelic Frequency
    4. MapReduce Solution, Phase 1
      1. Input
      2. Output/Result
      3. Phase 1 Mapper
      4. Phase 1 Reducer
      5. Sample Run of Phase 1 MapReduce/Hadoop Implementation
      6. Sample Plot of P-Values
    5. MapReduce Solution, Phase 2
      1. Phase 2 Mapper for Bottom 100 P-Values
      2. Phase 2 Reducer for Bottom 100 P-Values
      3. Is Our Bottom 100 List a Monoid?
      4. Hadoop Implementation Classes for Bottom 100 List
    6. MapReduce Solution, Phase 3
      1. Phase 3 Mapper for Bottom 100 P-Values
      2. Phase 3 Reducer for Bottom 100 P-Values
      3. Hadoop Implementation Classes for Bottom 100 List for Each Chromosome
    7. Special Handling of Chromosomes X and Y
  24. 22. The T-Test
    1. Performing the T-Test on Biosets
    2. MapReduce Problem Statement
    3. Input
    4. Expected Output
    5. MapReduce Solution
      1. Hadoop Implementation Classes
    6. Spark Implementation
      1. High-Level Steps
      2. T-Test Algorithm
      3. Sample Run
  25. 23. Pearson Correlation
    1. Pearson Correlation Formula
    2. Pearson Correlation Example
    3. Data Set for Pearson Correlation
    4. POJO Solution for Pearson Correlation
    5. POJO Solution Test Drive
    6. MapReduce Solution for Pearson Correlation
      1. map() Function for Pearson Correlation
      2. reduce() Function for Pearson Correlation
    7. Hadoop Implementation Classes
    8. Spark Solution for Pearson Correlation
      1. Input
      2. Output
      3. Spark Solution
      4. High-Level Steps
      5. Step 1: Import required classes and interfaces
      6. smaller() method
      7. MutableDouble class
      8. toMap() method
      9. toListOfString() method
      10. readBiosets() method
      11. Step 2: Handle input parameters
      12. Step 3: Create a Spark context object
      13. Step 4: Create list of input files/biomarkers
      14. Step 5: Broadcast reference as global shared object
      15. Step 6: Read all biomarkers from HDFS and create the first RDD
      16. Step 7: Filter biomarkers by reference
      17. Step 8: Create (Gene-ID, (Patient-ID, Gene-Value)) pairs
      18. Step 9: Group by gene
      19. Step 10: Create Cartesian product of all genes
      20. Step 11: Filter redundant pairs of genes
      21. Step 12: Calculate Pearson correlation and p-value
      22. Pearson Correlation Wrapper Class
      23. Testing the Pearson Class
      24. Pearson Correlation Using R
      25. YARN Script to Run Spark Program
    9. Spearman Correlation Using Spark
      1. Spearman Correlation Wrapper Class
      2. Testing the Spearman Correlation Wrapper Class
  26. 24. DNA Base Count
    1. FASTA Format
      1. FASTA Format Example
    2. FASTQ Format
      1. FASTQ Format Example
    3. MapReduce Solution: FASTA Format
      1. Reading FASTA Files
      2. MapReduce FASTA Solution: map()
      3. MapReduce FASTA Solution: reduce()
    4. Sample Run
      1. Log of sample run
      2. Generated output
      3. Custom Sorting
      4. Custom Partitioning
    5. MapReduce Solution: FASTQ Format
      1. Reading FASTQ Files
      2. MapReduce FASTQ Solution: map()
      3. MapReduce FASTQ Solution: reduce()
      4. Hadoop Implementation Classes: FASTQ Format
      5. Sample Run
    6. Spark Solution: FASTA Format
      1. High-Level Steps
      2. Sample Run
    7. Spark Solution: FASTQ Format
      1. High-Level Steps
      2. Step 1: Import required classes and interfaces
      3. Step 2: Handle input parameters
      4. Step 3: Create a JavaPairRDD from FASTQ input
      5. Step 4: Map partitions
      6. Step 5: Collect all DNA base counts
      7. Step 6: Emit Final Counts
      8. Sample Run
  27. 25. RNA Sequencing
    1. Data Size and Format
    2. MapReduce Workflow
      1. Input Data Validation
    3. RNA Sequencing Analysis Overview
    4. MapReduce Algorithms for RNA Sequencing
      1. Step 1: MapReduce TopHat Mapping
      2. Step 2: MapReduce Calling Cuffdiff
  28. 26. Gene Aggregation
    1. Input
    2. Output
    3. MapReduce Solutions (Filter by Individual and by Average)
      1. Mapper: Filter by Individual
      2. Reducer: Filter by Individual
      3. Mapper: Filter by Average
      4. Reducer: Filter by Average
      5. Computing Gene Aggregation
      6. Hadoop Implementation Classes
      7. Analysis of Output
    4. Gene Aggregation in Spark
    5. Spark Solution: Filter by Individual
      1. Sharing Data Between Cluster Nodes
      2. High-Level Steps
      3. Utility Functions
      4. Sample Run
    6. Spark Solution: Filter by Average
      1. High-Level Steps
      2. Utility Functions
      3. Sample Run
  29. 27. Linear Regression
    1. Basic Definitions
    2. Simple Example
    3. Problem Statement
    4. Input Data
    5. Expected Output
    6. MapReduce Solution Using SimpleRegression
    7. Hadoop Implementation Classes
    8. MapReduce Solution Using R’s Linear Model
      1. Phase 1
      2. Phase 2
      3. Hadoop Implementation Using Classes
  30. 28. MapReduce and Monoids
    1. Introduction
    2. Definition of Monoid
      1. How to Form a Monoid
    3. Monoidic and Non-Monoidic Examples
      1. Maximum over a Set of Integers
      2. Subtraction over a Set of Integers
      3. Addition over a Set of Integers
      4. Multiplication over a Set of Integers
      5. Mean over a Set of Integers
      6. Non-Commutative Example
      7. Median over a Set of Integers
      8. Concatenation over Lists
      9. Union/Intersection over Integers
      10. Functional Example
      11. Matrix Example
    4. MapReduce Example: Not a Monoid
    5. MapReduce Example: Monoid
      1. Hadoop Implementation Classes
      2. Sample Run
      3. View Hadoop output
    6. Spark Example Using Monoids
      1. High-Level Steps
      2. Sample Run
    7. Conclusion on Using Monoids
    8. Functors and Monoids
  31. 29. The Small Files Problem
    1. Solution 1: Merging Small Files Client-Side
      1. Input Data
      2. Solution with SmallFilesConsolidator
      3. Solution Without SmallFilesConsolidator
    2. Solution 2: Solving the Small Files Problem with CombineFileInputFormat
      1. Custom CombineFileInputFormat
      2. Sample Run Using CustomCFIF
    3. Alternative Solutions
  32. 30. Huge Cache for MapReduce
    1. Implementation Options
    2. Formalizing the Cache Problem
    3. An Elegant, Scalable Solution
    4. Implementing the LRUMap Cache
      1. Extending the LRUMap Class
      2. Testing the Custom Class
      3. The MapDBEntry Class
      4. Using MapDB
      5. Testing MapDB: put()
      6. Testing MapDB: get()
    5. MapReduce Using the LRUMap Cache
      1. CacheManager Definition
      2. Initializing the Cache
      3. Using the Cache
      4. Closing the Cache
  33. 31. The Bloom Filter
    1. Bloom Filter Properties
    2. A Simple Bloom Filter Example
    3. Bloom Filters in Guava Library
    4. Using Bloom Filters in MapReduce
  34. A. Bioset
  35. B. Spark RDDs
    1. Spark Operations
    2. Tuple<N>
    3. RDDs
      1. How to Create RDDs
      2. Creating RDDs Using Collection Objects
      3. Collecting Elements of an RDD
      4. Transforming an Existing RDD into a New RDD
      5. Creating RDDs by Reading Files
      6. Grouping by Key
      7. Mapping Values
      8. Reducing by Key
      9. Combining by Key
      10. Filtering an RDD
      11. Saving an RDD as an HDFS Text File
      12. Saving an RDD as an HDFS Sequence File
      13. Reading an RDD from an HDFS Sequence File
      14. Counting RDD Items
      15. Spark RDD Examples in Scala
      16. PySpark Examples
      17. How to Package and Run Spark Jobs
      18. Creating the JAR for Data Algorithms
      19. Running a Job in a Spark Cluster
      20. Running a Job in Hadoop’s YARN Environment
  36. Bibliography
  37. Index