You are previewing MapReduce Design Patterns.

MapReduce Design Patterns

Cover of MapReduce Design Patterns by Donald Miner... Published by O'Reilly Media, Inc.
  1. MapReduce Design Patterns
  2. Dedication
  3. Preface
    1. Intended Audience
    2. Pattern Format
    3. The Examples in This Book
    4. Conventions Used in This Book
    5. Using Code Examples
    6. Safari® Books Online
    7. How to Contact Us
    8. Acknowldgements
  4. 1. Design Patterns and MapReduce
    1. Design Patterns
    2. MapReduce History
    3. MapReduce and Hadoop Refresher
    4. Hadoop Example: Word Count
    5. Pig and Hive
  5. 2. Summarization Patterns
    1. Numerical Summarizations
      1. Pattern Description
      2. Numerical Summarization Examples
    2. Inverted Index Summarizations
      1. Pattern Description
      2. Inverted Index Example
    3. Counting with Counters
      1. Pattern Description
      2. Counting with Counters Example
  6. 3. Filtering Patterns
    1. Filtering
      1. Pattern Description
      2. Filtering Examples
    2. Bloom Filtering
      1. Pattern Description
      2. Bloom Filtering Examples
    3. Top Ten
      1. Pattern Description
      2. Top Ten Examples
    4. Distinct
      1. Pattern Description
      2. Distinct Examples
  7. 4. Data Organization Patterns
    1. Structured to Hierarchical
      1. Pattern Description
      2. Structured to Hierarchical Examples
    2. Partitioning
      1. Pattern Description
      2. Partitioning Examples
    3. Binning
      1. Pattern Description
      2. Binning Examples
    4. Total Order Sorting
      1. Pattern Description
      2. Total Order Sorting Examples
    5. Shuffling
      1. Pattern Description
      2. Shuffle Examples
  8. 5. Join Patterns
    1. A Refresher on Joins
    2. Reduce Side Join
      1. Pattern Description
      2. Reduce Side Join Example
      3. Reduce Side Join with Bloom Filter
    3. Replicated Join
      1. Pattern Description
      2. Replicated Join Examples
    4. Composite Join
      1. Pattern Description
      2. Composite Join Examples
    5. Cartesian Product
      1. Pattern Description
      2. Cartesian Product Examples
  9. 6. Metapatterns
    1. Job Chaining
      1. With the Driver
      2. Job Chaining Examples
      3. With Shell Scripting
      4. With JobControl
    2. Chain Folding
      1. The ChainMapper and ChainReducer Approach
      2. Chain Folding Example
    3. Job Merging
      1. Job Merging Examples
  10. 7. Input and Output Patterns
    1. Customizing Input and Output in Hadoop
      1. InputFormat
      2. RecordReader
      3. OutputFormat
      4. RecordWriter
    2. Generating Data
      1. Pattern Description
      2. Generating Data Examples
    3. External Source Output
      1. Pattern Description
      2. External Source Output Example
    4. External Source Input
      1. Pattern Description
      2. External Source Input Example
    5. Partition Pruning
      1. Pattern Description
      2. Partition Pruning Examples
  11. 8. Final Thoughts and the Future of Design Patterns
    1. Trends in the Nature of Data
      1. Images, Audio, and Video
      2. Streaming Data
    2. The Effects of YARN
    3. Patterns as a Library or Component
    4. How You Can Help
  12. A. Bloom Filters
    1. Overview
    2. Use Cases
      1. Representing a Data Set
      2. Reduce Queries to External Database
      3. Google BigTable
    3. Downsides
    4. Tweaking Your Bloom Filter
  13. Index
  14. About the Authors
  15. Colophon
  16. Copyright
O'Reilly logo

Appendix A. Bloom Filters

Overview

Conceived by Burton Howard Bloom in 1970, a Bloom filter is a probabilistic data structure used to test whether a member is an element of a set. Bloom filters have a strong space advantage over other data structures such as a Java Set, in that each element uses the same amount of space, no matter its actual size. For example, a string of 32 characters takes up the same amount of memory in a Bloom filter as a string of 1024 characters, which is drastically different than other data structures. Bloom filters are introduced as part of a pattern in Bloom Filtering.

While the data structure itself has vast memory advantages, it is not always 100% accurate. While false positives are possible, false negatives are not. This means the result of each test is either a definitive “no” or “maybe.” You will never get a definitive “yes.” With a traditional Bloom filter, elements can be added to the set, but not removed. There are a number of Bloom filter implementations that address this limitation, such as a Counting Bloom Filter, but they typically require more memory. As more elements are added to the set, the probability of false positives increases. Bloom filters cannot be resized like other data structures. Once they have been sized and trained, they cannot be reverse-engineered to achieve the original set nor resized and still maintain the same data set representation.

The following variables are used in the more detailed explanation of a Bloom filter below: ...

The best content for your career. Discover unlimited learning on demand for around $1/day.