Chapter 31. The Bloom Filter

This chapter introduces the concept of a Bloom filter and uses it in a reduce-side join, which engages the Bloom filter in the map phase of a MapReduce job. So what is a Bloom filter? How it can be used in a MapReduce environment? How can it speed up a join operation between two big relations/tables? According to Wikipedia:

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not....In other words, a query returns either “possibly in set” or “definitely not in set.” Elements can be added to the set, but not removed (though this can be addressed with a “counting” filter). The more elements that are added to the set, the larger the probability of false positives.

The Bloom filter data structure may return true for elements that are not actually members of the set (this is called a false-positive error), but it will never return false for elements that are in the set; for each element in the set, the Bloom filter must return true. There is a very nice tutorial on the Bloom filter by Bill Mill. If you’d like to explore this data structure in more depth, Jacob Honoroff has written a good introduction to the Bloom filter data structure.

Bloom Filter Properties

We can summarize Bloom filter properties as follows:

  • Given a big set S = {x1, x2, ..., xn}, the Bloom filter is a probabilistic, ...

Get Data Algorithms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.