Using a bloom filter to remove unique items

A bloom filter is an abstract data type that tests whether an item exists in a set. Unlike a typical hash map data structure, a bloom filter only takes up a constant amount of space. The advantage comes in handy when dealing with billions of data, such as representations of DNA strands as strings: "GATA", "CTGCTA", and so on.

In this recipe, we will use a bloom filter to try to remove unique DNA strands from a list. This is often desired because a typical DNA sample may contain thousands of strands that only appear once. The major disadvantage of a bloom filter is that false positive results for membership are possible. The bloom filter may accidentally claim that an element exists. Though false negatives ...

Get Haskell Data Analysis Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.