Chapter 10Approximating Streaming Data with Sketching

Questions about a set of distinct elements often arise in analytics, streaming or otherwise. Common questions are things like “Have I seen this before?” (set membership), “How many different things have I seen?” (cardinality estimation), or “How often do I see this thing?” (frequency).

When the number of different elements is small (low cardinality), this can be computed directly even in a streaming system. Some of the storage mechanisms introduced in earlier chapters, such as Redis, even have specialized data structures to allow for maintaining sets and histograms with real-time updates.

However, when the cardinality of the set becomes large (that is, there are many distinct items) direct maintenance of these sets becomes problematic. These data structures require O(log n) update time and O(n) storage, which can be infeasible in a streaming setting and expensive (at the very least) in terms of hardware costs.

To combat these problems, a number of algorithms, collectively known as sketch algorithms, have been developed to approximate the answers to these questions. Sketch algorithms have three features that make them desirable. The first feature is constant time updates of the data, which allows them to be easily maintained in a streaming setting. Secondly, the storage space needed is usually independent of the amount of data. Finally, querying the data structure can be completed in at worst linear time. The downside is that ...

Get Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.