When I first started learning about joins, it was an intimidating topic;
CROSS: the language of joins is expressive and expansive. Add on top of that the dimension of time that streaming brings to the table, and you’re left with what appears to be a challengingly complex topic. The good news is that joins really aren’t the frightening beast with nasty, pointy teeth that they may initially appear to be. As is the case with so many other complex topics, once you understand the central ideas and themes of joins, the broader landscape that’s built on top of these basics suddenly becomes so much more accessible. So please join me now as we explore the fascinating topic of, well, joins.
What does it mean to join two datasets? We understand intuitively that joins are just a specific type of grouping operation: by joining together data that share some property (i.e., key), we collect together some number of previously-unrelated individual data elements into a group of related elements. And as we learned in Chapter 6, grouping operations always consume a stream and yield a table. Knowing these two things, it’s only a small leap to then arrive at the conclusion that forms the basis for this entire chapter: all joins are streaming joins at the heart.
What’s great about this fact is that it actually makes the topic of streaming joins that much more tractable. All the tools we’ve learned for reasoning ...