Chapter 8. Data Processing

Now that we’ve talked through considerations around building data pipelines, we’ll wrap up with a discussion of processing and analyzing all of the data that’s been gathered via those data pipelines. Considerations around collecting, storing, and managing data provide the foundation for any data architecture, but it’s the processing of that data that that will allow you to derive value.

Just as with other components utilized in a distributed data architecture, the challenge with processing is the large number of options available, many of which have different goals and are targeted at different use cases. Like Chapter 5, the goal of this chapter is to provide a list of criteria for categorizing processing systems in order to provide a framework for evaluating them.

Ultimately, the decisions around selection of specific engines will depend on considerations such as your use cases, experience, and knowledge of your team, target users, SLAs, and components used elsewhere in your architecture. Our hope in this chapter is to provide an understanding of where different tools fit in order to allow you to make more informed decisions when planning your projects.

Attributes of Processing Engines

The following are attributes that we’ll use throughout this chapter to distinguish various processing engines:

Directed acyclic graph (DAG) management

How does the engine process an execution plan? We’ll provide more detail on what this means momentarily.

Concurrency ...

Get Foundations for Architecting Data Solutions now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.