The Dataflow programming model

Cloud Dataflow runner services execute various data processing jobs that are created using the Dataflow SDK in a programming model that simplifies large-scale data processing.

We have our code programming model divided in four major components:

  • Pipelines: Represents a single, repeatable job from start to finish
  • PCollections: Represents a set of data in your pipeline
  • Transforms: Performs processing on the elements of PCollection
  • I/O Sources and Sinks: Provides data source / data sink APIs for pipeline I/O

Let's discuss them one by one in the following topics.

Get Cloud Analytics with Google Cloud Platform now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.