Tiering flows

In Chapter 1, Overview and Architecture, we talked about tiering your data flows. There are several reasons for you to want to do this. You may want to limit the number of Flume agents that directly connect to your Hadoop cluster, to limit the number of parallel requests. You may also lack sufficient disk space on your application servers to store a significant amount of data while you are performing maintenance on your Hadoop cluster. Whatever your reason or use case, the most common mechanism to chain Flume agents is to use the Avro source/sink pair.

The Avro source/sink

We covered Avro a bit in Chapter 4, Sinks and Sink Processors, when we discussed how to use it as an on-disk serialization format for files stored in HDFS. Here, ...

Get Apache Flume: Distributed Log Collection for Hadoop - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Apache Flume: Distributed Log Collection for Hadoop - Second Edition by Steve Hoffman

Tiering flows

The Avro source/sink

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly