Summary

In this chapter we covered in depth the HDFS sink, the Flume output that writes streaming data into the HDFS. We covered how Flume can separate data into different HDFS paths based on time or contents of Flume headers. Several file-rolling techniques were also discussed including the following:

Time rotation
Event count rotation
Size rotation
Rotation on idle only

Compression was discussed as a means to reduce storage requirements in HDFS and should be used when possible. Besides storage savings, it is often faster to read a compressed file and decompress in memory than it is to read an uncompressed file. This will result in performance improvements in MapReduce jobs run on this data. Splitability of compressed data was also covered as a factor ...

Get Apache Flume: Distributed Log Collection for Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Apache Flume: Distributed Log Collection for Hadoop by Steve Hoffman

Summary

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly