Book description
Stream data to Hadoop using Apache Flume
- Integrate Flume with your data sources
- Transcode your data en-route in Flume
- Route and separate your data using regular expression matching
- Configure failover paths and load-balancing to remove single points of failure
- Utilize Gzip Compression for files written to HDFS
In Detail
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with many failover and recovery mechanisms.
Apache Flume: Distributed Log Collection for Hadoop covers problems with HDFS and streaming data/logs, and how Flume can resolve these problems. This book explains the generalized architecture of Flume, which includes moving data to/from databases, NO-SQL-ish data stores, as well as optimizing performance. This book includes real-world scenarios on Flume implementation.
Apache Flume: Distributed Log Collection for Hadoop starts with an architectural overview of Flume and then discusses each component in detail. It guides you through the complete installation process and compilation of Flume.
It will give you a heads-up on how to use channels and channel selectors. For each architectural component (Sources, Channels, Sinks, Channel Processors, Sink Groups, and so on) the various implementations will be covered in detail along with configuration options. You can use it to customize Flume to your specific needs. There are pointers given on writing custom implementations as well that would help you learn and implement them.
By the end, you should be able to construct a series of Flume agents to transport your streaming data and logs from your systems into Hadoop in near real time.
Table of contents
-
Apache Flume: Distributed Log Collection for Hadoop
- Table of Contents
- Apache Flume: Distributed Log Collection for Hadoop
- Credits
- About the Author
- About the Reviewers
- www.PacktPub.com
- Preface
- 1. Overview and Architecture
- 2. Flume Quick Start
- 3. Channels
- 4. Sinks and Sink Processors
- 5. Sources and Channel Selectors
- 6. Interceptors, ETL, and Routing
- 7. Monitoring Flume
- 8. There Is No Spoon – The Realities of Real-time Distributed Data Collection
- Index
Product information
- Title: Apache Flume: Distributed Log Collection for Hadoop
- Author(s):
- Release date: July 2013
- Publisher(s): Packt Publishing
- ISBN: 9781782167914
You might also like
book
Apache Flume: Distributed Log Collection for Hadoop - Second Edition
Design and implement a series of Flume agents to send streamed data into Hadoop In Detail …
book
Pro Apache Hadoop, Second Edition
Pro Apache Hadoop, Second Edition brings you up to speed on Hadoop the framework of big …
book
Hadoop: Data Processing and Modelling
Unlock the power of your data with Hadoop 2.X ecosystem and its data warehousing techniques across …
book
Expert Hadoop® Administration
The Comprehensive, Up-to-Date Apache Hadoop Administration Handbook and Reference “Sam Alapati has worked with production Hadoop …