You are previewing Using Flume.
O'Reilly logo
Using Flume

Book Description

How can you get your data from frontend servers to Hadoop in near real time? With this complete reference guide, you’ll learn Flume’s rich set of features for collecting, aggregating, and writing large amounts of streaming data to the Hadoop Distributed File System (HDFS), Apache HBase, SolrCloud, Elastic Search, and other systems.

Table of Contents

  1. Foreword
  2. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. Safari® Books Online
    4. How to Contact Us
    5. Acknowledgments
  3. 1. Apache Hadoop and Apache HBase: <span xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" class="keep-together">An Introduction</span>
    1. HDFS
      1. HDFS Data Formats
      2. Processing Data on HDFS
    2. Apache HBase
    3. Summary
    4. References
  4. 2. Streaming Data Using Apache Flume
    1. The Need for Flume
    2. Is Flume a Good Fit?
    3. Inside a Flume Agent
    4. Configuring Flume Agents
    5. Getting Flume Agents to Talk to Each Other
    6. Complex Flows
    7. Replicating Data to Various Destinations
    8. Dynamic Routing
    9. Flume’s No Data Loss Guarantee, Channels, and Transactions
      1. Transactions in Flume Channels
    10. Agent Failure and Data Loss
    11. The Importance of Batching
    12. What About Duplicates?
    13. Running a Flume Agent
    14. Summary
    15. References
  5. 3. Sources
    1. Lifecycle of a Source
    2. Sink-to-Source Communication
      1. Avro Source
      2. Thrift Source
      3. Failure Handling in RPC Sources
    3. HTTP Source
      1. Writing Handlers for the HTTP Source*
    4. Spooling Directory Source
      1. Reading Custom Formats Using Deserializers*
      2. Spooling Directory Source Performance
    5. Syslog Sources
    6. Exec Source
    7. JMS Source
      1. Converting JMS Messages into Flume Events*
    8. Writing Your Own Sources*
      1. Event-Driven and Pollable Sources
    9. Summary
    10. References
  6. 4. Channels
    1. Transaction Workflow
    2. Channels Bundled with Flume
      1. Memory Channel
      2. File Channel
    3. Summary
    4. References
  7. 5. Sinks
    1. Lifecycle of a Sink
    2. Optimizing the Performance of Sinks
    3. Writing to HDFS: The HDFS Sink
      1. Understanding Buckets
      2. Configuring the HDFS Sink
      3. Controlling the Data Format Using Serializers*
    4. HBase Sinks
      1. Translating Flume Events to HBase Puts and Increments Using Serializers*
    5. RPC Sinks
      1. Avro Sink
      2. Thrift Sink
    6. Morphline Solr Sink
    7. Elastic Search Sink
      1. Customizing the Data Format*
    8. Other Sinks: Null Sink, Rolling File Sink, Logger Sink
    9. Writing Your Own Sink*
    10. Summary
    11. References
  8. 6. Interceptors, Channel Selectors, Sink Groups, and Sink Processors
    1. Interceptors
      1. Timestamp Interceptor
      2. Host Interceptor
      3. Static Interceptor
      4. Regex Filtering Interceptor
      5. Morphline Interceptor
      6. UUID Interceptor
      7. Writing Interceptors*
    2. Channel Selectors
      1. Replicating Channel Selector
      2. Multiplexing Channel Selector
      3. Custom Channel Selectors*
    3. Sink Groups and Sink Processors
      1. Load-Balancing Sink Processor
      2. Failover Sink Processor
    4. Summary
    5. References
  9. 7. Getting Data into Flume*
    1. Building Flume Events
    2. Flume Client SDK
      1. Building Flume RPC Clients
      2. RPC Client Interface
      3. Configuration Parameters Common to All RPC Clients
      4. Default RPC Client
      5. Load-Balancing RPC Client
      6. Failover RPC Client
      7. Thrift RPC Client
    3. Embedded Agent
      1. Configuring an Embedded Agent
    4. log4j Appenders
      1. Load-Balancing log4j Appender
    5. Summary
    6. References
  10. 8. Planning, Deploying, and <span xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" class="keep-together">Monitoring Flume</span>
    1. Planning a Flume Deployment
      1. Time to Repair
      2. How Much Capacity Do I Need in My Flume Channels?
      3. How Many Tiers?
      4. Sending Data over Cross–Data Center Links
      5. Sharding Tiers
    2. Deploying Flume
      1. Deploying Custom Code
    3. Monitoring Flume
      1. Reporting Metrics from Custom Components
    4. Summary
    5. References
  11. Index