You are previewing Kafka: The Definitive Guide.
O'Reilly logo
Kafka: The Definitive Guide

Book Description

Learn how to take full advantage of Apache Kafka, the distributed, publish-subscribe queue for handling real-time data feeds. With this comprehensive book, you’ll understand how Kafka works and how it’s designed. Authors Neha Narkhede, Gwen Shapira, and Todd Palino show you how to deploy production Kafka clusters; secure, tune, and monitor them; write rock-solid applications that use Kafka; and build scalable stream-processing applications.

Table of Contents

  1. 1. Meet Kafka
    1. Publish / Subscribe Messaging
      1. How It Starts
      2. Individual Queue Systems
    2. Enter Kafka
      1. Messages and Batches
      2. Schemas
      3. Topics and Partitions
      4. Producers and Consumers
      5. Brokers and Clusters
      6. Multiple Clusters
    3. Why Kafka?
      1. Multiple Producers
      2. Multiple Consumers
      3. Disk-based Retention
      4. Scalable
      5. High Performance
    4. The Data Ecosystem
      1. Use Cases
    5. The Origin Story
      1. LinkedIn’s Problem
      2. The Birth of Kafka
      3. Open Source
      4. The Name
    6. Getting Started With Kafka
  2. 2. Installing Kafka
    1. First Things First
      1. Choosing an Operating System
      2. Installing Java
      3. Installing Zookeeper
    2. Installing a Kafka Broker
    3. Broker Configuration
      1. General Broker
      2. Topic Defaults
    4. Hardware Selection
      1. Disk Throughput
      2. Disk Capacity
      3. Memory
      4. Networking
      5. CPU
    5. Kafka in the Cloud
    6. Kafka Clusters
      1. How Many Brokers
      2. Broker Configuration
      3. Operating System Tuning
    7. Production Concerns
      1. Garbage Collector Options
      2. Datacenter Layout
      3. Colocating Applications on Zookeeper
    8. Getting Started With Clients
  3. 3. Kafka Producers - Writing Messages to Kafka
    1. Producer overview
    2. Constructing a Kafka Producer
    3. Sending a Message to Kafka
    4. Serializers
    5. Partitions
    6. Configuring Producers
    7. Old Producer APIs
  4. 4. Kafka Consumers - Reading Data from Kafka
    1. KafkaConsumer Concepts
      1. Consumers and Consumer Groups
      2. Consumer Groups - Partition Rebalance
    2. Creating a Kafka Consumer
    3. Subscribing to Topics
    4. The Poll Loop
    5. Commits and Offsets
      1. Automatic Commit
      2. Commit Current Offset
      3. Asynchronous Commit
      4. Combining Synchronous and Asynchronous commits
      5. Commit Specified Offset
    6. Rebalance Listeners
    7. Seek and Exactly Once Processing
    8. But How Do We Exit?
    9. Deserializers
    10. Configuring Consumers
      1. fetch.min.bytes
      2. fetch.max.wait.ms
      3. max.partition.fetch.bytes
      4. session.timeout.ms
      5. auto.offset.reset
      6. enable.auto.commit
      7. partition.assignment.strategy
      8. client.id
    11. Stand Alone Consumer - Why and How to Use a Consumer without a Group
    12. Older consumer APIs
  5. 5. Kafka Internals
    1. Cluster Membership
    2. Replication
      1. The Controller
      2. In-Sync Replica
    3. Request Processing
      1. Produce Requests
      2. Fetch Requests
      3. Other Requests
    4. Physical Storage
      1. Partition Allocation
      2. File Management
      3. File Format
      4. Indexes
      5. Compaction
      6. How Compaction Works
      7. Deleted Events
      8. When Are Topics Compacted
    5. Summary
  6. 6. Reliable Data Delivery
    1. Reliability Guarantees
    2. Replication
    3. Broker Configuration
      1. Replication Factor
      2. Unclean Leader Election
      3. Minimum In-Sync Replicas
    4. Using Producers in Reliable System
      1. Send Acknowledgements
      2. Configuring Producer Retries
      3. Additional Error Handling
    5. Using Consumers in Reliable System
      1. Important Consumer Configuration for Reliable Processing
      2. Explicitly Committing Offsets in Consumer
    6. Validating System Reliability
      1. Validating Configuration
      2. Validating Applications
      3. Monitoring Reliability in Production
    7. Final Notes
  7. 7. Building Data Pipelines
    1. Considerations When Building Data Pipelines
      1. Timeliness
      2. Reliability
      3. High and varying throughput
      4. Data Formats
      5. Transformations
      6. Security
      7. Failure Handling
      8. Coupling and Agility
    2. When to use Kafka Connect vs. Producer and Consumer
    3. Kafka Connect
      1. Running Connect
      2. Connectors Example - File source and File sink
      3. Connectors Example - MySQL to ElasticSearch
      4. A Deeper Look at Connect
    4. Alternatives to Kafka Connect
      1. Ingest frameworks for other data stores
      2. GUI-based ETL tools
      3. Stream Processing Frameworks
    5. Summary