Kafka: The Definitive Guide

Book description

Every enterprise application creates data, whether it’s log messages, metrics, user activity, outgoing messages, or something else. And how to move all of this data becomes nearly as important as the data itself. If you’re an application architect, developer, or production engineer new to Apache Kafka, this practical guide shows you how to use this open source streaming platform to handle real-time data feeds.

Engineers from Confluent and LinkedIn who are responsible for developing Kafka explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream-processing applications with this platform. Through detailed examples, you’ll learn Kafka’s design principles, reliability guarantees, key APIs, and architecture details, including the replication protocol, the controller, and the storage layer.

  • Understand publish-subscribe messaging and how it fits in the big data ecosystem.
  • Explore Kafka producers and consumers for writing and reading messages
  • Understand Kafka patterns and use-case requirements to ensure reliable data delivery
  • Get best practices for building data pipelines and applications with Kafka
  • Manage Kafka in production, and learn to perform monitoring, tuning, and maintenance tasks
  • Learn the most critical metrics among Kafka’s operational measurements
  • Explore how Kafka’s stream delivery capabilities make it a perfect source for stream processing systems

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface
    1. Who Should Read This Book
    2. Conventions Used in This Book
    3. Using Code Examples
    4. O’Reilly Online Learning
    5. How to Contact Us
    6. Acknowledgments
  3. 1. Meet Kafka
    1. Publish/Subscribe Messaging
      1. How It Starts
      2. Individual Queue Systems
    2. Enter Kafka
      1. Messages and Batches
      2. Schemas
      3. Topics and Partitions
      4. Producers and Consumers
      5. Brokers and Clusters
      6. Multiple Clusters
    3. Why Kafka?
      1. Multiple Producers
      2. Multiple Consumers
      3. Disk-Based Retention
      4. Scalable
      5. High Performance
    4. The Data Ecosystem
      1. Use Cases
    5. Kafka’s Origin
      1. LinkedIn’s Problem
      2. The Birth of Kafka
      3. Open Source
      4. The Name
    6. Getting Started with Kafka
  4. 2. Installing Kafka
    1. First Things First
      1. Choosing an Operating System
      2. Installing Java
      3. Installing Zookeeper
    2. Installing a Kafka Broker
    3. Broker Configuration
      1. General Broker
      2. Topic Defaults
    4. Hardware Selection
      1. Disk Throughput
      2. Disk Capacity
      3. Memory
      4. Networking
      5. CPU
    5. Kafka in the Cloud
    6. Kafka Clusters
      1. How Many Brokers?
      2. Broker Configuration
      3. OS Tuning
    7. Production Concerns
      1. Garbage Collector Options
      2. Datacenter Layout
      3. Colocating Applications on Zookeeper
    8. Summary
  5. 3. Kafka Producers: Writing Messages to Kafka
    1. Producer Overview
    2. Constructing a Kafka Producer
    3. Sending a Message to Kafka
      1. Sending a Message Synchronously
      2. Sending a Message Asynchronously
    4. Configuring Producers
    5. Serializers
      1. Custom Serializers
      2. Serializing Using Apache Avro
      3. Using Avro Records with Kafka
    6. Partitions
    7. Old Producer APIs
    8. Summary
  6. 4. Kafka Consumers: Reading Data from Kafka
    1. Kafka Consumer Concepts
      1. Consumers and Consumer Groups
      2. Consumer Groups and Partition Rebalance
    2. Creating a Kafka Consumer
    3. Subscribing to Topics
    4. The Poll Loop
    5. Configuring Consumers
    6. Commits and Offsets
      1. Automatic Commit
      2. Commit Current Offset
      3. Asynchronous Commit
      4. Combining Synchronous and Asynchronous Commits
      5. Commit Specified Offset
    7. Rebalance Listeners
    8. Consuming Records with Specific Offsets
    9. But How Do We Exit?
    10. Deserializers
    11. Standalone Consumer: Why and How to Use a Consumer Without a Group
    12. Older Consumer APIs
    13. Summary
  7. 5. Kafka Internals
    1. Cluster Membership
    2. The Controller
    3. Replication
    4. Request Processing
      1. Produce Requests
      2. Fetch Requests
      3. Other Requests
    5. Physical Storage
      1. Partition Allocation
      2. File Management
      3. File Format
      4. Indexes
      5. Compaction
      6. How Compaction Works
      7. Deleted Events
      8. When Are Topics Compacted?
    6. Summary
  8. 6. Reliable Data Delivery
    1. Reliability Guarantees
    2. Replication
    3. Broker Configuration
      1. Replication Factor
      2. Unclean Leader Election
      3. Minimum In-Sync Replicas
    4. Using Producers in a Reliable System
      1. Send Acknowledgments
      2. Configuring Producer Retries
      3. Additional Error Handling
    5. Using Consumers in a Reliable System
      1. Important Consumer Configuration Properties for Reliable Processing
      2. Explicitly Committing Offsets in Consumers
    6. Validating System Reliability
      1. Validating Configuration
      2. Validating Applications
      3. Monitoring Reliability in Production
    7. Summary
  9. 7. Building Data Pipelines
    1. Considerations When Building Data Pipelines
      1. Timeliness
      2. Reliability
      3. High and Varying Throughput
      4. Data Formats
      5. Transformations
      6. Security
      7. Failure Handling
      8. Coupling and Agility
    2. When to Use Kafka Connect Versus Producer and Consumer
    3. Kafka Connect
      1. Running Connect
      2. Connector Example: File Source and File Sink
      3. Connector Example: MySQL to Elasticsearch
      4. A Deeper Look at Connect
    4. Alternatives to Kafka Connect
      1. Ingest Frameworks for Other Datastores
      2. GUI-Based ETL Tools
      3. Stream-Processing Frameworks
    5. Summary
  10. 8. Cross-Cluster Data Mirroring
    1. Use Cases of Cross-Cluster Mirroring
    2. Multicluster Architectures
      1. Some Realities of Cross-Datacenter Communication
      2. Hub-and-Spokes Architecture
      3. Active-Active Architecture
      4. Active-Standby Architecture
      5. Stretch Clusters
    3. Apache Kafka’s MirrorMaker
      1. How to Configure
      2. Deploying MirrorMaker in Production
      3. Tuning MirrorMaker
    4. Other Cross-Cluster Mirroring Solutions
      1. Uber uReplicator
      2. Confluent Replicator
    5. Summary
  11. 9. Administering Kafka
    1. Topic Operations
      1. Creating a New Topic
      2. Adding Partitions
      3. Deleting a Topic
      4. Listing All Topics in a Cluster
      5. Describing Topic Details
    2. Consumer Groups
      1. List and Describe Groups
      2. Delete Group
      3. Offset Management
    3. Dynamic Configuration Changes
      1. Overriding Topic Configuration Defaults
      2. Overriding Client Configuration Defaults
      3. Describing Configuration Overrides
      4. Removing Configuration Overrides
    4. Partition Management
      1. Preferred Replica Election
      2. Changing a Partition’s Replicas
      3. Changing Replication Factor
      4. Dumping Log Segments
      5. Replica Verification
    5. Consuming and Producing
      1. Console Consumer
      2. Console Producer
    6. Client ACLs
    7. Unsafe Operations
      1. Moving the Cluster Controller
      2. Killing a Partition Move
      3. Removing Topics to Be Deleted
      4. Deleting Topics Manually
    8. Summary
  12. 10. Monitoring Kafka
    1. Metric Basics
      1. Where Are the Metrics?
      2. Internal or External Measurements
      3. Application Health Checks
      4. Metric Coverage
    2. Kafka Broker Metrics
      1. Under-Replicated Partitions
      2. Broker Metrics
      3. Topic and Partition Metrics
      4. JVM Monitoring
      5. OS Monitoring
      6. Logging
    3. Client Monitoring
      1. Producer Metrics
      2. Consumer Metrics
      3. Quotas
    4. Lag Monitoring
    5. End-to-End Monitoring
    6. Summary
  13. 11. Stream Processing
    1. What Is Stream Processing?
    2. Stream-Processing Concepts
      1. Time
      2. State
      3. Stream-Table Duality
      4. Time Windows
    3. Stream-Processing Design Patterns
      1. Single-Event Processing
      2. Processing with Local State
      3. Multiphase Processing/Repartitioning
      4. Processing with External Lookup: Stream-Table Join
      5. Streaming Join
      6. Out-of-Sequence Events
      7. Reprocessing
    4. Kafka Streams by Example
      1. Word Count
      2. Stock Market Statistics
      3. Click Stream Enrichment
    5. Kafka Streams: Architecture Overview
      1. Building a Topology
      2. Scaling the Topology
      3. Surviving Failures
    6. Stream Processing Use Cases
    7. How to Choose a Stream-Processing Framework
    8. Summary
  14. A. Installing Kafka on Other Operating Systems
    1. Installing on Windows
      1. Using Windows Subsystem for Linux
      2. Using Native Java
    2. Installing on MacOS
      1. Using Homebrew
      2. Installing Manually
  15. Index

Product information

  • Title: Kafka: The Definitive Guide
  • Author(s): Neha Narkhede, Gwen Shapira, Todd Palino
  • Release date: September 2017
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491936160