You are previewing Designing Data-Intensive Applications.
O'Reilly logo
Designing Data-Intensive Applications

Book Description

Want to know how the best software engineers and architects structure their applications to make them scalable, reliable, and maintainable in the long term? This book examines the key principles, algorithms, and trade-offs of data systems, using the internals of various popular software packages and frameworks as examples. You’ll learn how to determine what kind of tool is appropriate for which purpose, and how certain tools can be combined to form the foundation of a good application architecture.

Table of Contents

  1. About this Book
    1. Who Should Read this Book?
    2. Scope of this Book
    3. Outline of this Book
    4. Early Release Status and Feedback
  2. I. Foundations of Data Systems
  3. 1. Reliable, Scalable and Maintainable Applications
    1. Thinking About Data Systems
    2. Reliability
      1. Hardware faults
      2. Software errors
      3. Human errors
      4. How important is reliability?
    3. Scalability
      1. Describing load
      2. Describing performance
      3. Approaches for coping with load
    4. Maintainability
      1. Operability: making life easy for operations
      2. Simplicity: managing complexity
      3. Evolvability: making change easy
    5. Summary
  4. 2. Data Models and Query Languages
    1. Relational Model vs. Document Model
      1. The birth of NoSQL
      2. The object-relational mismatch
      3. Many-to-one and many-to-many relationships
      4. Are document databases repeating history?
      5. Relational vs. document databases today
    2. Query Languages for Data
      1. Declarative queries on the web
      2. MapReduce querying
    3. Graph-like Data Models
      1. Property graphs
      2. The Cypher query language
      3. Graph queries in SQL
      4. Triple-stores and SPARQL
      5. The foundation: Datalog
    4. Summary
  5. 3. Storage and Retrieval
    1. Data Structures that Power Your Database
      1. Hash indexes
      2. SSTables and LSM-trees
      3. B-trees
      4. Other indexing structures
      5. Keeping everything in memory
    2. Transaction Processing or Analytics?
      1. Data warehousing
      2. Stars and snowflakes: schemas for analytics
    3. Column-oriented storage
      1. Column compression
      2. Sort order in column storage
      3. Writing to column-oriented storage
      4. Aggregation: Data cubes and materialized views
    4. Summary
  6. 4. Encoding and Evolution
    1. Formats for Encoding Data
      1. Language-specific formats
      2. JSON, XML and binary variants
      3. Thrift and Protocol Buffers
      4. Avro
      5. The merits of schemas
    2. Modes of Data Flow
      1. Data flow through databases
      2. Data flow through services: REST and RPC
      3. Message passing data flow
    3. Summary
  7. II. Distributed Data
  8. 5. Replication
    1. Leaders and Followers
      1. Synchronous vs. asynchronous replication
      2. Setting up new followers
      3. Handling node outages
      4. Implementation of replication logs
    2. Problems With Replication Lag
      1. Reading your own writes
      2. Monotonic reads
      3. Consistent prefix reads
      4. Solutions for replication lag
    3. Multi-leader replication
      1. Use cases for multi-leader replication
      2. Handling write conflicts
      3. Multi-leader replication topologies
    4. Leaderless replication
      1. Writing to the database when a node is down
      2. Limitations of quorum consistency
      3. Sloppy quorums and hinted handoff
      4. Detecting concurrent writes
    5. Summary
  9. 6. Partitioning
    1. Partitioning and replication
    2. Partitioning of key-value data
      1. Partitioning by key range
      2. Partitioning by hash of key
      3. Skewed workloads and relieving hot spots
    3. Partitioning and secondary indexes
      1. Partitioning secondary indexes by document
      2. Partitioning secondary indexes by term
    4. Rebalancing partitions
      1. Strategies for rebalancing
      2. Operations: automatic or manual rebalancing
    5. Request routing
      1. Parallel query execution
    6. Summary
  10. 7. Transactions
    1. The slippery concept of a transaction
      1. The meaning of ACID
      2. Single-object and multi-object operations
    2. Weak isolation levels
      1. Read committed
      2. Snapshot isolation and repeatable read
      3. Preventing lost updates
      4. Preventing write skew and phantoms
    3. Serializability
      1. Actual serial execution
      2. Two-phase locking (2PL)
      3. Serializable snapshot isolation (SSI)
    4. Summary
  11. 8. The Trouble with Distributed Systems
    1. Faults and Partial Failures
      1. Cloud computing and supercomputing
    2. Unreliable Networks
      1. Network faults in practice
      2. Detecting faults
      3. Timeouts and unbounded delays
      4. Synchronous vs. asynchronous networks
    3. Unreliable Clocks
      1. Monotonic vs. time-of-day clocks
      2. Clock synchronization and accuracy
      3. Relying on synchronized clocks
      4. Process pauses
    4. Knowledge, Truth and Lies
      1. The truth is defined by the majority
      2. Byzantine faults
      3. System model and reality
    5. Summary
  12. 9. Consistency and Consensus
    1. Consistency Guarantees
    2. Linearizability
      1. What makes a system linearizable?
      2. Relying on linearizability
      3. Implementing linearizable systems
      4. The cost of linearizability
    3. Ordering Guarantees
      1. Ordering and causality
      2. Sequence number ordering
      3. Total order broadcast
    4. Distributed Transactions and Consensus
      1. Atomic commit and two-phase commit (2PC)
      2. Distributed transactions in practice
      3. Fault-tolerant consensus
      4. Membership and coordination services
    5. Summary
  13. III. Derived Data
  14. 10. Batch Processing
    1. Batch Processing with Unix Tools
      1. Simple log analysis
      2. The Unix philosophy
    2. MapReduce and Distributed Filesystems
      1. MapReduce job execution
      2. Reduce-side joins and grouping
      3. Map-side joins
      4. The output of batch workflows
      5. Comparing MapReduce to distributed databases
    3. Beyond MapReduce
      1. Materialization of intermediate state
      2. Graphs and iterative processing
      3. High-level APIs and languages
    4. Summary
  15. 11. Stream Processing
    1. Transmitting Event Streams
      1. Messaging systems
      2. Partitioned logs
    2. Databases and streams
      1. Keeping systems in sync
      2. Change data capture
      3. Event sourcing
      4. State, streams, and immutability
    3. Processing Streams
      1. Uses of stream processing
      2. Reasoning about time
      3. Stream joins
      4. Fault tolerance
    4. Summary