Practice of Cloud System Administration, The: DevOps and SRE Practices for Web Services, Volume 2

Book description

None

Table of contents

  1. About This eBook
  2. Title Page
  3. Copyright Page
  4. Contents at a Glance
  5. Contents
  6. Preface
    1. About This Book
    2. Acknowledgments
      1. Part I Design: Building It
      2. Part II Operations: Running It
      3. Part III Appendices
  7. About the Authors
  8. Introduction
    1. Business Objectives
    2. Ideal System Architecture
    3. Ideal Release Process
    4. Ideal Operations
  9. Part I: Design: Building It
    1. Chapter 1. Designing in a Distributed World
      1. 1.1 Visibility at Scale
      2. 1.2 The Importance of Simplicity
      3. 1.3 Composition
        1. 1.3.1 Load Balancer with Multiple Backend Replicas
        2. 1.3.2 Server with Multiple Backends
        3. 1.3.3 Server Tree
      4. 1.4 Distributed State
      5. 1.5 The CAP Principle
        1. 1.5.1 Consistency
        2. 1.5.2 Availability
        3. 1.5.3 Partition Tolerance
      6. 1.6 Loosely Coupled Systems
      7. 1.7 Speed
      8. 1.8 Summary
      9. Exercises
    2. Chapter 2. Designing for Operations
      1. 2.1 Operational Requirements
        1. 2.1.1 Configuration
        2. 2.1.2 Startup and Shutdown
        3. 2.1.3 Queue Draining
        4. 2.1.4 Software Upgrades
        5. 2.1.5 Backups and Restores
        6. 2.1.6 Redundancy
        7. 2.1.7 Replicated Databases
        8. 2.1.8 Hot Swaps
        9. 2.1.9 Toggles for Individual Features
        10. 2.1.10 Graceful Degradation
        11. 2.1.11 Access Controls and Rate Limits
        12. 2.1.12 Data Import Controls
        13. 2.1.13 Monitoring
        14. 2.1.14 Auditing
        15. 2.1.15 Debug Instrumentation
        16. 2.1.16 Exception Collection
        17. 2.1.17 Documentation for Operations
      2. 2.2 Implementing Design for Operations
        1. 2.2.1 Build Features in from the Beginning
        2. 2.2.2 Request Features as They Are Identified
        3. 2.2.3 Write the Features Yourself
        4. 2.2.4 Work with a Third-Party Vendor
      3. 2.3 Improving the Model
      4. 2.4 Summary
      5. Exercises
    3. Chapter 3. Selecting a Service Platform
      1. 3.1 Level of Service Abstraction
        1. 3.1.1 Infrastructure as a Service
        2. 3.1.2 Platform as a Service
        3. 3.1.3 Software as a Service
      2. 3.2 Type of Machine
        1. 3.2.1 Physical Machines
        2. 3.2.2 Virtual Machines
        3. 3.2.3 Containers
      3. 3.3 Level of Resource Sharing
        1. 3.3.1 Compliance
        2. 3.3.2 Privacy
        3. 3.3.3 Cost
        4. 3.3.4 Control
      4. 3.4 Colocation
      5. 3.5 Selection Strategies
      6. 3.6 Summary
      7. Exercises
    4. Chapter 4. Application Architectures
      1. 4.1 Single-Machine Web Server
      2. 4.2 Three-Tier Web Service
        1. 4.2.1 Load Balancer Types
        2. 4.2.2 Load Balancing Methods
        3. 4.2.3 Load Balancing with Shared State
        4. 4.2.4 User Identity
        5. 4.2.5 Scaling
      3. 4.3 Four-Tier Web Service
        1. 4.3.1 Frontends
        2. 4.3.2 Application Servers
        3. 4.3.3 Configuration Options
      4. 4.4 Reverse Proxy Service
      5. 4.5 Cloud-Scale Service
        1. 4.5.1 Global Load Balancer
        2. 4.5.2 Global Load Balancing Methods
        3. 4.5.3 Global Load Balancing with User-Specific Data
        4. 4.5.4 Internal Backbone
      6. 4.6 Message Bus Architectures
        1. 4.6.1 Message Bus Designs
        2. 4.6.2 Message Bus Reliability
        3. 4.6.3 Example 1: Link-Shortening Site
        4. 4.6.4 Example 2: Employee Human Resources Data Updates
      7. 4.7 Service-Oriented Architecture
        1. 4.7.1 Flexibility
        2. 4.7.2 Support
        3. 4.7.3 Best Practices
      8. 4.8 Summary
      9. Exercises
    5. Chapter 5. Design Patterns for Scaling
      1. 5.1 General Strategy
        1. 5.1.1 Identify Bottlenecks
        2. 5.1.2 Reengineer Components
        3. 5.1.3 Measure Results
        4. 5.1.4 Be Proactive
      2. 5.2 Scaling Up
      3. 5.3 The AKF Scaling Cube
        1. 5.3.1 x: Horizontal Duplication
        2. 5.3.2 y: Functional or Service Splits
        3. 5.3.3 z: Lookup-Oriented Split
        4. 5.3.4 Combinations
      4. 5.4 Caching
        1. 5.4.1 Cache Effectiveness
        2. 5.4.2 Cache Placement
        3. 5.4.3 Cache Persistence
        4. 5.4.4 Cache Replacement Algorithms
        5. 5.4.5 Cache Entry Invalidation
        6. 5.4.6 Cache Size
      5. 5.5 Data Sharding
      6. 5.6 Threading
      7. 5.7 Queueing
        1. 5.7.1 Benefits
        2. 5.7.2 Variations
      8. 5.8 Content Delivery Networks
      9. 5.9 Summary
      10. Exercises
    6. Chapter 6. Design Patterns for Resiliency
      1. 6.1 Software Resiliency Beats Hardware Reliability
      2. 6.2 Everything Malfunctions Eventually
        1. 6.2.1 MTBF in Distributed Systems
        2. 6.2.2 The Traditional Approach
        3. 6.2.3 The Distributed Computing Approach
      3. 6.3 Resiliency through Spare Capacity
        1. 6.3.1 How Much Spare Capacity
        2. 6.3.2 Load Sharing versus Hot Spares
      4. 6.4 Failure Domains
      5. 6.5 Software Failures
        1. 6.5.1 Software Crashes
        2. 6.5.2 Software Hangs
        3. 6.5.3 Query of Death
      6. 6.6 Physical Failures
        1. 6.6.1 Parts and Components
        2. 6.6.2 Machines
        3. 6.6.3 Load Balancers
        4. 6.6.4 Racks
        5. 6.6.5 Datacenters
      7. 6.7 Overload Failures
        1. 6.7.1 Traffic Surges
        2. 6.7.2 DoS and DDoS Attacks
        3. 6.7.3 Scraping Attacks
      8. 6.8 Human Error
      9. 6.9 Summary
      10. Exercises
  10. Part II Operations: Running It
    1. Chapter 7. Operations in a Distributed World
      1. 7.1 Distributed Systems Operations
        1. 7.1.1 SRE versus Traditional Enterprise IT
        2. 7.1.2 Change versus Stability
        3. 7.1.3 Defining SRE
        4. 7.1.4 Operations at Scale
      2. 7.2 Service Life Cycle
        1. 7.2.1 Service Launches
        2. 7.2.2 Service Decommissioning
      3. 7.3 Organizing Strategy for Operational Teams
        1. 7.3.1 Team Member Day Types
        2. 7.3.2 Other Strategies
      4. 7.4 Virtual Office
        1. 7.4.1 Communication Mechanisms
        2. 7.4.2 Communication Policies
      5. 7.5 Summary
      6. Exercises
    2. Chapter 8. DevOps Culture
      1. 8.1 What Is DevOps?
        1. 8.1.1 The Traditional Approach
        2. 8.1.2 The DevOps Approach
      2. 8.2 The Three Ways of DevOps
        1. 8.2.1 The First Way: Workflow
        2. 8.2.2 The Second Way: Improve Feedback
        3. 8.2.3 The Third Way: Continual Experimentation and Learning
        4. 8.2.4 Small Batches Are Better
        5. 8.2.5 Adopting the Strategies
      3. 8.3 History of DevOps
        1. 8.3.1 Evolution
        2. 8.3.2 Site Reliability Engineering
      4. 8.4 DevOps Values and Principles
        1. 8.4.1 Relationships
        2. 8.4.2 Integration
        3. 8.4.3 Automation
        4. 8.4.4 Continuous Improvement
        5. 8.4.5 Common Nontechnical DevOps Practices
        6. 8.4.6 Common Technical DevOps Practices
        7. 8.4.7 Release Engineering DevOps Practices
      5. 8.5 Converting to DevOps
        1. 8.5.1 Getting Started
        2. 8.5.2 DevOps at the Business Level
      6. 8.6 Agile and Continuous Delivery
        1. 8.6.1 What Is Agile?
        2. 8.6.2 What Is Continuous Delivery?
      7. 8.7 Summary
      8. Exercises
    3. Chapter 9. Service Delivery: The Build Phase
      1. 9.1 Service Delivery Strategies
        1. 9.1.1 Pattern: Modern DevOps Methodology
        2. 9.1.2 Anti-pattern: Waterfall Methodology
      2. 9.2 The Virtuous Cycle of Quality
      3. 9.3 Build-Phase Steps
        1. 9.3.1 Develop
        2. 9.3.2 Commit
        3. 9.3.3 Build
        4. 9.3.4 Package
        5. 9.3.5 Register
      4. 9.4 Build Console
      5. 9.5 Continuous Integration
      6. 9.6 Packages as Handoff Interface
      7. 9.7 Summary
      8. Exercises
    4. Chapter 10. Service Delivery: The Deployment Phase
      1. 10.1 Deployment-Phase Steps
        1. 10.1.1 Promotion
        2. 10.1.2 Installation
        3. 10.1.3 Configuration
      2. 10.2 Testing and Approval
        1. 10.2.1 Testing
        2. 10.2.2 Approval
      3. 10.3 Operations Console
      4. 10.4 Infrastructure Automation Strategies
        1. 10.4.1 Preparing Physical Machines
        2. 10.4.2 Preparing Virtual Machines
        3. 10.4.3 Installing OS and Services
      5. 10.5 Continuous Delivery
      6. 10.6 Infrastructure as Code
      7. 10.7 Other Platform Services
      8. 10.8 Summary
      9. Exercises
    5. Chapter 11. Upgrading Live Services
      1. 11.1 Taking the Service Down for Upgrading
      2. 11.2 Rolling Upgrades
      3. 11.3 Canary
      4. 11.4 Phased Roll-outs
      5. 11.5 Proportional Shedding
      6. 11.6 Blue-Green Deployment
      7. 11.7 Toggling Features
      8. 11.8 Live Schema Changes
      9. 11.9 Live Code Changes
      10. 11.10 Continuous Deployment
      11. 11.11 Dealing with Failed Code Pushes
      12. 11.12 Release Atomicity
      13. 11.13 Summary
      14. Exercises
    6. Chapter 12. Automation
      1. 12.1 Approaches to Automation
        1. 12.1.1 The Left-Over Principle
        2. 12.1.2 The Compensatory Principle
        3. 12.1.3 The Complementarity Principle
        4. 12.1.4 Automation for System Administration
        5. 12.1.5 Lessons Learned
      2. 12.2 Tool Building versus Automation
        1. 12.2.1 Example: Auto Manufacturing
        2. 12.2.2 Example: Machine Configuration
        3. 12.2.3 Example: Account Creation
        4. 12.2.4 Tools Are Good, But Automation Is Better
      3. 12.3 Goals of Automation
      4. 12.4 Creating Automation
        1. 12.4.1 Making Time to Automate
        2. 12.4.2 Reducing Toil
        3. 12.4.3 Determining What to Automate First
      5. 12.5 How to Automate
      6. 12.6 Language Tools
        1. 12.6.1 Shell Scripting Languages
        2. 12.6.2 Scripting Languages
        3. 12.6.3 Compiled Languages
        4. 12.6.4 Configuration Management Languages
      7. 12.7 Software Engineering Tools and Techniques
        1. 12.7.1 Issue Tracking Systems
        2. 12.7.2 Version Control Systems
        3. 12.7.3 Software Packaging
        4. 12.7.4 Style Guides
        5. 12.7.5 Test-Driven Development
        6. 12.7.6 Code Reviews
        7. 12.7.7 Writing Just Enough Code
      8. 12.8 Multitenant Systems
      9. 12.9 Summary
      10. Exercises
    7. Chapter 13. Design Documents
      1. 13.1 Design Documents Overview
        1. 13.1.1 Documenting Changes and Rationale
        2. 13.1.2 Documentation as a Repository of Past Decisions
      2. 13.2 Design Document Anatomy
      3. 13.3 Template
      4. 13.4 Document Archive
      5. 13.5 Review Workflows
        1. 13.5.1 Reviewers and Approvers
        2. 13.5.2 Achieving Sign-off
      6. 13.6 Adopting Design Documents
      7. 13.7 Summary
      8. Exercises
    8. Chapter 14. Oncall
      1. 14.1 Designing Oncall
        1. 14.1.1 Start with the SLA
        2. 14.1.2 Oncall Roster
        3. 14.1.3 Onduty
        4. 14.1.4 Oncall Schedule Design
        5. 14.1.5 The Oncall Calendar
        6. 14.1.6 Oncall Frequency
        7. 14.1.7 Types of Notifications
        8. 14.1.8 After-Hours Maintenance Coordination
      2. 14.2 Being Oncall
        1. 14.2.1 Pre-shift Responsibilities
        2. 14.2.2 Regular Oncall Responsibilities
        3. 14.2.3 Alert Responsibilities
        4. 14.2.4 Observe, Orient, Decide, Act (OODA)
        5. 14.2.5 Oncall Playbook
        6. 14.2.6 Third-Party Escalation
        7. 14.2.7 End-of-Shift Responsibilities
      3. 14.3 Between Oncall Shifts
        1. 14.3.1 Long-Term Fixes
        2. 14.3.2 Postmortems
      4. 14.4 Periodic Review of Alerts
      5. 14.5 Being Paged Too Much
      6. 14.6 Summary
      7. Exercises
    9. Chapter 15. Disaster Preparedness
      1. 15.1 Mindset
        1. 15.1.1 Antifragile Systems
        2. 15.1.2 Reducing Risk
      2. 15.2 Individual Training: Wheel of Misfortune
      3. 15.3 Team Training: Fire Drills
        1. 15.3.1 Service Testing
        2. 15.3.2 Random Testing
      4. 15.4 Training for Organizations: Game Day/DiRT
        1. 15.4.1 Getting Started
        2. 15.4.2 Increasing Scope
        3. 15.4.3 Implementation and Logistics
        4. 15.4.4 Experiencing a DiRT Test
      5. 15.5 Incident Command System
        1. 15.5.1 How It Works: Public Safety Arena
        2. 15.5.2 How It Works: IT Operations Arena
        3. 15.5.3 Incident Action Plan
        4. 15.5.4 Best Practices
        5. 15.5.5 ICS Example
      6. 15.6 Summary
      7. Exercises
    10. Chapter 16. Monitoring Fundamentals
      1. 16.1 Overview
        1. 16.1.1 Uses of Monitoring
        2. 16.1.2 Service Management
      2. 16.2 Consumers of Monitoring Information
      3. 16.3 What to Monitor
      4. 16.4 Retention
      5. 16.5 Meta-monitoring
      6. 16.6 Logs
        1. 16.6.1 Approach
        2. 16.6.2 Timestamps
      7. 16.7 Summary
      8. Exercises
    11. Chapter 17. Monitoring Architecture and Practice
      1. 17.1 Sensing and Measurement
        1. 17.1.1 Blackbox versus Whitebox Monitoring
        2. 17.1.2 Direct versus Synthesized Measurements
        3. 17.1.3 Rate versus Capability Monitoring
        4. 17.1.4 Gauges versus Counters
      2. 17.2 Collection
        1. 17.2.1 Push versus Pull
        2. 17.2.2 Protocol Selection
        3. 17.2.3 Server Component versus Agent versus Poller
        4. 17.2.4 Central versus Regional Collectors
      3. 17.3 Analysis and Computation
      4. 17.4 Alerting and Escalation Manager
        1. 17.4.1 Alerting, Escalation, and Acknowledgments
        2. 17.4.2 Silence versus Inhibit
      5. 17.5 Visualization
        1. 17.5.1 Percentiles
        2. 17.5.2 Stack Ranking
        3. 17.5.3 Histograms
      6. 17.6 Storage
      7. 17.7 Configuration
      8. 17.8 Summary
      9. Exercises
    12. Chapter 18. Capacity Planning
      1. 18.1 Standard Capacity Planning
        1. 18.1.1 Current Usage
        2. 18.1.2 Normal Growth
        3. 18.1.3 Planned Growth
        4. 18.1.4 Headroom
        5. 18.1.5 Resiliency
        6. 18.1.6 Timetable
      2. 18.2 Advanced Capacity Planning
        1. 18.2.1 Identifying Your Primary Resources
        2. 18.2.2 Knowing Your Capacity Limits
        3. 18.2.3 Identifying Your Core Drivers
        4. 18.2.4 Measuring Engagement
        5. 18.2.5 Analyzing the Data
        6. 18.2.6 Monitoring the Key Indicators
        7. 18.2.7 Delegating Capacity Planning
      3. 18.3 Resource Regression
      4. 18.4 Launching New Services
      5. 18.5 Reduce Provisioning Time
      6. 18.6 Summary
      7. Exercises
    13. Chapter 19. Creating KPIs
      1. 19.1 What Is a KPI?
      2. 19.2 Creating KPIs
        1. 19.2.1 Step 1: Envision the Ideal
        2. 19.2.2 Step 2: Quantify Distance to the Ideal
        3. 19.2.3 Step 3: Imagine How Behavior Will Change
        4. 19.2.4 Step 4: Revise and Select
        5. 19.2.5 Step 5: Deploy the KPI
      3. 19.3 Example KPI: Machine Allocation
        1. 19.3.1 The First Pass
        2. 19.3.2 The Second Pass
        3. 19.3.3 Evaluating the KPI
      4. 19.4 Case Study: Error Budget
        1. 19.4.1 Conflicting Goals
        2. 19.4.2 A Unified Goal
        3. 19.4.3 Everyone Benefits
      5. 19.5 Summary
      6. Exercises
    14. Chapter 20. Operational Excellence
      1. 20.1 What Does Operational Excellence Look Like?
      2. 20.2 How to Measure Greatness
      3. 20.3 Assessment Methodology
        1. 20.3.1 Operational Responsibilities
        2. 20.3.2 Assessment Levels
        3. 20.3.3 Assessment Questions and Look-For’s
      4. 20.4 Service Assessments
        1. 20.4.1 Identifying What to Assess
        2. 20.4.2 Assessing Each Service
        3. 20.4.3 Comparing Results across Services
        4. 20.4.4 Acting on the Results
        5. 20.4.5 Assessment and Project Planning Frequencies
      5. 20.5 Organizational Assessments
      6. 20.6 Levels of Improvement
      7. 20.7 Getting Started
      8. 20.8 Summary
      9. Exercises
    15. Epilogue
  11. Part III Appendices
    1. Appendix A. Assessments
      1. A.1 Regular Tasks (RT)
        1. Sample Assessment Questions
        2. Level 1: Initial
        3. Level 2: Repeatable
        4. Level 3: Defined
        5. Level 4: Managed
        6. Level 5: Optimizing
      2. A.2 Emergency Response (ER)
        1. Sample Assessment Questions
        2. Level 1: Initial
        3. Level 2: Repeatable
        4. Level 3: Defined
        5. Level 4: Managed
        6. Level 5: Optimizing
      3. A.3 Monitoring and Metrics (MM)
        1. Sample Assessment Questions
        2. Level 1: Initial
        3. Level 2: Repeatable
        4. Level 3: Defined
        5. Level 4: Managed
        6. Level 5: Optimizing
      4. A.4 Capacity Planning (CP)
        1. Sample Assessment Questions
        2. Level 1: Initial
        3. Level 2: Repeatable
        4. Level 3: Defined
        5. Level 4: Managed
        6. Level 5: Optimizing
      5. A.5 Change Management (CM)
        1. Sample Assessment Questions
        2. Level 1: Initial
        3. Level 2: Repeatable
        4. Level 3: Defined
        5. Level 4: Managed
        6. Level 5: Optimizing
      6. A.6 New Product Introduction and Removal (NPI/NPR)
        1. Sample Assessment Questions
        2. Level 1: Initial
        3. Level 2: Repeatable
        4. Level 3: Defined
        5. Level 4: Managed
        6. Level 5: Optimizing
      7. A.7 Service Deployment and Decommissioning (SDD)
        1. Sample Assessment Questions
        2. Level 1: Initial
        3. Level 2: Repeatable
        4. Level 3: Defined
        5. Level 4: Managed
        6. Level 5: Optimizing
      8. A.8 Performance and Efficiency (PE)
        1. Sample Assessment Questions
        2. Level 1: Initial
        3. Level 2: Repeatable
        4. Level 3: Defined
        5. Level 4: Managed
        6. Level 5: Optimizing
      9. A.9 Service Delivery: The Build Phase
        1. Sample Assessment Questions
        2. Level 1: Initial
        3. Level 2: Repeatable
        4. Level 3: Defined
        5. Level 4: Managed
        6. Level 5: Optimizing
      10. A.10 Service Delivery: The Deployment Phase
        1. Sample Assessment Questions
        2. Level 1: Initial
        3. Level 2: Repeatable
        4. Level 3: Defined
        5. Level 4: Managed
        6. Level 5: Optimizing
      11. A.11 Toil Reduction
        1. Sample Assessment Questions
        2. Level 1: Initial
        3. Level 2: Repeatable
        4. Level 3: Defined
        5. Level 4: Managed
        6. Level 5: Optimizing
      12. A.12 Disaster Preparedness
        1. Sample Assessment Questions
        2. Level 1: Initial
        3. Level 2: Repeatable
        4. Level 3: Defined
        5. Level 4: Managed
        6. Level 5: Optimizing
    2. Appendix B. The Origins and Future of Distributed Computing and Clouds
      1. B.1 The Pre-Web Era (1985–1994)
        1. Availability Requirements
        2. Technology
        3. Scaling
        4. High Availability
        5. Costs
      2. B.2 The First Web Era: The Bubble (1995–2000)
        1. Availability Requirements
        2. Technology
        3. Scaling
        4. High Availability
        5. N + 1 Configurations
        6. N + 2 Configurations
        7. Costs
      3. B.3 The Dot-Bomb Era (2000–2003)
      4. Availability Requirements
        1. Technology
        2. High Availability
        3. Scaling
        4. Data Scaling
        5. Applicability
        6. Costs
      5. B.4 The Second Web Era (2003–2010)
      6. Availability Requirements
        1. Technology
        2. High Availability
        3. Scaling
        4. Costs
      7. B.5 The Cloud Computing Era (2010–present)
        1. Availability Requirements
        2. Costs
      8. Scaling and High Availability
        1. Technology
      9. B.6 Conclusion
      10. Exercises
    3. Appendix C. Scaling Terminology and Concepts
      1. C.1 Constant, Linear, and Exponential Scaling
      2. C.2 Big O Notation
      3. C.3 Limitations of Big O Notation
    4. Appendix D. Templates and Examples
      1. D.1 Design Document Template
      2. D.2 Design Document Example
      3. D.3 Sample Postmortem Template
    5. Appendix E. Recommended Reading
      1. DevOps:
      2. ITIL:
      3. Theory:
      4. Classic Google Papers:
      5. Classic Facebook Papers:
      6. Scalability:
      7. UNIX Internals:
      8. UNIX Systems Programming:
      9. Network Protocols:
  12. Bibliography
  13. Index

Product information

  • Title: Practice of Cloud System Administration, The: DevOps and SRE Practices for Web Services, Volume 2
  • Author(s):
  • Release date:
  • Publisher(s): Addison-Wesley Professional
  • ISBN: None