Cover image for Web Operations

Book description

Learn what it takes to build and maintain high-traffic websites with Web Operations. Featuring essays from today's top web veterans, this insightful book shows you how to run your web ops as reliably and effectively as Google, Microsoft, and Yahoo run theirs. Even if your site never gets that big, you'll profit from the experience and knowledge of the people who created sites for these and other industry giants.

Table of Contents

  1. Web Operations: Keeping the Data on Time
    1. SPECIAL OFFER: Upgrade this ebook with O’Reilly
    2. Foreword
    3. Preface
      1. How This Book Is Organized
      2. Who This Book Is For
      3. Conventions Used in This Book
      4. Using Code Examples
      5. How to Contact Us
      6. Safari® Books Online
      7. Acknowledgments
    4. 1. Web Operations: The Career
      1. Why Does Web Operations Have It Tough?
        1. A Strong Background in Computing
        2. Practiced Decisiveness
        3. A Calm Disposition
      2. From Apprentice to Master
        1. Knowledge
        2. Tools
        3. Experience
          1. The organizational challenge of inexperience
          2. The concept of "senior operations"
        4. Discipline
      3. Conclusion
    5. 2. How Picnik Uses Cloud Computing: Lessons Learned
      1. Where the Cloud Fits (and Why!)
        1. Storage
        2. Hybrid Computing with EC2
      2. Where the Cloud Doesn't Fit (for Picnik)
      3. Conclusion
    6. 3. Infrastructure and Application Metrics
      1. Time Resolution and Retention Concerns
      2. Locality of Metrics Collection and Storage
      3. Layers of Metrics
        1. High-Level Business or Feature-Specific Metrics
        2. System- and Service-Level Metrics
      4. Providing Context for Anomaly Detection and Alerts
      5. Log Lines Are Metrics, Too
      6. Correlation with Change Management and Incident Timelines
      7. Making Metrics Available to Your Alerting Mechanisms
      8. Using Metrics to Guide Load-Feedback Mechanisms
      9. A Metrics Collection System, Illustrated: Ganglia
        1. Background
        2. A Quick Introduction to Ganglia
          1. The need to keep collection and aggregation costs low
          2. The need to automatically discover new nodes and metrics
          3. The need to match network transport with your metrics collection task
          4. The need to implicitly prioritize cluster metrics
          5. The need to aggregate and organize metrics once they're collected
          6. The need to provide convenient interfaces for creating new metrics and pulling out existing metrics for correlation against other data
      10. Conclusion
    7. 4. Continuous Deployment
      1. Small Batches Mean Faster Feedback
      2. Small Batches Mean Problems Are Instantly Localized
      3. Small Batches Reduce Risk
      4. Small Batches Reduce Overhead
      5. The Quality Defenders' Lament
        1. Why Does It Work?
      6. Getting Started
        1. Step 1: Continuous Integration Server
        2. Step 2: Source Control Commit Check
        3. Step 3: Simple Deployment Script
        4. Step 4: Real-Time Alerting
        5. Step 5: Root-Cause Analysis (Five Whys)
      7. Continuous Deployment Is for Mission-Critical Applications
        1. Another Release? Do I Have To?
        2. The QA Dilemma
      8. Conclusion
    8. 5. Infrastructure As Code
      1. Service-Oriented Architecture
        1. Configuration Management
          1. Configuration management is policy driven
          2. System automation is configuration management policy made into code
          3. Configuration management in system administration
        2. System Integration
          1. Step 1: Break the infrastructure down into reusable, network-accessible services
            1. The bootstrapping service.
            2. The configuration service.
          2. Step 2: Integrate the services together
      2. Conclusion
    9. 6. Monitoring
      1. Story: "The Start of a Journey"
      2. Step 1: Understand What You Are Monitoring
      3. Step 2: Understand Normal Behavior
      4. Step 3: Be Prepared and Learn
      5. Conclusion
    10. 7. How Complex Systems Fail
      1. How Complex Systems Fail
        1. (Being a Short Treatise on the Nature of Failure; How Failure Is Evaluated; How Failure Is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety)
          1. Complex systems are intrinsically hazardous systems
          2. Complex systems are heavily and successfully defended against failure
          3. Catastrophe requires multiple failures–single-point failures are not enough
          4. Complex systems contain changing mixtures of failures latent within them
          5. Complex systems run in degraded mode
          6. Catastrophe is always just around the corner
          7. Post-accident attribution to a "root cause" is fundamentally wrong
          8. Hindsight biases post-accident assessments of human performance
          9. Human operators have dual roles: as producers and as defenders against failure
          10. All practitioner actions are gambles
          11. Actions at the sharp end resolve all ambiguity
          12. Human practitioners are the adaptable element of complex systems
          13. Human expertise in complex systems is constantly changing
          14. Change introduces new forms of failure
          15. Views of "cause" limit the effectiveness of defenses against future events
          16. Safety is a characteristic of systems and not of their components
          17. People continuously create safety
          18. Failure-free operations require experience with failure
        2. As It Pertains Specifically to Web Operations
          1. It will be difficult to tell that the system has failed
          2. It will be difficult to tell what has failed
          3. Meaningful response will be delayed
          4. Communications will be strained and tempers will flare
          5. Maintenance will be a major source of new failures
          6. Recovery from backup is itself difficult and potentially dangerous
          7. Create test procedures that front-line people can use to verify system status
          8. Manage operations on a daily basis
          9. Control maintenance
          10. Assess performance at regular intervals
          11. Be a (unique) customer
      2. Further Reading
    11. 8. Community Management and Web Operations
    12. 9. Dealing with Unexpected Traffic Spikes
      1. How It All Started
      2. Alarms Abound
      3. Putting Out the Fire
      4. Surviving the Weekend
      5. Preparing for the Future
      6. CDN to the Rescue
      7. Proxy Servers
      8. Corralling the Stampede
      9. Streamlining the Codebase
      10. How Do We Know It Works?
      11. The Real Test
      12. Lessons Learned
      13. Improvements Since Then
    13. 10. Dev and Ops Collaboration and Cooperation
      1. Deployment
      2. Shared, Open Infrastructure
      3. Trust
      4. On-call Developers
        1. Live Debugging Tools
        2. Feature Flags
      5. Avoiding Blame
      6. Conclusion
    14. 11. How Your Visitors Feel: User-Facing Metrics
      1. Why Collect User-Facing Metrics?
        1. Successful Start-ups Learn and Adapt
        2. Performance Matters
        3. Recent Research Quantifies the Relationship
      2. What Makes a Site Slow?
        1. Service Discovery
        2. Sending the Request
        3. Thinking About the Response
        4. Delivering the Response
        5. Asynchronous Traffic and Refresh
        6. Rendering Time
      3. Measuring Delay
        1. Synthetic Monitoring
          1. When to use synthetic monitoring
          2. Limitations of synthetic monitoring
          3. Configuring synthetic monitoring
        2. Real User Monitoring
          1. When to use RUM
          2. Limitations of RUM
          3. Configuring RUM
      4. Building an SLA
        1. Apdex
      5. Visitor Outcomes: Analytics
        1. How Marketing Defines Success
        2. The Four Kinds of Sites
        3. A (Very) Basic Model of Analytics
        4. Correlating Performance and Analytics by Time
        5. Correlating Performance and Analytics by Visits
      6. Other Metrics Marketing Cares About
        1. Web Interaction Analytics
        2. Voice of the Customer
      7. How User Experience Affects Web Ops
        1. Many More Stakeholders
        2. Monitoring As Part of the Life Cycle, Not Just QA
      8. The Future of Web Monitoring
        1. Moving from Parts to Users
        2. Service-Centric Architectures
        3. Clouds and Monitoring
        4. APIs and RSS Feeds
          1. Delivering an API to others
          2. Consuming an API from someone else
        5. Rich Internet Applications
        6. HTML5: Server-Sent Events and WebSockets
        7. Online Communities and the Long Funnel
        8. Tying Together Mail and Conversion Loops
        9. The Capacity/Cost/Revenue Equation
      9. Conclusion
    15. 12. Relational Database Strategy and Tactics for the Web
      1. Requirements for Web Databases
        1. Always On
        2. Mostly Transactional Workload
        3. Simple Data, Simple Queries
        4. Availability Trumps Consistency
        5. Rapid Development
        6. Online Deployment
        7. Built by Developers
      2. How Typical Web Databases Grow
        1. Single Server
        2. Master and Replication Slaves
        3. Functional Partitioning
        4. Sharding, or Horizontal Partitioning
        5. Caching Layer
      3. The Yearning for a Cluster
        1. The CAP Theorem and ACID Versus BASE
        2. State of MySQL Clustering
          1. DRBD and Heartbeat
          2. Master-Master Replication Manager (MMM)
          3. Heartbeat with replication
          4. Proxy-based solutions
          5. InfiniDB, Galera, Tungsten, and ScaleDB
          6. Summary
      4. Database Strategy
        1. Architecture Requirements
          1. Easy wins
        2. Safe-Bet Architectures
        3. Risky Architectures
          1. Sharding
          2. Writing to more than one master
          3. Multilevel replication
          4. Ring replication (beyond two nodes)
          5. Reliance on DNS
          6. The so-called Entity-Attribute-Value (EAV) design pattern
      5. Database Tactics
        1. Taking Backups on a Slave
        2. Online Schema Changes
        3. Monitoring, Graphing, and Instrumentation
        4. Analyzing Performance
        5. Archiving and Purging Data
      6. Conclusion
    16. 13. How to Make Failure Beautiful: The Art and Science of Postmortems
      1. The Worst Postmortem
      2. What Is a Postmortem?
      3. When to Conduct a Postmortem
      4. Who to Invite to a Postmortem
      5. Running a Postmortem
      6. Postmortem Follow-Up
      7. Conclusion
    17. 14. Storage
      1. Data Asset Inventory
      2. Data Protection
      3. Capacity Planning
      4. Storage Sizing
      5. Operations
      6. Conclusion
    18. 15. Nonrelational Databases
      1. NoSQL Database Overview
        1. Pure Key/Value
        2. Data Structure
        3. Graph
        4. Document Oriented
        5. Highly Distributed
      2. Some Systems in Detail
        1. Cassandra
        2. HBase
        3. Riak
        4. CouchDB
        5. MongoDB
        6. Redis
      3. Conclusion
    19. 16. Agile Infrastructure
      1. Agile Infrastructure
        1. But Agile Is Not the Only Thing That Has Evolved
        2. Some People Are Born to Web Operations, Some People Have Web Operations Thrust upon Them...
        3. Working Software Is the Primary Measure of Progress
        4. The Application Is the Infrastructure, the Infrastructure Is the Application
      2. So, What's the Problem?
        1. Talk Does Not Cook Rice
          1. The infrastructure is an application
          2. Version control: The foundation of sanity
          3. Configuration management and automated deployments
          4. Monitoring
          5. Dev-test-prod life cycle, continuous integration, and disaster recovery
          6. Radiate information
          7. Reflective process improvement
          8. Incremental changes and refactoring
          9. The simplest thing that could work
          10. Separation of concerns
          11. Technical debt
          12. Continuous deployment
          13. Pairing
          14. Managing flow
      3. Communities of Interest and Practice
      4. Trading Zones and Apologies
        1. What to Do?
      5. Conclusion
    20. 17. Things That Go Bump in the Night (and How to Sleep Through Them)
      1. Definitions
      2. How Many 9s?
      3. Impact Duration Versus Incident Duration
      4. Datacenter Footprint
      5. Gradual Failures
      6. Trust Nobody
      7. Failover Testing
      8. Monitoring and History of Patterns
      9. Getting a Good Night's Sleep
    21. A. Contributors
    22. Index
    23. About the Authors
    24. Colophon
    25. SPECIAL OFFER: Upgrade this ebook with O’Reilly