Cover image for Effective Monitoring and Alerting

Book description

The book describes data-driven approach to optimal monitoring and alerting in distributed computer systems. It interprets monitoring as a continuous process aimed at extraction of meaning from system's data. The resulting wisdom drives effective maintenance and fast recovery - the bread and butter of web operations.

The content of the book gives a scalable perspective on the following topics:

  • anatomy of monitoring and alerting
  • conclusive interpretation of time series
  • data-driven approach to setting up monitors
  • addressing system failures by their impact
  • applications of monitoring in automation
  • reporting on quality with quantitative means
  • and more!

Table of Contents

  1. Effective Monitoring and Alerting
  2. Preface
    1. Who Should Read This Book
    2. Conventions Used in This Book
    3. Using Code Examples
    4. Safari® Books Online
    5. How to Contact Us
    6. Acknowledgements
  3. 1. Introduction
    1. Monitoring, Alerting, and What They Can Do for You
      1. Early Problem Detection
        1. Availability
        2. Performance
      2. Decision Making
        1. Baselining
        2. Predictions
      3. Automation
        1. Admission Control
        2. Autonomic Computing
    2. Monitoring and Alerting in a Nutshell
      1. Metrics and Timeseries
      2. Alarms, Alerts, and Monitors
      3. Monitoring System
      4. The Process of Alerting
      5. Issue Tracking
        1. Tickets and queues
    3. The Challenges
    4. Important Terms
  4. 2. Monitoring
    1. The Building Blocks
      1. Data Collection
      2. Coverage
        1. Resources
          1. Network
          2. Computational resources
        2. Solution stack
          1. Operating system
          2. Middleware
          3. Application
        3. User experience
      3. Metrics
        1. Summary statistics
          1. Frequency distribution and percentiles
          2. Rate of change
        2. Time granularity
        3. Metric aggregation
      4. Example: Inputs, Metrics, and Timeseries
      5. Understanding Metrics
        1. Type of unit
        2. Data Collection Mode
        3. Data Source
        4. Number of Inputs per Data Point
        5. Type of Quantity
      6. Timeseries Patterns
    2. Drawing Conclusions from Timeseries Plots
      1. Interpretation of Anomalies
        1. Flow
        2. Stock
        3. Availability
        4. Throughput
        5. Applications of quantities
      2. Frequently Encountered Anomalies
        1. Flattening Effect
        2. Warm-Up Effect
        3. Regular Anomalies
        4. Spikes During Troughs
      3. Determining Causality
      4. Capturing the Daily Cycle, Trends, and Seasonal Changes
  5. 3. Alerting
    1. The Challenge
    2. Prerequisites
      1. Monitoring and Alerting Platform
      2. Audit Trail
      3. Issue Tracking
    3. Understanding Failure and Its Impact
      1. Establishing Significance
      2. Identifying Causes
    4. Anatomy of an Alarm
      1. Boolean Function
        1. Metric Monitor
          1. Upper Limit
          2. Lower Limit
          3. Outside Range
          4. Data Points Not Recorded
        2. Time Evaluation
        3. Another Alarm as Input Source
      2. Suppression
      3. Aggregation
    5. Case Study: A Data Pipeline
    6. Types of Alerts
    7. Setting Up Alarms
      1. Identifying Impact
      2. Establishing Severity
      3. Picking the Right Timeseries
      4. Configuring Monitors
        1. Coming Up with a Threshold
          1. Static thresholds
          2. Data-driven thresholds
        2. Breach and Clear Delay
      5. Setting Up Alarms
      6. Testing Alerting Configurations
    8. Alerting Suggestions
  6. 4. At Scale
    1. Implications of Scale
    2. Composition of Large-Scale Systems
    3. Commonalities of Large-Scale Alerting Configurations
    4. Monitoring Coverage
      1. Reflecting Dimensions in Metrics
    5. Managing Large Alerting Configurations
      1. Addressing the Problems
        1. Organize alarms and monitors in a namespace
        2. Calculate threshold values from metric data
        3. Periodically refresh and clean up the configuration
      2. Suggested Solution
        1. Refresh intervals
          1. Running the engine
          2. Naming
          3. Alarm creation and threshold calculation
          4. Cleanup procedures
          5. Writing Modules
          6. Suppression
          7. Extra Features
      3. Result
  7. 5. Monitoring in System Automation
    1. Choosing Appropriate Maintenance Times Automatically
    2. Controlling the Rate of Upgrade
    3. Recovery-Oriented Admission Control
    4. Automated Deployment and Rollback
  8. 6. The Work Environment
    1. Keeping an Audit Trail
    2. Working with Tickets
      1. Root Cause Analysis
        1. The Five Whys
          1. Extracting Categories
    3. Dealing with Anomalies
    4. Learning from Outages
    5. Using Checklists
    6. Creating Dashboards
    7. Service-Level Agreements
    8. Preventing the Ironies of Automation
    9. Culture
  9. 7. Measuring Success
    1. The Feedback Loop
      1. Root Cause Classification
        1. A Short Story of a Long Classifier List
      2. Timing
    2. Ticket Reporting
      1. Frequency of Incidence
      2. Incidence Times
      3. Time to Respond and Time to Resolution
    3. Measuring Detectability
      1. False Positives and False Negatives
      2. Precision and Recall
      3. The F-Measure
    4. Transition to Automated Alarms
    5. Maintenance Overhead
    6. How (Not) to Measure
  10. 8. The Principles
    1. Get in the Habit of Measuring
    2. Draw Conclusions Reliably
    3. Monitor Extensively
    4. Alarm Selectively
    5. Work Smart, Not Hard
      1. Learn from the Experience of Others
      2. Have a Tactic
      3. Run a Bank of Cases
      4. Enjoy the Process
  11. A. Setting Up OpenTSDB
    1. The Software
      1. Architecture
      2. Getting OpenTSDB
    2. First Steps
      1. Starting TSD
      2. Pushing Data
      3. Input Tagging
        1. Tag Wildcards
      4. Temporal Aggregation
      5. Summary Statistics
      6. Rate of Change
    3. Gathering Data System-Wide
      1. Running tcollector
      2. Writing a Custom Collector
    4. Timeseries Plots
      1. Plotting Tips
    5. Get Involved
  12. About the Author
  13. Copyright