You are previewing Effective Monitoring and Alerting.

Effective Monitoring and Alerting

Cover of Effective Monitoring and Alerting by Slawek Ligus Published by O'Reilly Media, Inc.
  1. Effective Monitoring and Alerting
  2. Preface
    1. Who Should Read This Book
    2. Conventions Used in This Book
    3. Using Code Examples
    4. Safari® Books Online
    5. How to Contact Us
    6. Acknowledgements
  3. 1. Introduction
    1. Monitoring, Alerting, and What They Can Do for You
      1. Early Problem Detection
      2. Decision Making
      3. Automation
    2. Monitoring and Alerting in a Nutshell
      1. Metrics and Timeseries
      2. Alarms, Alerts, and Monitors
      3. Monitoring System
      4. The Process of Alerting
      5. Issue Tracking
    3. The Challenges
    4. Important Terms
  4. 2. Monitoring
    1. The Building Blocks
      1. Data Collection
      2. Coverage
      3. Metrics
      4. Example: Inputs, Metrics, and Timeseries
      5. Understanding Metrics
      6. Timeseries Patterns
    2. Drawing Conclusions from Timeseries Plots
      1. Interpretation of Anomalies
      2. Frequently Encountered Anomalies
      3. Determining Causality
      4. Capturing the Daily Cycle, Trends, and Seasonal Changes
  5. 3. Alerting
    1. The Challenge
    2. Prerequisites
      1. Monitoring and Alerting Platform
      2. Audit Trail
      3. Issue Tracking
    3. Understanding Failure and Its Impact
      1. Establishing Significance
      2. Identifying Causes
    4. Anatomy of an Alarm
      1. Boolean Function
      2. Suppression
      3. Aggregation
    5. Case Study: A Data Pipeline
    6. Types of Alerts
    7. Setting Up Alarms
      1. Identifying Impact
      2. Establishing Severity
      3. Picking the Right Timeseries
      4. Configuring Monitors
      5. Setting Up Alarms
      6. Testing Alerting Configurations
    8. Alerting Suggestions
  6. 4. At Scale
    1. Implications of Scale
    2. Composition of Large-Scale Systems
    3. Commonalities of Large-Scale Alerting Configurations
    4. Monitoring Coverage
      1. Reflecting Dimensions in Metrics
    5. Managing Large Alerting Configurations
      1. Addressing the Problems
      2. Suggested Solution
      3. Result
  7. 5. Monitoring in System Automation
    1. Choosing Appropriate Maintenance Times Automatically
    2. Controlling the Rate of Upgrade
    3. Recovery-Oriented Admission Control
    4. Automated Deployment and Rollback
  8. 6. The Work Environment
    1. Keeping an Audit Trail
    2. Working with Tickets
      1. Root Cause Analysis
    3. Dealing with Anomalies
    4. Learning from Outages
    5. Using Checklists
    6. Creating Dashboards
    7. Service-Level Agreements
    8. Preventing the Ironies of Automation
    9. Culture
  9. 7. Measuring Success
    1. The Feedback Loop
      1. Root Cause Classification
      2. Timing
    2. Ticket Reporting
      1. Frequency of Incidence
      2. Incidence Times
      3. Time to Respond and Time to Resolution
    3. Measuring Detectability
      1. False Positives and False Negatives
      2. Precision and Recall
      3. The F-Measure
    4. Transition to Automated Alarms
    5. Maintenance Overhead
    6. How (Not) to Measure
  10. 8. The Principles
    1. Get in the Habit of Measuring
    2. Draw Conclusions Reliably
    3. Monitor Extensively
    4. Alarm Selectively
    5. Work Smart, Not Hard
      1. Learn from the Experience of Others
      2. Have a Tactic
      3. Run a Bank of Cases
      4. Enjoy the Process
  11. A. Setting Up OpenTSDB
    1. The Software
      1. Architecture
      2. Getting OpenTSDB
    2. First Steps
      1. Starting TSD
      2. Pushing Data
      3. Input Tagging
      4. Temporal Aggregation
      5. Summary Statistics
      6. Rate of Change
    3. Gathering Data System-Wide
      1. Running tcollector
      2. Writing a Custom Collector
    4. Timeseries Plots
      1. Plotting Tips
    5. Get Involved
  12. About the Author
  13. Copyright
O'Reilly logo

Chapter 7. Measuring Success

One can recognize the quality of collaboration by how success is measured. The task of fostering the culture rests mainly with managers. The onus of maintaining a top-notch configuration and event response is on operators. Measuring success, however, is a collaborative effort, one requiring a fair amount of common sense and resisting the urge to positively exaggerate results. Effective alerting configurations are hard to build and even harder to measure. This chapter deals with measuring qualitative changes with quantitative means.

Managers and the employees, here typically systems engineers, work towards common goals but some of their priorities may vary. Engineers like getting things done; managers like when things get done. In other words, engineers focus on problem solving while managers put emphasis on efficiency of execution and reporting. The two priorities are not conflicting as long as a healthy balance is maintained. If it is not the case, reporting can become burdensome.

The problem is real: the more time spent by engineers on accounting, the less of it goes into doing actual work. But it’s not just about overhead. When people see no value in accounting (and if you hire smart people, as most IT businesses do, they most likely have valid reasons for it) they might still carry out the tasks you ask but are more likely to do it neglectfully. Ironically, inaccurate accounting makes it harder to reliably identify pain points, which, when resolved, ...

The best content for your career. Discover unlimited learning on demand for around $1/day.