You are previewing Effective Monitoring and Alerting.

Effective Monitoring and Alerting

Cover of Effective Monitoring and Alerting by Slawek Ligus Published by O'Reilly Media, Inc.
  1. Effective Monitoring and Alerting
  2. Preface
    1. Who Should Read This Book
    2. Conventions Used in This Book
    3. Using Code Examples
    4. Safari® Books Online
    5. How to Contact Us
    6. Acknowledgements
  3. 1. Introduction
    1. Monitoring, Alerting, and What They Can Do for You
      1. Early Problem Detection
      2. Decision Making
      3. Automation
    2. Monitoring and Alerting in a Nutshell
      1. Metrics and Timeseries
      2. Alarms, Alerts, and Monitors
      3. Monitoring System
      4. The Process of Alerting
      5. Issue Tracking
    3. The Challenges
    4. Important Terms
  4. 2. Monitoring
    1. The Building Blocks
      1. Data Collection
      2. Coverage
      3. Metrics
      4. Example: Inputs, Metrics, and Timeseries
      5. Understanding Metrics
      6. Timeseries Patterns
    2. Drawing Conclusions from Timeseries Plots
      1. Interpretation of Anomalies
      2. Frequently Encountered Anomalies
      3. Determining Causality
      4. Capturing the Daily Cycle, Trends, and Seasonal Changes
  5. 3. Alerting
    1. The Challenge
    2. Prerequisites
      1. Monitoring and Alerting Platform
      2. Audit Trail
      3. Issue Tracking
    3. Understanding Failure and Its Impact
      1. Establishing Significance
      2. Identifying Causes
    4. Anatomy of an Alarm
      1. Boolean Function
      2. Suppression
      3. Aggregation
    5. Case Study: A Data Pipeline
    6. Types of Alerts
    7. Setting Up Alarms
      1. Identifying Impact
      2. Establishing Severity
      3. Picking the Right Timeseries
      4. Configuring Monitors
      5. Setting Up Alarms
      6. Testing Alerting Configurations
    8. Alerting Suggestions
  6. 4. At Scale
    1. Implications of Scale
    2. Composition of Large-Scale Systems
    3. Commonalities of Large-Scale Alerting Configurations
    4. Monitoring Coverage
      1. Reflecting Dimensions in Metrics
    5. Managing Large Alerting Configurations
      1. Addressing the Problems
      2. Suggested Solution
      3. Result
  7. 5. Monitoring in System Automation
    1. Choosing Appropriate Maintenance Times Automatically
    2. Controlling the Rate of Upgrade
    3. Recovery-Oriented Admission Control
    4. Automated Deployment and Rollback
  8. 6. The Work Environment
    1. Keeping an Audit Trail
    2. Working with Tickets
      1. Root Cause Analysis
    3. Dealing with Anomalies
    4. Learning from Outages
    5. Using Checklists
    6. Creating Dashboards
    7. Service-Level Agreements
    8. Preventing the Ironies of Automation
    9. Culture
  9. 7. Measuring Success
    1. The Feedback Loop
      1. Root Cause Classification
      2. Timing
    2. Ticket Reporting
      1. Frequency of Incidence
      2. Incidence Times
      3. Time to Respond and Time to Resolution
    3. Measuring Detectability
      1. False Positives and False Negatives
      2. Precision and Recall
      3. The F-Measure
    4. Transition to Automated Alarms
    5. Maintenance Overhead
    6. How (Not) to Measure
  10. 8. The Principles
    1. Get in the Habit of Measuring
    2. Draw Conclusions Reliably
    3. Monitor Extensively
    4. Alarm Selectively
    5. Work Smart, Not Hard
      1. Learn from the Experience of Others
      2. Have a Tactic
      3. Run a Bank of Cases
      4. Enjoy the Process
  11. A. Setting Up OpenTSDB
    1. The Software
      1. Architecture
      2. Getting OpenTSDB
    2. First Steps
      1. Starting TSD
      2. Pushing Data
      3. Input Tagging
      4. Temporal Aggregation
      5. Summary Statistics
      6. Rate of Change
    3. Gathering Data System-Wide
      1. Running tcollector
      2. Writing a Custom Collector
    4. Timeseries Plots
      1. Plotting Tips
    5. Get Involved
  12. About the Author
  13. Copyright
O'Reilly logo

Chapter 6. The Work Environment

Humans follow incentives, get easily distracted, and are forgetful. Systems keep evolving. Remember this whenever a human operator is expected to become an integral part of an operational process. Some fundamental problems related to monitoring and alerting are due to making false assumptions about human nature; others are due to putting insufficient weight on the importance of change. In general, the problem stems from the perception of how things ought to be, rather than how they actually are. The system is dynamic, many parts are movable, and it’s only predictable to a certain degree. The people who designed it are most often not the ones in charge of 24/7 operations. For that reason, the work environment should foster a flexible culture, one that assists in the progress of adaptability and encourages growth.

Keeping an Audit Trail

Responding to alerts means dealing with uncertainty. Even in mature IT organizations outages resulting from changes made by operators, such as new software rollouts, configuration updates, and infrastructure upgrades account for more than 50% of all outages. Keeping an audit trail and consulting it during early outage indications can, therefore, reduce the initial uncertainty in every second case, giving the troubleshooter a massive advantage.

An accurate and complete audit trail does not necessarily have to come at a cost of high manual overhead. It can be greatly automated with the help of a publish-subscribe style messaging ...

The best content for your career. Discover unlimited learning on demand for around $1/day.