You are previewing Effective Monitoring and Alerting.

Effective Monitoring and Alerting

Cover of Effective Monitoring and Alerting by Slawek Ligus Published by O'Reilly Media, Inc.
  1. Effective Monitoring and Alerting
  2. Preface
    1. Who Should Read This Book
    2. Conventions Used in This Book
    3. Using Code Examples
    4. Safari® Books Online
    5. How to Contact Us
    6. Acknowledgements
  3. 1. Introduction
    1. Monitoring, Alerting, and What They Can Do for You
      1. Early Problem Detection
      2. Decision Making
      3. Automation
    2. Monitoring and Alerting in a Nutshell
      1. Metrics and Timeseries
      2. Alarms, Alerts, and Monitors
      3. Monitoring System
      4. The Process of Alerting
      5. Issue Tracking
    3. The Challenges
    4. Important Terms
  4. 2. Monitoring
    1. The Building Blocks
      1. Data Collection
      2. Coverage
      3. Metrics
      4. Example: Inputs, Metrics, and Timeseries
      5. Understanding Metrics
      6. Timeseries Patterns
    2. Drawing Conclusions from Timeseries Plots
      1. Interpretation of Anomalies
      2. Frequently Encountered Anomalies
      3. Determining Causality
      4. Capturing the Daily Cycle, Trends, and Seasonal Changes
  5. 3. Alerting
    1. The Challenge
    2. Prerequisites
      1. Monitoring and Alerting Platform
      2. Audit Trail
      3. Issue Tracking
    3. Understanding Failure and Its Impact
      1. Establishing Significance
      2. Identifying Causes
    4. Anatomy of an Alarm
      1. Boolean Function
      2. Suppression
      3. Aggregation
    5. Case Study: A Data Pipeline
    6. Types of Alerts
    7. Setting Up Alarms
      1. Identifying Impact
      2. Establishing Severity
      3. Picking the Right Timeseries
      4. Configuring Monitors
      5. Setting Up Alarms
      6. Testing Alerting Configurations
    8. Alerting Suggestions
  6. 4. At Scale
    1. Implications of Scale
    2. Composition of Large-Scale Systems
    3. Commonalities of Large-Scale Alerting Configurations
    4. Monitoring Coverage
      1. Reflecting Dimensions in Metrics
    5. Managing Large Alerting Configurations
      1. Addressing the Problems
      2. Suggested Solution
      3. Result
  7. 5. Monitoring in System Automation
    1. Choosing Appropriate Maintenance Times Automatically
    2. Controlling the Rate of Upgrade
    3. Recovery-Oriented Admission Control
    4. Automated Deployment and Rollback
  8. 6. The Work Environment
    1. Keeping an Audit Trail
    2. Working with Tickets
      1. Root Cause Analysis
    3. Dealing with Anomalies
    4. Learning from Outages
    5. Using Checklists
    6. Creating Dashboards
    7. Service-Level Agreements
    8. Preventing the Ironies of Automation
    9. Culture
  9. 7. Measuring Success
    1. The Feedback Loop
      1. Root Cause Classification
      2. Timing
    2. Ticket Reporting
      1. Frequency of Incidence
      2. Incidence Times
      3. Time to Respond and Time to Resolution
    3. Measuring Detectability
      1. False Positives and False Negatives
      2. Precision and Recall
      3. The F-Measure
    4. Transition to Automated Alarms
    5. Maintenance Overhead
    6. How (Not) to Measure
  10. 8. The Principles
    1. Get in the Habit of Measuring
    2. Draw Conclusions Reliably
    3. Monitor Extensively
    4. Alarm Selectively
    5. Work Smart, Not Hard
      1. Learn from the Experience of Others
      2. Have a Tactic
      3. Run a Bank of Cases
      4. Enjoy the Process
  11. A. Setting Up OpenTSDB
    1. The Software
      1. Architecture
      2. Getting OpenTSDB
    2. First Steps
      1. Starting TSD
      2. Pushing Data
      3. Input Tagging
      4. Temporal Aggregation
      5. Summary Statistics
      6. Rate of Change
    3. Gathering Data System-Wide
      1. Running tcollector
      2. Writing a Custom Collector
    4. Timeseries Plots
      1. Plotting Tips
    5. Get Involved
  12. About the Author
  13. Copyright

Chapter 2. Monitoring

Some benefits of monitoring are immediate, such as early detection, evidence-based decision making, and automation. But its full value extends beyond that. Monitoring plays a central role in the absorption of job knowledge and driving innovation. You can’t manage what you don’t measure. A widely deployed monitoring solution keeps everyone on the same page. Timeseries plots allow for the exchange of complex ideas that would otherwise take a thousand words. Monitoring adds great value to the system and helps to foster the culture of rapid and informed learning.

The Building Blocks

The main purpose of monitoring is to gain near real-time insight into the current state of the system, in the context of its recent performance. The extracted information helps to answer many important questions, assists in the verification of nonstandard behavior, lets you drill down for more information on an issue that has been reported, and helps you estimate the capacity of the system. Before I move on to discussing all these useful aspects, I think it will help if I discuss the fundamental building blocks of a system from the bottom up.

Data Collection

The process of monitoring starts with gathering data by collection agents, specialized software programs running on monitored entities such as hosts, databases, or network devices. Agents capture meaningful system information, encapsulate it into quantitative data inputs, and then report these data inputs to the monitoring system at ...

The best content for your career. Discover unlimited learning on demand for around $1/day.