You are previewing APM Best Practices: Realizing Application Performance Management.
O'Reilly logo
APM Best Practices: Realizing Application Performance Management

Book Description

The objective of APM Best Practices is to establish reliable Application Performance Management practices—to demonstrate value, to do it quickly, and to adapt to the client circumstances. Balancing long-term goals with short-term deliverables, but without compromising usefulness or correctness. The successful strategy is to establish a few reasonable goals, achieve them quickly, and then iterate over the same topics two more times, with each successive iteration expanding the Skills and Capabilities of the APM team. This strategy is referred to as "Good, Better, Best". The expanding capabilities of the team are immortalized as milestones towards achieving a Catalog of Services that become the interface with your application and business sponsors.

The Application Performance Monitoring marketplace is very focused on ease of installation, rapid time to usefulness, and overall ease of use. But these worthy platitudes do not really address the Application Performance Management processes that ensure that you will deploy effectively, synergize on QA test plans, triage accurately, and encourage collaboration across the application lifecycle that ultimately lowers overall application cost and ensures a quality user experience. These are also fine platitudes but these are the ones that are of interest to your application sponsors. These are the ones for which you need to show value. So this book employs this iterative approach, adapted pragmatically for the realities of your organizational and operational constraints, to realize a future state that your sponsors will find useful, predictable and manageable—and something that they will want to fund. In the meantime, you will learn the useful techniques needed to set up and maintain a useful performance management system utilizing best practices regardless of the software provider(s).

Table of Contents

  1. Copyright
  2. About the Author
  3. Acknowledgments
  4. Introduction
    1. Vendor Neutrality
    2. APM Starts Here
    3. Organization of this Book
      1. Introduction to Part 1: Planning
      2. Introduction to Part 2: Implementation
      3. Introduction to Part 3: Practioner's Guide
  5. 1. Getting Started with APM
    1. 1.1. The Challenge
    2. 1.2. Reference Architecture
      1. 1.2.1. Life Cycles
        1. 1.2.1.1. The Software Development Life Cycle
        2. 1.2.1.2. The Application Life Cycle
        3. 1.2.1.3. The APM Life Cycle
      2. 1.2.2. Organizational Maturity
    3. 1.3. The Five APM Competencies
      1. 1.3.1. Monitoring Architectures
      2. 1.3.2. APM Architecture
      3. 1.3.3. Application Characteristics
      4. 1.3.4. Reporting Characteristics
    4. 1.4. The First Meeting
    5. 1.5. Meeting Themes
      1. 1.5.1. What is APM?
      2. 1.5.2. Realities of Packaged Software Solutions
      3. 1.5.3. How the Monitoring Game has Changed
      4. 1.5.4. Limitations of Availability Monitoring
        1. 1.5.4.1. Enter Two-Tier Client Server and FCAPS
        2. 1.5.4.2. The Need for a New Perspective
        3. 1.5.4.3. Emphasizing Transactions in Addition to Resources
      5. 1.5.5. The Impact of APM Visibility
      6. 1.5.6. Addressing the Visibility Gap
        1. 1.5.6.1. Getting Those New Metrics
        2. 1.5.6.2. Understanding Those New Metrics
        3. 1.5.6.3. Managing Those New Metrics
        4. 1.5.6.4. Exploiting Those New Metrics across the Application Life Cycle
      7. 1.5.7. Demonstrating Value
        1. 1.5.7.1. Good-Better-Best
      8. 1.5.8. Establishing an APM Project
        1. 1.5.8.1. Initial Iteration
        2. 1.5.8.2. Follow-on Iteration
        3. 1.5.8.3. Closing Iteration
    6. 1.6. Summary
  6. I. Planning
    1. 2. Business Justification
      1. 2.1. Justification Is Not ROI
      2. 2.2. Entry Points
      3. 2.3. Initiative Themes
        1. 2.3.1. Availability vs. Performance Monitoring
        2. 2.3.2. Resolving Application Incidents and Outages
        3. 2.3.3. Improving Application Software Quality
        4. 2.3.4. Pre-production Readiness and Deployment
        5. 2.3.5. Managing Service Level Agreements
        6. 2.3.6. Enhancing the Value of the Monitoring Tool Investment
        7. 2.3.7. Proactive Monitoring
        8. 2.3.8. Trending and Analysis
        9. 2.3.9. Single-View of Service Performance (Dashboards)
      4. 2.4. Summary
    2. 3. Assessments
      1. 3.1. Overview
        1. 3.1.1. Visibility
      2. 3.2. Assessment Dimensions
        1. 3.2.1. Application Survey
          1. 3.2.1.1. Survey Dimensions
          2. 3.2.1.2. Business
          3. 3.2.1.3. Environment
          4. 3.2.1.4. Resources
          5. 3.2.1.5. Software Architecture
          6. 3.2.1.6. Existing Monitoring (Optional)
          7. 3.2.1.7. Avoiding Scope-Creep During Deployment Planning
        2. 3.2.2. Skills and Processes
        3. 3.2.3. Incidents
          1. 3.2.3.1. Cookbook
        4. 3.2.4. Stakeholder Interviews
          1. 3.2.4.1. APM Roles
          2. 3.2.4.2. Topics by Stakeholder
          3. 3.2.4.3. Reporting Dimensions
          4. 3.2.4.4. Cookbook
          5. 3.2.4.5. Management Capabilities
            1. 3.2.4.5.1. Cookbook
          6. 3.2.4.6. End-to-End Visibility
            1. 3.2.4.6.1. Cookbook
          7. 3.2.4.7. Life Cycle Visibility
            1. 3.2.4.7.1. Cookbook
          8. 3.2.4.8. Monitoring Tools
            1. 3.2.4.8.1. Executive Summary
            2. 3.2.4.8.2. Cookbook
        5. 3.2.5. Detailed Findings (Alternate Technique)
        6. 3.2.6. Solution Sizing
          1. 3.2.6.1. Synthetic Transactions
          2. 3.2.6.2. Real Transactions
          3. 3.2.6.3. Instrumentation
          4. 3.2.6.4. Cookbook
      3. 3.3. Summarizing Your Findings and Recommendations
        1. 3.3.1.
          1. 3.3.1.1. Cookbook
      4. 3.4. Conclusion
    3. 4. Staffing and Responsibilities
      1. 4.1. The Staffing Question
        1. 4.1.1. Who Has Responsibility for Monitoring?
          1. 4.1.1.1. The APM Expert
        2. 4.1.2. Roles and Responsibilities
          1. 4.1.2.1. APM Administrator
          2. 4.1.2.2. APM Project Manager
          3. 4.1.2.3. Application Specialist
          4. 4.1.2.4. APM Architect
          5. 4.1.2.5. APM Specialist (Triage/Firefighter)
          6. 4.1.2.6. APM Evangelist
        3. 4.1.3. Staffing Strategies
          1. 4.1.3.1. Adding New Personnel
          2. 4.1.3.2. Product Training
          3. 4.1.3.3. Repurposing Existing Personnel
          4. 4.1.3.4. Best Practice Mentoring
          5. 4.1.3.5. Staff Augmentation
      2. 4.2. Staffing an APM Initiative
        1. 4.2.1. What Staffing Is Appropriate for APM Success?
        2. 4.2.2. The FTE Approach
        3. 4.2.3. Evolving Your Organization's Use of APM
        4. 4.2.4. Building a Scalable Monitoring Organization
        5. 4.2.5. Real-World Staffing Evolution Scenarios?
          1. 4.2.5.1. First APM, Accepted but Constrained Budget
          2. 4.2.5.2. Significant APM, Committed but Has Gaps with Current Practice
          3. 4.2.5.3. Significant Existing APM, Recommitted but Constrained Budget
          4. 4.2.5.4. Significant Existing APM, Recommitted and Budgeted for APM Practice
          5. 4.2.5.5. Significant Existing APM, Retrenching, Maintain Existing Footprint Only
      3. 4.3. Summary
    4. 5. APM Patterns
      1. 5.1. Processes, Skills, and Competencies for APM
      2. 5.2. Demonstrating Value
      3. 5.3. Management Maturity
      4. 5.4. Deployment Scenarios and Priorities
        1. 5.4.1. Small Deployment Footprint Scenario
        2. 5.4.2. Multiple Independent Initiatives Scenario
        3. 5.4.3. Service Bureau for a Line of Business Scenario
        4. 5.4.4. Center of Excellence for a Corporation Scenario
      5. 5.5. Defining Your Services Catalog
        1. 5.5.1. Why is a Services Catalog Such a Valuable Strategy?
        2. 5.5.2. Assessing Role Competency
      6. 5.6. The Last APM Pattern
      7. 5.7. Cookbook
      8. 5.8. Summary
    5. 6. The Pilot Evaluation
      1. 6.1. The Pilot Evaluation
        1. 6.1.1. Participation
        2. 6.1.2. Goals
          1. 6.1.2.1. Connectivity
          2. 6.1.2.2. Platform Suitability
          3. 6.1.2.3. Application Suitability
          4. 6.1.2.4. Pre-Production Visibility
          5. 6.1.2.5. Production Visibility
        3. 6.1.3. Criteria
          1. 6.1.3.1. Compatible with your Environment
          2. 6.1.3.2. Ease of Initial Installation
          3. 6.1.3.3. Flexibility of Technology Configuration
          4. 6.1.3.4. Assessing Overhead
          5. 6.1.3.5. Assessing Deployability and Scalability
          6. 6.1.3.6. Solution Certification
        4. 6.1.4. Cookbook
          1. 6.1.4.1. Planning Phase
          2. 6.1.4.2. Scope Document
          3. 6.1.4.3. Implementation
          4. 6.1.4.4. Review
      2. 6.2. Summary
  7. II. Implementation
    1. 7. Deployment Strategies
      1. 7.1. Stand-Alone
      2. 7.2. Phased Deployments
        1. 7.2.1.
          1. 7.2.1.1. Preproduction
          2. 7.2.1.2. Preproduction Review
          3. 7.2.1.3. Operations
      3. 7.3. Realizing QA and Triage
        1. 7.3.1. Understanding Preproduction
        2. 7.3.2. The Problem with Preproduction
          1. 7.3.2.1. Repurposing QA Equipment for Production
        3. 7.3.3. Evolving QA and Triage
          1. 7.3.3.1. Acceptance Criteria
          2. 7.3.3.2. Triage
        4. 7.3.4. Evolving Management Capabilities
          1. 7.3.4.1. Reactive Management
          2. 7.3.4.2. Reactive Alerting
          3. 7.3.4.3. Predictive/Directed Management
          4. 7.3.4.4. Proactive Management
      4. 7.4. Deployment Necessities
        1. 7.4.1. Kickoff Meeting
        2. 7.4.2. Phased Deployment Schedule
        3. 7.4.3. Preproduction Review
      5. 7.5. Postproduction Review
      6. 7.6. Install/Update Validation
        1. 7.6.1. Call Stack Visibility
        2. 7.6.2. Transaction Definition
      7. 7.7. Updates
        1. 7.7.1. Alternate Agent Configurations and Versions
      8. 7.8. Alert Integration
      9. 7.9. Summary
    2. 8. Essential Processes
      1. 8.1. Monitoring Runbook
      2. 8.2. Pre-Production Practice
      3. 8.3. Operations
      4. 8.4. Improving Software Quality
        1. 8.4.1.
          1. 8.4.1.1. Define Acceptance Criteria
          2. 8.4.1.2. Audit the Application
          3. 8.4.1.3. Onboard the Application
          4. 8.4.1.4. Supervise the Third Party Testing Process
          5. 8.4.1.5. Empower the Third Party to Deliver Proactive Performance Management
      5. 8.5. Summary
    3. 9. Essential Service Capabilities
      1. 9.1. Triage
        1. 9.1.1.
          1. 9.1.1.1. Metrics Storage Dashboard
          2. 9.1.1.2. Pilot Findings
          3. 9.1.1.3. Metrics Archive Triage and Presentation
          4. 9.1.1.4. Baselines
          5. 9.1.1.5. Characterization
          6. 9.1.1.6. Cookbook
      2. 9.2. Application Audit
        1. 9.2.1. Why Audits Matter
      3. 9.3. Pre-production Review
      4. 9.4. Capacity Management of APM Metrics
      5. 9.5. Solution Sizing
      6. 9.6. Capacity Forecast
        1. 9.6.1. Application
        2. 9.6.2. Monitoring Environment
      7. 9.7. Summary
  8. III. Practitioners
    1. 10. Solution Sizing
      1. 10.1. Kick Off Meeting
        1. 10.1.1. Solution Architecture
          1. 10.1.1.1.
            1. 10.1.1.1.1. Stand-alone
            2. 10.1.1.1.2. Failover
            3. 10.1.1.1.3. Federated
          2. 10.1.1.2. Metrics Storage and Agent Capacities and Realities
        2. 10.1.2. Metrics Storage Sizing
          1. 10.1.2.1. Application Realities
            1. 10.1.2.1.1. Not everything needs instrumentation monitoring
            2. 10.1.2.1.2. Not all apps need detailed application monitoring.
            3. 10.1.2.1.3. Some apps need KPIs.
            4. 10.1.2.1.4. Not all apps function the same way.
            5. 10.1.2.1.5. Most apps have different monitoring requirements across the application life cycle.
          2. 10.1.2.2. Sizing Attributes
        3. 10.1.3. Deployment Strategy
          1. 10.1.3.1. Ideal APM Architecture
        4. 10.1.4. Monitoring the Metrics Storage
          1. 10.1.4.1. Why do MS Components Fail?
        5. 10.1.5. Communicating Sizing Results and Recommendations
        6. 10.1.6. Process Considerations
        7. 10.1.7. Deployment Survey and Sizing
        8. 10.1.8. Solution Certification
          1. 10.1.8.1. Test Architecture
          2. 10.1.8.2. Load Profiles
          3. 10.1.8.3. Reporting
          4. 10.1.8.4. Analysis
      2. 10.2. Competency
      3. 10.3. Artifacts
      4. 10.4. Summary
    2. 11. Load Generation
      1. 11.1. Kick Off Meeting
        1. 11.1.1. Why Simulate Load?
        2. 11.1.2. Types of Testing
          1. 11.1.2.1. Test Anti-patterns
          2. 11.1.2.2. Test Evolution
            1. 11.1.2.2.1. Maximum Load Considerations
            2. 11.1.2.2.2. Defining a Test Plan
            3. 11.1.2.2.3. User Population
          3. 11.1.2.3. Accuracy and Reproducibility
      2. 11.2. Process
        1. 11.2.1. Test Plans
          1. 11.2.1.1. Test Evolution—Gold Configuration (sequence of configurations)
            1. 11.2.1.1.1. Test Plan Cookbook
            2. 11.2.1.1.2. Gold Configuration Review
            3. 11.2.1.1.3. Load Reproducibility Review
          2. 11.2.1.2. Application Baseline
            1. 11.2.1.2.1. Heartbeat Metrics
          3. 11.2.1.3. Performance Baseline
          4. 11.2.1.4. Load Reproducibility Analysis
          5. 11.2.1.5. General Acceptance Criteria
      3. 11.3. Competency
      4. 11.4. Artifacts
      5. 11.5. Summary
    3. 12. Baselines
      1. 12.1. Kick-Off Meeting
        1. 12.1.1. Terms of Endearment
          1. 12.1.1.1. Configuration Tuning
          2. 12.1.1.2. Baseline
          3. 12.1.1.3. Configuration Baseline
          4. 12.1.1.4. Application Baseline
          5. 12.1.1.5. Performance Baseline
          6. 12.1.1.6. Capacity Planning
          7. 12.1.1.7. Capacity Forecast
        2. 12.1.2. The Fab Four
          1. 12.1.2.1. Reporting
            1. 12.1.2.1.1. Frequency
            2. 12.1.2.1.2. Baseline
            3. 12.1.2.1.3. HealthCheck
            4. 12.1.2.1.4. Summary
      2. 12.2. Process
        1. 12.2.1. Collecting Baselines
          1. 12.2.1.1. Configuration
          2. 12.2.1.2. Application
          3. 12.2.1.3. Performance
      3. 12.3. Competency
      4. 12.4. Summary
    4. 13. The Application Audit
      1. 13.1. Kick Off Meeting
        1. 13.1.1. Compare and Contrast
          1. 13.1.1.1. What to Look for
            1. 13.1.1.1.1. GC/Heap
            2. 13.1.1.1.2. EJB, Servlet, JSP
            3. 13.1.1.1.3. Concurrency
            4. 13.1.1.1.4. Stalls
            5. 13.1.1.1.5. Optional
          2. 13.1.1.2. Auditing Pre-Production
          3. 13.1.1.3. Auditing Operationally
        2. 13.1.2. Configuration Baseline
          1. 13.1.2.1. Project Scope
          2. 13.1.2.2. Test Plan
          3. 13.1.2.3. Analysis of Results
          4. 13.1.2.4. Memory Utilization
          5. 13.1.2.5. CPU Utilization
          6. 13.1.2.6. I/O Throughput
          7. 13.1.2.7. I/O Contention
          8. 13.1.2.8. Conclusions
        3. 13.1.3. HealthCheck
        4. 13.1.4. Dashboard and Alerts
          1. 13.1.4.1.
            1. 13.1.4.1.1. Managing Alerts
          2. 13.1.4.2. Getting Production Baselines
      2. 13.2. Process
        1. 13.2.1. Scope
          1. 13.2.1.1. Necessities
        2. 13.2.2. Load Generation
          1. 13.2.2.1. Production-Only Situation
        3. 13.2.3. Transaction Definition
        4. 13.2.4. Acceptance Criteria
        5. 13.2.5. Reporting
      3. 13.3. Competency
      4. 13.4. Summary
    5. 14. Triage with Single Metrics
      1. 14.1. Triage with Single Metrics
        1. 14.1.1.
          1. 14.1.1.1. The Program
      2. 14.2. Kick Off Meeting
        1. 14.2.1. Motivation
        2. 14.2.2. Why Triage is Difficult to Master
          1. 14.2.2.1. Triage with Single Metrics
            1. 14.2.2.1.1. Navigation Among Single Metrics
            2. 14.2.2.1.2. Organization via Metrics Collections
            3. 14.2.2.1.3. Presentation via Dashboards
            4. 14.2.2.1.4. Presentation via Reports
          2. 14.2.2.2. Metric Categories
            1. 14.2.2.2.1. Memory Profile
            2. 14.2.2.2.2. Response Times
            3. 14.2.2.2.3. Concurrency
            4. 14.2.2.2.4. Stalls
            5. 14.2.2.2.5. SQL
      3. 14.3. Process
        1. 14.3.1. Scope
        2. 14.3.2. Triage with Single Metrics
          1. 14.3.2.1. Single Incident
          2. 14.3.2.2. Performance Test
            1. 14.3.2.2.1. Initial
            2. 14.3.2.2.2. Baseline Report Available
          3. 14.3.2.3. HealthCheck
            1. 14.3.2.3.1. QA HealthCheck
            2. 14.3.2.3.2. Production HealthCheck
      4. 14.4. Competency
      5. 14.5. Artifacts
        1. 14.5.1. Triage scope Template
      6. 14.6. Summary
    6. 15. Triage with Baselines
      1. 15.1. Kick Off Meeting
        1. 15.1.1. Motivation
        2. 15.1.2. Triage with Baselines
          1. 15.1.2.1. Reporting Conventions
          2. 15.1.2.2. Achieving Consistent Results
          3. 15.1.2.3. Component Analysis
          4. 15.1.2.4. Initial Automation
      2. 15.2. Process
        1. 15.2.1. Triage with Baselines
      3. 15.3. Summary
    7. 16. Triage with Trends
      1. 16.1. Kick Off Meeting
        1. 16.1.1. Motivation
          1. 16.1.1.1. Types of Trends: Correlation
          2. 16.1.1.2. Trending Scenario
          3. 16.1.1.3. Preparing for Triage with Trends
            1. 16.1.1.3.1. Data Integration Strategies
            2. 16.1.1.3.2. Extending Visibility
            3. 16.1.1.3.3. Reporting Integration
          4. 16.1.1.4. The Future of Monitoring Integration
            1. 16.1.1.4.1. Integration Strategies
            2. 16.1.1.4.2. APM-Specific Applications for CMDB
          5. 16.1.1.5. Root-cause Analysis
          6. 16.1.1.6. Implementing the Fix
          7. 16.1.1.7. Service Level Management
            1. 16.1.1.7.1. Defining the SLA
      2. 16.2. Process
        1. 16.2.1. Triage with Trends
      3. 16.3. Competency
      4. 16.4. Artifacts
      5. 16.5. Summary
    8. 17. Firefighting and Critical Situations
      1. 17.1. Kick Off Meeting
        1. 17.1.1. Firefighter, Smoke-jumpers, Triage
        2. 17.1.2. Before You Jump
          1. 17.1.2.1. Do You Have the Right Tools?
          2. 17.1.2.2. Do You Have Potential Data?
          3. 17.1.2.3. Do You Have an Appropriate Environment?
          4. 17.1.2.4. Have You Set Appropriate Expectations?
        3. 17.1.3. Why Do Applications Fail?
        4. 17.1.4. What May Firefighting Achieve?
        5. 17.1.5. Resistance Sounds Like...
        6. 17.1.6. Success Sounds Like...
        7. 17.1.7. Firefighting is Temporary
      2. 17.2. Process
        1. 17.2.1. Firefighting Process Documentation
          1. 17.2.1.1. Scope document
          2. 17.2.1.2. Scope Template
          3. 17.2.1.3. Rules of Engagement (internal to the firefight organization)
        2. 17.2.2. Forensic Analysis
        3. 17.2.3. Establishing the Firefighting Discipline
          1. 17.2.3.1. Get Visibility
            1. 17.2.3.1.1. Checklist
          2. 17.2.3.2. Audit the Production Experience
          3. 17.2.3.3. Audit the QA Experience with Recommendations
      3. 17.3. Competency
        1. 17.3.1. Rapid Deployment
        2. 17.3.2. Analysis and Presentation
        3. 17.3.3. Application Audit
        4. 17.3.4. Performance Tuning
      4. 17.4. Summary