Chapter 15. Disaster Preparedness

Failure is not falling down but refusing to get back up.

—Theodore Roosevelt

Disasters and major outages happen. Everyone in the company from the top down needs to recognize that fact and adopt a mindset that accepts outages and learns from them. An operations organization needs to be able to handle outages well and avoid repeating past mistakes.

Previously we’ve examined technology related to being resilient to failures and outages as well as organizational strategies like oncall. In this chapter we discuss disaster preparedness at the individual, team, procedural, and organizational levels. People must be trained so that they know the procedure well enough that they can execute it with confidence. Teams need ...

Get Practice of Cloud System Administration, The: DevOps and SRE Practices for Web Services, Volume 2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.