Recovering the system

One thing that we have yet to touch on is how to debug your application and do the technical work of responding to an incident. This is the third pillar of the initial three pillars we mentioned when defining incident response. We were alerted that things were not great. We communicated that we were on the case. Now we need to make things better.

How do we do that? We will be talking about measuring mean time to recovery (MTTR) in Chapter 4, Postmortems, but the strategy that we kept mentioning earlier in this chapter was bringing the system back to a working state. That's because you don't necessarily want to immediately go into bug-hunting mode. Instead, you want to find what has changed in the system and revert back. Let ...

Get Real-World SRE now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.