Conclusion

In the end, postmortems are the most useful way to avoid repeats of incidents. It's understandable for an incident to happen once in a fast-paced environment, but there should be no excuses for it to repeat. By taking the time to clearly understand the facts of the issue and then deciding on, recording, and implementing high-impact remediation items, you can avoid repeat incidents.

Here's how the "worst postmortem" example might have played out if these guidelines were followed:

The VP starts by saying, "Thank you for resolving the incident last night. The point of this postmortem is to understand exactly what happened and identify the best ways to make sure it doesn't happen again. In our business, incidents will occasionally occur, but it is our job to make sure they don't repeat." The manager then starts by piecing together a detailed timeline based on the actions of each participant in the incident. The group discovers that the incident was originally thought to have been caused by a network issue, but the cause was later identified as a bad code push, which was fixed by rolling back. Once all the facts are straight, the team spends its time brainstorming remediations to both prevent the incident from repeating and reduce the TTR of future incidents. Then remediations are prioritized based on their ability to reduce the likelihood of future incidents versus the level of effort. High-priority remediations are then assigned owners who are responsible for allocating resources ...

Get Web Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.