Running a Postmortem

The first thing you need to do when you start a postmortem is to lay down the ground rules. Make sure you tell the participants that the postmortem is not about assigning blame, but rather that the primary purpose is to prevent the incident or similar incidents from reoccurring. Incidents will occur in fast-moving Internet sites. What is important is that we learn from our mistakes.

Start by getting the facts concerning root cause, stabilizing steps, and timeline. This is necessary for a productive discussion of corrective actions and will hopefully calm the nerves of people who might be afraid that the meeting will turn into a witch hunt.

Once the facts are straight, start to discuss what can be done to keep the incident from happening again. Make sure you address the root cause, but also look for ways to fix it faster (lower the TTR). The remediation stage should also consider potential similar incidents. If you ran out of capacity on one set of servers, for example, have remediation in place to add capacity to those servers but also to investigate other servers for similar weaknesses.

Avoid personal attacks. Humans make mistakes, and if that's what caused the incident, move on and look at how the human element can be made more failsafe. Look for automation opportunities, better planning, or simplification of the process.

Make sure participants are coming up with remediations for their own areas. Sometimes people will try to blame other groups or individuals. ...

Get Web Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.