Appendix C. Example Incident State Document
Shakespeare Sonnet++ Overload: 2015-10-21 Incident management info: http://incident-management-cheat-sheet
(Communications lead to keep summary updated.) Summary: Shakespeare search service in cascading failure due to newly discovered sonnet not in search index.
Status: active, incident #465
Command Post(s): #shakespeare
on IRC
Command Hierarchy (all responders)
-
Current Incident Commander: jennifer
-
Operations lead: docbrown
-
Planning lead: jennifer
-
Communications lead: jennifer
-
-
Next Incident Commander: to be determined
(Update at least every four hours and at handoff of Comms Lead role.) Detailed Status (last updated at 2015-10-21 15:28 UTC by jennifer)
Exit Criteria:
-
New sonnet added to Shakespeare search corpus TODO
-
Within availability (99.99%) and latency (99%ile < 100 ms) SLOs for 30+ minutes TODO
TODO list and bugs filed:
-
Run MapReduce job to reindex Shakespeare corpus DONE
-
Borrow emergency resources to bring up extra capacity DONE
-
Enable flux capacitor to balance load between clusters (Bug 5554823) TODO
Incident timeline (most recent first: times are in UTC)
-
2015-10-21 15:28 UTC jennifer
-
Increasing serving capacity globally by 2x
-
-
2015-10-21 15:21 UTC jennifer
-
Directing all traffic to USA-2 sacrificial cluster and draining traffic from other clusters so they can recover from cascading failure while spinning up more tasks
-
MapReduce index job complete, awaiting Bigtable replication to all clusters ...
-
Get Site Reliability Engineering now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.