Appendix F. Example Production Meeting Minutes
Date: 2015-10-23
Attendees: agoogler, clarac, docbrown, jennifer, martym
Announcements:
-
Major outage (#465), blew through error budget
Previous Action Item Review
-
Certify Goat Teleporter for use with cattle (bug 1011101)
-
Nonlinearities in mass acceleration now predictable, should be able to target accurately in a few days.
-
Outage Review
-
New Sonnet (outage 465)
-
1.21B queries lost due to cascading failure after interaction between latent bug (leaked file descriptor on searches with no results) + not having new sonnet in corpus + unprecedented & unexpected traffic volume
-
File descriptor leak bug fixed (bug 5554825) and deployed to prod
-
Looking into using flux capacitor for load balancing (bug 5554823) and using load shedding (bug 5554826) to prevent recurrence
-
Annihilated availability error budget; pushes to prod frozen for 1 month unless docbrown can obtain exception on grounds that event was bizarre & unforeseeable (but consensus is that exception is unlikely)
-
Paging Events
-
AnnotationConsistencyTooEventual
: paged 5 times this week, likely due to cross-regional replication delay between Bigtables.-
Investigation still ongoing, see bug 4821600
-
No fix expected soon, will raise acceptable consistency threshold to reduce unactionable alerts
-
Nonpaging Events
-
None
Monitoring Changes and/or Silences
-
AnnotationConsistencyTooEventual
, acceptable delay threshold raised from 60s to 180s, see bug 4821600; ...
Get Site Reliability Engineering now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.