When you reach for the stars you may not quite get one, but you won’t come up with a handful of mud either.
I have participated in an on-call pager rotation for over 10 years, and I have never heard anyone say that they have a goal of zero alerts. It’s always been accepted that my week on-call will be anywhere between “alright” and “terrible”. It’s normal in this field to be woken by an event that you know will fix itself, so you do nothing and go back to bed, you know nothing is really broken, you know it is a transient problem that will resolve itself shortly. It’s normal in this field to monitor and alert on anything and everything as if all events had the same level of urgency. And so we paint ourselves into a corner where we ignore the alerts because they are mostly noise and not actionable, but at the same time we are too afraid of missing a critical event to turn down the noise and adjust our thresholds.
That’s why I was excited when I saw a talk by Ryan Frantz and Laurie Denness at Velocity Conference 2014 in Santa Clara, and the corresponding blog post about measuring the on-call experience, because once you measure a problem, you then have a chance of making it better.
Why does this matter?
Because being on-call can be an awful experience, I firmly believe that a poor work life will lead to a poor home life — if you have a crummy day at work, you bring that stress home with you and it has a negative impact on those people that really matter in your life ( friends, family ). Being on-call pulls you away from those things that really matter.
Because being on-call interrupts your sleep — lack of sleep can negatively impact your mood, and productivity. We have a name for this, it’s called “Pager Brain”, where you just can’t think clearly because you are so sleep deprived.
Because being on-call interrupts your work while you are at work — when you are concentrating on a problem, the worst thing that can happen is to be distracted and have to change contexts, getting your head back into the problem afterwards wastes time.
Why a goal of zero alerts?
Because we have to have some goal, some metric that we can measure to know if we are making things better. I don’t really expect us to get to zero, but I do expect the act of aiming for that goal to improve our overall number, and to improve the lives of those on-call people on our team.
What are we doing to make it better?
Weekly Pager Hand-off meeting.
All the on-call people on our team meet once a week for thirty minutes. We review what alert events happened for the week before we rotate the on-call person. We review what events were actionable and which events were not. Every week we ask “what actions can we take to make this better?” Here we find the events that are not actionable and adjust their alerting thresholds or remove them. This gives everyone in the group an idea of what normal means for being on-call. We can catch trends in alerting events that cross on-call weeks that an individual with just a one week vantage cannot recognize.
We use OpsWeekly from Etsy to categorize our alert events. OpsWeekly allows us to categorize events based on whether action was taken or not. Ideally we want to see the number of actionable alerts go up, and the number of non-actionable alerts go down. The alert events that reach the on-call person should all mean something, they should be real actionable events. Below you can see our trend for the last few months. OpsWeekly also allows us to measure which hosts are the most problematic, and which services are alerting the most. With this data we can focus our energies on fixing the big problems that will garner the most return.
To measure sleep and the impact a waking event can have we use the Jawbone UP24 band. We can see if the on-call person had a rough night and adjust our schedules to provide support and rest for the on-call person. Below you can see an example of one of my on-call nights. What I’ve learned from tracking my sleep over weeks of on-call is that a single waking event will cost me at a minimum one hour of sleep, even if the amount of time I spend addressing the problem is five minutes.
We use PagerDuty to leverage our teammates across time zones. We use schedule overrides to assist the on-call person after a rough night, to help them get some more sleep. In the early morning our East Coast team members are on-call while our West Coast team gets extra sleep, and in the evening our West Coast team are on-call while the East Coast folks commute home. We schedule overrides in the on-call schedule for when life happens: appointments, or brief times when trying to handle an alert yourself would be inappropriate. We plan on using overrides to share the on-call schedule during the upcoming holidays. No longer will one person get the burden of an entire week on-call during the winter holidays.
So what have the last couple months been like?
Over the last few months our number of critical alerts that each the on-call people has gone down, and even our spikes are less prominent. We’re still not at zero, and I dream of a day when we go weeks without getting a waking event and we don’t need to have the pager hand-off meeting, because there is nothing to talk about.