O'Reilly logo
  • Brian Cunningham thinks this is interesting:

To avoid this fate, the team tasked with managing a service needs to code or it will drown. Therefore, Google places a 50% cap on the aggregate “ops” work for all SREs—tickets, on-call, manual tasks, etc. This cap ensures that the SRE team has enough time in their schedule to make the service stable and operable. This cap is an upper bound; over time, left to their own devices, the SRE team should end up with very little operational load and almost entirely engage in development tasks, because the service basically runs and repairs itself: we want systems that are automatic, not just automated. In practice, scale and new features keep SREs on their toes.


Cover of Site Reliability Engineering


Interesting practice. Measuring how much time is spent doing operational work and then mandating that 50% of the time is spent coding/automating strategic work. Overtime this prevents you needing to hire more and more people to perform manual repetitive tasks.