Chapter 18. Introduction to Machine Learning for SRE

Why Use Machine Learning for SRE?

In clear and simple words: because it makes sense and mostly because we (now) can.

SRE, fundamentally, is what happens when you ask a software engineer to design an operations function.1

This chapter is based on the presentation I did at DrupalCon Vienna. Here we will be exploring some machine learning solutions for a few SRE open questions:

  • How do we automate those repetitive tasks that just generate toil and that no one wants to do?

  • How do we look at data and preview what’s going to happen to our system in the future?

  • How do we reinforce “applying software engineering to an operations function”?

The automation of operation processes is a critical target we pursue. As artificial intelligence (AI) and machine learning get better, the tasks that we can automate increase. If we keep the historical data to programmatically react to something new, we will be able to fix the issue beforehand because the system will alert us as to what is going to happen instead of having someone manually analyzing the past results and trying to preview the future.

I’ve just picked up a fault in the AE35 unit. It’s going to go 100% failure in 72 hours.

HAL 9000, 2001: A Space Odyssey

That gives us the chance to use our time for more innovative tasks and feature development. Although this certainly is not an overnight achievement, lately we have seen the line between the work of machines ...

Get Seeking SRE now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.