13.3. Automation

Availability management aims at responding to anomalous events fast enough that the damage is contained and repaired before the SLA criteria are missed. Typically, humans take at least a few seconds to observe, analyze, and react to a situation.[24] With the speed of modern processors, a lot of damage can occur in a few seconds.

[24] As today's systems continue to improve, these anomalous events happen less and less frequently—a good thing. As a result, we tend to forget what the proper procedure is to respond to the event—a bad thing. Typically, there is a written procedure on how to recover. But a site is almost guaranteed to have a lengthy outage if the operator first has to read up on what to do.

Automation refers to a program ...

Get Linux® on the Mainframe now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.