Defining Fault Management

Detecting and reporting unusual or unacceptable behavior is generally referred to as fault management (or event management). A fault is any behavior different from specified or expected behavior, and generally is used to refer to the complete failure of a hardware component or software product.

Fault conditions can be characterized in many different ways. Faults can be caused by hardware component failures in the environment, or by the failure of software running on systems within the environment. A computer is dependent on more than the CPU and memory; for example, power supplies and fans can also fail. Loss of power in the data center, natural disasters, and the failure of air conditioning units are just a few examples ...

Get UNIX® Fault Management: A Guide for System Administration now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.