Recovering from System Problems

When a server fails and can't be immediately repaired, high availability cluster software (such as MC/ServiceGuard) can be used to reduce the downtime associated with the situation and to keep services available. MC/ServiceGuard detects the failure of an application and automatically restarts the application on another system. This automatic detection and recovery can save you downtime. MC/ServiceGuard can detect a failure and restart an application on another system in under one minute.

However, even with the kernel's capability to mask certain failures and high availability software's capability to move applications to redundant servers, ultimately you still need to repair the failed components. For hardware ...

Get UNIX® Fault Management: A Guide for System Administration now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.