Recipe 20-1: Process Control
A number of commercial clustering products are available, which offer features such as monitoring and failing over network and storage services as well as processes. They also offer advanced protection against possible failure scenarios of multiple-node clusters, commonly known as “split brain” and “amnesia.” This recipe is nowhere near that level of complexity, but it provides simple monitoring and restarting of services on a single server.
Clustering and High Availability are huge topics in their own right. This recipe just looks at the monitoring and possible restarting of processes. At first, this seems a fairly trivial task, but there are a few subtleties to be dealt with. How to deal with a persistently failing process is one such issue; this recipe notes when the service last failed, and if it had already been restarted recently, then it disables the service and does not restart it again. Of course, just because something failed two weeks ago and has failed again since, that does not mean that it should be abandoned, so a timeout of 3 minutes (180 seconds) is defined in the script. By setting such hard-coded values as this and the debug value in the script just before its configuration file is read, the defaults are set if no other value is chosen, but the ...