Uptime versus Downtime

Generally, if high availability is desired, both uptime and downtime should be addressed. Uptime is a function of components and architecture, whereas downtime is a function of a process that includes monitoring, fault and failure detection, diagnosis, repair preparation, repair processes, testing, and service restoration. To understand these individual terms, consider a variety of things that need to happen when a service goes down.

  • Monitoring. If you don’t monitor status and performance, there will delays in recognizing that there has been an outage.
  • Detection. Just because you monitor, it does not mean that you will realize that there is an actual problem.
  • Diagnosis. Realizing that there is a problem does not mean that the root cause will be immediately obvious. Often an issue in one area—say, network congestion—will cause a problem in another—say, service unavailability due to an application time-out.
  • Repair preparation. The fact that a root cause has been identified does not mean that a repair will occur immediately. For hardware problems, spare parts may need to be ordered or correctly retrieved from spares inventory; for software problems, a patch may need to be written.
  • Repair. The repair process may require time: disassembling components, shutting down zones, and so forth.
  • Testing. Ensuring that the repair was conducted properly and that the component, subsystem, or system is ready for use requires time as well.
  • Restoration. Finally, a cutover of ...

Get Cloudonomics: The Business Value of Cloud Computing, + Website now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.