Mean Time to Failure and Mean Time to Recover

The two most common metrics used to measure fault tolerance and avoidance are the following:

  • Mean time to failure (MTTF) The mean time until the device will fail

  • Mean time to recover (MTTR) The mean time it takes to recover once a failure has occurred

Although a great deal of time and energy is often spent trying to lower the MTTF, it’s important to keep in mind that even if you have a finite failure rate, if your MTTR is zero or near zero, this may be indistinguishable from a system that hasn’t failed. Downtime is generally measured as MTTR/MTTF, but because it can be prohibitively expensive to increase MTTF beyond a certain point, you should spend both time and resources on managing and reducing ...

Get Microsoft® Windows Server 2003: Administrator’s Companion now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.