13.3. FAULT-TOLERANT SYSTEMS AND RAPIDIO

Victor Menasce

At one time fault tolerance was considered the domain of the lunatic fringe, telecom central office applications, the space shuttle, and so on. Today, the requirement for high availability is mainstream. It's easy to see why. When one adds up all of the network elements that a data packet traverses from source to destination, one commonly crosses 20 or more hardware network elements. While the trend is toward a flatter network, significant hierarchy still exists. The problem with this hierarchy is that it adversly affects reliability. Let's look at a simple example:

  • Our example network element has an availability probability of 0.99. A network made up of two of these elements has an availability of 0.992 = 0.9801. A network made up of 20 of these elements has an availability of 0.9920 = 0.82, which most users would consider blatantly unacceptable.

This fact means that each individual network element must achieve sufficiently high availability so that the product of the availabilities yields an acceptable result at the network level. This has become the generally accepted 'five nines', or 0.99999 availability. It is important to note that this availability is measured at the application level. This level of availability equates to 5 min outage downtime per system per year. Let's examine a typical analysis of fault tolerant network element failures shown in Figure 13.11. The systems analyzed in the figure achieve five nines ...

Get RapidIO: The Next Generation Communication Fabric For Embedded Application now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.