What Is “Reliability”?

Most people who speak of “reliability” don’t really know what they mean by it. We can only define reliability in terms of failure. That is, if we can handle a certain set of well-defined and understood failures, we are reliable with respect to those failures. No more, no less. So let’s look at the possible causes of failure in a distributed ØMQ application, in roughly descending order of probability:

  1. Application code is the worst offender. It can crash and exit, freeze and stop responding to input, run too slowly for its input, exhaust all memory, and so on.

  2. System code (such as brokers we write using ØMQ) can die for the same reasons as application code. System code should be more reliable than application code, but it can still crash and burn, and especially run out of memory if it tries to queue messages for slow clients.

  3. Message queues can overflow, typically in system code that has learned to deal brutally with slow clients. When a queue overflows, it starts to discard messages, so we get “lost” messages.

  4. Networks can fail (e.g., WiFi gets switched off or goes out of range). ØMQ will automatically reconnect in such cases, but in the meantime, messages may get lost.

  5. Hardware can fail and take with it all the processes running on that box.

  6. Networks can fail in exotic ways; e.g., some ports on a switch may die and those parts of the network become inaccessible.

  7. Entire data centers can be struck by lightning, earthquakes, fire, or more mundane power or cooling failures.

Making a software system fully reliable against all of these possible failures is an enormously difficult and expensive job and goes beyond the scope of this modest tome.

Because the first five cases in the preceding list cover 99.9% of real-world requirements outside large companies (according to a highly scientific study I just ran, which also told me that 78% of statistics are made up on the spot), that’s what we’ll examine here. If you’re a large company with money to spend on the last two cases, contact my company immediately! There’s a large hole behind my beach house waiting to be converted into an executive swimming pool.

Get ZeroMQ now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.