The Recovery Challenge

Proper error handling and recovery is the Achilles’ heel of many applications. When an application fails to perform a particular operation, you should recover from it and restore the system—that is, the collection of interacting services and clients—to a consistent state (usually, the state the system was at before the operation that caused the error took place). Operations that can fail typically consist of multiple potentially concurrent smaller steps. Some of those steps can fail while others succeed. The problem with recovery is the sheer number of partial success and partial failure permutations that you have to code against. For example, an operation comprising 10 smaller concurrent steps has some three million recovery scenarios, because for the recovery logic, the order in which the suboperations fail matters as well, and the factorial of 10 is roughly three million.

Trying to handcraft recovery code in a decent-sized application is often a futile attempt, resulting in fragile code that is very susceptible to any changes in the application execution or the business use case, incurring both productivity and performance penalties. The productivity penalty results from all the effort required for handcrafting the recovery logic. The performance penalty is inherited with such an approach because you need to execute huge amounts of code after every operation to verify that all is well. In reality, developers tend to deal only with the easy recovery cases; ...

Get Programming WCF Services, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.