Chapter 5. Fault Tolerance and Catastrophe-Preparedness

A production-ready microservice is fault tolerant and prepared for any catastrophe. Microservices will fail, they will fail often, and any potential failure scenario can and will happen at some point within the microservice’s lifetime. Ensuring availability across the microservice ecosystem requires careful failure planning, preparation for catastrophes, and actively pushing the microservice to fail in real time to ensure that it can recover from failures gracefully.

This chapter covers avoiding single points of failure, common catastrophes and failure scenarios, handling failure detection and remediation, implementing different types of resiliency testing, and ways to handle incidents and outages at the organizational level when failures do occur.

Principles of Building Fault-Tolerant Microservices

The reality of building large-scale distributed systems is that individual components can fail, they will fail, and they will fail often. No microservice ecosystem is an exception to this rule. Any possible failure scenario can and will happen at some point in a microservice’s lifetime, and these failures are made worse by the complex dependency chains within microservice ecosystems: if one service in the dependency chain fails, all of the upstream clients will suffer, and the end-to-end availability of the entire system will be compromised.

The only way to mitigate catastrophic failures and avoid compromising the availability ...

Get Production-Ready Microservices now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.