Chapter 12. Disaster Recovery

If you’re like most users, you have probably looked to Kubernetes, at least in part, for its ability to automatically recover from failure. And, of course, Kubernetes does a great job of keeping your workloads up and running. However, as with any complex system, there is always room for failure. Whether that failure is due to something like hardware fault on a node, or even data loss on the etcd cluster, we want to have systems in place to ensure that we can recover in a timely and reliable fashion.

High Availability

A first principle in any disaster recovery strategy is to design your systems to minimize the possibility of failure in the first place. Naturally, designing a foolproof system is an impossibility, but we should always build with the worst-case scenarios in mind.

When building production-grade Kubernetes clusters, best practices always dictate that critical components are highly available. In some cases, as with the API server, these may have an active-active configuration, whereas with items like the scheduler and controller manager, these operate in an active-passive manner. When these control plane surfaces are deployed properly, a user should not notice that a failure has even occurred.

Similarly, we recommend that your etcd backing store is deployed in a three- or five-node cluster configuration. You may certainly deploy larger clusters (always with an odd number of members), but clusters of this size should suffice for the vast ...

Get Managing Kubernetes now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.