Chapter 1. Introduction

Kubernetes is an open source orchestrator for deploying containerized applications. The system was open sourced by Google, inspired by a decade of experience deploying scalable, reliable systems in containers via application-oriented APIs,¹ and developed over the last four years by a vibrant community of open source contributors.

It is used by a large and growing number of developers to deploy reliable distributed systems, as well as to run machine learning, big data, and other batch workloads. A Kubernetes cluster provides an orchestration API that enables applications to be defined and deployed with simple declarative syntax. Further, the Kubernetes cluster itself provides numerous online, self-healing control algorithms that repair applications in the presence of failures. Finally, the Kubernetes API exposes concepts like Deployments that make it easier to perform zero-downtime updates of your software and Service load balancers that make it easy to spread traffic across a number of replicas of your service. Additionally, Kubernetes provides tools for naming and discovery of services so that you can build loosely coupled microservice architectures. Kubernetes is widely used across public and private clouds, as well as physical infrastructure.

This book is dedicated to the topic of managing a Kuberentes cluster. You might be managing your own cluster on your own hardware, part of a team managing a cluster for a larger organization, or a Kubernetes user who wants to go beyond the APIs and learn more about the internals of the system. Regardless of where you are in the journey, deepening your knowledge of how to manage the system can make you more capable of accomplishing all of the things you need to do with Kubernetes.

Note

When we speak of a cluster, we’re referring to a collection of machines that work together to provide the aggregate computing power that Kubernetes makes available to its end users. A Kubernetes cluster is a collection of machines that are all controlled by a single API and can be used by consumers of that API.

There are a variety of topics that make up the necessary skills for managing a Kubernetes cluster:

How the cluster operates
How to adjust, secure, and tune the cluster
How to understand your cluster and respond when things go wrong
How to extend your cluster with new and custom functionality

How the Cluster Operates

Ultimately, if you are going to manage a system, you need to understand how that system operates. What are the pieces that it is made up of, and how do they fit together? Without at least a rough understanding of the components and how they interoperate, you are unlikely to be successful at managing any system. Managing a piece of software, especially one as complex as Kubernetes, without this understanding is like attempting to repair a car without knowing how the tail pipe relates to the engine. It’s a bad idea.

However, in addition to understanding how all the pieces fit together, it’s also essential to understand how the user consumes the Kubernetes cluster. Only by knowing how a tool like Kubernetes should be used can you truly understand the needs and demands required for its successful management. To revisit our analogy of the car, without understanding the way in which a driver sits in the vehicle and guides it down the road, you are unlikely to successfully manage the vehicle. The same is true of a Kubernetes cluster.

Finally, it is critical that you understand the role that the Kubernetes cluster plays in a user’s daily existence. What is the cluster accomplishing for the end user? Which applications are they deploying on it? What complexity and hardship is the cluster removing? What complexity is the Kubernetes API adding? To complete the car analogy, in order to understand the importance of a car to its end user, it is critical to know that it is the thing that ensures a person shows up to work on time. Likewise with Kubernetes, if you don’t understand that the cluster is the place where a user’s mission-critical application runs, and that the Kubernetes API is what a developer relies on to fix a problem when something goes wrong at 3 a.m., you won’t really grasp what is needed to successfully manage that cluster.

Adjust, Secure, and Tune the Cluster

In addition to knowing how the pieces of the cluster fit together and how the Kubernetes API is used by developers to build and deploy applications, it is also critical to understand the various APIs and configuration options to adjust, secure, and tune your cluster. A Kubernetes cluster—or really any significant piece of software—is not something that you simply turn up, start running, and walk away from.

The cluster and its usage have a lifecycle. Developers join and leave teams. New teams are formed and old ones die. The cluster scales with the growth of the business. New Kubernetes releases come out to fix bugs, add new features, and improve stability. Increased demand on the cluster exposes performance problems that had previously been ignored. Responding to all of these changes in the lifespan of your cluster requires an understanding of the ways in which Kubernetes can be configured via command line flags, deployment options, and API configurations.

Additionally, your cluster is not just a target for application deployment. It can also be a vector for attacking the security of your applications. Configuring your cluster to be secure against many different attacks—from application compromises to denial of service—is a critical component of sucessfully managing a cluster. Much of the time, this hardening is, in fact, simply to prevent mistakes. In many cases, the value of hardening and security is that they prevent one team or user from accidentally “attacking” another team’s service. However, active attacks do sometimes happen, and the configuration of the cluster is critical to both detecting attacks when they occur and to preventing them from happening in the first place.

Finally, depending on the usage of the cluster, you may need to demonstrate compliance with various security standards that are required for application developers in many industries, such as healthcare, finance, or government. When you understand how to build a compliant cluster, you can put Kubernetes to work in these environments.

Responding When Things Go Wrong

If things never went wrong, it would be a great world to live in. Sadly, of course, that is not the way things are, especially not with any computer system I’ve ever helped to manage. What’s critical when things go wrong is that you learn of it quickly, that you find out through automation and alerts (rather than from a user), and that you are capable of responding and restoring the system as quickly as possible.

The first step in detecting when things break and in understanding why they are broken is to have the right metrics in place. Fortunately, there are two technologies present in the Kubernetes cluster that make this job easier. The first is that Kubernetes itself is generally deployed inside of containers. In addition to the value in reliable packaging and deployment, the container itself forms a boundary where basic metrics such as CPU, memory, network, and disk usage can be observed. These metrics can then be recorded into a monitoring system for both alerting and introspection.

In addition to these container-generated metrics, the Kubernetes codebase itself has been instrumented with a significant number of application metrics. These include things like the number of requests sent or received by various components, as well as the latency of those requests. These metrics are expressed using a format popularized by the Prometheus open source project, and they can be easily collected and populated into Prometheus, which can be used directly or with other tools, like Grafana, for visualization and introspection.

Combined together, the baseline metrics from the operating system containers, as well as the application metrics from Kubernetes itself, provide a rich set of data that can be used to generate alerts, which tell you when the system isn’t working properly, along with the historical data necessary to debug and determine what went wrong and when.

Of course, understanding the problem is only the first half of the battle. The next step is to respond and recover from the problems with the system. Fortunately, Kubernetes was built in a decoupled, modular manner, with minimal state in the system. This means that, generally, at any given time, it is safe to restart any component in the system that may be overloaded or misbehaving. This modularity and idempotency means that, once you determine the problem, developing a solution is often as straightforward as restarting a few applications.

Of course, in some cases, something truly terrible happens, and, your only recourse is to restore the cluster from a disaster recovery backup somewhere. This presumes that you have enabled such backups in the first place. In addition to all of the monitoring to show you what is happening, the alerts to tell you when something breaks, and the playbooks to tell you how to repair it, successfully managing a cluster requires that you develop and exercise a disaster response and recovery procedure. It’s important to remember that simply developing this plan is insufficient. You need to practice it regularly, or you will not be ready (and the plan itself may be flawed) in the presence of a real problem.

Extending the System with New and Custom Functionality

One of the most important strengths of the Kubernetes open source project has been the explosive growth of libraries, tools, and platforms that build on, extend, or otherwise improve the usage of a Kubernetes cluster.

There are tools like Spinnaker or Jenkins for continuous deployment, and tools like Helm that make it easy to package and deploy complete applications. Platforms like Deis provide Git push–style developer workflows, and numerous functions as a service (FaaS) platforms build on top of Kubernetes to enable users to consume it via simple functions. There are even tools for automating the creation and rotation of certificates, in addition to service mesh technologies that make it easy to link and introspect a myriad of microservices.

All of these tools in the ecosystem can be used to enhance, extend, and improve the Kubernetes cluster that you are managing. They can provide new functionality to make your users’ lives easier and make the software that they deploy more robust and more manageable.

However, these tools can also make your cluster more unstable, less secure, and more prone to failures. They can expose your users to immature, poorly supported software that feels like an “official” part of the cluster but actually serves to make the users’ life more difficult.

Part of managing a Kubernetes cluster is knowing how and when to add these tools, platforms, and projects into the cluster. It requires an exploration and understanding of not only what a particular project is attempting to accomplish but also of the other solutions that exist in the ecosystem. Often, users will come to you with a request for a particular tool based on some video or blog that they happened across. In truth, they are often asking for a capability like continuous integration and continuous delivery (CI/CD) or certificate rotation.

It is your job as a cluster manager to act as a curator of such projects. You are also an editor and an advisor who can recommend alternate solutions or determine whether a particular project is a good fit for your cluster or if there is a better way of accomplishing the same goal for the end user.

Additionally, the Kubernetes API itself contains rich tools for extending and enhancing the API. A Kubernetes cluster is not limited solely to the APIs that are built into it. Instead, new APIs can be dynamically added and removed. Besides the existing extensions just mentioned, sometimes the job of managing a Kubernetes cluster involves developing new code and new extensions that enhance your cluster in ways that were previously impossible. Part of managing a cluster may very well be developing new tooling. Of course, once developed, sharing that tooling with the growing Kubernetes ecosystem is a great way to give back to the community that brought you the Kubernetes software in the first place.

Summary

Managing a Kubernetes cluster is more than just the act of installing some software on a set of machines. Successful management requires a solid grasp of how Kubernetes is put together and how it is put to use by the developers who are Kubernetes users. It requires that you understand how to maintain, adjust, and improve the cluster over time as its usage patterns change. Additionally, you need to know how to monitor the information put off by the cluster in operation and how to develop the alerts and dashboards to tell you when the cluster is sick and how to make it healthy again. Finally, you need to understand when and how to extend the Kubernetes cluster with other tools to make it even more helpful to your users. We hope that within this book you find answers and more for all of these topics and that, at completion, you find yourself with the skills to be successful at Managing Kubernetes.

¹ Brendan Burns et al., Borg, Omega, and Kubernetes: Lessons Learned from Three Container-Management Systems over a Decade”, ACM Queue 14 (2016): 70–93.

Get Managing Kubernetes now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Managing Kubernetes by