Back in Transparency, we saw how individual instances can reveal their state. That’s the start of a total story about transparency. Now we look at how to assemble a picture of system-wide health from the individual instances’ information.
The first place to start is by defining what we need from our efforts. When dealing with the system as a whole, two fundamental questions need to be answered:
Are users receiving a good experience?
Is the system creating the economic value we want?
Notice that the question, “Is everything running?” isn’t on that list. Even at small scale, we should be able to survive periods where everything isn’t running. At scale, “partially broken” is the normal state of operation. It’s ...