I would like to close this book with a few principles that I consider fundamental for effective work with monitoring and alerting. I put them at the end as a summary of the message I’m trying to send. If I were to boil these principles down to a just a few sentences, they would read as follows.
Monitoring is about detecting state changes from fluctuating timeseries or, more generally, about extracting meaning from the data in real time. The first step on the way to systematic discovery of useful information is to make a habit of measuring relevant information.
Collect the data starting with important metrics. Focus on top-level performance indicators and keep adding related ones as necessary. Try to understand the relationships between subsystems and their components. Do strong relationships exist? Are they invariant? Are they of linear or exponential nature? Do they have a confounding factor?
Secondly, it is very important to look at the gathered data. Too often the measurements are never analyzed. It’s okay not to look at all the generated metrics—you want these to be there for you just in case. If, however, data collection involves human effort then not looking at the outcome renders data collection pointless.
Then, discern signal and discard noise. Not all data will be rich enough for extraction of relevant information at the cost you’re willing to pay, but be careful not to disregard information-rich outliers. In many cases, ...