Chapter 21. Designing for Incident Investigation

Incident investigation will often follow an error or abnormal event being raised. Developers are constantly involved in incident investigation — right through the lifecycle, not just in live service. Defects raised during testing need to be resolved quickly and efficiently. It is often during testing that you find that not only are the events inadequate, but the logging, tracing, auditing, and tooling are, too. Developers often spend so much time trying to get the functionality right that they forget the instrumentation and diagnostics. I've mentioned this before, but there's nothing worse than being awoken at 3 A.M. to investigate a problem and the event itself contains no real information, and then, to add further insult to injury, the logs don't tell anything conclusive, either.

Incident investigation is all about getting to the root cause of the problem, re-creating the issue, analyzing and defining a solution, and, ultimately, implementing that solution. The quicker you can achieve this goal, the sooner you can go (back) to bed. You've seen how good diagnostics provide a great starting point. This chapter looks a little bit further into the actual actions that follow and what you can do to really provide value to this process.

This chapter is organized into the following sections:

  • Tracing — Examines the tracing that should be included in the solution components. It also looks at some important practices that you should consider ...

Get Design – Build – Run: Applied Practices and Principles for Production-Ready Software Development now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.