Distributed Logging and Monitoring

Let’s look at logging and monitoring. If you’ve ever managed a real server (like a web server) you know how vital it is to have a capture of what is going on. There’s a long list of reasons, not least:

  • To measure the performance of the system over time

  • To see what kinds of work are done the most, to optimize performance

  • To track errors and how often they occur

  • To do postmortems of failures

  • To provide an audit trail in case of dispute

Let’s scope this in terms of the problems we think we’ll have to solve:

  • We want to track key events (such as nodes leaving and rejoining the network).

  • For each event, we want to track a consistent set of data: the date/time, node that observed the event, peer that created the event, type of the event itself, and other event data.

  • We want to be able to switch logging on and off at any time.

  • We want to be able to process log data mechanically, since it will be sizable.

  • We want to be able to monitor a running system; that is, collect logs and analyze them in real time.

  • We want log traffic to have minimal effect on the network.

  • We want to be able to collect log data at a single point on the network.

As in any design, some of these requirements are hostile to each other. For example, collecting log data in real time means sending it over the network, which will affect network traffic to some extent. However, as in any design these requirements are also hypothetical until we have running code, so we can’t take them too seriously. ...

Get ZeroMQ now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.