Detecting Disappearances

Heartbeating sounds simple, but it’s not. UDP packets get dropped when there’s a lot of TCP traffic, so if we depend on UDP beacons we’ll get false disconnections. TCP traffic can be delayed for 5, 10, even 30 seconds if the network is really busy. So if we kill peers when they go quiet, we’ll have false disconnections.

Since UDP beacons aren’t reliable, it’s tempting to add in TCP beacons. After all, TCP will deliver them reliably. However, there’s one little problem. Imagine you have 100 nodes on a network, and each node sends a TCP beacon once a second. Each beacon is 22 bytes, not counting TCP’s framing overhead. That is 100 * 99 * 22 bytes per second, or 217,000 bytes/second just for heartbeating. That’s about 1–2% of a typical WiFi network’s ideal capacity, which sounds OK. But when a network is stressed, or fighting other networks for airspace, that extra 200K a second will break what’s left. UDP broadcasts are at least low cost.

So what we do is switch to TCP heartbeats only when a specific peer hasn’t sent us any UDP beacons in a while. And then, we send TCP heartbeats only to that one peer. If the peer continues to be silent, we conclude it’s gone away. If the peer comes back, with a different IP address and/or port, we have to disconnect our DEALER socket and reconnect to the new port.

This gives us a set of states for each peer, though at this stage the code doesn’t use a formal state machine:

  • Peer visible thanks to ...

Get ZeroMQ now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.