> Consider a system with 1000 nodes where each node sends heartbeats to a central monitor every 500 milliseconds. This results in 2000 heartbeat messages per second just for health monitoring. In a busy production environment, this overhead can interfere with actual application traffic.
If your 1000-node busy production environment is run so close to the edge that 2000 heartbeat messages per second, push it into overload, that's impressive resource scheduling.
Really, setting the interval balances speed of detection/cost of slow detection vs cost of reacting to a momentary interruption. If the node actually dies, you'd like to react as soon as possible; but if it's something like a link flap or system pause (GC or otherwise), most applications would prefer to wait and not transition state; some applications like live broadcast are better served by moving very rapidly and 500 ms might be too long.
Re: network partitioning, the author left out the really fun splits. Say you have servers in DC, TX, and CA. If there's a damaged (but not severed) link between TX and CA, there's a good chance that DC can talk to everyone, but TX and CA can't communicate. You can have that inside a datacenter too, maybe each node can only reach 75% of the other nodes, but A can reach B and B can reach C does not indicate A can reach C. Lots of fun times there.
I was working on network-booting servers with iPXE and we got a bug saying that things were working fine until the cluster size went over 4/5 machines. In a larger cluster, machines would not come up from a reboot. I thought QA was just being silly, why would the size of the cluster matter? I took a closer look and, sure enough, was able to reproduce the bug. Basically, the machine would sit there stuck trying to download the boot image over TCP from the server.
After some investigation, it turned out to be related to the heartbeats sent between machines (they were ICMP pings). Since iPXE is a very nice and fancy bootloader, it will happily respond to ICMP pings. Note that, in order to do this, it would do an ARP to find address to send the response to. Unfortunately, the size of the ARP cache was pretty small since this was "embedded" software (take a guess how big the cache was...). Essentially, while iPXE was downloading the image, the address of the image server would get pushed out of the ARP cache by all these heartbeats. Thus, the download would suffer since it had to constantly pause to redo the ARP request. So, things would work with a smaller cluster size since the ARP cache was big enough to keep track of the download server and the peers in the cluster.
I think I "fixed" it by responding to the ICMP using the source MAC address (making sure it wasn't broadcast) rather than doing an ARP.
When systems were smaller I tried to push for the realization that I don’t need a heartbeat from a machine that is currently returning status 200 messages from 60 req/s. The evidence of work is already there, and more meaningful than the status check.
We end up adding real work to the status checks often enough anyway, to make sure the database is still visible and other services. So inference has a lot of power that a heartbeat does not.
I'm amused that you got pushback. Just going on the metaphor, the very way that we check heartbeats in people is by a proof of work in some other part of the body. Isn't like we are directly hooked into the heart. It isn't like the heart sends out an otherwise useless signal.
> If your 1000-node busy production environment is run so close to the edge that 2000 heartbeat messages per second, push it into overload, that's impressive resource scheduling.
Eeeh I’m not so sure. The overhead of handling a hello-world heartbeat request is negligible, sure, but what about the overhead of having the connections open (file descriptors, maybe >1 per request), client tracking metadata for what are necessarily all different client location identifiers, and so on?
That’s still cheap stuff, but at 2krps there are totally realistic scenarios where a system with decent capacity budgeting could still be adversely affected by heartbeats.
And what if a heartbeat client’s network link is degraded and there’s a super long time between first byte and last? Whether or not that client gets evicted from the cluster, if it’s basically slowloris-ing the server that can cause issues too.
File descriptors are a limited resource, but the limits are huge. My little 2GB instances on GCP claim a limit of 1M; FreeBSD autotunes my 16GB servers to 0.5M (but I could increase it if I needed).
I just don't know how you have a 1000 node system and you can't manage to heartbeat everything 2x a second; I don't think you need that many heartbeats in most systems, but it's just not that much work. The only way I can see it being a lot of work is if your nodes are very small; but do you really need a 1000 node esp8266 cluster and you can't get anything bigger for the management node?
> And what if a heartbeat client’s network link is degraded and there’s a super long time between first byte and last? Whether or not that client gets evicted from the cluster, if it’s basically slowloris-ing the server that can cause issues too.
How big is your heartbeat, is the response really going to be in multiple packets? With a 500ms heartbeat and a typical 3-5x no response => dead, you're going to hold onto the partial heartbeat for like 2 seconds.
> File descriptors are a limited resource, but the limits are huge.
Maybe in the linuxy servery context we're talking about, but I'll note I've encountered EMFILE and ENFILE errors attempting to open a mere couple hundred files on iOS simultaniously. And while my own experience is over a decade old, here's someone hitting those limits this year: https://github.com/bevyengine/bevy/pull/17377
(Context for my own encounter: gamedev, with massively parallel asset streaming and multiple layers of direct blocking dependencies in deserialization. One of my hackarounds involved temporarily closing the least recently read files, and then reopening them on demand in our VFS abstraction. Of course, for non-dev builds, you can just use a pack file and `pread` a single fd from multiple threads, but loose files were handy for incremental dev updates...)
> The only way I can see it being a lot of work is if your nodes are very small
Which is a totally valid use case for a huge-nodecount cluster! Geo distribution or similar use cases employ this pattern often.
> you're going to hold onto the partial heartbeat for like 2 seconds
And if a region goes down such that a large number of nodes are tarpit-heartbeating at the same time? Your cluster can fail to accept real traffic because it’s clogged with heartbeats.
Like sure, generally 2krps should be fine. But in clusters this large it is totally reasonable to tune heartbeat behavior. I’ve seen plenty of real distributed system failures along these same exact lines.
I've dealt with exactly this. We had a couple thousand webapp server instances that had connections to a MySQL database. Each one only polled its connection for liveliness once per second, but those were little interruptions that were poking at the servers and showed up on top time consuming request charts.
> Really, setting the interval balances speed of detection/cost of slow detection vs cost of reacting to a momentary interruption.
Another option is dynamically adjusting heartbeat interval based on cluster-size to ensure processing heartbeats has a fixed cost. That's what Nomad does and in my 10 year fuzzy memory heartbeating has never caused resource constraints on the schedulers: https://developer.hashicorp.com/nomad/docs/configuration/ser... For reference clusters are commonly over 10k nodes and to my knowledge peak between 20k-30k. At least if anyone is running Nomad larger than that I'd love to hear from them!
That being said the default of 50/s is probably too low, and the liveness tradeoff we force on users is probably not articulated clearly enough.
As an off-the-shelf scheduler we can't encode liveness costs for our users unfortunately, but we try to offer the right knobs to adjust it including per-workload parameters for what to do when heartbeats fail: https://developer.hashicorp.com/nomad/docs/job-specification...
> You can have that inside a datacenter too, maybe each node can only reach 75% of the other nodes, but A can reach B and B can reach C does not indicate A can reach C. Lots of fun times there
At BigCloud in the early days, things went berserk with a gossip system when A could reach B but B couldn't reach A.
Indeed, which is why I've heard of failover setups where the backup has a means to make very sure that the main system is off before it takes over (often by cutting the power).
I did this systematically: at the first sign of outlier in performance one system would move itself to another platform and shut itself down. The shutdown meant turn all services off and let someone log in to investigate and rearrange it again. This system allowed different roles to be assigned to different platform. The platform was bare metal or bhyve vm. It worked perfect.
10 years later, any details I could offer would be relatively suspect; my memory is pretty poor. Still, I'll make an attempt. Don't write a book based on anything below.
I can honestly say it's the best job I ever had, working with the smartest collection of people (I've been fortunate to work at multiple startups, and while Ian Murdock put together a solid crew at Progeny, Basho was just chock full of brilliant people).
Riak was designed for rock solid, massive scale data. The problem was that most companies didn't really need that, and by the time someone reached that size, they had a significant investment in databases that were much more developer-friendly.
We attempted to make it easier to work with by introducing CRDTs, so developers wouldn't necessarily have to write as much application logic to handle conflicts.
Still, it's hard to sell a database when you can't demonstrate simple queries, because your choices were primarily "update static views as you add new data so you don't actually have to run queries in real-time" or "write map-reduce logic in Erlang".
And, it was hard to simulate massive scale databases, so while our resilience and operations stories were pretty good, it was challenging to anticipate how, and where, performance might start breaking down.
I miss it dearly, but I was also lucky enough to never be on a support call when someone was trying to rescue a failing cluster that served millions of customers.
I will say that Erlang made that last scenario much more manageable: the ability to connect to a live node and see exactly what was happening, and the ability to hot patch a function on the fly, was incredibly helpful.
Top shelf would be noticing an anomaly in behavior for a node and then interrogating it to see what’s wrong.
Automatic load balancing always gets weird, because it can end up sending more traffic to the sick server instead of less, because the results come back faster. So you have to be careful with status codes.
You have to consider the tail latencies of the system responding plus the network in between. The p99 is typically much higher than the average. Also, may have to account for GC as was mentioned in the article. 500ms gets used up pretty fast.
500ms is actually a very short interval for heartbeats in modern distributed systems. Kubernetes nodes out of the box send heartbeats every 10s, and Kubernetes only declares a node as dead when there's no heartbeat for 40s.
The relevant timescale here is not CPU time but network time. There's so much jitter in networks that if your heartbeats are on CPU scale (even, say, 100ms) and you wait for 4 missed before declaring dead, you'd just be constantly failing over.
10 years ago I've implemented SCAMP (a gossip protocol) in Clojure, you might find it interesting, the implementation is quite small https://github.com/cipherself/gossip
I’ve worked with a few, and seen a few built. IME they are never really worth it. They make the system much more complex, fragile and hard to debug. The scale limits they aim to help are just not a big deal anymore. A small group of “coordinator” nodes to do this is much simpler and easy to build and scale.
> The scale limits they aim to help are just not a big deal anymore. A small group of “coordinator” nodes to do this is much simpler and easy to build and scale.
Small example from CZMQ - High-level C Binding for ZeroMQ: http://czmq.zeromq.org/manual:zgossip . Perhaps not self-contained, one might have to read the manual of other resources, such as zactor, zsock or of the application Zyre - Local Area Clustering for Peer-to-Peer Applications https://github.com/zeromq/zyre .
Some fuzzy thinking in here. "A heartbeat sent from a node in California to a monitor in Virginia might take 80 milliseconds under normal conditions, but could spike to 200 milliseconds during periods of congestion." This is not really the effect of congestion, or at best this sentence misleads the reader. The mechanism that causes high latency during congestion is dropped frames, which are retried at the protocol level based on timers. You can get a 200ms delay between two nodes even if they are adjacent, because the TCP minimum RTO is 200ms.
Congestion manifests as packet queueing as well as packet dropping. 120 ms would be a lot of queuing, especially if we assume the 1000 node cluster is servers on high bandwidth networks, but some network elements are happy to buffer that much without dropping packets.
You could also get a jump to 200 ms round trip if a link in the path goes down and a significantly less optimal route is chosen. Again, 120 ms is a large change, but routing policies don't always result in the best real world results; and while the link status propagates, packets may take longer paths or loop.
I've been noodling a lot on how IP/ARP works as a "distributed system". Are there any reference distributed systems that have a similar setup of "optimistic"/best effort delivery? IPv6 and NDP seem like they could scale a lot, what would be the negatives about using a similar design for RPC?
A few years ago, I created a simple Broker (middle layer) program with NetMQ (C# implementation of ZeroMQ)
I did not have to worry about a "Heartbeat" because the way the...
Client <--> Broker <--> Worker
...would communicate already handled it for me.
As for the Worker would send a message to the Broker like "Do you have anything for me?" and the Broker would either respond with "Nothing at the moment" or "Yes.. here you go!"
Either way, we have confirmation between the two services. The Broker keeps a log of the last interactions.. and so does the Worker. If the Worker has not heard back from the Broker in about 10 mins, would attempt to reconnect. If this fails 3 times, would send out an email.
Same for the Broker. If it has not heard anything from the Worker, will send out an email.
As for the Client, that will always send new messages to the Broker. It also keeps tabs on all messages, ensuring the Broker confirms the message is Completed before removing it from the Client system. Every now and then the Client asks "if this was processed" to which the Brokers responds with "No.. still processing" or "Successful/Error"
Regardless, it has a "heartbeat" by default, as well as reliability and low chance of data loss (well, data can get lost... trying to work out a perfect system is difficult but we had something decent beyond the scope of this post)
It was a good system. It was light and requests from the client or worker was responded fast by the broker. Was pretty fast compared to the old system which was relying on SQL Server queries to check states of data, with SQL Server replication on the client systems. It was a "quick and easy" solution when you had 10 clients. Once it got to over 40 it concerned me a lot.
> Consider a system with 1000 nodes where each node sends heartbeats to a central monitor every 500 milliseconds. This results in 2000 heartbeat messages per second just for health monitoring. In a busy production environment, this overhead can interfere with actual application traffic.
If your 1000-node busy production environment is run so close to the edge that 2000 heartbeat messages per second, push it into overload, that's impressive resource scheduling.
Really, setting the interval balances speed of detection/cost of slow detection vs cost of reacting to a momentary interruption. If the node actually dies, you'd like to react as soon as possible; but if it's something like a link flap or system pause (GC or otherwise), most applications would prefer to wait and not transition state; some applications like live broadcast are better served by moving very rapidly and 500 ms might be too long.
Re: network partitioning, the author left out the really fun splits. Say you have servers in DC, TX, and CA. If there's a damaged (but not severed) link between TX and CA, there's a good chance that DC can talk to everyone, but TX and CA can't communicate. You can have that inside a datacenter too, maybe each node can only reach 75% of the other nodes, but A can reach B and B can reach C does not indicate A can reach C. Lots of fun times there.
> ... push it into overload ...
Oh, oh, I get to talk about my favorite bug!
I was working on network-booting servers with iPXE and we got a bug saying that things were working fine until the cluster size went over 4/5 machines. In a larger cluster, machines would not come up from a reboot. I thought QA was just being silly, why would the size of the cluster matter? I took a closer look and, sure enough, was able to reproduce the bug. Basically, the machine would sit there stuck trying to download the boot image over TCP from the server.
After some investigation, it turned out to be related to the heartbeats sent between machines (they were ICMP pings). Since iPXE is a very nice and fancy bootloader, it will happily respond to ICMP pings. Note that, in order to do this, it would do an ARP to find address to send the response to. Unfortunately, the size of the ARP cache was pretty small since this was "embedded" software (take a guess how big the cache was...). Essentially, while iPXE was downloading the image, the address of the image server would get pushed out of the ARP cache by all these heartbeats. Thus, the download would suffer since it had to constantly pause to redo the ARP request. So, things would work with a smaller cluster size since the ARP cache was big enough to keep track of the download server and the peers in the cluster.
I think I "fixed" it by responding to the ICMP using the source MAC address (making sure it wasn't broadcast) rather than doing an ARP.
Yeah broadcast with iPxe commonly has this issue, I’ve also run into it in my career more than once.
When systems were smaller I tried to push for the realization that I don’t need a heartbeat from a machine that is currently returning status 200 messages from 60 req/s. The evidence of work is already there, and more meaningful than the status check.
We end up adding real work to the status checks often enough anyway, to make sure the database is still visible and other services. So inference has a lot of power that a heartbeat does not.
I'm amused that you got pushback. Just going on the metaphor, the very way that we check heartbeats in people is by a proof of work in some other part of the body. Isn't like we are directly hooked into the heart. It isn't like the heart sends out an otherwise useless signal.
> If your 1000-node busy production environment is run so close to the edge that 2000 heartbeat messages per second, push it into overload, that's impressive resource scheduling.
Eeeh I’m not so sure. The overhead of handling a hello-world heartbeat request is negligible, sure, but what about the overhead of having the connections open (file descriptors, maybe >1 per request), client tracking metadata for what are necessarily all different client location identifiers, and so on?
That’s still cheap stuff, but at 2krps there are totally realistic scenarios where a system with decent capacity budgeting could still be adversely affected by heartbeats.
And what if a heartbeat client’s network link is degraded and there’s a super long time between first byte and last? Whether or not that client gets evicted from the cluster, if it’s basically slowloris-ing the server that can cause issues too.
File descriptors are a limited resource, but the limits are huge. My little 2GB instances on GCP claim a limit of 1M; FreeBSD autotunes my 16GB servers to 0.5M (but I could increase it if I needed).
I just don't know how you have a 1000 node system and you can't manage to heartbeat everything 2x a second; I don't think you need that many heartbeats in most systems, but it's just not that much work. The only way I can see it being a lot of work is if your nodes are very small; but do you really need a 1000 node esp8266 cluster and you can't get anything bigger for the management node?
> And what if a heartbeat client’s network link is degraded and there’s a super long time between first byte and last? Whether or not that client gets evicted from the cluster, if it’s basically slowloris-ing the server that can cause issues too.
How big is your heartbeat, is the response really going to be in multiple packets? With a 500ms heartbeat and a typical 3-5x no response => dead, you're going to hold onto the partial heartbeat for like 2 seconds.
> File descriptors are a limited resource, but the limits are huge.
Maybe in the linuxy servery context we're talking about, but I'll note I've encountered EMFILE and ENFILE errors attempting to open a mere couple hundred files on iOS simultaniously. And while my own experience is over a decade old, here's someone hitting those limits this year: https://github.com/bevyengine/bevy/pull/17377
(Context for my own encounter: gamedev, with massively parallel asset streaming and multiple layers of direct blocking dependencies in deserialization. One of my hackarounds involved temporarily closing the least recently read files, and then reopening them on demand in our VFS abstraction. Of course, for non-dev builds, you can just use a pack file and `pread` a single fd from multiple threads, but loose files were handy for incremental dev updates...)
> The only way I can see it being a lot of work is if your nodes are very small
Which is a totally valid use case for a huge-nodecount cluster! Geo distribution or similar use cases employ this pattern often.
> you're going to hold onto the partial heartbeat for like 2 seconds
And if a region goes down such that a large number of nodes are tarpit-heartbeating at the same time? Your cluster can fail to accept real traffic because it’s clogged with heartbeats.
Like sure, generally 2krps should be fine. But in clusters this large it is totally reasonable to tune heartbeat behavior. I’ve seen plenty of real distributed system failures along these same exact lines.
For Linux, the limit on FDs is just memory.
Kingsbury & Bailis's paper on the topic of network partitions: https://github.com/aphyr/partitions-post
I've dealt with exactly this. We had a couple thousand webapp server instances that had connections to a MySQL database. Each one only polled its connection for liveliness once per second, but those were little interruptions that were poking at the servers and showed up on top time consuming request charts.
> Really, setting the interval balances speed of detection/cost of slow detection vs cost of reacting to a momentary interruption.
Another option is dynamically adjusting heartbeat interval based on cluster-size to ensure processing heartbeats has a fixed cost. That's what Nomad does and in my 10 year fuzzy memory heartbeating has never caused resource constraints on the schedulers: https://developer.hashicorp.com/nomad/docs/configuration/ser... For reference clusters are commonly over 10k nodes and to my knowledge peak between 20k-30k. At least if anyone is running Nomad larger than that I'd love to hear from them!
That being said the default of 50/s is probably too low, and the liveness tradeoff we force on users is probably not articulated clearly enough.
As an off-the-shelf scheduler we can't encode liveness costs for our users unfortunately, but we try to offer the right knobs to adjust it including per-workload parameters for what to do when heartbeats fail: https://developer.hashicorp.com/nomad/docs/job-specification...
(Disclaimer: I'm on the Nomad team)
> You can have that inside a datacenter too, maybe each node can only reach 75% of the other nodes, but A can reach B and B can reach C does not indicate A can reach C. Lots of fun times there
At BigCloud in the early days, things went berserk with a gossip system when A could reach B but B couldn't reach A.
Cloudflare hit something similar though they misclassified the failure mode: https://blog.cloudflare.com/a-byzantine-failure-in-the-real-...
Related advice based on my days working at Basho: find a way to recognize, and terminate, slow-running (or erratically-behaving) servers.
A dead server is much better for a distributed system than a misbehaving one. The latter can bring down your entire application.
Indeed, which is why I've heard of failover setups where the backup has a means to make very sure that the main system is off before it takes over (often by cutting the power).
Usually we call this STONITH
I did this systematically: at the first sign of outlier in performance one system would move itself to another platform and shut itself down. The shutdown meant turn all services off and let someone log in to investigate and rearrange it again. This system allowed different roles to be assigned to different platform. The platform was bare metal or bhyve vm. It worked perfect.
Docker and Kubernetes have health check mechanisms to help solve for this;
Docker docs > Dockerfile HEALTHCHECK instruction: https://docs.docker.com/reference/dockerfile/#healthcheck
Podman docs > podman-healthcheck-run, docker-healthcheck-run: https://docs.podman.io/en/v5.4.0/markdown/podman-healthcheck...
Kubernetes docs > "Configure Liveness, Readiness and Startup Probes" https://kubernetes.io/docs/tasks/configure-pod-container/con...
Always thought Riak was a really cool database, distributed by default kinda blew my mind at the time! Would love to hear more about basho
10 years later, any details I could offer would be relatively suspect; my memory is pretty poor. Still, I'll make an attempt. Don't write a book based on anything below.
I can honestly say it's the best job I ever had, working with the smartest collection of people (I've been fortunate to work at multiple startups, and while Ian Murdock put together a solid crew at Progeny, Basho was just chock full of brilliant people).
Riak was designed for rock solid, massive scale data. The problem was that most companies didn't really need that, and by the time someone reached that size, they had a significant investment in databases that were much more developer-friendly.
We attempted to make it easier to work with by introducing CRDTs, so developers wouldn't necessarily have to write as much application logic to handle conflicts.
Still, it's hard to sell a database when you can't demonstrate simple queries, because your choices were primarily "update static views as you add new data so you don't actually have to run queries in real-time" or "write map-reduce logic in Erlang".
And, it was hard to simulate massive scale databases, so while our resilience and operations stories were pretty good, it was challenging to anticipate how, and where, performance might start breaking down.
I miss it dearly, but I was also lucky enough to never be on a support call when someone was trying to rescue a failing cluster that served millions of customers.
I will say that Erlang made that last scenario much more manageable: the ability to connect to a live node and see exactly what was happening, and the ability to hot patch a function on the fly, was incredibly helpful.
> When a system uses very short intervals, such as sending heartbeats every 500 milliseconds
500 milliseconds is a very long interval, on a CPU timescale. Funny how we all tend to judge intervals based on human timescales
Of course the best way to choose heartbeat intervals is based on metrics like transaction failure rate or latency
Top shelf would be noticing an anomaly in behavior for a node and then interrogating it to see what’s wrong.
Automatic load balancing always gets weird, because it can end up sending more traffic to the sick server instead of less, because the results come back faster. So you have to be careful with status codes.
You have to consider the tail latencies of the system responding plus the network in between. The p99 is typically much higher than the average. Also, may have to account for GC as was mentioned in the article. 500ms gets used up pretty fast.
500ms is actually a very short interval for heartbeats in modern distributed systems. Kubernetes nodes out of the box send heartbeats every 10s, and Kubernetes only declares a node as dead when there's no heartbeat for 40s.
The relevant timescale here is not CPU time but network time. There's so much jitter in networks that if your heartbeats are on CPU scale (even, say, 100ms) and you wait for 4 missed before declaring dead, you'd just be constantly failing over.
cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
That is two hours in seconds.
Well, it is called a heartbeat after all, not a oscillator beat :-)
Does anyone have recommendations on books/papers/articles which cover gossip protocols?
I have been more interested in learning about gossip protocols and how they are used, different tradeoffs, etc.
Two interesting papers:
* Epidemic broadcast trees: https://asc.di.fct.unl.pt/~jleitao/pdf/srds07-leitao.pdf
* HyParView: https://asc.di.fct.unl.pt/~jleitao/pdf/dsn07-leitao.pdf
The iroh-gossip implementation is based on those: https://docs.rs/iroh-gossip/latest/iroh_gossip/
Thank you
10 years ago I've implemented SCAMP (a gossip protocol) in Clojure, you might find it interesting, the implementation is quite small https://github.com/cipherself/gossip
https://thesecretlivesofdata.com/raft/
> https://thesecretlivesofdata.com/raft/
Are you suggesting to use raft as a gossip protocol? Run a replicated state machine with leader election, replicated logs and stable storage?
Raft is a consensus protocol, which is very different from a gossip protocol.
I'm sorry, I got confused.
I’ve worked with a few, and seen a few built. IME they are never really worth it. They make the system much more complex, fragile and hard to debug. The scale limits they aim to help are just not a big deal anymore. A small group of “coordinator” nodes to do this is much simpler and easy to build and scale.
> The scale limits they aim to help are just not a big deal anymore. A small group of “coordinator” nodes to do this is much simpler and easy to build and scale.
What changed that made it easier?
Faster hardware (cpu and nic) and better low overhead networking APIs in Linux. One machine can manage really a ton of low volume connections.
Small example from CZMQ - High-level C Binding for ZeroMQ: http://czmq.zeromq.org/manual:zgossip . Perhaps not self-contained, one might have to read the manual of other resources, such as zactor, zsock or of the application Zyre - Local Area Clustering for Peer-to-Peer Applications https://github.com/zeromq/zyre .
While not a book/paper/article, this is good implementation practice: https://fly.io/dist-sys/
Check out SWIM and its extensions: https://en.wikipedia.org/wiki/SWIM_Protocol https://arxiv.org/abs/1707.00788
Some fuzzy thinking in here. "A heartbeat sent from a node in California to a monitor in Virginia might take 80 milliseconds under normal conditions, but could spike to 200 milliseconds during periods of congestion." This is not really the effect of congestion, or at best this sentence misleads the reader. The mechanism that causes high latency during congestion is dropped frames, which are retried at the protocol level based on timers. You can get a 200ms delay between two nodes even if they are adjacent, because the TCP minimum RTO is 200ms.
Congestion manifests as packet queueing as well as packet dropping. 120 ms would be a lot of queuing, especially if we assume the 1000 node cluster is servers on high bandwidth networks, but some network elements are happy to buffer that much without dropping packets.
You could also get a jump to 200 ms round trip if a link in the path goes down and a significantly less optimal route is chosen. Again, 120 ms is a large change, but routing policies don't always result in the best real world results; and while the link status propagates, packets may take longer paths or loop.
I've been noodling a lot on how IP/ARP works as a "distributed system". Are there any reference distributed systems that have a similar setup of "optimistic"/best effort delivery? IPv6 and NDP seem like they could scale a lot, what would be the negatives about using a similar design for RPC?
Why can't network time synchronization services like SPTP and WhiteRabbit also solve for heartbeats in distributed systems?
A few years ago, I created a simple Broker (middle layer) program with NetMQ (C# implementation of ZeroMQ)
I did not have to worry about a "Heartbeat" because the way the...
Client <--> Broker <--> Worker
...would communicate already handled it for me.
As for the Worker would send a message to the Broker like "Do you have anything for me?" and the Broker would either respond with "Nothing at the moment" or "Yes.. here you go!"
Either way, we have confirmation between the two services. The Broker keeps a log of the last interactions.. and so does the Worker. If the Worker has not heard back from the Broker in about 10 mins, would attempt to reconnect. If this fails 3 times, would send out an email.
Same for the Broker. If it has not heard anything from the Worker, will send out an email.
As for the Client, that will always send new messages to the Broker. It also keeps tabs on all messages, ensuring the Broker confirms the message is Completed before removing it from the Client system. Every now and then the Client asks "if this was processed" to which the Brokers responds with "No.. still processing" or "Successful/Error"
Regardless, it has a "heartbeat" by default, as well as reliability and low chance of data loss (well, data can get lost... trying to work out a perfect system is difficult but we had something decent beyond the scope of this post)
It was a good system. It was light and requests from the client or worker was responded fast by the broker. Was pretty fast compared to the old system which was relying on SQL Server queries to check states of data, with SQL Server replication on the client systems. It was a "quick and easy" solution when you had 10 clients. Once it got to over 40 it concerned me a lot.
Nice article, I will use the concept for my own network node bots ^^