Linux-HA Heartbeat

There are two kinds of heartbeats that come to mind for use in Linux-HA.  The first is a ring heartbeat, and the second is a broadcast (IP) heartbeat.  These names relate to the interconnection topology necessary to support them.

Ring Heartbeats

In this topology, each machine is connected to its neighbor machines in a ring architecture.  Each neighbor forwards heartbeats to its neighbors.  A two-node network requires one serial port per machine.  Higher numbers of machines need two ports per machine.  If more than three nodes are in the network, the network can become separated into two distinct subnets if more than one machine fails.  If the two failed machines are neighbors, then the remaining machines continue to be interconnected.

For the ring to remain robust with respect to failures, it is necessary for heartbeats to travel around in both directions.  If the machines are interconnected with the proper serial switches, failed machines can be manually switched out of the ring, restoring integrity to the ring, and rendering it failsafe with regard to to multiple-node failures.  Such switches are unnecessary for networks of three or fewer nodes.  A diagram of such a switching arrangement is shown below:
 
 

The placement of the NULL modem cables is important to the switching arrangement.  One must be between the swtich and the host, and one must be between the switch and one of the neighbors.  This is so that each connection maintains exactly one null modem in each "signal path".  This arrangement is probably difficult to cable correctly the first time, but adds a certain amount of (manually triggered) additional protection from failures for rings of more than three nodes.  Depending on how the "double cross" switch is wired, it is likely that the "current" machine will be connected to itself.  The heartbeat software should be prepared to deal with such a situation.   Since most PC motherboards come with two serial ports, this mechanism is ideal for small HA clusters.  Since serial ports are highly reliable, they make a good mechanism for heartbeat communication.
 

Broadcast Heartbeats

Broadcast heartbeats are used in a medium like ethernet which support broadcast connectivity.  Each node broadcasts its status to the cluster, and the other members of the cluster receive the heartbeat broadcast directly.  I use the term "broadcast" heartbeat  rather than "IP" heartbeat, because some media (like IrDA) have broadcast capabilities, but  are not based on IP.  Broadcast heartbeats are prone to being delayed or dropped by high traffic levels on the network the endpoints are connected with,but don't have the network divisibility problems that a ring-based architecture has.  For large numbers of nodes, broadcast heartbeats are significantly simpler to wire than ring networks.  However, NIC cards and network hubs are generally less reliable than simple serial port communications.
 

Summary

Neither form of heartbeat is superior to the other in all configurations.  It is desirable for Linux-HA to eventually support both forms of heartbeat.  It may even be desirable to support both simultaneously.

Comments, questions (on this rough draft)?
Send mail to alanr@unix.sh or the Linux-HA mailing list  (though if you don't like it, the members of the list will claim they never heard of this document :-))