STONITH: Shoot The Other Node In The Head

STONITH is a graphic term for a very effective, simple, and useful hardware technique used in High-Availability clusters, especially those with shared data.

Motivations for STONITH

To understand why STONITH is a useful technique, it is appropriate to look at some situations in which it is used. If two nodes in a cluster share a filesystem, for example, by shared SCSI, then it is essential to the integrity of the data that no more than one of the two machines writes to the disk at a time.

The situation where two different systems each falsely believe the other to be dead is called a partitioned cluster. If a partitioned cluster occurs, and a shared disk is involved, the result is usually disastrous for the data on the disk. There many different techniques for minimizing the chances of a partitioned cluster occurring, but few that guarantee it. STONITH is perhaps the most effective of such techniques. Given that the integrity of shared data in your cluster is a high enough priority, the chances are that you'll eventually consider using STONITH to guarantee its integrity. With STONITH, it is certain that the other machine isn't writing on the disk, because it has been killed - shot it in the head, so to speak.

Even a single write from the other machine can irreparably damage the data on the disk. One way of attempting to achieve this situation is to use SCSI reserve and release commands. Unfortunately, they require proper OS support, only work on some kinds of devices, and can be defeated by the system on the other side, if each system believes that the other system has died. There are many kinds of communication difficulties, hardware and software hangs which can cause each of two machines to believe the other is dead, when in fact, they're not. It is imperative that a machine which is hung and whose shared disk has been remounted over by another machine, not later wake up and continue writing on the disk again. STONITH is a technique which ensures that, although high-availability is desired, data integrity is an even higher priority.

Quorum

A related concept in a cluster is the concept of quorum. Quorum is the idea that when a cluster splits into several partitions, by agreement, only the majority part of a cluster will continue to operate. So, if a machine isn't part of the majority (i.e., doesn't have quorum), it stops providing service. Although quorum is a necessary safeguard for the correct operation of many kinds of cluster services, it is not sufficient to guarantee the correct operation of physically shared disks. This is because it operates at too high a level, and takes too long to act. It may take a cluster partition many seconds to decide it has lost quorum, and if it was hung, it may complete many queued I/O operations after it resumes working, but before it realizes it has lost quorum. If the hang was long enough, the shared device will have already been remounted by another cluster member, and these I/O operations will damage the data on the disk.

Risks associated with STONITH

There is a risk associated with STONITH, that each of two machines tries to power each other off simultaneously, and then both go off and stay off. Although this does ensure that no harm is done to the data, it is not the most desirable outcome. There are several ways of minimizing this risk. One method is to use STONITH hardware which can can only be operated by one machine at a time. This prevents more than one machine from being reset at once. Another method is to use variable delays. For example, node one waits one extra second, node two two extra seconds, and so on. In this way, the probability of simultaneous resets due to communication failures is minimized.

Another factor to consider is that the STONITH hardware itself not become a single point of failure, and that it reliably report whether the other machine was reset. A STONITH device which restricts the attached devices to sharing a common power source is not acceptable, as it forces the introduction of an SPOF (Single Point of Failure) in system power. Before taking over a shared device, it is necessary to STONITH the previous owner, and not proceed to take over ownership of the shared data, unless it is reliably known that the previous owner of the data cannot continue writing on it. One must obviously check the return code from the STONITH operation, and not proceed unless the reset was carried out. Because of the power entrusted to the STONITH device, it is vital that it not become a security hole. It is important the the STONITH device be protected by adequate security measures, and not be usable as part of a denial of service attack.

Implementations

There are basically two approaches that can be used to implement STONITH.

To reset a machine, you can either duplicate the operation of the reset switch (activate the reset lead), or you can simply power the machine off and back on. For PCs with Intel motherboards, one could use the EMP port for resetting nodes instead using the Intel power management interface (IPMI) protocol. If you're simply power cycling machines with ATX power supplies, you must ensure that the power supply is optioned to power the machine back up when power is restored.

IPMI

FermiLab has implemented a version of IPMI: ftp://linux-rep.fnal.gov/pub/ipmi/ VA Linux has implemented it as part of their VACM cluster management software. IPMI requires a serial connection to each machine being monitored. This works fine for two-node systems (where each can be directly connected to the other), but for larger systems, a serial concentrator to a control machine may be required.

External Power Control

In an ideal STONITH implementation, the reset mechanism is an exclusive mechanism, and any machine can directly reset any other machine in the cluster. Furthermore, you still want to be able for each machine to be powered by a separate UPS (power source), so that the UPS does not become a single point of failure.

If your cluster is large enough to require multiple independent switches, then it is desirable for each machine to incorporate delays to minimize the probability of each machine simultaneously shutting down the other one.

The most common implementations of remotely controlled power switches support serial or telnet connections. Serial connections are secure, but can only connect to a single machine. Telnet connections solve the ability for every machine to connect to the switch. However, telnet has significant security issues associated with it (passwords are sent in the clear). An ideal implementation would use ssh or other advanced authentication mechanisms to access the switch. Unfortunately, at this point, no such switches are commercially available. If a small, inexpensive computer can be dedicated to controlling the switch via serial ports, this could be easily accomplished. In some sense, a network attached STONITH device becomes a quorum device. If the STONITH device cannot be accessed, in effect, you do not have quorum, and cannot proceed to take control of the shared storage device.

STONITH API

int shoot(char * node); Well, in fact, the actual stonith API is somewhat more complex than this ;-), but not that much.

Summary

The STONITH technique has many advantages in High-Availability systems. It is simple, reliable, and easy to understand method for ensuring data integrity in a shared storage environment. On the other hand, there are still many issues to be taken into consideration, otherwise data integrity, security or cluster robustness can be unintentionally compromised.


Alan Robertson <alanr@unix.sh>

home
about this site
contact
legal
help