1-800-MAGIC: An observation for fault tolerant systems

One morning during our vacation stay in California I have awoken to an unpleasant surprise: a front tire went flat overnight. A trivial event, in theory, as the car is fault tolerant when it comes to tires: it has a full size spare.

Unfortunately, the spare was cold: even though the right way to rotate tires is by including the spare in the rotation cycle, in practice I failed to do this, so the spare was simply hanging in its suspension system for several years, untouched and untested.

When I installed it, the very first thing that I discovered was that the pressure in the tire was very low. On this particular car (Toyota Sienna) the spare is well hidden, and it is very easy to forget to check its pressure, so I did. It was not completely flat, but it was not drivable. Luckily, hotel was close to a gas station with an air compressor, but if a tire were to burst far away from the civilization, driving with this spare long distance would have been slow, painful, and unsafe.

Second problem that arose from incorrect rotation schedule was that while the rest of the tires were well worn, this one was completely new. Which means that it was appreciably larger in diameter, making the car asymmetric, and it was on the front wheel. A tire shop could have swapped one of the rear wheels for the flat front, and have the spare installed at the rear, where it would have been less critical, but doing this on a hotel parking lot with one jack was out of question.

By the time the problem was resolved, it took me a good part of the day waiting for Costco to replace four tires instead of enjoying a bike ride across the Golden Gate Bridge with the rest of my family.

So what does all this have to do with the design of fault tolerant systems?

Basically, if the system relies on a cold spare – a replacement part that is squared away, but is not part of the day to day operation, there is a good chance that the spare won’t work – and you will find that out at the worst possible moment, exactly when you need to use it.

Defective spares are not the only source of problems during failure recovery. The recovery process itself is subject to bugs and operator errors. Usually code paths that are activated during recovery are not exercised daily, and can and often do contain bugs that are not ferreted out during regular testing.

If the repair process involves an operator, things can get even worse: an operator also does not execute failure recovery process often enough to be familiar with it, and the probability of a fat-fingered action skyrockets. I personally once lost a whole RAID array at home by replacing a working, instead of the failed, disk.

Most fault tolerance models presume that the failures are independent, and the probability distribution of the second failure is the same as the probability distribution of the first. In practice it is usually not true.

In a fault tolerant system, the mean time to second failure is shorter than the mean time to the first failure.

Since failure recovery adds new code paths and new processes, it is impossible to achieve complete independence of the primary and secondary faults. So… what to do?

A typical reaction would be to ensure that testing failure code paths happens regularly. For instance, the example with the car above had a reasonably simple process-based solution: the pressure should be tested before every long trip. Likewise, testing a master service failure could (and should) be a part of acceptance tests before the release to production.

A better way to handle the situation, however, is by a more careful design.

If at all possible, prefer the design where there is only one role, and if one machine fails, the rest just get higher QPS. This should be the default for services that do not require state preservation, like most frontends. In this case the divergent code path is simply absent, and the code that tests whether to eject a failed system from the query path is always active.

This is not always an option, however. Most backend systems require state persistence, and implement a variation of Paxos, Zookeeper, or simple master-slave protocol where there is a defined leader and one or more followers.

Here a failure triggers a complex leader re-election protocol, and the new leader may exercise different hardware components, which may have already failed, but because the follower did not use them, it is not discovered by the time the election happens.

If the system has distinct roles for primary and secondaries, the simplest way to ensure that all machines can execute all roles is to have it rotate the roles during the normal course of operations. This way a premature switch away from a failed master would be likely to be as uneventful as a routine switch that would have happened just half an hour later.

The leader election protocol would be tested not just a few times in the lab and once in production, but exhaustively, many times under all conditions that arise in real life.

In conclusion: choose a car with a full spare, rotate your tires periodically, and have the spare participate in rotation schedule.