Unfortunately, the spare was cold: even though the right way
to rotate tires is by including the spare in the rotation cycle, in practice I
failed to do this, so the spare was simply hanging in its suspension system for
several years, untouched and untested.
When I installed it, the very first thing that I discovered
was that the pressure in the tire was very low. On this particular car (Toyota
Sienna) the spare is well hidden, and it is very easy to forget to check its
pressure, so I did. It was not completely flat, but it was not drivable.
Luckily, hotel was close to a gas station with an air compressor, but if a tire
were to burst far away from the civilization, driving with this spare long
distance would have been slow, painful, and unsafe.
Second problem that arose from incorrect rotation schedule
was that while the rest of the tires were well worn, this one was completely
new. Which means that it was appreciably larger in diameter, making the car asymmetric,
and it was on the front wheel. A tire shop could have swapped one of the rear
wheels for the flat front, and have the spare installed at the rear, where it
would have been less critical, but doing this on a hotel parking lot with one
jack was out of question.
By the time the problem was resolved, it took me a good part
of the day waiting for Costco to replace four tires instead of enjoying a bike
ride across the Golden Gate Bridge with the rest of my family.
So what does all this have to do with the design of fault
tolerant systems?
Basically, if the system relies on a cold spare – a replacement
part that is squared away, but is not part of the day to day operation, there
is a good chance that the spare won’t work – and you will find that out at the
worst possible moment, exactly when you need to use it.
Defective spares are not the only source of problems during
failure recovery. The recovery process itself is subject to bugs and operator
errors. Usually code paths that are activated during recovery are not exercised
daily, and can and often do contain bugs that are not ferreted out during
regular testing.
If the repair process involves an operator, things can get
even worse: an operator also does not execute failure recovery process often
enough to be familiar with it, and the probability of a fat-fingered action
skyrockets. I personally once lost a whole RAID array at home by replacing a
working, instead of the failed, disk.
Most fault tolerance models presume that the failures are
independent, and the probability distribution of the second failure is the same
as the probability distribution of the first. In practice it is usually not
true.
In a fault tolerant system, the mean time to second failure is shorter
than the mean time to the first failure.
Since failure recovery adds new code paths and new processes,
it is impossible to achieve complete independence of the primary and secondary faults.
So… what to do?
A typical reaction would be to ensure that testing failure
code paths happens regularly. For instance, the example with the car above had a
reasonably simple process-based solution: the pressure should be tested before
every long trip. Likewise, testing a master service failure could (and should) be
a part of acceptance tests before the release to production.
A better way to handle the situation, however, is by a more careful
design.
If at all possible, prefer the design where there is only
one role, and if one machine fails, the rest just get higher QPS. This should be
the default for services that do not require state preservation, like most
frontends. In this case the divergent code path is simply absent, and the code
that tests whether to eject a failed system from the query path is always
active.
This is not always an option, however. Most backend systems require
state persistence, and implement a variation of Paxos, Zookeeper, or simple
master-slave protocol where there is a defined leader and one or more
followers.
Here a failure triggers a complex leader re-election
protocol, and the new leader may exercise different hardware components, which
may have already failed, but because the follower did not use them, it is not
discovered by the time the election happens.
If the system has distinct roles for primary and secondaries,
the simplest way to ensure that all machines can execute all roles is to have
it rotate the roles during the normal course of operations. This way a premature
switch away from a failed master would be likely to be as uneventful as a routine
switch that would have happened just half an hour later.
The leader election protocol would be tested not just a few
times in the lab and once in production, but exhaustively, many times under all
conditions that arise in real life.
In conclusion: choose a car with a full spare, rotate your
tires periodically, and have the spare participate in rotation schedule.
No comments:
Post a Comment