Saturday, June 13, 2009

What is your number of nines?

Ran into an interesting page today - a list of scheduled down times for Blogger: http://status.blogger.com/.

It looks like Blogger is down for roughly 10 minutes once a month (in addition to a Picasa downtime that impairs its ability to accept images).

10 minutes a month does not look like much, but it does amount to about 2 hours of downtime per year. Is two hours a year good or bad?

The system's availability is defined as the ratio of uptime to the total time:

MTTF
Availability = -----------
MTTF + MTTR
where MTTF is the mean time between failures, and MTTR is the mean time that takes to bring the system back online. The "failure" here should be understood as a measure of system ability to process requests rather than a fault: a scheduled downtime is not a bug, but the system is not available nevertheless.

2 hours of downtime in a year yield the availability of 365.25 * 24 / (2 + 365.25 * 24) = 99.9%, or "3 nines", which puts Blogger in a category of "Well-managed" systems.

Here are the definitions of various levels of availability given in Jim Gray's famous book on transaction processing (http://www.amazon.com/Transaction-Processing-Concepts-Techniques-Management/dp/1558601902):


System typeUnavailability (min/year)AvailabilityClass
Unmanaged5256090%1
Managed525699%2
Well-managed52699.9%3
Fault-tolerant5399.99%4
High-availability599.999%5
Very-high-availability0.599.9999%6
Ultra-availability0.0599.99999%7


As the Blogger's example shows, it's fairly hard to create a fault-tolerant (or above) system - you have to account for things that range from OS and software patching to the maintenance of the power equipment in the data centers.

One might think that the hardware failures and software bugs cause most of the availability problems, but it is actually the scheduled maintenance that creates majority of work, because it causes a lot of downtime. Once you figured out how to deal with the maintenance, the unavailability due to bugs is probably already taken care of by the same measures.

And at server MTTF of roughly 14 years, one should only be worrying about hardware (assuming that the failure can be detected and the job reallocated within one hour) when availability starts approaching 5 nines.

How many nines does your system have?

No comments: