Monday, May 17, 2010

Solution to the slow formatting puzzle

A couple of weeks back I posted a puzzle about an experience that I just had with Server 2008 R2 software RAID subsystem: somehow the speed of formatting a new very large RAID-5 array was highly unpredictable. It advanced by a mere 12% a day for the 4 days and then suddenly sped up 4x and finished the last 50% of the formatting in just one day.

Furthermore, I expected precisely this behavior. Why?

This weirdness occurs because of the typical Microsoft phenomenon that was best expressed by an acquaintance of mine that works in Windows Mobile. He thought that the biggest difference between Microsoft's and Apple's approaches to development are these:

Microsoft attacks problems horizontally: it builds core system, then builds layers on top of it. When it's time to ship and something needs to get cut, it's the top of the stack that goes first - and more often than not it's the user experience.

Apple develops software vertically - it enables a user experience, from top to bottom. When they need to cut, they cut the entire experiences (and also maybe the bugs, judging from the fact that my iPhone seems to crash considerably more often than most of Windows Mobile phones I owned in the past). So for example, iPhone would ship with Bluetooth support just for the mono phone headset, but no stereo profile. But the experiences that are left are implemented completely, to the maximum possible level of usability.

Back to our RAID problem. The reason the formatting is so slow in the very beginning is because two things happen at the same time: RAID resync and the formatting. Resync gets kicked off immediately after the RAID system is built, and it simply ensures that the parity volume (in the case of RAID-5) has correct checksums, or the mirrored volume (in the case of RAID-1) has a correct copy of the primary volume.

The other thing that happens once you create RAID volume from the UI is, of course, formatting. One can select quick format which only creates a file system. However, most people probably prefer to run full format when a new drive is added, just to make sure that it is not full of bad sectors.

So we have two write operations that are going on - first one is creating the redundant information, the other is filling the disk with zeroes.

Writing to the disk is very fast these days, but the seek time has barely improved in the last decade, and because the two writes happen in different places on the disk, the whole thing is completely dominated by the disk head moving from one track to another. And with a seek time of 15ms you can only have ~70 of these per second.

Of course, if the format is in progress, there is absolutely no sense to do a resync at the same time - whatever redundancy data resync creates is going to be instantly overwritten by the format.

But there was no Steve Jobs standing over the devs' shoulder, and getting the (filesystem) format in sync with (block device) RAID required two different teams to do something together so it probably got punted. I am sure it's in in a readme somewhere... and now in at least one blog :-).

So to avoid unnecesary delays, select quick format option when adding the RAID volume, wait for the resync to finish, THEN format it again with the full format.


Eric Lee Green said...

AH! So *that* is the solution to the mystery. It didn't even occur to me, because no Unix filesystem all the way back to 1975 has ever had any concept of "formatting" that included writing zeros to every block. So when I rebuilt my large 1TB Linux RAID volume after a disk crash, I didn't do a mkfs on it, but if I *had* done that to create an XFS filesystem on it, the XFS mkfs operation would have completed almost immediately (since it only has to write a few hundred blocks of metadata), and the remainder of the RAID rebuild would have continued on at full speed.

Another optimization that at least recent versions of the Linux RAID system will do on an in-production system is elevator optimizations *of RAID REBUILD OPERATIONS too*. That is, even if Unix had the concept of a "full format", your thousands of blocks of zeros written to the drive would end up getting written out in order, rather than the drive head thrashing back and forth. There would be *some* thrashing, certainly, because the elevator would eventually shuffle back to where the rebuild is happening and then continue from there, but not the sort of super thrashing that you're describing. The RAID rebuild slows *way* down while filesystem operations are underway because of the operations of the elevator, but that was a design decision made by the authors of the Linux RAID system to keep the filesystem usable while RAID rebuilds are going on. This is just one of the advantages of a unified buffer cache operating between users of a driver and the driver itself, the unified buffer cache can merge writes from multiple writers and do the proper write ordering based on the write hints given in the write requests.

Of course, Linux has one advantage over Microsoft here -- virtually none of the system internals is ever exported to userland in any documented way, userland programs are expected to be written to the POSIX API, so the authors of Linux feel free to change underlying kernel API's in a sweeping way on a regular basis. When I ported network drivers from 2.6.7 to 2.6.18, it was like porting them to an almost-new operating system -- the entire driver-layer networking API had undergone sweeping changes, even simple operations like getting the PCI register space for a device or exporting the proper functions for the module API had undergone some significant changes because the function signatures had all changed. Going from 2.6.18 to 2.6.33 is proving to be a similarly giant change in the driver-layer networking API. I've not been in the disk API for some years, but my suspicion is that it's seen similar changes. Microsoft would get roundly blasted with breaking driver compatibility if it dared do such thing... but with Linux, those of us who need, say, a driver for a circa-2002 network card in order to deal with customers who have an installed base of the things, simply buck up and do the necessary work to port it, rather than kvetch about how Linus's internal kernel API's are made of jello. The end result is that Linux's performance continues to improve even where major internal reshufflings are necessary to make it happen. Just one of the advantages of the open source model for operating systems... though there are, of course, an equal and opposite number of disadvantages too :).

Anonymous said...

Well I'm not sure that the incredibly slow formatting is entirely a RAID problem.
My scenario is either a Promise TX2650 or an Adaptec 1405 controller with a Seagate Constellation 2TB HDD. Windows Server 2008 formats at 15%/day with no RAID involved.
On another system with Server 2003 the HDD formated in under 4 hours. I am still doing some testing.
Promise Support stated that the controller only supports 500GB HDD but literature states that it will support drives over 2TB with 64-bit LBA. Adaptec Support just states they haven't tested with that particular drive. Seagate support just refered to tools from Acronis that perform a quick format.
Trying to get a full format is driving me crazy!
Any suggestions?

Sergey Solyanik said...

If it's a single drive, it's an entirely orthogonal problem. 15% a day is way, way too slow for one drive. Way, way, way, way too slow.

Since you are getting the same results with two controllers, and the same drive works fine on a different system, maybe the next thing to look at is the SATA cable. If the cable is the same on both systems, too, the only (crazy!) theory I can imagine is that something (insufficient power supply?) forcing the drive into a low power mode.

There are utilities that report S.M.A.R.T. status on the drive. This may help you diagnoze the problem a bit better.