Friday, February 29, 2008

Testing disk backup

Now that I've written a simple disk backup problem, I need to ensure that it is correct. After all, we're dealing with the user data here - terrible things will happen if it will read (or write) a wrong cluster once in a while!

So, obviously, before I can use it, I need to persuade myself that it won't eat my data.

What makes a great test? The best test of all runs the software on all possible data sets in all possible environments, and verifies that it ran correctly. In our case, verification means that the backup can be restored, and the restored files match the files that were backed up perfectly.

Verification is actually is quite straightforward - we'll take a backup of one partition, restore it to another, and then compare the files.

Much harder is to generate all combinations of file systems and all environments (or partition sizes, in this case).

When there's a test that needs a lot of various combination, and I can't foresee the "interesting" data sets, I like the random number generator. I like it for two reasons. First, given enough time, it will come up with things I could not foresee. Second, most random number generators aren't really random - they generate a fixed sequence of numbers, given the starting number (a seed). For testing, this is actually an asset - this means that the test that is driven by a pseudorandom number generator is REPRODUCIBLE. Why is this good (actually, necessary)? Because when the test does not work, you want to repeat the conditions under which the software broke.

What about running in all possible environments? For us, the environment is the partition size. Here we will have to yield to reality - there is no way we can test a lot of partition configurations. We'll just have to go with the sequence I consider "interesting" - partition sizes of powers of two, powers of two +1, powers of two -1, and a random size in between.

Ok, then, the test. It was written in python and run from the same directory where the source for the program is.
"""Test for discio.

Usage:
python.exe test.py disk_no letter1 letter2
Where:
disk_no The number of the disk to use.
The disk must contain no information.
Everything on the disk will be wiped out!
letter1 The letter for the first volume to be
created on the test disk. Must not be
an existing volume, or else it will be
wiped out.
letter2 The letter for the second volume to be
created on the test disk. Must not be
an existing volume, or else it will be
wiped out.

WARNING: If run incorrectly, this test can
be enormously destructive! It cleans disks,
formats volumes etc. If you call it with an
incorrect disk number, that disk will be gone!
"""

import os
import random
import sys
import tempfile
import time

class Error(Exception):
"""Base error for the test."""
pass


class GenericFailure(Error):
"""Test has failed. The string describes the failure."""
pass


__DISKPART = 'c:\\windows\\system32\\diskpart.exe'
__FORMAT = 'c:\\windows\\system32\\format.com'
__DISKIO = 'Release\\discio.exe'
__BACKUPFILE = 'test.bkf'
__SVI = 'System Volume Information'

A function that compares directories and files. If given two files, it just keeps reading them in 2K chunks, and comparing the results. If given two directories, it lists them, sorts them, and then compares the sorted lists in the merge-sort fashion: if the names are the same, match the files (or directories). If not, say that the smaller (in lexicographical order) does not exists, and the advance the pointer to the smaller one.
def Compare(f1, f2):
"""Compares two filesystem objects.

Prints out the differences.
Returns:
True if the files compare OK.
"""
ret = True

if (__SVI in f1 and
__SVI in f2):
return ret

if os.path.isfile(f1) and os.path.isfile(f2):
fd1 = open(f1, 'rb')
fd2 = open(f2, 'rb')
while True:
s1 = fd1.read(2048)
s2 = fd2.read(2048)
if s1 != s2:
ret = False
break
if not s1:
break
fd1.close()
fd2.close()
if not ret:
print '%s and %s are different' % (f1, f2)
return ret

if os.path.isfile(f2):
print '%s is a dir, %s is a file' % (f1, f2)
return False

if os.path.isfile(f1):
print '%s is a dir, %s is a file' % (f2, f1)
return False

#print "%s <-> %s" % (f1, f2)

files1 = os.listdir(f1)
files2 = os.listdir(f2)

files1.sort()
files2.sort()

i = 0
j = 0

while i < len(files1) and j < len(files2):
if files1[i] == files2[j]:
if not Compare(os.path.join(f1, files1[i]),
os.path.join(f2, files2[j])):
ret = False
i = i + 1
j = j + 1
continue

ret = False

if files1[i] < files2[j]:
print 'Missing ' + os.path.join(f2, files1[i])
i = i + 1
else:
print 'Missing ' + os.path.join(f1, files2[j])
j = j + 1

while i < len(files1):
print 'Missing ' + os.path.join(f2, files1[i])
i = i + 1
ret = False

while j < len(files2):
print 'Missing ' + os.path.join(f1, files2[j])
j = j + 1
ret = False

return ret
This one is the biggest bottleneck in the whole system: imagine generating 4G of data and writing it to files in a SCRIPT! Yet that's exactly what it does, computer time be damned!
def GenerateFiles(base, size):
"""Generates a bunch of random files and directories.

Args:
base The base directory where the files should be.
Must end with '\\'.
size The size in kilobytes of files that it should
not exceed.
"""
if not os.path.exists(base):
os.mkdir(base)

if size > 100:
GenerateFiles(os.path.join(base, 'dir1'), size / 6)
GenerateFiles(os.path.join(base, 'dir2'), size / 6)
GenerateFiles(os.path.join(base, 'dir3'), size / 6)
size = size / 2

file_num = 0
size = size * 1024

while size > 0:
if size < 2000:
file_size = size
else:
file_size = random.randint(0, size)
fd = open(os.path.join(base, ('file%d.txt' % file_num)),
'w')

str = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
str = str + str.lower() + '\n'

try:
for i in xrange(0, file_size/len(str)):
fd.write(str)

fd.flush()
finally:
fd.close()

file_num = file_num + 1
size = size - file_size
The reason I remove some files is to generate holes - unused clusters between the used ones. Otherwise I would be backing up a completely filled disk - almost as uninsteresting as testing backup of an empty disk!
def RemoveSomeFiles(base):
"""Removes every third file from a tree."""
counter = 0
for (basepath, dirs, files) in os.walk(base):
if __SVI in basepath:
continue
for file in files:
if counter % 3 == 0:
full_file_name = os.path.join(basepath, file)
os.remove(full_file_name)
counter = counter + 1


def GenerateFileSystem(base, size):
"""Generates a bunch of random files and directories.

Args:
base The base directory where the files should be.
Must end with '\\'.
size The size in kilobytes of files that it should
not exceed.
"""
try:
GenerateFiles(base, int(size * 9 / 10))
except IOError:
pass

RemoveSomeFiles(base)
Finally, the test itself. We use diskpart to wipe and repartition the disk on every iteration - which means that we need the whole disk to ourselves to run. If you haven't figured this out on your own by now :-).
def RunTest(size, fs, disk, source, dest):
"""The test.

Creates a two partitions, on the disk,
formats one, fills it with a bunch of random
files, backups it, and restores the results
on the second partition.

Then verifies that the files are identical
on both.

Args:
size The size of an individual partition
fs The file system to use, FAT, FAT32,
or NTFS
disk The number of the disk to use in testing
Everything on this disk will be gone!
source The letter to use for volume on which
the test data will be generated
dest The letter to use for volume to which
the backup will be restored.

Throws:
GenericFailure if the test fails

"""
print 'Running the test.'
print ' Physical disk: %d ' % disk
print ' Parition size: %dM' % size
print ' File system: %s' % fs

(fi, script) = tempfile.mkstemp()
os.close(fi)
fd = open(script, 'w')
fd.write('SELE DIS %d\n' % disk)
fd.write('CLEAN\n')
fd.write('CREAT PART PRI SIZE=%d\n' % size)
fd.write('ASSIGN LETTER=%s\n' % source)
fd.write('CREAT PART PRI SIZE=%d\n' % size)
fd.write('ASSIGN LETTER=%s\n' % dest)
fd.write('EXIT\n')
fd.close()

print 'Preparing the partitions'

ret = os.spawnv(os.P_WAIT, __DISKPART,
[__DISKPART,
'/s',
script])

os.remove(script)

if ret != 0:
raise GenericFailure(
'Could not partition the disk!')

ret = os.spawnv(os.P_WAIT, __FORMAT,
[__FORMAT,
source + ':',
'/fs:' + fs,
'/v:test',
'/y'])

if ret != 0:
raise GenericFailure(
'Could not format the test partition')

print 'Generating test files'

GenerateFileSystem(source + ':\\', size * 1024)

time.sleep(20)

print 'Running backup'

ret = os.spawnv(os.P_WAIT, __DISKIO,
[__DISKIO,
source + ':\\',
'readto',
__BACKUPFILE]);

if ret != 0:
raise GenericFailure('Backup failed!')

print 'Running restore'

ret = os.spawnv(os.P_WAIT, __DISKIO,
[__DISKIO,
dest + ':\\',
'writefrom',
__BACKUPFILE]);

if ret != 0:
raise GenericFailure('Restore failed!')

if not Compare(source + ':\\', dest + ':\\'):
raise GenericFailure(
'Restore does not match backup!')

print 'Test succeeded!'
print
And finally, the main. First and foremost it is trying to keep you safe: prevent you from wiping out some useful data.
def main(args):
"""Main entry point.

Args:
args: Command line arguments. See usage.

"""

random.seed(0)

if len(args) < 3:
print __doc__
return

try:
disk = int(args[0])
except ValueError:
print 'Disk should be a number!'
return

if disk <= 0:
print 'Disk cannot be a system disk (0).'
return

source = args[1].upper()
dest = args[2].upper()
if len(source) != 1 or len(dest) != 1:
print 'Only use 1 letter to specify drives'
return

if (dest < 'K' or dest > 'O'
or source < 'K' or source > 'O'):
print 'Use letters K through O for test drives.'
print 'This is to minimize potential of a'
print 'collision with the real drives.'
return

if dest == source:
print 'Source and destination should be different!'
return

if os.path.exists(source + ':\\'):
print 'Cannot continue: %s:\\ exists!' % source
return

if os.path.exists(dest + ':\\'):
print 'Cannot continue: %s:\\ exists!' % dest
return

print 'Running the test on physical disk %d' % disk
print 'and using drive letters %s and %s' % (source,
dest)
And here we run through a bunch of partition sizes (in Megabytes) I thought were interesting, and run backup test on 3 file systems: FAT16, FAT32, and NTFS.
  try:
for size in [50, 100, 128, 200, 256, 300, 315,
1000, 1024, 2047, 2048, 2049, 4095,
4096, 4097]:
RunTest(size, "FAT", disk, source, dest)
RunTest(size, "FAT32", disk, source, dest)
RunTest(size, "NTFS", disk, source, dest)
except GenericFailure, f:
print "Test failed: " + f.message


if __name__ == '__main__':
main(sys.argv[1:])

And - voila! This ran overnight and succeeded!

No comments: