Saturday, December 8, 2007

Sparse arrays

Suppose you need to store a lot of things that are indexed by integer numbers. You however know absolutely nothing about the range of the index - it could be anything. You hope that whatever the range is, there's a certain density to it - it is not all over the map. How do you store this data?

One way to do it would be to employ the hash map. For many cases it is a very plausible solution. Hashes could be wasteful, however, if there is a lot of data - every time anything is stored in a hash, a whole structure is allocated, and it's not less than double, and more like triple or quadruple of the amount of data the can be stored.

Another way to do it is with sparse arrays. Sparse array is implemented by a sequence of cascading arrays and breaking the index into groups of bits. The top level array is indexed by the very first group of bits, and contains pointer to the second level array, that is indexed by the second group of bits and contains pointers to the last set of arrays that are indexed by the last group of bits that contains the actual objects.

The example has 3 levels of arrays, and so every access requires 3 lookups. This of course can be changed to contain any level of cascading. For example, let's implement an array that would take any 32-bit integer as an index.

For simplicity we will have 4 levels of arrays, each covering one byte of the index's 32 bits.

This is one of the few examples where template usage is indispensible, so I am going to be using templates.

// Free code! No attribution necessary!
// Caution: this has not been debugged - I just
// typed the code directly in here.
template class SparseArray {
void *top_tier[256];

SparseArray() {
memset(top_tier, 0, sizeof(top_tier));
}

T& operator[](const unsigned int index) {
void **second_tier = (void **)top_tier[index >> 24];
if (!second_tier) {
// in our case all tiers are the same size
// if you are adapting this code, you may have
// to change this
second_tier = (void **)malloc(sizeof(top_tier));
memset(second_tier, 0, sizeof(sizeof(top_tier));
top_tier[index >> 24] = second_tier;
}
void **third_tier = (void **)second_tier[(index >> 16)
& 0xff];
if (!third_tier) {
third_tier = (void **)malloc(sizeof(top_tier));
memset(third_tier, 0, sizeof(sizeof(top_tier));
second_tier[(index >> 16) & 0xff] = third_tier;
}
T *fourth_tier = (T *)third_tier[(index >> 8)
& 0xff];
if (!fourth_tier) {
fourth_tier = (T *)malloc(sizeof(T) * 256);
memset(fourth_tier, 0, sizeof(sizeof(T) * 256);
third_tier[(index >> 8) & 0xff] = fourth_tier;
}
return fourth_tier[index & 0xff];
}
}

So, four pointer indirections, and you can store any index from 0 to 2^32-1!
Note however, that if you happen to access it like this:

SparseArray array;
for (unsigned int i = 0; i < 0xffffffff; i += 256)
array[i] = i;

it will happily consume all the memory while being less than 1/256 as efficient as it should have been. In this case hash is a LOT better!

No comments: