Wednesday, November 21, 2007

Guerilla guide to native memory management

DISCLAIMER: Vast majority of the article if about fighting heap fragmentation issues that plague native environments. Managed code that uses generational (or generational-like) garbage collectors is not as susceptible to heap fragmentation, and a lot of things that are described here are not applicable.

Once upon a time I was working on (initially, just managing) a project that to protect the guilty shall remain nameless. The project was a very intensive graphical application running on Windows CE. It was a port from an identical feature in regular Windows. Very soon after the RTM (release to the manufacturing) a huge customer came back and told us that there is no way they are going to be upgrading to the new version of the product - it is at least 3 times slower than the previous version on their tests.

As a side note, an important lesson here - one needs performance testing as part of the release criteria. But that was aeons ago, in a relatively small team, and our processes were not nearly as sophisticated as they should have been.

Anyway, we did run their test, and - lo and behold - version 2 took 3 minutes to complete, and version 3 took over 10 minutes. WTF?

Between v2 and v3 we ported the next version of the app from Windows NT. Both versions of the product ran just fine in that environment. After poking around, we found out that graphic calls, specifically the creation of graphics primitives was becoming extremely slow after a while, and the test used to create a lot of them. That was actually the main difference between v2 and v3 (v2 reused the objects - brushes, pens, DC, fonts, and other things that you use in normal Windows graphics, whereas v3 was recreating them).

Ok, it was time to look at the OS code. Here's what I've found - most of the graphical primitives were implemented as very small (16-20-24 bytes) structures. For example, a brush might hold a color, an index to a pattern, a size. That takes ~12 bytes. The objects were allocated from regular heap of the video subsystem using LocalAlloc.

If you allocate and free a lot of small objects in random succession, the heap experiences a phenomenon called "heap fragmentation". This means that the heap contains a lot of small objects interspersed with a lot of small empty spaces. Naive memory allocation algorithms (like we had in Windows CE) take time that is roughly linear with the number of objects because they look though the linked list of blocks to find an empty block that has sufficient size. If you have 10000 of 12-28 byte objects, and between them there are another 10000 12-28 byte empty spaces, it takes a while to find a place (at the end of this list) for a 512 byte block.

When I replaced LocalAlloc with a fixed-block allocation algorithm (see below), the test went back to 3 minutes, erasing ALL of the performance difference between the two versions.

Before we start on techniques that can be used to avoid or minimize heap fragmentation, it is important to list the conditions under which the heap fragmentation occurs.

Specifically, heap fragmentation matter if and only if a program is allocating and freeing a lot of small objects, and is resident in memory for a long period of time.

Let me give an example where heap fragmentation does NOT matter. Suppose you are converting a document from one format to another. This involves parsing the source format, and then generating the output. Let's suppose that you have to parse the entire document before starting the conversion because there are certain globally-computed parameters in the output format that are only known based on the entire document. Let's also assume that the task runs as a single process that exits when the job is done.

In this example we are clearly allocating a lot of small memory blocks. However, we (1) do not free them, so they are all allocated in a contiguous block in the heap, without creating holes, and (2) the program does not stay resident for a long time. So basically allocations are taking as little time as we can make them anyway, and the heap fragmentation does not occur.

Now suppose the same tasks run as a web service, processing a lot of simultaneous requests from different users. Now one thread can be allocating and the same time as another thread is freeing, and the heap can survive for a long long time. At a minimum we have to consider heap fragmentation as one of the possible performance bottlenecks.

So how does one fight heap fragmentation? By minimizing or containing small allocations. Large allocation are less of a problem because when freed, they result in holes in the heap big enough to accomodate variety of subsequent allocations.

Containing allocations



Containing allocations means replacing many small allocations with fewer large pools, and managing small objects out of these pools. The ways pools are constructed depend on specifics of the task.

Fixed block allocator



One technique for containment is called fixed heap, or fixed-size allocation pool. This technique is widely used by compilers and OS kernels. Here, if you expect that you will need a lot of small objects of equal size, you pre-allocate a number of them (a pool), then dispense the objects one by one, and then allocate another pool as you run out. The code for this looks roughly as follows:


// Free code! Free for all! Public domain!
// No need for attribution!
// That is - if you debug it. I have not.
// I just typed it right into the blog.
struct Pool {
Pool *next;
void *free_objects;
int used_objects;
union {
unsigned char memory[0];
void *__enforce_ptr_alignment;
}
};

struct PoolDescriptor {
Pool *pools;
size_t object_size;
size_t alloc_step;
};

void *GetPool(size_t size, size_t alloc_step) {
PoolDescriptor *descr = (PoolDescriptor *)malloc(
sizeof(PoolDescriptor));
if (! descr)
return NULL;

descr->pools = null;

// ensure pointer-sized alignment since we will be
// cross-linking the empty blocks
descr->object_size = (size + sizeof(void *) - 1) &
~(sizeof(void *) - 1);
descr->alloc_step = alloc_step;
return descr;
}

void *PoolAlloc(void *pool_descr) {
PoolDescriptor *descr = (PoolDescriptor *)pool_descr;
Pool *pool = descr->pools;
while (pool && !pool->free_objects)
pool = pool->next;

if (!pool) {
pool = (Pool *)malloc(offsetof(Pool, memory) +
descr->object_size * descr->alloc_step);

if (!pool)
return NULL;

pool->next = descr->pools;
descr->pools = pool;

// Populate the free list. Initial allocation
// is done in order to mimic the cache locality
// of the regular heap.
pool->free_objects = (void *)pool->memory;
unsigned char *ptr = pool->memory;
for (int i = 0; i < descr->alloc_step - 1; ++i) {
unsigned char *next = ptr + descr->object_size;
*(unsigned char **)ptr = next;
ptr = next;
}
*(unsigned void **)ptr = NULL;

pool->used_objects = 0;
}
// pop one from free list
void *result = pool->free_objects;
pool->free_objects = *(void **)pool->free_objects;
pool->used_objects++;
return result;
}

static int PoolContains(Pool *pool, void *block,
size_t size) {
return pool->memory <= block
&& pool->memory + size > block;
}

static int NumFreePools(PoolDescriptor *descr) {
Pool *pool = descr->pools;
int count = 0;
while (pool) {
if (pool->used_objects == 0)
count++;
pool = pool->next;
}
return count;
}

void PoolFree(void *pool_descr, void *block) {
PoolDescriptor *descr = (PoolDescriptor *)pool_descr;
Pool *pool = descr->pools;
size_t size = descr->alloc_step * descr->object_size;
while (pool && !PoolContains(pool, block))
pool = pool->next;
if (!pool) {
assert(0);
return;
}
*(void **)block = pool->free_objects;
pool->free_objects = block;
pool->used_objects--;
// Unlink one pool if more than one become free
if (pool->used_objects == 0 &&
NumFreePools(descr) > 1) {
Pool *runner = descr->pools;
Pool *parent = NULL;
while (runner && runner->used_objects) {
parent = runner;
runner = runner->next;
}
if (parent)
parent->next = runner->next;
else
descr->pools = runner->next;
}
}

void DeletePool(void *pool_descr) {
... left as an exercise for the reader
}



As you can see, allocations from the fixed heap are almost always O(1), and frees are O(N) (it is actually easy to make frees O(1) as well if you do not need to collect free pools. In this case instead of keeping per-pool free list, keep a global one in PoolDescriptor).

Arena allocator



Another widely used technique is called 'arena allocator'. Arenas are used whenever you expect that you will need to make a lot of small allocations of different sizes, and you will be throwing them all away as a block. In the example above of a service converting documents on the web, one conversion requires many allocations, that do not need to be freed until the processing of a document is completely done. The same may be true in many other cases - rendering HTML page, printing a document, rendering a map, etc.

Arena allocators allocate a big block of memory, and then give away parts of this block as requested. Because there is no freeing of individual objects, arena allocator does not need to keep around any data about allocated objects - only the current pointer and size of the remainder of allocated block of memory (they still need to heed the alignment requirements of the blocks that they allocate, i. e. memory that contains doubles needs to be aligned on 8 byte boundaries, etc). This makes arenas extremely space-efficient.

The code for an arena allocator might look like this example.


// Free code! Free for all! Public domain!
// No need for attribution!
// That is - if you debug it. I have not.
// I just typed it right into the blog.

struct Arena {
Arena *next;
size_t bytes_allocated;
size_t bytes_used;
char memory[0];
};

struct ArenaDescriptor {
Arena *arenas;
size_t alloc_step;
};

// alloc_step should be big, e. g. a MB
void *GetArena(size_t alloc_step) {
ArenaDescriptor *descr = (ArenaDescriptor *)
malloc(sizeof(ArenaDescriptor));
if (!descr)
return NULL;
descr->alloc_step = alloc_step;
descr->arenas = NULL;
return descr;
}

void *ArenaAlloc(void *arena_descr, size_t size,
int align) {
// align must be power of 2
assert(aling && align ^ (align - 1) == 0);
ArenaDescriptor *descr = (ArenaDescriptor *)arena_descr;
Arena *a = descr->arenas;
while (a) {
if ((a->bytes_used + align - 1) & ~(align - 1) +
size < a->bytes_allocated)
break;
a = a->next;
}

if (!a) {
size_t bytes = (offsetof(Arena, memory) + align - 1)
& ~(align - 1);
if (bytes < descr->alloc_step)
bytes = descr->alloc_step;
a = (Arena *)malloc(offsetof(Arena, memory) + bytes);
a->next = descr->arenas;
descr->arenas = a;
a->bytes_allocated = bytes;
a->bytes_used = 0;
}

size_t offset = (a->bytes_used + align - 1)
& ~(align - 1);
a->bytes_used = offset + size;
return a->memory + offset;
}


void DeleteArena(void *arena_descr) {
...left as an exercise for the reader
}


Sounds way too complicated? Well, efficiency is not free :-). However, there is a simpler way if you are prepared to pay slightly higher per-object allocation cost in memory.

Starting with Windows XP NT implements a so-called "low-fragmentation heap", which behaves like a block heap. See the description here: http://msdn2.microsoft.com/en-us/library/Aa366750.aspx. LFH behaves in a way that is similar to a fixed block allocator, it just spends more time (and memory) managing the objects because an object of any size can be allocated out of it, and the heap will still have to service the request. With the fixed allocator we can avoid this complexity by restricting the problem space.

Also, for a poor man's arena implementation, consider constructing a new heap (HeapCreate on NT) whenever you need an arena. It is not as memory efficient because it still keeps around all the data that is necessary to delete individual blocks, but as long as you never delete them, there is no fragmentation, and at the end of your processing you can throw away the entire heap as one block.

Minimizing allocations



Using programming practices that minimize the number of small allocations can go a long way towards a healthy heap even without the help of advanced allocators. Here are several tips:

(1) Try to embed the data in your structures instead of using pointers.
For example, this:

struct A {
...
char name[256];
...
};

is preferable to

struct B {
...
char *name;
...
};


Yes, if the string is small, you can save a few bytes. Remember, however, that allocating from heap costs 8-16 bytes just for alignment losses because the heap-allocated objects must be aligned to the most restrictive data type on your computer, since the heap manager does not know what you are going to be storing there, and whether it requires natural alignment. There are another 8-16 bytes for the heap header. Plus you store a pointer in your structure (another 4-8 bytes). So the savings are actually smaller than might appear.

This is the actual code that I have seen in a commercial product, which you should have been never, ever written:


struct A {
...
int num_elements;
int *array;
...
};


where array was always malloc'ed to contain either 1 or 2 integers. This is what it should have been, which is always cheaper in every respect (time, memory) than the code above:


struct A {
...
int num_elements;
int array[2];
...
};


(2) Use stack allocation whenever possible. Stack allocation is very quick, since it is handled by the compiler, and is cleaned up automatically once the function exits.

For example, this:


#define ARRLEN(c) (sizeof(c)/sizeof(c[0]))

void func(WCHAR *dir) {
WCHAR mask[_MAX_PATH];
StringCchCopyW(mask, ARRLEN(mask), dir);
StringCchCatW(mask, ARRLEN(mask), L"\\*.txt");
...


is always better than this:

void func(WCHAR *dir) {
WCHAR *mask = (WCHAR *)malloc(sizeof(WCHAR) * (wcslen(dir) + 7));
StringCchCopyW(mask, ARRLEN(mask), dir);
StringCchCatW(mask, ARRLEN(mask), L"\\*.txt");
...


Yes, it takes more space, but it's a temporary space on the stack which has no implication for long-term run-time of the program.

Sometimes it is hard to predict the maximum size of the buffer, but possible to predict it for vast majority of cases. Consider the following rewrite of the function above that is now not limited to _MAX_PATH characters in the file name (so it can support long file names on NTFS):


void func(WCHAR *dir) {
WCHAR mask_buffer[_MAX_PATH]; // 99% of our cases
WCHAR *mask = mask_buffer;
size_t cchNeeded = wcslen(dir) + 7;
if (cchNeeded > ARRLEN(mask_buffer))
mask = (WCHAR *)malloc(cchNeeded * sizeof(WCHAR));
...
if (mask != mask_buffer)
free(mask);
}


This uses stack space most of the time, except in really exceptional cases.

(3) Finally, do not be afraid to trade slightly higher memory consumption for memory fragmentation. For example, allocate bigger string buffers so it is less likely that you need to reallocate them multiple times if user input require it.

Other considerations



If you've read this far, you may be wondering why on Earth did I omit the most frequently used way to "optimize" allocation speed - the free lists? For those that do not know, free list is a technique whereby malloc and free are overridden in such a way that free does not actually free (to a point), but instead puts memory blocks on a list of free memory blocks. The malloc then gets the closest bigger match to the size that is needed

The reason for this is that the simplest implementations of the free list are quite inefficient. For example, a block on a free list can be much larger than the block that is requested, leading to quite a bit of wasted memory. This can of course be overcome by a smart heuristic.

But the main reason is that free list does not really help heap fragmentation in cases where there are a lot of small allocations. When the free list runs out of number of objects, they are allocated from heap (and when the free list is big, they are freed to heap), leading to the same set of problems that we set out to solve in the beginning of this article.

No comments: