Tuesday, February 26, 2008

STL vector performance

I was going through an interview package today, where the interviewee returned a whole STL vector as a result of the function. I wondered if this is a reasonable thing to do, and whether it would cause a lot of data copying.

So I cranked up Visual Studio 2008 jotted down a few lines of code, and looked at the disassembly.

Turns out, the compiler does a really good job with the aliasing, so the code that it generates (in optimized version) is very efficient. What drew my attention was nearly 30 instructions that it generated for "vector.push_back(i)". I expected that the optimized code would be a lot less, and started digging.

As it turned out, it was a lot worse. The code generated inline was the tip of an iceberg - another function was called, and not just in the case where array reallocation was required. I stepped through it in assembly several times to make sure.

It looked pretty bad, so I decided to measure how it would compare to non-STL programming idioms. I compared pushing 10000 numbers into pre-allocated STL vector, an auto-expanding STL vector, an expanding C array, and a pre-allocated C array.

The code was compiled in Release (optimized) mode on Visual Studio 2008.

Here is the test. Getting the boring stuff out of the way first:

#include <windows.h>

#include <stdio.h>
#include <math.h>

#include <vector>

using namespace std;

I used Pentium clock counter instruction to minimize overhead. This function returns 64-bit clock count since the CPU was last reset. Thank you, Wikipedia!

__declspec(naked)
unsigned __int64 __cdecl rdtsc(void) {
__asm {
rdtsc
ret
}
}

However, if the thread is relocated to a different core, the clock count returned is for a different CPU, which makes time measurement completely random. So I restricted the code to a single core, and for a good measure, pre-committed the heap to ensure that paging does not spoil the results:

#define NUM_ITERATIONS 10000
void PrepEnvironment(void) {
// Move everything to one CPU, we're going
// to be using its clock counter
SetThreadAffinityMask(GetCurrentThread(), 1);

// Commit all the memory.
void *p = malloc(sizeof(int) * 4 * NUM_ITERATIONS);
memset(p, 0, sizeof(int) * 4 * NUM_ITERATIONS);
free(p);
}

For stats, I calculated the average, the median, and the histogram on the log scale:

int __cdecl comparator(const void *p1,
const void *p2) {
const unsigned int *pi1 =
(const unsigned int*)p1;
const unsigned int *pi2 =
(const unsigned int*)p2;
return (int)(*pi1 - *pi2);
}

#define BUCKETS 20
void DumpStats(unsigned int *clocks,
unsigned int cptr,
unsigned int iters) {
qsort(clocks, cptr,
sizeof(unsigned int), comparator);

int buckets[BUCKETS];
memset(buckets, 0, sizeof(buckets));

double average = 0;
double log2 = log(2.0);

for (unsigned int i = 0 ; i < cptr; ++i) {
average += (double)clocks[i];
int ndx = (int)(log((double)clocks[i]
/ 1000.0) / log2);
if (ndx < 0)
buckets[0]++;
else if (ndx < BUCKETS-1)
buckets[ndx + 1]++;
else
buckets[BUCKETS-1]++;
}

average /= cptr;

printf("Median: %d\n",
(int)clocks[cptr / 2]);
printf("Median, CPI: %d\n",
clocks[cptr / 2] / iters);
printf("Average: %d\n", (int)average);
printf("Average, CPI: %d\n",
(int)(average / iters));

printf("Histogram:\n");
int prev = 0;
int curr = 1000;
for (int i = 0; i < BUCKETS - 1; ++i) {
printf("%d - %d: %d\n", prev, curr,
buckets[i]);
prev = curr;
curr *= 2;
}

printf(" > %d: %d\n", prev,
buckets[BUCKETS - 1]);
}

Finally, the 4 different pieces of code all filled an array with 10000 numbers, and repeated this operation 1000 times (so the standard error is about 3%). I have made C array reallocation strategy to match STL vector implementation.

#define NUM_SAMPLES 1000
int main(int argc, char **argv) {
PrepEnvironment();

static unsigned int clocks[NUM_SAMPLES];
for (int k = 0; k < 4; ++k) {
unsigned int cptr = 0;

switch(k) {
case 0:
printf("STL re-alloc\n");
for (int i = 0; i < NUM_SAMPLES; ++i) {
vector work;
unsigned __int64 begin = rdtsc();
for (int j = 0; j < NUM_ITERATIONS; ++j)
work.push_back(i);
unsigned __int64 end = rdtsc();
clocks[cptr++] =
(unsigned int)(end - begin);
}
break;
case 1:
printf("STL pre-alloc\n");
for (int i = 0; i < NUM_SAMPLES; ++i) {
vector work;
work.reserve(NUM_ITERATIONS);
unsigned __int64 begin = rdtsc();
for (int j = 0; j < NUM_ITERATIONS; ++j)
work.push_back(i);
unsigned __int64 end = rdtsc();
clocks[cptr++] =
(unsigned int)(end - begin);
}
break;
case 2:
printf("C re-alloc\n");
for (int i = 0; i < NUM_SAMPLES; ++i) {
int size = 20;
int *work = (int *)malloc(size * sizeof(int));
unsigned __int64 begin = rdtsc();
for (int j = 0; j < NUM_ITERATIONS; ++j) {
if (j >= size) {
// Match STL vector allocation algorithm
size = size + size / 2;
work = (int *)realloc(work, size
* sizeof(int));
}
work[j] = j;
}
unsigned __int64 end = rdtsc();
clocks[cptr++] =
(unsigned int)(end - begin);
free(work);
}
break;
case 3:
printf("C pre-alloc\n");
for (int i = 0; i < NUM_SAMPLES; ++i) {
int *work = (int *)malloc(
NUM_ITERATIONS * sizeof(int));
unsigned __int64 begin = rdtsc();
for (int j = 0; j < NUM_ITERATIONS; ++j)
work[j] = j;
unsigned __int64 end = rdtsc();
clocks[cptr++] =
(unsigned int)(end - begin);
free(work);
}
break;
}

DumpStats(clocks, cptr, NUM_ITERATIONS);
printf("\n\n");
}

return 0;
}

So the results are as follows (this is on my T60p ThinkPad T2600 Core 2 Duo running at 2.16 GHz plugged-in):

STL re-alloc
Median: 1511328
Median, CPI: 151
Average: 1540473
Average, CPI: 154
Histogram:
0 - 1000: 0
1000 - 2000: 0
2000 - 4000: 0
4000 - 8000: 0
8000 - 16000: 0
16000 - 32000: 0
32000 - 64000: 0
64000 - 128000: 0
128000 - 256000: 0
256000 - 512000: 0
512000 - 1024000: 123
1024000 - 2048000: 835
2048000 - 4096000: 35
4096000 - 8192000: 6
8192000 - 16384000: 1
16384000 - 32768000: 0
32768000 - 65536000: 0
65536000 - 131072000: 0
131072000 - 262144000: 0
> 262144000: 0


STL pre-alloc
Median: 80171
Median, CPI: 8
Average: 82007
Average, CPI: 8
Histogram:
0 - 1000: 0
1000 - 2000: 0
2000 - 4000: 0
4000 - 8000: 0
8000 - 16000: 0
16000 - 32000: 0
32000 - 64000: 0
64000 - 128000: 995
128000 - 256000: 2
256000 - 512000: 1
512000 - 1024000: 2
1024000 - 2048000: 0
2048000 - 4096000: 0
4096000 - 8192000: 0
8192000 - 16384000: 0
16384000 - 32768000: 0
32768000 - 65536000: 0
65536000 - 131072000: 0
131072000 - 262144000: 0
> 262144000: 0


C re-alloc
Median: 698529
Median, CPI: 69
Average: 728341
Average, CPI: 72
Histogram:
0 - 1000: 0
1000 - 2000: 0
2000 - 4000: 0
4000 - 8000: 0
8000 - 16000: 0
16000 - 32000: 0
32000 - 64000: 0
64000 - 128000: 0
128000 - 256000: 0
256000 - 512000: 0
512000 - 1024000: 972
1024000 - 2048000: 27
2048000 - 4096000: 1
4096000 - 8192000: 0
8192000 - 16384000: 0
16384000 - 32768000: 0
32768000 - 65536000: 0
65536000 - 131072000: 0
131072000 - 262144000: 0
> 262144000: 0


C pre-alloc
Median: 20124
Median, CPI: 2
Average: 21572
Average, CPI: 2
Histogram:
0 - 1000: 0
1000 - 2000: 0
2000 - 4000: 0
4000 - 8000: 0
8000 - 16000: 0
16000 - 32000: 980
32000 - 64000: 15
64000 - 128000: 2
128000 - 256000: 2
256000 - 512000: 0
512000 - 1024000: 1
1024000 - 2048000: 0
2048000 - 4096000: 0
4096000 - 8192000: 0
8192000 - 16384000: 0
16384000 - 32768000: 0
32768000 - 65536000: 0
65536000 - 131072000: 0
131072000 - 262144000: 0
> 262144000: 0

The results are quite damning: the most commonly accepted C idiom of pre-allocating the array is 2 clocks per integer inserted, compared to 8 for pre-allocated vector, and 150 for most commonly used dynamically expanding vector. Which is 4 to 75 times(!) faster.

NOTE: The original version of the article contained an error: I pre-allocated the array incorrectly. The problem is now fixed and results updated. The fix resulted in dramatic improvement of STL performance, although it still did not come close to matching raw C performance.

NOTE2: I have tested it with the disabled security (#define _SECURE_SCL 0, and verified that generated assembly did not contain calls to the boundary checker), but it did not change the performance.

Even the dynamically expanding C array is the same speed as the pre-allocated vector, and 2.5 times faster than its dynamically-expanding STL counterpart.

But who cares, right? Computers are fast, we'll optimize the tight loops, and nobody would notice things that are not on the critical path, anyway. Well, yes and no. If your software is slow across the board, you'll be hit with what I call the Vista syndrome - there isn't one loop to optimize if it's all sluggish. More on this here: http://1-800-magic.blogspot.com/2007/12/memory-is-not-free-more-on-vista.html.

For a company like Google, it also means higher CPU utilization, more machines, and ultimately, more energy used. The difference in power by the way is very real - Google clusters do have a certain maximum CPU utilization threshold, beyond which the fuses get blown.

Next on my plate is to measure Java performance, and my gut feel is that it's not going to be much slower than STL. Which brings up the question - why use C++ at all, if the only advantage it has over Java and C# - the performance - is eaten by the runtime?


-----
Update: There IS a way to get STL to match C in the pre-allocated case. Here's how it's done. Ouch!


...
#define _SECURE_SCL 0
#include
...

printf("STL pre-alloc\n");
for (int i = 0; i < NUM_SAMPLES; ++i) {
vector work(NUM_ITERATIONS);
work.reserve(NUM_ITERATIONS);
unsigned __int64 begin = rdtsc();
for (int j = 0; j < NUM_ITERATIONS; ++j)
work[j] = j;
unsigned __int64 end = rdtsc();
clocks[cptr++] =
(unsigned int)(end - begin);
}

...and here are the results:

STL pre-alloc
Median: 20124
Median, CPI: 2
Average: 21668
Average, CPI: 2
Histogram:
0 - 1000: 0
1000 - 2000: 0
2000 - 4000: 0
4000 - 8000: 0
8000 - 16000: 0
16000 - 32000: 979
32000 - 64000: 11
64000 - 128000: 5
128000 - 256000: 4
256000 - 512000: 1
512000 - 1024000: 0
1024000 - 2048000: 0
2048000 - 4096000: 0
4096000 - 8192000: 0
8192000 - 16384000: 0
16384000 - 32768000: 0
32768000 - 65536000: 0
65536000 - 131072000: 0
131072000 - 262144000: 0
> 262144000: 0

20 comments:

Anonymous said...

Your STL-preallocated test is not written correctly. You have this:

vector work(NUM_ITERATIONS);
for (int j = 0; j < NUM_ITERATIONS; ++j)
work.push_back(i);

while it should be this:

vector work;
work.reserve(NUM_ITERATIONS);
for (...) work.push_back(i);

or this:
vector work(NUM_ITERATIONS);
for (...) work[i] = i;

Essentially, in your test you preallocate the block and immediately start allocating again. Passing an int in vector's c'tor is a way to allocate, not to pre-allocate. You should use vector::reserve for pre-allocation.

Looking forward to see the updated test results.

Anonymous said...

I also forgot to ask if you #defined this symbol in your test program:

_SECURE_SCL=0

The STL included in VS 2005 and 2008 by default performs additional security checks, even in optimized release build. With _SECURE_SCL=0 you should get better performing code.

Yuri Chebotarev said...

I change you "STL pre-alloc" code to use "reserve".

I also extend your code to test boost::array and boost::scoped_array
results (only Average, CPI)

STL re-alloc - Average, CPI: 65
C re-alloc - Average, CPI: 62

STL pre-alloc - Average, CPI: 10
C pre-alloc - Average, CPI: 9
boost.array - Average, CPI: 2
scoped_array - Average, CPI: 10

I test this for different numbers of NUM_ITERATIONS (from 10000 to 15000 step 10000)and the result was always the same:

"STL re-alloc average" is always almost the same as "C re-alloc average". I didn't get 2.5 difference, never. (and I'm a bit surprised about this). Update - I _did_ get 2 times difference when I set NUM_ITERATIONS to 5000 or less... So size metter ;)...

And "C pre-alloc" is always almost the same as "STL pre-alloc", "boost.array" or "boost::scoped_array".

So the problem is only with "STL re-alloc" on relatively small arrays.

And the probles is - they doesn't "realloc" memory, they always allocate new block of memory. As for me this make sence for not POD types.

When I change the code to "C re-alloc" code to (pseudocode):
int* newwork = malloc(size);
memcpy(newwork,work);
free(work);
work = newwork;

there was no significant difference with performance.

Sergey Solyanik said...

You are correct, I should have used reserve. Code and results updated, and STL pre-allocated case went from 60 to 8 clocks on my box - which is much more reasonable.

Sergey Solyanik said...

Another update - tested Yuri's suggestion - changed the realloc code like so:

int sz_sav = size;
size = size + size / 2;
int *newwork = (int *)malloc(size * sizeof(int));
memcpy(newwork, work, sz_sav * sizeof(int));
free(work);
work = newwork;

On my machine it went approximately half-way towards STL realloc case: to 110.

Anonymous said...

Your STL prealloc test is not equivalent to your C one.

vector work;
work.resize(NUM_ITERATIONS);
unsigned __int64 begin = rdtsc();

for (int j = 0; j < NUM_ITERATIONS; ++j)
work[i] = i;

unsigned __int64 end = rdtsc();
clocks[cptr++] = (unsigned int)(end - begin);

There is no point using push_back in this situation at all.

Anonymous said...

Interesting microbenchmark. I ran it on on my own system (Linux 2.4 GHz Xeon, GCC 4.1.2; changed the reallocation policy to match libstdc++), and I got the following averages:

STL re-alloc: 141631 (14 CPI)
STL pre-alloc: 69655 (6 CPI)
C re-alloc: 35222 (3 CPI)
C pre-alloc: 22926 (2 CPI)
C re-alloc with copy: 119871 (11 CPI)

The use of memcpy() instead of realloc() really kills the performance. Unfortunately, in a "generic" application there is a good chance that the call to realloc will need to call memcpy anyway, so I think the reality would be somewhere between the two.

My results look a lot more like Yuri's results than your Windows results. I what Microsoft's STL implementation is doing that makes such a big difference?

The plain C version are certainly faster. However, in most cases, I am willing to accept the convenience and familiarity of the STL vector over coding my own.

Anonymous said...

Your updated code is still wrong. You should just create vector using default constructor, then call vector::reserve.

Anonymous said...

Interesting. I've never been a fan of STL myself.

And you should clearly benchmark realloc against malloc in the C case. Realloc should never be slower - likely STL is using realloc, so it's not a fair test yet.

Finally, I have *aweful* performance problems with the CLib allocator on Linux between version 2.4 and 2.5. There's a patch that can be applied for 2.5, but just getting the 2.6 clib made a huge difference (about 3x) for my performance critical application.

2.6 Clib is only available in the newest distros - Ubuntu 7.10 has it, and I think FC9.

I would be interested to know the Clib versions of your linux tests.

Anonymous said...

This is really a poor benchmark to test the STL against plain old C.

First of all as pointed out by many commenters here and on reddit your original tests at best misleading and inconclusive. I note however that you updated your STL pre-alloc test with a version that matches the C version in performance. The reason this works is because you are using arrays of primitive types in your benchmarks. This however is not a typical usage scenario. In a realistic situation in a real application, you are much more likely to use arrays of user-defined types much more often than you would arrays of primitive types. The STL containers will call new/delete as items are inserted/removed, which means each item is guaranteed to be initialized/destroyed. A C version would have to do this plumbing manually.

This is why:

vector work(NUM_ITERATIONS);
for (...) work[i] = i;

is not equivalent to your C pre-alloc test because the vecdtor constructor doesn't just allocate memory for NUM_ITERATIONS ints, it also initializes them.

At any rate, your benchmark is an extremely poor way to measure up the STL vs plain old C. Dynamic arrays are meant to be allocated, *iterated over several times*, and then freed, as opposed to allocated, *iterated over once*, and freed. Memory allocation is rarely the bottleneck in a typical application (i.e. if you spend too much time allocating memory it is not because your allocator is slow, it is because you're treating memory and allocation performance as infinite resources).

With plain old C constructs you are also almost guaranteed to lose the other benefits of the STL (type safety, algorithms, etc.).

About 99% of C features have an equivalent (in both functionality and performance) in C++. The opposite is not true.

Sergey Solyanik said...

Anonymous,

> Your updated code is still wrong. You should just create vector using default constructor, then call vector::reserve.

Reserve does not resize. You can do push_back with it, but can't do index access.

Another Anonymous,
I guess what I tested here are different paradigms (which is why I didn't change the conclusion of this article). STL allows you to not worry about allocations, and suffer the performance. It is possible to get the same performance, but vast majority of the code that I've see does not do this:
vector<x> y;
for (...)
y.push_back(...);
is the most common C++ idiom I've seen at Google, where STL is very popular.

In C-ish C++ you don't get this option, so you HAVE to pay attention, and you are more likely yo get implementation (if not algorithmic) efficiency.

Now, if you only care about algorithmic efficiency, why not just use Java or C#? Both are somewhat nicer as languages - better tools, smaller difference between the full language and the commonly used subset, easier to understand, etc.

You've made a great point, though, I've tested the code that is favorable to STL. Had I have something that actually does use copy constructor in assignment, the performance would have been worse. This, again, is a difference in approaches - with STL it's easy to write a = b without realizing that you've just copied 20M of data. In C, you will have to write memcpy(&a, &2, 20M) which is bound to draw the attention (and then maybe you'll be storing pointers, not objects, in your vector).

To your other point, the memory bottlnecks are usually due to two things: cache locality and fragmented heaps. I have a write-up on this here:

http://1-800-magic.blogspot.com/2007/11/guerilla-guide-to-native-memory.html

Anonymous said...

STL would be a lot less scary for you if you just spent some more time going over the basics.

Even your final updated version is not quite correct, you've got these 2 lines in there:

vector work(NUM_ITERATIONS);
work.reserve(NUM_ITERATIONS);

The first line creates a vector populated with NUM_ITERATIONS amount of default constructed objects in it.

Your second line there reserves a chunk of memory to allow up to NUM_ITERATIONS amount of objects to fit within the vector without it having to realloc.

Since there are already that number of items populated in the vector, your call to reserve there does nothing at all, it can just be removed in this case.

You only call reserve to improve performance for the case where you want to populate the vector in other ways such as by individual push_backs.

One of the best things about STL is that you can get really great performance out of it if you know what you are doing.

It provides a lot of important things such as a fully debugged red-black tree implementation and algorithms with guaranteed running complexity.

Like any kind of higher level wrapper, it is easy to lose performance if you don't really understand what is happening behind the scenes.

This is solved by learning what is happening behind the scenes, and of course profiling to determine areas where you need to apply extra optimizations such as a memory pool.

STL is your friend for high performance programming, not your enemy!

Anonymous said...

"Another Anonymous,"

Another Anonymous here.

"vector <x> y;
for (...)
y.push_back(...);
is the most common C++ idiom I've seen at Google, where STL is very popular."

There is nothing wrong with this per se. Since push back expands the vector by a certain amount over its size this is unlikely to cause a bottleneck in the vast majority of cases. For example a hundred push backs will only cause 13 reallocations. This quickly becomes relatively negligible as the number of objects increases. Also just because a seemingly inefficient idiom is used a lot in a codebase doesn't mean it is a performance bottleneck. Maybe there is a great variability in the number of objects that will be held in the vector. Maybe the idiom is used in insignificant, performance wise, portions of the codebase. Maybe the people who used this idiom *know what they are doing*.

"Now, if you only care about algorithmic efficiency, why not just use Java or C#?"

There are many reasons why one would choose C++ over Java or C# (both of which have many of the same issues that you claim are easier to spot in C because "you have to pay attention"), but I don't see what that has to do with my original comment or with this discussion. Besides, the STL provides other benefits over C besides containers, e.g. type safety, algorithms. I would certainly take these over a *hypothetical* and (if any) probably marginal improvement in performance in going all C.

"This, again, is a difference in approaches - with STL it's easy to write a = b without realizing that you've just copied 20M of data."

Why would you not realize that? This strikes me as saying that it is very easy to new or malloc memory without remembering to delete or free it. The STL makes it clear that it has value semantics. And with idioms like return value optimization it is often possible to avoid needless copying. However I can't say if that would apply to your original example of the interviewee returning a vector.

"To your other point, the memory bottlnecks are usually due to two things: cache locality and fragmented heaps. I have a write-up on this here:"

Right, and the STL allows you to supply your containers with custom allocators which would help avoid many of the problems you listed in your write-up.

You are making it sound like these are obscure issues of the STL that belong in the dark and far reaching corners of C++, but this is not the case. All of these issues are pretty basic. The STL has been part of the C++ standard for over a decade, yet it is pretty clear that you did not take the time to learn it. It is not without its flaws, but your criticism is not constructive.

Anonymous said...

@Sergey:
> Reserve does not resize. You can do
> push_back with it, but can't do
> index access.

For index access, you don't need the reserve call in the code included in final paragraph of your post. It's a no-op anyway, because vector contains already NUM_ITERATONS elements.

Anonymous said...

Using subtraction in a comparator is a very bad habit. One day your subtraction will silently overflow and you no longer have a well-ordering on the numbers you are trying to sort.

Yuri Chebotarev said...

Maybe this will be interesting to you. I try to compare string performance using C, "C++ with STL and boost" and C#. See results here

Anonymous said...

Besides trying to reinvent the wheel (and playing a very weak play of trying to beat STL and C++ people that thought of everything Java, C etc people never did), you are also not reading that counter correcttly because you need a serialising instruction.

C++ beats many C algorithms and use cases that it is scary to read this is coming from Google optimizing for speed..

Anonymous said...

I truly hope that performance testing or tuning is not your day job.

I wouldn't accept such a report/conclusion from my interns.

Mr. Hyde said...

it's advisable to surround your rdtsc with cpuid instructions otherwise caching/pipelining will throw your timing off.

http://www.df.lth.se/~john_e/gems/gem0029.html

Anonymous said...

The code is still not as it should be.

The reserve in the following code fragment is useless.
vector work(NUM_ITERATIONS);
work.reserve(NUM_ITERATIONS);
unsigned __int64 begin = rdtsc();
for (int j = 0; j < NUM_ITERATIONS; ++j)
work[j] = j;

vector work(NUM_ITERATIONS); will create a vector of NUM_ITERATIONS, so it is no use using reserve.

Either write
vector work(NUM_ITERATIONS);
for (int j = 0; j < NUM_ITERATIONS; ++j)
work[j] = j;

or
vector work;
work.reserve(NUM_ITERATIONS);
for (int j = 0; j < NUM_ITERATIONS; ++j)
work.push_back(j);