Wednesday, February 27, 2008

STL vector perf redux

Alright, my exercise in STL vector perf analysis was to a large extent an exercise in n00bness. Nevertheless, I learned a lot of stuff, thanks to the readers, and here's the summary.

If you are looking for the gory details, read this: http://1-800-magic.blogspot.com/2008/02/stl-vector-performance.html and the commentary.

The task: stuff an STL vector with 10000 numbers, measure what it takes per iteration.

The performance of course will vary with the computer. Mine is T2600 @ 2.16GHz.

C reference point - pre-allocated array.
2 clocks per iteration:
int *work = (int *)malloc(10000 * sizeof(int));
for (int j = 0; j < 10000; ++j)
work[j] = j;

C reference point - dynamically allocated array using STL allocation policy.
70 clocks per iteration:
int size = 20;
int *work = (int *)malloc(size * sizeof(int));
for (int j = 0; j < 10000; ++j) {
if (j >= size) {
size = size + size / 2;
work = (int *)realloc(work,
size * sizeof(int));
}
work[j] = j;
}

STL, dynamically allocated vector, with or without security checks.
140 clocks per iteration.
vector work;
for (int j = 0; j < 10000; ++j)
work.push_back(j);

STL, pre-allocated, with security checks, push_back.
9 clocks per iteration.
vector work;
work.reserve(10000);
for (int j = 0; j < 10000; ++j)
work.push_back(j);

STL, pre-allocated, no security checks, push_back.
8 clocks per iteration.
#define _SECURE_SCL 0
vector work;
work.reserve(10000);
for (int j = 0; j < 10000; ++j)
work.push_back(j);

STL, pre-allocated, with security checks, [].
6 clocks per iteration.
vector work(10000);
for (int j = 0; j < 10000; ++j)
work[j] = j;

STL, pre-allocated, no security checks, [].
2 clocks per iteration.
#define _SECURE_SCL 0
vector work(10000);
for (int j = 0; j < 10000; ++j)
work[j] = j;


Resume: know what you're doing. Performance can be from 3x to 70x slower if you're not careful.

For those who didn't read the previous post, and want to argue that none of this is important, read this http://1-800-magic.blogspot.com/2007/12/memory-is-not-free-more-on-vista.html for a fun real life story when you wouldn't think that performance matters until the very end, when it suddenly does, but then it's too late :-).

Another observation: the fact that most STL implementations are not THAT much smaller is an accolade to the modern CPU design.

If you actually look at the generated code, the STL (especially push_back) compiles to much bigger assembly than you could guess from the results above - even without the security check it is ~30 instructions plus the call into reallocating logic.
; for (int j = 0; j < NUM_ITERATIONS; ++j)
mov edx,dword ptr [esp+3Ch]
mov ecx,dword ptr [esp+38h]
xor esi,esi
mov dword ptr [esp+1Ch],esi
mov edi,edi
; work.push_back(j);
L0:
cmp ecx,edi
jne L1
xor eax,eax
jmp L2
L1:
mov eax,dword ptr [esp+40h]
sub eax,ecx
sar eax,2
L2:
mov ebx,edx
sub ebx,ecx
sar ebx,2
cmp ebx,eax
jae L3
mov dword ptr [edx],esi
add edx,4
mov dword ptr [esp+3Ch],edx
jmp L4
L3:
lea ecx,[esp+1Ch]
push ecx
push edx
lea eax,[esp+3Ch]
call std::vector<int,std::allocator<int> >::_Insert_n
mov edx,dword ptr [esp+3Ch]
mov ecx,dword ptr [esp+38h]
L4:
inc esi
cmp esi,2710h
mov dword ptr [esp+1Ch],esi
jl $LN22+5Ah (4013E0h)

This is only 4 times slower than the C loop, which is so much less:
xor         eax,eax 
lea esp,[esp]
L0:
mov dword ptr [ecx+eax*4],eax
inc eax
cmp eax,2710h
jl L0

This is because all these heaps of STL code execute in parallel, speculatively, whereas the straightforward C loop is not parallelizeable at all.

7 comments:

Anonymous said...

You may also want to try measuring the performance of accessing vector elements with iterators (which should be plain pointers in release build):

vector work(NUM_ITER);
int val = 0;
for (vector::iterator it = work.begin();
it != work.end;
++ it, ++ val)
*it = val;

Unknown said...

If you're into optimizing array ops, you'll probably want to perform loop unrolling. In this case, the raw C version will probably gain the advantage again, because less code is generated, and the amount of code in the instruction pipeline does matter a lot for loop unrolling.

Ex:
int i=0;
while(i<10000) {
work[i]=i;
work[i+1]=i+1; i+=2;
work[i]=i;
work[i+1]=i+1; i+=2;
}

and of course other variations on the places where i is incremented and the size of the inner loop.

Anonymous said...

Anonymous again. I've just noticed that you compare the footprint of the code using vector::push_back with iteration over elements of plain C array. This particular comparison makes no sense, since push_back must handle the reallocations.

On VS 2008 with _SECURE_SCL defined to zero, the code below compiles to exactly the same four opcodes in the loop body.

vector work(NUM_ITER);
for (int i = 0; i < NUM_ITER; i ++)
work[i] = i;


mov ecx, DWORD PTR _work$[esp+36]
xor eax, eax
$LL3@wmain:
mov DWORD PTR [ecx+eax*4], eax
inc eax
cmp eax, 10000
jl SHORT $LL3@wmain

Sergey Solyanik said...

Anonymous,

Yes, I know. Again, I compare the idiom most frequently used in one world vs. the idiom most frequently used in the other world. They don't have to be identical.

Besides, the point here is not to knock down STL, but to point out how processors compensate for the bigger code.

Anonymous said...

Please write more about STL and C++. I have a lot of fun reading it :-)

Anonymous said...

Stick to your other job.

Beware that 'Thinking against C++ and STL' is so common it is bound to be wrong.

Especially by people who don't, in technical speak, 'like' it.

Anonymous said...

I want to also mention that you should also factor in the cost of STL's destructor as well.

Try creating a 2D large grid (a vector of vector), where each element is a vector.

Once the function returns, you will see the destructor kicks in, and that could be a bottleneck (certainly is the case of mine!)