## Wednesday, February 27, 2008

### STL vector perf redux

Alright, my exercise in STL vector perf analysis was to a large extent an exercise in n00bness. Nevertheless, I learned a lot of stuff, thanks to the readers, and here's the summary.

If you are looking for the gory details, read this: http://1-800-magic.blogspot.com/2008/02/stl-vector-performance.html and the commentary.

The task: stuff an STL vector with 10000 numbers, measure what it takes per iteration.

The performance of course will vary with the computer. Mine is T2600 @ 2.16GHz.

C reference point - pre-allocated array.
2 clocks per iteration:
`int *work = (int *)malloc(10000 * sizeof(int));for (int j = 0; j < 10000; ++j)  work[j] = j;`

C reference point - dynamically allocated array using STL allocation policy.
70 clocks per iteration:
`int size = 20;int *work = (int *)malloc(size * sizeof(int));for (int j = 0; j < 10000; ++j) {  if (j >= size) {    size = size + size / 2;    work = (int *)realloc(work,        size * sizeof(int));  }  work[j] = j;}`

STL, dynamically allocated vector, with or without security checks.
140 clocks per iteration.
`vector work;for (int j = 0; j < 10000; ++j)  work.push_back(j);`

STL, pre-allocated, with security checks, push_back.
9 clocks per iteration.
`vector work;work.reserve(10000);for (int j = 0; j < 10000; ++j)  work.push_back(j);`

STL, pre-allocated, no security checks, push_back.
8 clocks per iteration.
`#define _SECURE_SCL 0vector work;work.reserve(10000);for (int j = 0; j < 10000; ++j)  work.push_back(j);`

STL, pre-allocated, with security checks, [].
6 clocks per iteration.
`vector work(10000);for (int j = 0; j < 10000; ++j)  work[j] = j;`

STL, pre-allocated, no security checks, [].
2 clocks per iteration.
`#define _SECURE_SCL 0vector work(10000);for (int j = 0; j < 10000; ++j)  work[j] = j;`

Resume: know what you're doing. Performance can be from 3x to 70x slower if you're not careful.

For those who didn't read the previous post, and want to argue that none of this is important, read this http://1-800-magic.blogspot.com/2007/12/memory-is-not-free-more-on-vista.html for a fun real life story when you wouldn't think that performance matters until the very end, when it suddenly does, but then it's too late :-).

Another observation: the fact that most STL implementations are not THAT much smaller is an accolade to the modern CPU design.

If you actually look at the generated code, the STL (especially push_back) compiles to much bigger assembly than you could guess from the results above - even without the security check it is ~30 instructions plus the call into reallocating logic.
`; for (int j = 0; j < NUM_ITERATIONS; ++j)mov         edx,dword ptr [esp+3Ch] mov         ecx,dword ptr [esp+38h] xor         esi,esi mov         dword ptr [esp+1Ch],esi mov         edi,edi ; work.push_back(j);L0:cmp         ecx,edi jne         L1xor         eax,eax jmp         L2L1:mov         eax,dword ptr [esp+40h] sub         eax,ecx sar         eax,2 L2:mov         ebx,edx sub         ebx,ecx sar         ebx,2 cmp         ebx,eax jae         L3mov         dword ptr [edx],esi add         edx,4 mov         dword ptr [esp+3Ch],edx jmp         L4L3:lea         ecx,[esp+1Ch] push        ecx  push        edx  lea         eax,[esp+3Ch] call        std::vector<int,std::allocator<int> >::_Insert_nmov         edx,dword ptr [esp+3Ch] mov         ecx,dword ptr [esp+38h] L4:inc         esi  cmp         esi,2710h mov         dword ptr [esp+1Ch],esi jl          \$LN22+5Ah (4013E0h)`

This is only 4 times slower than the C loop, which is so much less:
`xor         eax,eax lea         esp,[esp]L0: mov         dword ptr [ecx+eax*4],eax inc         eax  cmp         eax,2710h jl          L0`

This is because all these heaps of STL code execute in parallel, speculatively, whereas the straightforward C loop is not parallelizeable at all.

Anonymous said...

You may also want to try measuring the performance of accessing vector elements with iterators (which should be plain pointers in release build):

vector work(NUM_ITER);
int val = 0;
for (vector::iterator it = work.begin();
it != work.end;
++ it, ++ val)
*it = val;

thbb said...

If you're into optimizing array ops, you'll probably want to perform loop unrolling. In this case, the raw C version will probably gain the advantage again, because less code is generated, and the amount of code in the instruction pipeline does matter a lot for loop unrolling.

Ex:
int i=0;
while(i<10000) {
work[i]=i;
work[i+1]=i+1; i+=2;
work[i]=i;
work[i+1]=i+1; i+=2;
}

and of course other variations on the places where i is incremented and the size of the inner loop.

Anonymous said...

Anonymous again. I've just noticed that you compare the footprint of the code using vector::push_back with iteration over elements of plain C array. This particular comparison makes no sense, since push_back must handle the reallocations.

On VS 2008 with _SECURE_SCL defined to zero, the code below compiles to exactly the same four opcodes in the loop body.

vector work(NUM_ITER);
for (int i = 0; i < NUM_ITER; i ++)
work[i] = i;

mov ecx, DWORD PTR _work\$[esp+36]
xor eax, eax
\$LL3@wmain:
mov DWORD PTR [ecx+eax*4], eax
inc eax
cmp eax, 10000
jl SHORT \$LL3@wmain

Sergey Solyanik said...

Anonymous,

Yes, I know. Again, I compare the idiom most frequently used in one world vs. the idiom most frequently used in the other world. They don't have to be identical.

Besides, the point here is not to knock down STL, but to point out how processors compensate for the bigger code.

dzembu said...

Anonymous said...

Beware that 'Thinking against C++ and STL' is so common it is bound to be wrong.

Especially by people who don't, in technical speak, 'like' it.

Shivesh W. said...

I want to also mention that you should also factor in the cost of STL's destructor as well.

Try creating a 2D large grid (a vector of vector), where each element is a vector.

Once the function returns, you will see the destructor kicks in, and that could be a bottleneck (certainly is the case of mine!)