1-800-MAGIC: STL vector perf redux

Wednesday, February 27, 2008

STL vector perf redux

Alright, my exercise in STL vector perf analysis was to a large extent an exercise in n00bness. Nevertheless, I learned a lot of stuff, thanks to the readers, and here's the summary.

If you are looking for the gory details, read this: http://1-800-magic.blogspot.com/2008/02/stl-vector-performance.html and the commentary.

The task: stuff an STL vector with 10000 numbers, measure what it takes per iteration.

The performance of course will vary with the computer. Mine is T2600 @ 2.16GHz.

C reference point - pre-allocated array.
2 clocks per iteration:

int *work = (int *)malloc(10000 * sizeof(int));
for (int j = 0; j < 10000; ++j)
  work[j] = j;

C reference point - dynamically allocated array using STL allocation policy.
70 clocks per iteration:

int size = 20;
int *work = (int *)malloc(size * sizeof(int));
for (int j = 0; j < 10000; ++j) {
  if (j >= size) {
    size = size + size / 2;
    work = (int *)realloc(work,
        size * sizeof(int));
  }
  work[j] = j;
}

STL, dynamically allocated vector, with or without security checks.
140 clocks per iteration.

vector work;
for (int j = 0; j < 10000; ++j)
  work.push_back(j);

STL, pre-allocated, with security checks, push_back.
9 clocks per iteration.

vector work;
work.reserve(10000);
for (int j = 0; j < 10000; ++j)
  work.push_back(j);

STL, pre-allocated, no security checks, push_back.
8 clocks per iteration.

#define _SECURE_SCL 0
vector work;
work.reserve(10000);
for (int j = 0; j < 10000; ++j)
  work.push_back(j);

STL, pre-allocated, with security checks, [].
6 clocks per iteration.

vector work(10000);
for (int j = 0; j < 10000; ++j)
  work[j] = j;

STL, pre-allocated, no security checks, [].
2 clocks per iteration.

#define _SECURE_SCL 0
vector work(10000);
for (int j = 0; j < 10000; ++j)
  work[j] = j;

Resume: know what you're doing. Performance can be from 3x to 70x slower if you're not careful.

For those who didn't read the previous post, and want to argue that none of this is important, read this http://1-800-magic.blogspot.com/2007/12/memory-is-not-free-more-on-vista.html for a fun real life story when you wouldn't think that performance matters until the very end, when it suddenly does, but then it's too late :-).

Another observation: the fact that most STL implementations are not THAT much smaller is an accolade to the modern CPU design.

If you actually look at the generated code, the STL (especially push_back) compiles to much bigger assembly than you could guess from the results above - even without the security check it is ~30 instructions plus the call into reallocating logic.

; for (int j = 0; j < NUM_ITERATIONS; ++j)
mov         edx,dword ptr [esp+3Ch] 
mov         ecx,dword ptr [esp+38h] 
xor         esi,esi 
mov         dword ptr [esp+1Ch],esi 
mov         edi,edi 
; work.push_back(j);
L0:
cmp         ecx,edi 
jne         L1
xor         eax,eax 
jmp         L2
L1:
mov         eax,dword ptr [esp+40h] 
sub         eax,ecx 
sar         eax,2 
L2:
mov         ebx,edx 
sub         ebx,ecx 
sar         ebx,2 
cmp         ebx,eax 
jae         L3
mov         dword ptr [edx],esi 
add         edx,4 
mov         dword ptr [esp+3Ch],edx 
jmp         L4
L3:
lea         ecx,[esp+1Ch] 
push        ecx  
push        edx  
lea         eax,[esp+3Ch] 
call        std::vector<int,std::allocator<int> >::_Insert_n
mov         edx,dword ptr [esp+3Ch] 
mov         ecx,dword ptr [esp+38h] 
L4:
inc         esi  
cmp         esi,2710h 
mov         dword ptr [esp+1Ch],esi 
jl          $LN22+5Ah (4013E0h)

This is only 4 times slower than the C loop, which is so much less:

xor         eax,eax 
lea         esp,[esp]
L0: 
mov         dword ptr [ecx+eax*4],eax 
inc         eax  
cmp         eax,2710h 
jl          L0

This is because all these heaps of STL code execute in parallel, speculatively, whereas the straightforward C loop is not parallelizeable at all.

7 comments:

Anonymous said...: You may also want to try measuring the performance of accessing vector elements with iterators (which should be plain pointers in release build):

vector work(NUM_ITER);
int val = 0;
for (vector::iterator it = work.begin();
it != work.end;
++ it, ++ val)
*it = val;; February 28, 2008 at 3:07 AM
Unknown said...: If you're into optimizing array ops, you'll probably want to perform loop unrolling. In this case, the raw C version will probably gain the advantage again, because less code is generated, and the amount of code in the instruction pipeline does matter a lot for loop unrolling.

Ex:
int i=0;
while(i<10000) {
work[i]=i;
work[i+1]=i+1; i+=2;
work[i]=i;
work[i+1]=i+1; i+=2;
}

and of course other variations on the places where i is incremented and the size of the inner loop.; February 28, 2008 at 3:34 AM
Anonymous said...: Anonymous again. I've just noticed that you compare the footprint of the code using vector::push_back with iteration over elements of plain C array. This particular comparison makes no sense, since push_back must handle the reallocations.

On VS 2008 with _SECURE_SCL defined to zero, the code below compiles to exactly the same four opcodes in the loop body.

vector work(NUM_ITER);
for (int i = 0; i < NUM_ITER; i ++)
work[i] = i;

mov ecx, DWORD PTR _work$[esp+36]
xor eax, eax
$LL3@wmain:
mov DWORD PTR [ecx+eax*4], eax
inc eax
cmp eax, 10000
jl SHORT $LL3@wmain; February 28, 2008 at 4:47 AM
Sergey Solyanik said...: Anonymous,

Yes, I know. Again, I compare the idiom most frequently used in one world vs. the idiom most frequently used in the other world. They don't have to be identical.

Besides, the point here is not to knock down STL, but to point out how processors compensate for the bigger code.; February 28, 2008 at 9:12 AM
Anonymous said...: Please write more about STL and C++. I have a lot of fun reading it :-); February 28, 2008 at 10:05 AM
Anonymous said...: Stick to your other job.

Beware that 'Thinking against C++ and STL' is so common it is bound to be wrong.

Especially by people who don't, in technical speak, 'like' it.; March 1, 2008 at 8:32 AM
Anonymous said...: I want to also mention that you should also factor in the cost of STL's destructor as well.

Try creating a 2D large grid (a vector of vector), where each element is a vector.

Once the function returns, you will see the destructor kicks in, and that could be a bottleneck (certainly is the case of mine!); November 9, 2008 at 7:13 PM

1-800-MAGIC

Wednesday, February 27, 2008

STL vector perf redux

7 comments:

NIST Clock

About Me

Blog Archive

Links

Visitors