Friday, December 21, 2007

Integer multiplication and dynamic programming (Part 1 of 2)

Integer multiplication is one of the operations that benefited a lot from transition to Core architecture. It's latency dropped from 10 clocks on Pentium 4 to 3 on Core 2 Duo. As it happens, Pentium 4 was a step backwards from PIII - it was 4 clocks there. But they were slower clocks :-).

I happen to look at what Microsoft Visual Studio-generated assembly a lot, and for a while I was noticing that whenever a program needed to multiply by integer constant, the multiplication operation was almost always replaced by other instructions - and this was true not for just multiplying by a power of two, where it's just a shift.

(With VS 2008 this is no longer true - the 3 clock penalty for imul is better than memory pressure of replacing it with 2 instructions.)

Intel has a number of other operations that can be used in lieu of integer multiplication, and you can combine them to get to fairly complex numbers. For example, suppose you want to multiply EAX by an integer constant.

You can:
SHL EAX, n ; multiplies by 2 to the power of N
LEA EAX, [EAX + 2 * EAX] ; Multiply EAX by 3(*)
LEA EAX, [EAX + 4 * EAX] ; 5
LEA EAX, [EAX + 8 * EAX] ; 9

(*) LEA = load (really, compute) effective address. This is indexing instruction that is designed to quickly add an index of an array element times its size (which can be 1, 2, 4, and 8) to the starting address of the array to get the address of the element. Usually the start of the array lives in the first register, and the index in another, so typical use is something like lea eax, [edx + 8*esi]. But with the right registers it can be used for multiplication as well.

Now, suppose you can use another register (say, the number you're multiplying by can be copied to EDX)

Then you can do even more:
ADD EAX, EDX ; +1
SUB EAX, EDX ; -1
LEA EAX, [EAX + 2 * EDX] ; +2
LEA EAX, [EAX + 4 * EDX] ; +4
LEA EAX, [EAX + 8 * EDX] ; +8
LEA EAX, [EDX + 2 * EAX] ; *2 + 1
LEA EAX, [EDX + 4 * EAX] ; *4 + 1
LEA EAX, [EDX + 8 * EAX] ; *8 + 1


By using combinations of these operations, you can achieve amazing things:

Multiply by 45 (5 * 9):
lea eax, [eax + 8*eax]
lea eax, [eax + 4*eax]
Multiply by 78:
mov ecx, eax
lea eax, [ecx + 4*eax]
lea eax, [eax + 8*ecx]
lea eax, [eax + 2*eax]
shl eax, 1
Multiply by 13:
mov ecx, eax
lea eax, [ecx + 4*eax]
lea eax, [eax + 8*ecx]
Multiply by 241:
mov ecx, eax
lea eax, [ecx + 4*eax]
lea eax, [eax + 2*eax]
shl eax, 4
add eax, ecx
Multiply by 189:
mov ecx, eax
lea eax, [ecx + 4*eax]
lea eax, [ecx + 4*eax]
lea eax, [eax + 8*eax]
Multiply by 1023:
mov ecx, eax
shl eax, 10
sub eax, ecx


How does compiler finds the right decomposition of the multiple into these operations? The answer is dynamic programming.

The subject of dynamic programming is decomposing a problem into simpler subtasks that can be computed recursively by decomposing them into more subtasks.

In this case if we have a multiple we want to decompose, we can try every possible operation from the list above. 'Trying an operation' actually means applying the inverse to the multiple - e. g. if we want to see if we can use 'multiply by 9' operation (lea eax, [eax + 8 * eax]), this would mean that we divide our multiple by 9 (decompose it into *9 and other operations), then call decompose on this new number (but only if it exists - i. e. is a whole number).

For every multiple we will go through all inverse operations, and note which one resulted in the shortest link. This would be the operation we will end up using.

The algorithm is as follows:

Let F(N) be all of the above operations, and F-1(N) are the inverse operations (so for example let F(3) = lea eax, [eax + 8 * eax] - multiplication by 9, then F-1(3) would be division by 9.

Decompose (multiple, recursion_level):
if recursion_level > 10 return infinite (*)
if multiple == 1 return 0
min_cost = infinite
best_op = none
for n = 1..N
next_multiple = F-1(n)(multiple)
if next_multiple is a whole number:
cost = 1 + Decompose(next_multiple)
if const < min_cost:
min_cost = cost
best_op = F(n)
decomposition[multiple] = best_op
return min_cost

(*) This makes sure it terminates. Because some of our inverse
operations can actually make the result bigger (specifically,
an inverse to -1 is +1), the recursion becomes infinite otherwise.

The actual C code of course is much more involved. I will post it tomorrow.

No comments: