<<BASM ѧ>>  3 

http://www.cnpack.org
QQ Group: 130970
룺SkyJacker
汾ݸ
״̬δУ
ʱ䣺2007

In this third lesson topic such as MMX and SSE2 will be introduced together with Int64 arithmetics.
This is the first time we will see processor dependent optimisations.
 3 εǽ Int64 㣬ͬʱҲ MMX,SSE2 ָ
ǽһοŻ

The example looks like this   
£
function AddInt64_1(A, B : Int64) : Int64;
begin
 Result := A + B;
end;

Let us jump straight into the asm code. 
ֱ룺
function AddInt64_2(A, B : Int64) : Int64;
begin
 {
 push ebp
 mov ebp,esp
 add esp,-$08 // ʱռ
 }
 Result := A + B;
 {
 mov eax,[ebp+$10]
 mov edx,[ebp+$14]
 add eax,[ebp+$08] //  32 λ
 adc edx,[ebp+$0c] // λ 32 λ
 mov [ebp-$08],eax //  32 λ
 mov [ebp-$04],edx //  32 λ
 mov eax,[ebp-$08]
 mov edx,[ebp-$04]
 }
 {
 pop ecx
 pop ecx
 pop ebp
 //ret
 }
end;
The first three lines of code are recognized as setting up a stack frame like in the previous lessons.
This time we know that the compiler might add the first two for us.
The last three lines are also a well-known pattern. Again the compiler might add pop ebp for us.
This brings us into the meat, which are these 8 lines Result := A + B;
ʼǰлöջǰγ̽һ
֪ΪǰС
дҲϤʽ⣬Ϊ pop ebp
 Result := A + B  8 л룺
 {
 mov eax,[ebp+$10]
 mov edx,[ebp+$14]
 add eax,[ebp+$08]
 adc edx,[ebp+$0c]
 mov [ebp-$08],eax
 mov [ebp-$04],edx
 mov eax,[ebp-$08]
 mov edx,[ebp-$04]
 }

They can be analysed in pairs because they work together in tandem doing 64-bit math by splitting
the problem up into 32 bit pieces. The first two lines load A into the register pair eax:edx.
They are loading a contiguous 64-bit block of data from the previous stack frame,
showing us that A was transferred on the stack.
ɶԷ룬Ϊ 64 λͨһ 64 λΪ 32 λ͸ 32 λ㡣
ǰнβ A Ĵ eax: edx
ǽ֮ǰջ 64 λװ룬Ȼ A ͨջ䡣

The two load pointers are separated by 4 bytes.
One of them is pointing to the beginning of A and the other one is pointing into the middle of A. 
Then comes two add instructions. The first is a normal add and the second one is add with carry.
The pointers in these two lines are pointing to B in the same fashion as the two previous were pointing at A. 
The first add adds the lower 32 bits of B to the lower 32 bits of A. 
This might lead to a carry if the sum is too big to fit into 32 bits. 
This carry is included in the addition of the higher 32 bits.
ָ֮䣬 4 ֽڵļ  
һָ A Ŀʼһָ A м䡣
֮ӷָһͨӣڶǴλӣ
еָǰָ A ָͬķʽ
һ add ָǽ B ĵ 32 λ A ĵ 32 λӡ
õĺ̫ 32 λװһλ
λڸ 32 λС

To make things totally clear lets do a simple example on decimal numbers.
We have the addition 1+2 = 3. 
Our imaginary data types for this in our brain CPU as two digits wide.
This means that the addition is actually looking like this 01+02=03.
There is no carry from the addition of the lower digits into the higher ones, which are zero.
Ϊ˵Ľ, һ򵥵 10 ӵ˵
һ 1 + 2 = 3 ǼǵĴ CPU  2 ֵĿȡ
ζӷʵ 01 + 02 = 03 ʽĵλûнλλ㡣 

Decimal example two. 13+38=?. First we add 3+8=11.
This results in a carry and a 1 in the lower half of the result.
Then we add Carry+1+3=1+1+3=5.
The result is 51.
ڶ 13 + 38 = ? Ǽ 3 + 8 = 11һλ
ȻǼ λ + 1 + 3 = 1 + 1 + 3 =5 51

In the third example we provoke an overflow. 50+51=101.
101 is too big to fit in two digits and our brain CPU cannot perform the calculation.
There was a carry on the addition of the two higher digits. Back to code.
Two things can happen now.
If we have compiled without range check the result wraps around. With range check an exception will be thrown.
We see that there is now range check code in our listing and wraparound will occur.
ӣһ50 + 51 = 101
101 ̫ 2 λװ£ǵĴ CPU ҲִмˡΪλҲһλ
ͷ룬
1ʱûʹ÷Χı߽硣
2ʹ÷Χ齫׳һ⡣ǻῴзΧĴбʾͬʱһ߽

The next two lines save the result into the current stack frame. 
The last two lines load the result from the stack frame into eax and edx where it already was. 
These 4 lines are redundant. They can be removed and this also removes the need for a stack frame.
it so easy to be an optimizer ;-)
дǽ EAX, EBX 浽ջ
ٽӶջԭ EAX,EDX С
Ȼ 4 ǶġǿԱɾͬʱҲɾջصĴ롣

function AddInt64_6(A, B : Int64) : Int64;
asm
 mov eax,[ebp+$10]
 mov edx,[ebp+$14]
 add eax,[ebp+$08]
 adc edx,[ebp+$0c]
end;

This is a nice small function. 
The compiler generated code consisted of 16 lines and we came down to 4 with only little effort. 
Today Delphi was really sleepy.
Now we think like this: If we had 64 bit registers the addition could be done with two lines of code.
But the MMX registers are 64 bits wide and this might be worth taking advantage of.
һǳĺ
˴ 16 УֻͨһŬͽ뾫Ϊ 4 С
Delphi е谡
룬 64 λļĴô 64 λӷдʵ֡
ʵ, MMX Ĵ 64 λǿܻüֵ

In the Intel SW Developers Manual instructions are not marked as belonging to IA32, MMX, SSE or SSE2. 
This information would be nice to have, but we have to look elsewhere for it.
I normally use three small programs from Intel. The so called computer based tutorials on MMX, SSE & SSE2.
I do not know where to find them on the Intel website now, but mail me if you want them.
They are simple and nice - very illustrative.
In these I find that a mov for 64 bits from memory into an MMX register is movq. 
Q stands for quad word. The mmx registers are named mm0, mm1....mm7. 
They are not arranged as a stack, as the FP registers are, and we can pick which one we like.
 Intel ֲУû IA32, MMX, SSE  SSE2 ָ
ֲЩָǾ̫ˣǲòطѰЩָ
 Intel ṩ 3 СǱΪ MMX, SSE & SSE2 Ļ̡̳
Ҳ֪δ Intel վҵǣҪԸҷ Email
Ǽ򵥵ģɵģǳ˵.
Щ, ҷһ 64 λƶָ movq, ݴڴƵһ MMX Ĵ.
Q ʾı֡mmx Ĵ鱻Ϊ mm0, mm1....mm7
ǲ FP ĴΪջʹãǿʹһ

Lets pick mm0. The first instruction looks like this
movq    mm0, [ebp+$10]
There is to ways two go now. We can load B into a register too.
This makes it easy to see what is going on by using the FPU window. 
The MMX registers are aliased onto the FP registers and the FPU view can show both sets. 
Switch between FP and MMX view by select "Display as words/Display as extendeds" in the shortcut menu. 
The second way to go is to use the pattern from the IA32 implementation and perform the addition with 
the memory location of B as source.
 mm0 Ϊһָʽ
movq mm0, [ebp+$10]

ַʽѡ
һַʽҲԽ B װصһĴ
ַʽͨ FPU Ĵ
MMX ĴΪ FP ĴıFPU ڿʾָ
ת FP  MMX ʾͨѡһݲ˵ϵ "ʾ/ʾչ"

ڶַʽʹ IA32 ִģʽڴе B ΪԴ

The two solutions is expected to perform identically because the CPU needs to load B into registers
before doing the addition and whether it is done explicitly with mov or explicitly with the add instruction,
the number of micro instructions will be the same. We use the more illustrative first way.
ַʽִͬЧΪ CPU ִмӷָ֮ǰҪ B װصĴ
Ƿȷʹ mov ָ add ָʹõ΢ָͬġ
ǿʹø˵

The next line is then a movq again
һ movq ָ
movq    mm1, [ebp+$08]

Then we have to go look for an add instruction which would be something like this- paddq.
P for MMX, add for addition and q for quad word. 
Now we get disappointed because there is no such MMX instruction. What about SSE.
One more disappointment.
ȻҪѰһӷָ paddq ʽ
P ʾ MMXadd ʾӷq ʾĸִС
ȻǺʧΪû MMX ָ
SSE أ Ҳʧ

Finally SSE2 got it and we are happy or are we? If we use it the code will be targeting P4 and not run P3 or Athlon.
Like the P4 lovers we are we proceed anyway.
paddq   mm0, mm1
This line is very intuitive. Adding mm1 to mm0.
Only thing left is to copy the result from mm0 into eax:edx.
To do this we need a double word mov instruction that can take 32 bits from a MMX register as source and
an IA32 register as destination.
movd    eax, mm0
This MMX instruction does the job. It copies the lower 32 bits of mm0 to eax.
 SSE2 зǸòΪ֮أ
ʹôֻ P4 ϣ P3  Athlon ϡ
P4 ǿԼȥ

paddq   mm0, mm1

һзǳֱۣۼ mm1  mm0
ǽ mm0 Ƶ eax:edx
ΪʵҪһ˫ mov ָԴΪԴ MMX Ĵи 32 λ,
ΪĿ IA32 ĴС
movd    eax, mm0

MMX ָmovd  mm0 ĵ 32 λ EAX С

Then we need to copy the upper 32 bits of the result to edx. I could not find an instruction for that and instead
I shift the upper 32 bits down into the lower 32 bit using a 64-bit MMX right shift instruction.

ȻҪ mm0 ĸ 32 λ eax Сûзʵָܵ
һķʹһ 64 λ MMX ָѸ 32 λƵ 32 λС

psrlq   mm0, 32
Then we copy
ȻǸ
movd    edx, mm0
Then we are done? Unfortunately we have to issue the emms instruction because we have used MMX instructions.

ڽ𣿺ܲңΪʹ MMX ָҪ emms ָ

It cleans up the FP stack and leaves in a well-defined empty state. Emms bums 23 cycles on a P4.
Together with the shift which is also ineffective (2 cycles throughput and latency) on P4 our solution is not
especially fast and it will only run on P4 and this AMD thing nobody has yet:-(
 FP ջնջ
Emms  P4 Ҫ 23 ʱڡ
λָҲǺЧ(2 cycles throughput and latency ʱڵǱ), ǵ P4 رĿ,
ֻ P4 , AMD ûʵ

This ended the 3. lesson. We left the ball hanging in the air. Can we come up with a more efficient solution? 
Moving data between MMX register and IA32 registers is expensive. The calling convention is no good, because data
were transferred on the stack and not in registers. eax->mm0 is 2 cycles.
The other way is 5 cycles. emms is 23 cycles. Addition is only 2 cycles. Overhead is plenty.
 3 εܽᡣ
ǰڿС
óһЧʵĽ
 MMX Ĵ IA32 ĴдݵĴۺܸߡԼǺܺãΪڶջϴڼĴд䡣
eax -> mm0  2 ڣʽ 5 ڣ emms  23 ڣӷֻ 2 ڣʱĺķǺܴġ
   
   
ע(ӢĻѧϰ)

1This brings us into the meat, which are these 8 lines Result := A + B;
̣
A. Ǵ 8 ;
B. ǽӴ...
C. ǽʵ...
the meat ĺʲôʵʣõݡ

ˣ
 Result := A + B  8 л룺

2We left the ball hanging in the air.
A. ǰڿСǰС
B. 

˼Ҳʵ˼

ˣ
ǰڿС

3Today Delphi was really sleepy.
A. Delphi ǻ˯
B. Delhpi ˡ

ˣ
Delphi е谡