<<BASM Ž̳>>  1 

http://www.cnpack.org
룺SkyJacker
汾ݸ
УԣԺ
ʱ䣺2007

Introduction to BASM for Beginners 
BASM Ž̳̼

The series of articles named BASM for beginners currently consists of 7 articles and no. 8 and 9 are in progress. 
Common for the articles, and coming articles, is that they explain some BASM issues by use of an example function. 
Most often this function is first implemented in Pascal and then the compiler generated assembler code is copied from the CPU view in Delphi and then analyzed and optimized. 
Sometimes optimization involves the usage of MMX, SSE or SSE2 instructions.

BASMŽ̡̳ϵ¹7Ρ8κ͵9С 
ЩͨʵBASMͨЩʹ pascal дȻDelphi  Cpu иƻ룬ٽзŻͬʱҲ MMX, SSE  SSE2 ָŻ

By taking the code made by the compiler from a Pascal function the most commonly used instructions from the big 32-bit Intel Architecture instruction set are introduced to the beginner first. Seeing which code the compiler generates is leading to a valuable insight in the effectiveness of compiler generated code in general and into the Delphi compiler specifically.
ͨñ Pascal Ļ룬Ϊѧ߽˴32λ Intel ϵṹָ
оĴ뽫ʹǸĴЧʣͬʱ˽ Delphi ص㡣

As specific assembly code optimizations are introduced generalizations will be introduced when suitable. These general optimizations are suitable for implementation in compilers and most compilers including Delphi have them. At some point in the future a tool that automatically optimizes assembler code will be developed.
̳̽һЩͨõĻŻЩͨõŻѾ󲿷ֱʹãDelphi
ĳЩ棬һЩԶŻҲ

Knowledge about the target processor is often needed when optimizing code and therefore are a lot of CPU details, such as pipelines, explained in the series too.
As far as I know there is only little literature available that explains all these issues on a level where beginners can follow it. I hope this series will help fill this void. 
Ż뾭õصĴԵϣܵ߼Ҳڴϵн̳нܡ
֪,ֻкٵԳѧЩģϣЩܹⷽĿհס

Best regards
Dennis Kjaer Christensen.
Lesson 1
						 1 
The first little example gets us started. It is a simple function in Pascal with multiplies an integer with the constant 2.
һСӿʼǵBASM֮á
һó 2 һС

function MulInt2(I : Integer) : Integer;
begin
 Result := I * 2;
end;

Lets steal the BASM from the CPU view. I compiled with optimizations turned on.
 CPU VIEW лȡ롣ʱŻѡ
function MulInt2_BASM(I : Integer) : Integer;
begin
 Result := I * 2;
 {
 add eax,eax
 ret
 }
end;

From this we see that I am transferred to the function in eax and that the result is transferred back to the caller in eax too. This is the convention for the register calling convention, which is the default in Delphi. The actual code is very simple, the times 2 multiplication is obtained by adding I to itself, I+I = 2I. The ret instruction returns execution to the line after the one which called the function.
Կֵ eax 룬 eax ءDelphi ĬϵĵԼ
ʵʴǳ򵥣ĳ 2 ӣI + I = 2I
Ret ָصô˺һָλá
 
Lets code the function as a pure asm function.
䴿ຯ£
function MulInt2_BASM2(I : Integer) : Integer;
asm
 //Result := I * 2;
 add eax,eax
 //ret
end;
Observe that the ret function is supplied by the inline assembler.
ȻRet Ƕṩ
Let us take a look at the calling code.
ǿô롣
This is the Pascal code 
Pascal £
procedure TForm1.Button1Click(Sender: TObject);
var
 I, J : Integer;
begin
 I := StrToInt(IEdit.Text);
 J := MulInt2_BASM2(I);
 JEdit.Text := IntToStr(J);
end;

The important line is  Ҫǣ
 J := MulInt2_BASM2(I);
From the cpu view   CpuView пԿ
 call StrToInt
 call MulInt2_BASM2
 mov esi,eax
 
After the call to StrToInt from the line before the one, which calls our function, I am in eax. (StrToInt is also following the register calling convention). MulInt2_BASM2 is called and returns the result in eax, which is copied, to esi in the next line.
ڵú MulInt2_BASM2 ֮ǰȵ StrToIntStrToInt ؽ I  eax (StrToInt ͬҲѭĴԼ)MulInt2_BASM2 ú󣬽 eaxұһд븴Ƶ esi С

Optimization issues: Multiplication by 2 can be done in two more ways. Use the mul instruction or shifting left by one. In the Intel IA32 SW developers manual 2 page 536 mul is described. It multiplies the value in eax by another register and the result is returned in the register pair edx:eax. A register pair is needed because a multiplication of two 32 bit numbers results in a 64 bit result, just like 9*9=81 - two one digit numbers (can) result in a two digit result.
Żʽ2жַʹ mul ˷ָһλɴ˹ܡ
 IA32 ֲ 2 ĵ 536 ҳ mul
 EaxļĴĴ edx: eax С
ʹüĴԵԭ32λ˽ 64 λ 9*9=81һλ˵õһλ

This raises the issue of which registers must be preserved by a function and which can be used freely. This is explained in the Delphi help.
£뱣ĴԭʼֵʹáDelphi Help нͣ
"An asm statement must preserve the EDI, ESI, ESP, EBP, and EBX registers, but can freely modify the EAX, ECX, and EDX registers."
ڻ﷨У뱣桡EDI, ESI, ESP, EBP, EBXĴ ǡEAX, ECX, EDX ʹá

We can conclude that it is no problem that edx is modified by the mul instruction and our function can also be implemented like this.
ˣǿȷ˷ָ mul ޸ edx κ⡣ǵĺҲ޸Ϊ£
function MulInt2_BASM3(I : Integer) : Integer;
asm
 //Result := I * 2;
 mov ecx, 2
 mul ecx
end;
ecx is used also but this is also ok. As long as the result is less than the range of integer it is returned correctly in eax. If I am bigger than half the range of integer overflow will occur and the result is incorrect.
Ecx ҲԱʹáֻҪֵ Integer Χڣeax ŵľȷĽֵ I һ룬ô˺eax ֵҲͲ׼ȷˡ

Implementation with shift ʹλָ
function MulInt2_BASM4(I : Integer) : Integer;
asm
 //Result := I * 2;
 shl eax,1
end;
Timing can reveal which implementation is fastest. We can also consult Intel or AMD documents with latency and throughput tables. Add & mov is 0.5 cycles latency and throughput, mul is 14-18 cycles latency and 5 cycles throughput. shl is 4 cycles latency and 1 cycle throughput. The version chosen by Delphi is the most efficient on P4 and this will probably also be the case on Athlon and P3.
ִʱǸЧġҲԲοڽǱں Intel  AMD ĵ
Add  Mov ָ 0.5 ʱں
Mul 14~18 ں 5 
Shl  4 ʱں 1 
汾 Delphi  P4 Чģ Athlon  P3 Ҳǡ

Issues not covered: mul versus imul and range checking, other calling conventions, benchmarking, clock count on other processors, clock count for call + ret, location of return address for ret etc..
汾ûпmul  imul ĲͬΧ飬Լ׼ʱڣ/ָʱڣرصַ ret ȵȡ


߸

1Cpuչָȫ
MMX(Multi Media eXtensionýչָ)
SSE(Streaming SIMD Extensionsָչ)
SSE2(Streaming SIMD Extensions 2IntelٷΪSIMD չ 2ָչָ 2 ) 

2Latency and throughput
http://www.yesky.com/120/1645120.shtml

LatencyǱڣ˽京ǱȽѵģʵϣʾȫִһָʱڣǱԽԽá
ϸ˵Ǳڰһָӽյ͵ȫֽ̡Ĵx86ָҪԼ5ʱڣЩ֮вָһģдCPUǱҪʵʵʱ䳤

Throughputֺ壺
һִ֣һָʱԽԽáִеٶԽ죬һָռԴĻҲԽ١
ڶ֣һʱڿִеָȻԽԽá 

3 ʵ 1A, 1B, 1C ɡ 1A