Lesson 2
<<BASM ѧ>>  2 

http://www.cnpack.org
QQ Group: 130970
룺SkyJacker
汾ݸ
״̬δУ
ʱ䣺2007.04.23

This is the second chapter of the introduction to BASM programming with Delphi. 
The first chapter was a short introduction to integer code and this second one is about floating point code.
Our example functions can evaluate a second order polynomial. The parameters A, B and C that defines 
the polynomial is coded as local constants. Input to the function is the variable X of type double and 
the result is also of type double. The function looks like this.
 Delphi BASM ĵڶ¡
һ¼򵥵Ľָڶ½ܹڸָ
ǵһ2ζʽ
A,B,CǶʽϵĲǸ X䷵ֵҲǸ
ʾ

function SecondOrderPolynomial1(X : Double) : Double;
const
 A : Double = 1;
 B : Double = 2;
 C : Double = 3;
begin
 Result := A*X*X + B*X + C;
end;
Copying the assembler code from the CPU view gives us this.
 CPU view еĻ룺
function SecondOrderPolynomial2(X : Double) : Double;
const
 A : Double = 1;
 B : Double = 2;
 C : Double = 3;

begin
 {
 push  ebp
 mov   ebp,esp
 add   esp,-$08
 }
 Result := A*X*X + B*X + C;
 {
 fld   qword ptr [A]
 fmul  qword ptr [ebp+$08] 
 fmul  qword ptr [ebp+$08]
 fld   qword ptr [B]
 fmul  qword ptr [ebp+$08]
 faddp st(1)
 fadd  qword ptr [C]
 fstp  qword ptr [ebp-$08]
 wait
 fld   qword ptr [ebp-$08]
 }
 {
 pop   ecx
 pop   ecx
 pop   ebp
 }
end;
Lets explain the asm code line by line.
һʹ.

The begin results in this code.
begin ָĴ
begin
 {
 push  ebp
 mov   ebp,esp
 add   esp,-$08
 }

which sets up a stack frame for the function.
A stack frame is just a piece of memory that is reserved for the stack.
A stack frame is accessed through two pointers, the base pointer and the stack pointer.
The base pointer is in ebp and the stack pointer is in esp. These two registers are reserved for use by these
pointers only.
The line push ebp backsup the base pointer. The line mov ebp, esp sets up a new base pointer,
which is pointing to the top of the stack. The line add esp, -$08 moves the stack pointer 8 bytes down.
As a curiosity the stack grows downward and the last line could more intuitively have been sub esp,8.
The new stack frame that was created by these three lines is standing on top of, or actually hanging under,
the last stack frame,  which was probably allocated by the function that called our SecondOrderPolynomial function.
ⲿִúĶջ
ջڴбһ򣬶ջָͨʵģַָջָ롣
ַָebpУջָespС
Ĵרʶջġ
Push esb ebpָ룬
Mov ebp, esp ַָָǰջ,
add esp, -$08 ջָ8ֽڡ
ջµģһ sub esp, 8 ֱ۵ıʾ˼
µĶջռ䱻3д봴ʵڵ SecondOrderPolynomial ִ֮С

The next line of Pascal was compiled into no less than 9 lines of ASM.
һ pascal 뱻ɲС 9 е ASM 롣

Result := A*X*X + B*X + C;
 {
 fld   qword ptr [A]       // A װ st(0)
 fmul  qword ptr [ebp+$08] //st(0) * X ->st(0)
 fmul  qword ptr [ebp+$08] //st(0) * X ->st(0)  A * X * X
 fld   qword ptr [B]  	   //st(0)Ϊգִһջst(0)->st(1)
								           //	B -> st(0)St(0) Ϊջ
 fmul  qword ptr [ebp+$08] //st(0) * X ->st(0) = B * X
 faddp st(1)               //st(1) + st(0) -> st(1)ִгջst(1)->st(0)
 fadd  qword ptr [C]       //st(0) + C -> st(0)
 fstp  qword ptr [ebp-$08] //st(0)[ebp-$08](X)
 wait                      //ͬFPUCPUֹͣCPUУֱFPUɵǰ   
 fld   qword ptr [ebp-$08] //X - > st(0)
}
For those that is used to HP calculators floating point code is very easy to understand. The first line,
fld   qword ptr [A], loads the constant A onto the floating-point register stack. The line,
fmul  qword ptr [ebp+$08], multiplies A with X. This makes sense by watching the Pascal code,
but what means "qword ptr [ebp+$08]". qword ptr says "pointer to a quad word,
which is the size of a double. (64 bit).
Щָ⡣
һ fld   qword ptr [A]װس A Ĵջ
 fmul  qword ptr [ebp+$08] X  AЩͨ pascal ֱ京,
 "qword ptr [ebp+$08]" ʾʲôء
qword ptr ʾָһĸֵָ룬һĴС(64 λ)

The value of the pointer is between the brackets in [ebp+$08].
ebp is the base pointer and $08 is - well just 8.
Because the stack grows down the memory location 8 bytes above the base pointer is in the previous stack frame.
Here X was placed by the function, which called our function.
The register calling convention decides this placement of a double variable.
ֵֵָ֮ ebp+$8
ebp ǻַƫ 8ΪһջУջӻַ 8 ֽڡ
һú X 
ĴԼһ Double λá

A double variable does not fit into the 32 bit integer registers,
but it fits perfectly onto the floating-point registers. Borland decided to pass double variables via the stack,
but passing them in floating point registers would have been more efficient.
The next 3 lines need no further explanation, but the line, faddp st(1), needs some.
All floating-point instructions starts with an f. add is addition. st(1) is floating point register 1,
which is the second because st(0) is the first!
һ 32 λͼĴװ, ÿװ븡Ĵ
Borland ͨջ Double ʵͨĴӦøЧʡ
вҪĽֻͣһ faddp st(1) Ҫ˵һ¡
еĸָһ f ͷ add ʾӷ st(1)  1 Ÿǵڶ
ĴΪ st(0) ǵһ

The floating point registers are combined into a stack and instructions implicitly works on the top of the stack, 
which is st(0).faddp st(1) is the same as faddp st(0), st(1) and it adds register st(0) to register st(1)
and place the result in st(1).
The p in faddp means pop st(0) of the stack. This way the result ends up in st(0).
The line  fadd  qword ptr [C] completes the calculations and the only thing left is to place the result in st(0).
It is actually already there and the two lines
ĴΪһջָջ st(0) в
faddp st(1)  faddp st(1), st(0) ͬĴ st(0) ӵĴ st(1) st(1)
fddp  p ʾ st(0) ջʹ st(1) Ƶ st(0) С
fadd  qword ptr [C] ɼ㣬 st(0)
ΪѾ st(0) ˣģ

fstp  qword ptr [ebp-$08]
fld   qword ptr [ebp-$08]

are redundant.
They just copy the result into the stack frame and loads it back in again.
Such a waste of precious time and energy :-).
The line wait makes sure that any exceptions that might have been raised by one of the floating-point
instructions are checked. See Intel SW Developers Manual Volume 2 page 822 for the full explanation.
ǽ st(0) ƵջȻٴӶջװ st(0)
Ȼ˷ʱ; :)
wait ָ鸡ָǷ⡣
Բο Intel ֲ 2  822 ҳϸ͡

Then there are only three lines of asm back to explain.
зشҪһ£
 {
 pop   ecx
 pop   ecx
 pop   ebp
 }
end;
These are removing the stack frame, by restoring the values of esp and ebp back to the values they
had when the function was entered. This code is much more intuitive and does the same thing
add esp, 4
pop ebp
it is also more effective and I do not know why the compiler is incrementing the stack pointer
in this cumbersome way. Remember that ecx can be used for free and assigning values to it is just
like pouring them into a waste bucket.

ǵջָ뺯ʱ espebp ֵıʾ
add esp, 4
pop ebp
ҲǸЧʵġ
Ҳ֪Ϊʲô鷳ķӶջָ롣
ǵ ecx Աʹú͸ƣpop ecx ݷŵͰ

Now we only need to investigate what is hiding behind the [A] in the line fld qword ptr [A].
We know that A must be a pointer to the place where A is placed in memory.
The address of A is coded in the instruction. This is the full line from the cpu view.
00451E40 DD05803C4500     fld qword ptr [B]
00451E40 is the address of this instruction in the exe file.
DD05803C4500 is the machine code for the line and fld qword ptr [B] is the more human readable format of it.
By consulting the Intel SW Developers Manual Volume 2 on page 280 we see that the opcode for fld is D9, DD,
DB or D9C0 depending on the type of data it should load. We recognize DD that is the opcode for fld double.
What is left is 05803C4500. 05 is (Somebody help me ! ). 803C4500 is the 32-bit address of A.
ֻҪо fld qword ptr [A]  [A] Ķ
֪ A ϶һָ A ڴλõָ롣
A ĵַָб롣cpu view е£
00451E40 DD05803C4500     fld qword ptr [B]
00451E40  exe ָĵַ
DD05803C4500 ǻ룬fld qword ptr [B] ҲǿԶĸʽ
ͨ鿴 Intel ֲ 2 ĵ 280 ҳ֪ fld Ĳ D9,DD,DB  D9C0
װ͡϶ 05803C4500 ߵ DD ʾ fld 
05 ʲôأ(˭ܸ!)803C4500  A  32 λַ

Let us convert the function into a pure BASM function now that we have finished analyzing it.
Ѿ˷ڽתΪ BASM ģ
function SecondOrderPolynomial3(X : Double) : Double;
const
 A : Double = 1;
 B : Double = 2;
 C : Double = 3;
asm
 push  ebp
 mov   ebp,esp
 add   esp,-$08
 //Result := A*X*X + B*X + C;
 fld   qword ptr [A]
 fmul  qword ptr [ebp+$08]
 fmul  qword ptr [ebp+$08]
 fld   qword ptr [B]
 fmul  qword ptr [ebp+$08]
 faddp //st(1)
 fadd  qword ptr [C]
 fstp  qword ptr [ebp-$08]
 wait
 fld   qword ptr [ebp-$08]
 pop   ecx
 pop   ecx
 pop   ebp
end;
Now come a few surprises. First the function will not compile. faddp st(1) is not
recognized as a valid combination of opcode and operands. By again consulting the Intel manual we 
learn that faddp comes in one version only. It operates on st(0), st(1) and it is not necessary to write faddp st(0),
st(1) and the short form faddp is the only valid one. We comment out st(1) and it compiles now.
ڷһЩ⡣
ȣܱ롣faddp st(1) ǺϷĲͲϡ
ٴβο Intel ֲᣬ֪ faddp ֻһ汾
 st(0), st(1) ҲҪд faddp st(0), st(1)
̵ faddp ΨһϷһע͵ st(1)ڱͨˡ

Second surprise.
Calling the function with X = 2 yields the calculation Y = 2^2+2*2+3 = 11.
SecondOrderPolynomial3 returns 3!
We must open the FPU view as well as the CPU view and trace through the code and watch what is happening.
It is seen that A=1 is correctly loaded into st(0) by the 4 line,but the 5 line that should
multiplicate A by X, 1 by 2, is resulting in st(0) being a very small number,in effect 0.
This tells us that X is near zero instead of 2.
Two things can be wrong. The calling code is transferring a wrong value of X or we are addressing X incorrectly.
By comparing the calling code when calling function SecondOrderPolynomial3 and
SecondOrderPolynomial1 we see that it is the same and this is not the bug.
It would also be quite surprising if Delphi were suddenly getting this wrong!
Try to step through the calling code while watching the memory pane in the CPU view.
The little green arrow is the position of the stack pointer.

һǸβ X =2Ӧ Y = 2^2+2*2+3 = 11 SecondOrderPolynomial3 Ȼ 3
Ǳ FPU ٴһ۲췢ʲô
ǿн A=1 װ st(0) ȷ, ǵ 5  A *X  1*2
Ľ st(0) бһǳСӽ 0  X ǽӽ 2
طܳô봫һ X ֵѰַ X 
ͨȽ SecondOrderPolynomial3  SecondOrderPolynomial1 ĵô룬ǿͬģ
ûκδ Delphi Ҳ൱˾ȡ
 CPU view иٴ룬۲ڴ塣ǸСɫͷջλá 
   
The calling code looks like this
ô뿴
push dword ptr [ebp-$0c]
push dword ptr [ebp-$10]

call SecondOrderPolynomial1 two pointers are pushed onto the stack. One of them is a pointer to X.
I do not what the other one is.
By watching the memory pane we see that the first one is the pointer to X and the second one is a nil pointer.
When we trace into the function we see that the first two lines are repeated.
The compiler automatically inserted the push ebp and mov ebp, esp lines. Because push decrements the stack pointer by 4,
  the reference to X went wrong. These two first lines are removed and everything is ok again.
Now we have finished analyzing the code and know what it does, we can begin optimizing it.
Let us first change the two fstp/fld lines that we already have seen is redundant.

 SecondOrderPolynomial1 ʱָָݱѹջ
еһָ XҲ֪һʲô
ͨ۲ڴ洰Ƿֵһָ X ָ룬ڶһ nil ָ롣
ǸٽʱǿǰظġԶĲ push ebp  mov ebp, esp
Ϊ push ʹջ 4 X ʱ˴ɾˣ
ڣȫ˴룬֪ЩʲôǿʼŻ
ȸѾֵ fstp  fldС

function SecondOrderPolynomial4(X : Double) : Double;
const
 A : Double = 1;
 B : Double = 2;
 C : Double = 3;
asm
 //push  ebp
 //mov   ebp,esp
 add   esp,-$08
 //Result := A*X*X + B*X + C;
 fld   qword ptr [A]
 fmul  qword ptr [ebp+$08]
 fmul  qword ptr [ebp+$08]
 fld   qword ptr [B]
 fmul  qword ptr [ebp+$08]
 faddp //st(1)
 fadd  qword ptr [C]
 //fstp  qword ptr [ebp-$08]
 wait
 //fld   qword ptr [ebp-$08]
 pop   ecx
 pop   ecx
 pop   ebp
end;
This was the only reference to the stack frame, which is not needed now.
ջصҲҪ
function SecondOrderPolynomial5(X : Double) : Double;
const
 A : Double = 1;
 B : Double = 2;
 C : Double = 3;
asm
 //push  ebp
 //mov   ebp,esp
 //add   esp,-$08
 //Result := A*X*X + B*X + C;
 fld   qword ptr [A]
 fmul  qword ptr [ebp+$08]
 fmul  qword ptr [ebp+$08]
 fld   qword ptr [B]
 fmul  qword ptr [ebp+$08]
 faddp //st(1)
 fadd  qword ptr [C]

 wait

 //pop   ecx
 //pop   ecx
 //pop   ebp
end;
That removed another 6 lines and reduces the function to this.
ɾ 6 УΪ
function SecondOrderPolynomial6(X : Double) : Double;
const
 A : Double = 1;
 B : Double = 2;
 C : Double = 3;
asm
 //Result := A*X*X + B*X + C;
 fld   qword ptr [A]
 fmul  qword ptr [ebp+$08]
 fmul  qword ptr [ebp+$08]
 fld   qword ptr [B]
 fmul  qword ptr [ebp+$08]
 faddp 
 fadd  qword ptr [C]
 wait
end;
X is loaded from memory into the FPU 3 times. It would be more effective to load it once and then reuse it.
X ڴװ FPU 3ΡʵװһθЧ޸£
function SecondOrderPolynomial7(X : Double) : Double;
const
 A : Double = 1;
 B : Double = 2;
 C : Double = 3;
asm
 //Result := A*X*X + B*X + C;
 fld   qword ptr [ebp+$08]
 fld   qword ptr [A]
 fmul  st(0), st(1)
 fmul  st(0), st(1)
 fld   qword ptr [B]
 fmul  st(0), st(2)
 ffree st(2)
 faddp
 fadd  qword ptr [C]
 wait
end;
I magically came up with this code. The first line loads X. The second line loads A.
The third line multiplies A with X. The 4. line multiplies a*X know in st(0) with X.
Then we have calculated the first term. Loading B and multiplication it with X does calculating the second term.
This was the last time we needed X in we free the register, st(2), holding it.
Now adding term 1 and 2 and popping term 2 of the stack. The only thing left to do is adding C.
The result is now in st(0) and the other registers are empty.
Then we check for exceptions with wait and are done.
It is seen that no redundant work is done and this implementation is near optimal.

ħĴЩ롣һת Xڶװ Aִ A * X
 4 н A*X Ľ st(0) ٳ X
ѾһΡװ B ҳ B ڶΡ
һʹ XͷżĴ st(2)
ڽ 1 κ͵ 2 ӣ 2 ΡʣµǼ C
 st(0) УĴǿյġ
Ȼ wait ⡣
ڴ뿴ûˣͬʱеҲܺá

There exits seven instructions for loading often used constants into the FPU.
One of these constants is 1, which can be loaded with the instruction fld1.
Using it saves one load from memory, which can be costly in terms of clock cycles if data are not properly aligned.
7 ָװس FPUЩ֮һ 1ָܱ fld1 װ롣
ʹʡ˴ڴתװһΡݲȫ룬ԭָķѲʱڡ

function SecondOrderPolynomial8(X : Double) : Double;
const
 //A : Double = 1;
 B : Double = 2;
 C : Double = 3;
asm
 //Result := A*X*X + B*X + C;
 fld   qword ptr [ebp+$08]
 //fld   qword ptr [A]
 fld1
 fmul  st(0), st(1)
 fmul  st(0), st(1)
 fld   qword ptr [B]
 fmul  st(0), st(2)
 ffree st(2)
 faddp
 fadd  qword ptr [C]
 wait
end;
This ended the second lesson. Stay tuned for more.

ڶνˡ΢Ϣһ£滹С