CnPack Forum » 技术板块灌水区 » <<BASM 初学者入门>> 第 2 课

2007-4-23 09:54 skyjacker
<<BASM 初学者入门>> 第 2 课

Lesson 2
<<BASM 初学者入门>> 第 2 课

http://www.cnpack.org
QQ Group: 130970

This is the second chapter of the introduction to BASM programming with Delphi.
The first chapter was a short introduction to integer code and this second one is about floating point code.
Our example functions can evaluate a second order polynomial. The parameters A, B and C that defines
the polynomial is coded as local constants. Input to the function is the variable X of type double and
the result is also of type double. The function looks like this.

function SecondOrderPolynomial1(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
begin
Result := A*X*X + B*X + C;
end;
Copying the assembler code from the CPU view gives us this.

function SecondOrderPolynomial2(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;

begin
{
push  ebp
mov   ebp,esp
add   esp,-\$08
}
Result := A*X*X + B*X + C;
{
fld   qword ptr [A]
fmul  qword ptr [ebp+\$08]
fmul  qword ptr [ebp+\$08]
fld   qword ptr [B]
fmul  qword ptr [ebp+\$08]
faddp st(1)
fadd  qword ptr [C]
fstp  qword ptr [ebp-\$08]
wait
fld   qword ptr [ebp-\$08]
}
{
pop   ecx
pop   ecx
pop   ebp
}
end;
Lets explain the asm code line by line.

The begin results in this code.
begin 指令产生下面的代码
begin
{
push  ebp
mov   ebp,esp
add   esp,-\$08
}

which sets up a stack frame for the function.
A stack frame is just a piece of memory that is reserved for the stack.
A stack frame is accessed through two pointers, the base pointer and the stack pointer.
The base pointer is in ebp and the stack pointer is in esp. These two registers are reserved for use by these
pointers only.
The line push ebp backsup the base pointer. The line mov ebp, esp sets up a new base pointer,
which is pointing to the top of the stack. The line add esp, -\$08 moves the stack pointer 8 bytes down.
As a curiosity the stack grows downward and the last line could more intuitively have been sub esp,8.
The new stack frame that was created by these three lines is standing on top of, or actually hanging under,
the last stack frame,  which was probably allocated by the function that called our SecondOrderPolynomial function.

Push esb 保存ebp指针，
Mov ebp, esp 将基址指针指向当前的栈顶,
add esp, -\$08 将栈顶指针往下移8个字节。

The next line of Pascal was compiled into no less than 9 lines of ASM.

Result := A*X*X + B*X + C;
{
fld   qword ptr [A]       // 将A 装入 st(0)
fmul  qword ptr [ebp+\$08] //st(0) * X ->st(0)
fmul  qword ptr [ebp+\$08] //st(0) * X ->st(0) 完成 A * X * X
fld   qword ptr [B]         //st(0)不为空，因此执行一次入栈操作：将st(0)->st(1)
//    B -> st(0)。St(0) 总是为栈顶。
fmul  qword ptr [ebp+\$08] //st(0) * X ->st(0) = B * X
faddp st(1)               //st(1) + st(0) -> st(1)，执行出栈，因此st(1)->st(0)
fadd  qword ptr [C]       //st(0) + C -> st(0)
fstp  qword ptr [ebp-\$08] //结果st(0)存入[ebp-\$08](X)。
wait                      //同步FPU与CPU：停止CPU的运行，直到FPU完成当前操作码
fld   qword ptr [ebp-\$08] //X - > st(0)
}
For those that is used to HP calculators floating point code is very easy to understand. The first line,
fld   qword ptr [A], loads the constant A onto the floating-point register stack. The line,
fmul  qword ptr [ebp+\$08], multiplies A with X. This makes sense by watching the Pascal code,
but what means "qword ptr [ebp+\$08]". qword ptr says "pointer to a quad word,
which is the size of a double. (64 bit).

qword ptr 表示指向一个四个字的指针，它是一个浮点数的大小(64 位)。

The value of the pointer is between the brackets in [ebp+\$08].
ebp is the base pointer and \$08 is - well just 8.
Because the stack grows down the memory location 8 bytes above the base pointer is in the previous stack frame.
Here X was placed by the function, which called our function.
The register calling convention decides this placement of a double variable.

ebp 是基址，偏移量是 8。因为在上一个堆栈中，堆栈从基址向下增长了 8 个字节。

A double variable does not fit into the 32 bit integer registers,
but it fits perfectly onto the floating-point registers. Borland decided to pass double variables via the stack,
but passing them in floating point registers would have been more efficient.
The next 3 lines need no further explanation, but the line, faddp st(1), needs some.
All floating-point instructions starts with an f. add is addition. st(1) is floating point register 1,
which is the second because st(0) is the first!

Borland 通过堆栈传输 Double 变量，其实通过浮点数寄存器传输它们应该更有效率。

The floating point registers are combined into a stack and instructions implicitly works on the top of the stack,
which is st(0).faddp st(1) is the same as faddp st(0), st(1) and it adds register st(0) to register st(1)
and place the result in st(1).
The p in faddp means pop st(0) of the stack. This way the result ends up in st(0).
The line  fadd  qword ptr [C] completes the calculations and the only thing left is to place the result in st(0).
It is actually already there and the two lines

faddp st(1) 与 faddp st(1), st(0) 相同，它将寄存器 st(0) 加到寄存器 st(1)，结果在 st(1)。
fddp 的 p 表示将 st(0) 弹出堆栈。这样结果就从 st(1) 移到了 st(0) 中。
fadd  qword ptr [C] 完成计算，并将结果存入 st(0)。

fstp  qword ptr [ebp-\$08]
fld   qword ptr [ebp-\$08]

are redundant.
They just copy the result into the stack frame and loads it back in again.
Such a waste of precious time and energy :-).
The line wait makes sure that any exceptions that might have been raised by one of the floating-point
instructions are checked. See Intel SW Developers Manual Volume 2 page 822 for the full explanation.

wait 指令检查浮点数指令是否产生的意外。

Then there are only three lines of asm back to explain.

{
pop   ecx
pop   ecx
pop   ebp
}
end;
These are removing the stack frame, by restoring the values of esp and ebp back to the values they
had when the function was entered. This code is much more intuitive and does the same thing
add esp, 4
pop ebp
it is also more effective and I do not know why the compiler is incrementing the stack pointer
in this cumbersome way. Remember that ecx can be used for free and assigning values to it is just
like pouring them into a waste bucket.

add esp, 4
pop ebp

Now we only need to investigate what is hiding behind the [A] in the line fld qword ptr [A].
We know that A must be a pointer to the place where A is placed in memory.
The address of A is coded in the instruction. This is the full line from the cpu view.
00451E40 DD05803C4500     fld qword ptr [B]
00451E40 is the address of this instruction in the exe file.
DD05803C4500 is the machine code for the line and fld qword ptr [B] is the more human readable format of it.
By consulting the Intel SW Developers Manual Volume 2 on page 280 we see that the opcode for fld is D9, DD,
DB or D9C0 depending on the type of data it should load. We recognize DD that is the opcode for fld double.
What is left is 05803C4500. 05 is (Somebody help me ! ). 803C4500 is the 32-bit address of A.

A 的地址在指令中被编码。cpu view 中的完整代码如下：
00451E40 DD05803C4500     fld qword ptr [B]
00451E40 是 exe 中指令的地址。
DD05803C4500 是机器代码，fld qword ptr [B] 也是我们可以读懂的格式。

05 是什么呢？(谁能告诉我!)。803C4500 是 A 的 32 位地址。

Let us convert the function into a pure BASM function now that we have finished analyzing it.

function SecondOrderPolynomial3(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
asm
push  ebp
mov   ebp,esp
add   esp,-\$08
//Result := A*X*X + B*X + C;
fld   qword ptr [A]
fmul  qword ptr [ebp+\$08]
fmul  qword ptr [ebp+\$08]
fld   qword ptr [B]
fmul  qword ptr [ebp+\$08]
faddp //st(1)
fadd  qword ptr [C]
fstp  qword ptr [ebp-\$08]
wait
fld   qword ptr [ebp-\$08]
pop   ecx
pop   ecx
pop   ebp
end;
Now come a few surprises. First the function will not compile. faddp st(1) is not
recognized as a valid combination of opcode and operands. By again consulting the Intel manual we
learn that faddp comes in one version only. It operates on st(0), st(1) and it is not necessary to write faddp st(0),
st(1) and the short form faddp is the only valid one. We comment out st(1) and it compiles now.

Second surprise.
Calling the function with X = 2 yields the calculation Y = 2^2+2*2+3 = 11.
SecondOrderPolynomial3 returns 3!
We must open the FPU view as well as the CPU view and trace through the code and watch what is happening.
It is seen that A=1 is correctly loaded into st(0) by the 4 line,but the 5 line that should
multiplicate A by X, 1 by 2, is resulting in st(0) being a very small number,in effect 0.
This tells us that X is near zero instead of 2.
Two things can be wrong. The calling code is transferring a wrong value of X or we are addressing X incorrectly.
By comparing the calling code when calling function SecondOrderPolynomial3 and
SecondOrderPolynomial1 we see that it is the same and this is not the bug.
It would also be quite surprising if Delphi were suddenly getting this wrong!
Try to step through the calling code while watching the memory pane in the CPU view.
The little green arrow is the position of the stack pointer.

The calling code looks like this

push dword ptr [ebp-\$0c]
push dword ptr [ebp-\$10]

call SecondOrderPolynomial1 two pointers are pushed onto the stack. One of them is a pointer to X.
I do not what the other one is.
By watching the memory pane we see that the first one is the pointer to X and the second one is a nil pointer.
When we trace into the function we see that the first two lines are repeated.
The compiler automatically inserted the push ebp and mov ebp, esp lines. Because push decrements the stack pointer by 4,
the reference to X went wrong. These two first lines are removed and everything is ok again.
Now we have finished analyzing the code and know what it does, we can begin optimizing it.
Let us first change the two fstp/fld lines that we already have seen is redundant.

function SecondOrderPolynomial4(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
asm
//push  ebp
//mov   ebp,esp
add   esp,-\$08
//Result := A*X*X + B*X + C;
fld   qword ptr [A]
fmul  qword ptr [ebp+\$08]
fmul  qword ptr [ebp+\$08]
fld   qword ptr [B]
fmul  qword ptr [ebp+\$08]
faddp //st(1)
fadd  qword ptr [C]
//fstp  qword ptr [ebp-\$08]
wait
//fld   qword ptr [ebp-\$08]
pop   ecx
pop   ecx
pop   ebp
end;
This was the only reference to the stack frame, which is not needed now.

function SecondOrderPolynomial5(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
asm
//push  ebp
//mov   ebp,esp
//add   esp,-\$08
//Result := A*X*X + B*X + C;
fld   qword ptr [A]
fmul  qword ptr [ebp+\$08]
fmul  qword ptr [ebp+\$08]
fld   qword ptr [B]
fmul  qword ptr [ebp+\$08]
faddp //st(1)
fadd  qword ptr [C]

wait

//pop   ecx
//pop   ecx
//pop   ebp
end;
That removed another 6 lines and reduces the function to this.

function SecondOrderPolynomial6(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
asm
//Result := A*X*X + B*X + C;
fld   qword ptr [A]
fmul  qword ptr [ebp+\$08]
fmul  qword ptr [ebp+\$08]
fld   qword ptr [B]
fmul  qword ptr [ebp+\$08]
faddp
fadd  qword ptr [C]
wait
end;
X is loaded from memory into the FPU 3 times. It would be more effective to load it once and then reuse it.
X 从内存装入 FPU 3次。其实装入一次更有效，修改如下：
function SecondOrderPolynomial7(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
asm
//Result := A*X*X + B*X + C;
fld   qword ptr [ebp+\$08]
fld   qword ptr [A]
fmul  st(0), st(1)
fmul  st(0), st(1)
fld   qword ptr [B]
fmul  st(0), st(2)
ffree st(2)
faddp
fadd  qword ptr [C]
wait
end;
I magically came up with this code. The first line loads X. The second line loads A.
The third line multiplies A with X. The 4. line multiplies a*X know in st(0) with X.
Then we have calculated the first term. Loading B and multiplication it with X does calculating the second term.
This was the last time we needed X in we free the register, st(2), holding it.
Now adding term 1 and 2 and popping term 2 of the stack. The only thing left to do is adding C.
The result is now in st(0) and the other registers are empty.
Then we check for exceptions with wait and are done.
It is seen that no redundant work is done and this implementation is near optimal.

There exits seven instructions for loading often used constants into the FPU.
One of these constants is 1, which can be loaded with the instruction fld1.
Using it saves one load from memory, which can be costly in terms of clock cycles if data are not properly aligned.

function SecondOrderPolynomial8(X : Double) : Double;
const
//A : Double = 1;
B : Double = 2;
C : Double = 3;
asm
//Result := A*X*X + B*X + C;
fld   qword ptr [ebp+\$08]
//fld   qword ptr [A]
fld1
fmul  st(0), st(1)
fmul  st(0), st(1)
fld   qword ptr [B]
fmul  st(0), st(2)
ffree st(2)
faddp
fadd  qword ptr [C]
wait
end;
This ended the second lesson. Stay tuned for more.

[[i] 本帖最后由 skyjacker 于 2007-4-23 21:24 编辑 [/i]]

2007-4-23 11:05 Passion

2007-4-23 11:32 skyjacker

2007-4-23 11:59 Passion

2007-4-23 21:27 zzzl

2007-4-23 21:43 Passion

2007-4-23 22:35 skyjacker

Powered by Discuz! Archiver 5.0.0  © 2001-2006 Comsenz Inc.