CnPack Forum » 技术板块灌水区 » <<BASM 初学者入门>> 第 2 课


2007-4-23 09:54 skyjacker
<<BASM 初学者入门>> 第 2 课

Lesson 2
<<BASM 初学者入门>> 第 2 课

http://www.cnpack.org
QQ Group: 130970
翻译:SkyJacker
版本:草稿版
状态:校对中
时间:2007.04.23

This is the second chapter of the introduction to BASM programming with Delphi.
The first chapter was a short introduction to integer code and this second one is about floating point code.
Our example functions can evaluate a second order polynomial. The parameters A, B and C that defines
the polynomial is coded as local constants. Input to the function is the variable X of type double and
the result is also of type double. The function looks like this.
这是 Delphi BASM 的第二章。
第一章简单的介绍了整数指令,第二章将介绍关于浮点数指令。
我们的例子是求一个2次多项式。
常数A,B,C是多项式的系数。函数的参数是浮点数 X,其返回值也是浮点数。
函数如下所示:

function SecondOrderPolynomial1(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
begin
Result := A*X*X + B*X + C;
end;
Copying the assembler code from the CPU view gives us this.
复制 CPU view 中的汇编代码:
function SecondOrderPolynomial2(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;

begin
{
push  ebp
mov   ebp,esp
add   esp,-$08
}
Result := A*X*X + B*X + C;
{
fld   qword ptr [A]
fmul  qword ptr [ebp+$08]
fmul  qword ptr [ebp+$08]
fld   qword ptr [B]
fmul  qword ptr [ebp+$08]
faddp st(1)
fadd  qword ptr [C]
fstp  qword ptr [ebp-$08]
wait
fld   qword ptr [ebp-$08]
}
{
pop   ecx
pop   ecx
pop   ebp
}
end;
Lets explain the asm code line by line.
接下来一行行来解释代码.

The begin results in this code.
begin 指令产生下面的代码
begin
{
push  ebp
mov   ebp,esp
add   esp,-$08
}

which sets up a stack frame for the function.
A stack frame is just a piece of memory that is reserved for the stack.
A stack frame is accessed through two pointers, the base pointer and the stack pointer.
The base pointer is in ebp and the stack pointer is in esp. These two registers are reserved for use by these
pointers only.
The line push ebp backsup the base pointer. The line mov ebp, esp sets up a new base pointer,
which is pointing to the top of the stack. The line add esp, -$08 moves the stack pointer 8 bytes down.
As a curiosity the stack grows downward and the last line could more intuitively have been sub esp,8.
The new stack frame that was created by these three lines is standing on top of, or actually hanging under,
the last stack frame,  which was probably allocated by the function that called our SecondOrderPolynomial function.
这部分代码是用来设置函数的堆栈。
堆栈是内存中保留的一块区域,堆栈是通过两个指针来访问的,基址指针和栈顶指针。
基址指针存放在ebp中,堆栈指针存放在esp中。
这两个寄存器是专门用来访问堆栈的。
Push esb 保存ebp指针,
Mov ebp, esp 将基址指针指向当前的栈顶,
add esp, -$08 将栈顶指针往下移8个字节。
堆栈的增长方向是往下的,最后一行 sub esp, 8 直观的表示了这个意思。
新的堆栈空间被最顶部的这3行代码创建。实际上是在调用 SecondOrderPolynomial 之后执行。

The next line of Pascal was compiled into no less than 9 lines of ASM.
下一行 pascal 代码被编译成不小于 9 行的 ASM 代码。

Result := A*X*X + B*X + C;
{
fld   qword ptr [A]       // 将A 装入 st(0)
fmul  qword ptr [ebp+$08] //st(0) * X ->st(0)
fmul  qword ptr [ebp+$08] //st(0) * X ->st(0) 完成 A * X * X
fld   qword ptr [B]         //st(0)不为空,因此执行一次入栈操作:将st(0)->st(1)
                                           //    B -> st(0)。St(0) 总是为栈顶。
fmul  qword ptr [ebp+$08] //st(0) * X ->st(0) = B * X
faddp st(1)               //st(1) + st(0) -> st(1),执行出栈,因此st(1)->st(0)
fadd  qword ptr [C]       //st(0) + C -> st(0)
fstp  qword ptr [ebp-$08] //结果st(0)存入[ebp-$08](X)。
wait                      //同步FPU与CPU:停止CPU的运行,直到FPU完成当前操作码   
fld   qword ptr [ebp-$08] //X - > st(0)
}
For those that is used to HP calculators floating point code is very easy to understand. The first line,
fld   qword ptr [A], loads the constant A onto the floating-point register stack. The line,
fmul  qword ptr [ebp+$08], multiplies A with X. This makes sense by watching the Pascal code,
but what means "qword ptr [ebp+$08]". qword ptr says "pointer to a quad word,
which is the size of a double. (64 bit).
这些浮点数指令很容易理解。
第一行 fld   qword ptr [A],装载常量 A 到浮点数寄存器堆栈。
这行 fmul  qword ptr [ebp+$08],用 X 乘 A。这些可以通过 pascal 代码直接理解其含义,
但是 "qword ptr [ebp+$08]" 表示什么呢。
qword ptr 表示指向一个四个字的指针,它是一个浮点数的大小(64 位)。

The value of the pointer is between the brackets in [ebp+$08].
ebp is the base pointer and $08 is - well just 8.
Because the stack grows down the memory location 8 bytes above the base pointer is in the previous stack frame.
Here X was placed by the function, which called our function.
The register calling convention decides this placement of a double variable.
指针的值是括号之间的数值 ebp+$8。
ebp 是基址,偏移量是 8。因为在上一个堆栈中,堆栈从基址向下增长了 8 个字节。
上一层调用函数把 X 放在这里。
寄存器调用约定决定了一个 Double 变量的位置。

A double variable does not fit into the 32 bit integer registers,
but it fits perfectly onto the floating-point registers. Borland decided to pass double variables via the stack,
but passing them in floating point registers would have been more efficient.
The next 3 lines need no further explanation, but the line, faddp st(1), needs some.
All floating-point instructions starts with an f. add is addition. st(1) is floating point register 1,
which is the second because st(0) is the first!
一个浮点数变量用 32 位的整型寄存器装不下, 但是它正好可以装入浮点数寄存器。
Borland 通过堆栈传输 Double 变量,其实通过浮点数寄存器传输它们应该更有效率。
接下来的三行不需要更多的解释,只有这一行 faddp st(1) 需要说明一下。
所有的浮点数指令都用一个 f 开头。 add 表示加法。 st(1) 是 1 号浮点数,它是第二个
寄存器,因为 st(0) 是第一个。

The floating point registers are combined into a stack and instructions implicitly works on the top of the stack,
which is st(0).faddp st(1) is the same as faddp st(0), st(1) and it adds register st(0) to register st(1)
and place the result in st(1).
The p in faddp means pop st(0) of the stack. This way the result ends up in st(0).
The line  fadd  qword ptr [C] completes the calculations and the only thing left is to place the result in st(0).
It is actually already there and the two lines
浮点数寄存器被组合为一个堆栈,指令隐含对栈顶 st(0) 进行操作。
faddp st(1) 与 faddp st(1), st(0) 相同,它将寄存器 st(0) 加到寄存器 st(1),结果在 st(1)。
fddp 的 p 表示将 st(0) 弹出堆栈。这样结果就从 st(1) 移到了 st(0) 中。
fadd  qword ptr [C] 完成计算,并将结果存入 st(0)。
因为结果已经在 st(0) 中了,因此这两行是冗余的:

fstp  qword ptr [ebp-$08]
fld   qword ptr [ebp-$08]

are redundant.
They just copy the result into the stack frame and loads it back in again.
Such a waste of precious time and energy :-).
The line wait makes sure that any exceptions that might have been raised by one of the floating-point
instructions are checked. See Intel SW Developers Manual Volume 2 page 822 for the full explanation.
它们是将 st(0) 复制到堆栈,然后再从堆栈装回 st(0)。
居然如此浪费珍贵的时间和精力 :)。
wait 指令检查浮点数指令是否产生的意外。
可以参考 Intel 软件开发手册卷 2 第 822 页的详细解释。

Then there are only three lines of asm back to explain.
这三行返回代码需要解释一下:
{
pop   ecx
pop   ecx
pop   ebp
}
end;
These are removing the stack frame, by restoring the values of esp and ebp back to the values they
had when the function was entered. This code is much more intuitive and does the same thing
add esp, 4
pop ebp
it is also more effective and I do not know why the compiler is incrementing the stack pointer
in this cumbersome way. Remember that ecx can be used for free and assigning values to it is just
like pouring them into a waste bucket.

它们弹出堆栈,恢复进入函数时的 esp、ebp 的值。更形象的表示如下
add esp, 4
pop ebp
这也是更有效率的。
我不知道为什么编译器用那样麻烦的方法增加堆栈指针。
记得 ecx 可以被自由使用和复制,pop ecx 看起来像把数据放到了垃圾桶。

Now we only need to investigate what is hiding behind the [A] in the line fld qword ptr [A].
We know that A must be a pointer to the place where A is placed in memory.
The address of A is coded in the instruction. This is the full line from the cpu view.
00451E40 DD05803C4500     fld qword ptr [B]
00451E40 is the address of this instruction in the exe file.
DD05803C4500 is the machine code for the line and fld qword ptr [B] is the more human readable format of it.
By consulting the Intel SW Developers Manual Volume 2 on page 280 we see that the opcode for fld is D9, DD,
DB or D9C0 depending on the type of data it should load. We recognize DD that is the opcode for fld double.
What is left is 05803C4500. 05 is (Somebody help me ! ). 803C4500 is the 32-bit address of A.
现在我们只需要研究代码 fld qword ptr [A] 中隐藏在 [A] 后面的东西。
我们知道 A 肯定是一个指向存放 A 的内存位置的指针。
A 的地址在指令中被编码。cpu view 中的完整代码如下:
00451E40 DD05803C4500     fld qword ptr [B]
00451E40 是 exe 中指令的地址。
DD05803C4500 是机器代码,fld qword ptr [B] 也是我们可以读懂的格式。
通过查看 Intel 软件开发手册卷 2 的第 280 页,我们知道 fld 的操作码是 D9,DD,DB 或者 D9C0,
它们依赖于装入的数据类型。我们认定 05803C4500 左边的 DD 表示 fld 浮点数。
05 是什么呢?(谁能告诉我!)。803C4500 是 A 的 32 位地址。

Let us convert the function into a pure BASM function now that we have finished analyzing it.
我们已经完成了分析,现在将函数转为纯 BASM 的:
function SecondOrderPolynomial3(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
asm
push  ebp
mov   ebp,esp
add   esp,-$08
//Result := A*X*X + B*X + C;
fld   qword ptr [A]
fmul  qword ptr [ebp+$08]
fmul  qword ptr [ebp+$08]
fld   qword ptr [B]
fmul  qword ptr [ebp+$08]
faddp //st(1)
fadd  qword ptr [C]
fstp  qword ptr [ebp-$08]
wait
fld   qword ptr [ebp-$08]
pop   ecx
pop   ecx
pop   ebp
end;
Now come a few surprises. First the function will not compile. faddp st(1) is not
recognized as a valid combination of opcode and operands. By again consulting the Intel manual we
learn that faddp comes in one version only. It operates on st(0), st(1) and it is not necessary to write faddp st(0),
st(1) and the short form faddp is the only valid one. We comment out st(1) and it compiles now.
现在发生了一些问题。
首先,函数不能编译。faddp st(1) 不是合法的操作码和操作数组合。
再次参考 Intel 手册,我们知道 faddp 只有一个版本。
它操作 st(0), st(1) 并且不需要这样写 faddp st(0), st(1)。
简短的 faddp 是唯一合法的一个。我们注释掉 st(1),现在编译通过了。

Second surprise.
Calling the function with X = 2 yields the calculation Y = 2^2+2*2+3 = 11.
SecondOrderPolynomial3 returns 3!
We must open the FPU view as well as the CPU view and trace through the code and watch what is happening.
It is seen that A=1 is correctly loaded into st(0) by the 4 line,but the 5 line that should
multiplicate A by X, 1 by 2, is resulting in st(0) being a very small number,in effect 0.
This tells us that X is near zero instead of 2.
Two things can be wrong. The calling code is transferring a wrong value of X or we are addressing X incorrectly.
By comparing the calling code when calling function SecondOrderPolynomial3 and
SecondOrderPolynomial1 we see that it is the same and this is not the bug.
It would also be quite surprising if Delphi were suddenly getting this wrong!
Try to step through the calling code while watching the memory pane in the CPU view.
The little green arrow is the position of the stack pointer.

另一个问题是给函数传传入形参 X =2,应计算 Y = 2^2+2*2+3 = 11,可是 SecondOrderPolynomial3 结果居然是 3。
我们必须打开 FPU 窗口像跟踪代码一样观察发生了什么。
我们看到,第四行将 A=1 装入 st(0) 是正确的, 但是第 5 行是 A *X 即 1*2,
它的结果在 st(0) 中变成了一个非常小的数,接近 0。 这告诉我们 X 是接近零而不是 2。
两个地方可能出错。调用代码传入了一个错误的 X 值或者我们寻址 X 错误。
通过比较 SecondOrderPolynomial3 与 SecondOrderPolynomial1 的调用代码,我们看到它们是相同的,
没有任何错误。如果 Delphi 产生这个错误,它也是相当令人惊讶。
再试着在 CPU view 中跟踪代码,观察内存面板。那个小的绿色箭头是栈顶的位置。
   
The calling code looks like this
调用代码看起来是这个:
push dword ptr [ebp-$0c]
push dword ptr [ebp-$10]

call SecondOrderPolynomial1 two pointers are pushed onto the stack. One of them is a pointer to X.
I do not what the other one is.
By watching the memory pane we see that the first one is the pointer to X and the second one is a nil pointer.
When we trace into the function we see that the first two lines are repeated.
The compiler automatically inserted the push ebp and mov ebp, esp lines. Because push decrements the stack pointer by 4,
  the reference to X went wrong. These two first lines are removed and everything is ok again.
Now we have finished analyzing the code and know what it does, we can begin optimizing it.
Let us first change the two fstp/fld lines that we already have seen is redundant.

调用 SecondOrderPolynomial1 时的两个指针指向的内容被压入堆栈。
它们其中的一个指向 X。我不知道另一个是什么。
通过观察内存窗口我们发现第一个是指向 X 的指针,第二个是一个 nil 指针。
当我们跟踪进函数时我们看到前两行是重复的。编译器自动的插入 push ebp 和 mov ebp, esp。
因为 push 使栈顶减 4,因此引用 X 时发生了错误。删除这两个,函数工作正常了.
现在,我们完全分析了代码,并且知道它做了些什么,我们开始优化它.
我们先更改已经发现的冗余代码 fstp 与 fld 这两行。

function SecondOrderPolynomial4(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
asm
//push  ebp
//mov   ebp,esp
add   esp,-$08
//Result := A*X*X + B*X + C;
fld   qword ptr [A]
fmul  qword ptr [ebp+$08]
fmul  qword ptr [ebp+$08]
fld   qword ptr [B]
fmul  qword ptr [ebp+$08]
faddp //st(1)
fadd  qword ptr [C]
//fstp  qword ptr [ebp-$08]
wait
//fld   qword ptr [ebp-$08]
pop   ecx
pop   ecx
pop   ebp
end;
This was the only reference to the stack frame, which is not needed now.
与堆栈相关的也不需要:
function SecondOrderPolynomial5(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
asm
//push  ebp
//mov   ebp,esp
//add   esp,-$08
//Result := A*X*X + B*X + C;
fld   qword ptr [A]
fmul  qword ptr [ebp+$08]
fmul  qword ptr [ebp+$08]
fld   qword ptr [B]
fmul  qword ptr [ebp+$08]
faddp //st(1)
fadd  qword ptr [C]

wait

//pop   ecx
//pop   ecx
//pop   ebp
end;
That removed another 6 lines and reduces the function to this.
删除这 6 行,代码减为:
function SecondOrderPolynomial6(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
asm
//Result := A*X*X + B*X + C;
fld   qword ptr [A]
fmul  qword ptr [ebp+$08]
fmul  qword ptr [ebp+$08]
fld   qword ptr [B]
fmul  qword ptr [ebp+$08]
faddp
fadd  qword ptr [C]
wait
end;
X is loaded from memory into the FPU 3 times. It would be more effective to load it once and then reuse it.
X 从内存装入 FPU 3次。其实装入一次更有效,修改如下:
function SecondOrderPolynomial7(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
asm
//Result := A*X*X + B*X + C;
fld   qword ptr [ebp+$08]
fld   qword ptr [A]
fmul  st(0), st(1)
fmul  st(0), st(1)
fld   qword ptr [B]
fmul  st(0), st(2)
ffree st(2)
faddp
fadd  qword ptr [C]
wait
end;
I magically came up with this code. The first line loads X. The second line loads A.
The third line multiplies A with X. The 4. line multiplies a*X know in st(0) with X.
Then we have calculated the first term. Loading B and multiplication it with X does calculating the second term.
This was the last time we needed X in we free the register, st(2), holding it.
Now adding term 1 and 2 and popping term 2 of the stack. The only thing left to do is adding C.
The result is now in st(0) and the other registers are empty.
Then we check for exceptions with wait and are done.
It is seen that no redundant work is done and this implementation is near optimal.

我魔法般的处理了这些代码。第一行转入 X,第二行装入 A,第三行执行 A * X。
第 4 行将 A*X 的结果 st(0) 再乘以 X。
我们已经计算完第一段。装入 B 并且乘以 B 来计算第二段。
这是我们最后一次使用 X,因此释放寄存器 st(2)。
现在将第 1 段和第 2 段相加,弹出第 2 段。剩下的事情就是加 C。
结果现在在 st(0) 中,其他寄存器都是空的。
然后我们用 wait 检查意外。
现在代码看起来没有冗余了,同时它运行的也很好。

There exits seven instructions for loading often used constants into the FPU.
One of these constants is 1, which can be loaded with the instruction fld1.
Using it saves one load from memory, which can be costly in terms of clock cycles if data are not properly aligned.
有7 条指令经常用于装载常量到 FPU。这些常量之一是 1,它能被指令 fld1 装入。
使用它节省了从内存转装入的一次。如果数据不是完全对齐,原来的指令会耗费不少时钟周期。

function SecondOrderPolynomial8(X : Double) : Double;
const
//A : Double = 1;
B : Double = 2;
C : Double = 3;
asm
//Result := A*X*X + B*X + C;
fld   qword ptr [ebp+$08]
//fld   qword ptr [A]
fld1
fmul  st(0), st(1)
fmul  st(0), st(1)
fld   qword ptr [B]
fmul  st(0), st(2)
ffree st(2)
faddp
fadd  qword ptr [C]
wait
end;
This ended the second lesson. Stay tuned for more.

第二课结束了。稍微休息一下,下面还有。

[[i] 本帖最后由 skyjacker 于 2007-4-23 21:24 编辑 [/i]]

2007-4-23 11:05 Passion
符点数应该是浮点数吧。

2007-4-23 11:32 skyjacker
恩。

我需要慢点了。:P

2007-4-23 11:59 Passion
错别字还好容易改,就怕有逻辑理解错误。:lol

2007-4-23 21:27 zzzl
最欣赏你这样的译者,中英文对照,哈哈,支持你

2007-4-23 21:43 Passion
难得见空气夸奖人,哈哈。:lol

2007-4-23 22:35 skyjacker
谢谢!

偶好害羞:loveliness:

页: [1]


Powered by Discuz! Archiver 5.0.0  © 2001-2006 Comsenz Inc.