In this lesson we will learn about branching, illustrated by an if-else construct.
Conditional floating point move will also be introduced.
The example function of this lesson is the Min function from the Delphi Math unit.
在这一课中,我们将学习分支语句,通过一个 if-else 结构来举例说明。
同时介绍浮点数条件转移。
这一课的例子函数是 Delphi Math 单元中的 Min 函数。
function Min1(const A, B: Single) : Single;
begin
if A < B then
Result := A
else
Result := B;
end;
The compiler generated asm for this function looks like this
汇编代码如下:
function Min2(const A, B: Single) : Single;
begin
{
00452458 55 push ebp
00452459 8BEC mov ebp,esp
0045245B 51 push ecx
}
if A < B then
{
0045245C D9450C fld dword ptr [ebp+$0c]
0045245F D85D08 fcomp dword ptr [ebp+$08]
00452462 DFE0 fstsw ax
00452464 9E sahf
00452465 7308 jnb +$08
}
Result := A
{
00452467 8B450C mov eax,[ebp+$0c]
0045246A 8945FC mov [ebp-$04],eax
0045246D EB06 jmp +$06
}
else
Result := B;
{
0045246F 8B4508 mov eax,[ebp+$08]
00452472 8945FC mov [ebp-$04],eax
}
{
00452475 D945FC fld dword ptr [ebp-$04]
00452478 59 pop ecx
00452479 5D pop ebp
}
end;
This time I included the address and opcode columns, because we need them later.
Lets analyze it line by line, like we always do.
这一次我复制了地址和机器码列,因为后面会用到。
像之前一样,我们来一行行分析。
function Min3(const A, B: Single) : Single;
begin
{
push ebp // Save ebp on stack 保存 ebp
mov ebp,esp // New basepointer is the old stackpointer 堆栈指针赋给基址指针
push ecx // subtract 4 from esp esp 减 4
}
if A < B then
{
fld dword ptr [ebp+$0c] // Load A on FP stack A 到浮点数寄存器堆栈
fcomp dword ptr [ebp+$08] // FP compare A to B and pop A from stack A 与 B 比较之后,将 A 从 FP 中弹出
fstsw ax // Store FP statusword in ax 将 FP 的状态字存入 ax
sahf // Store ah into EFlags register 存储 ah 到标志寄存器
jnb +$08 // If not below jump 8 bytes forward 如果不小于,则向前跳 8 个字节
}
Result := A
{
mov eax,[ebp+$0c] // Copy A into eax A 到 eax
mov [ebp-$04],eax // Copy A into stackframe A 到堆栈
jmp +$06 // Jmp 6 bytes forward 向前跳 6 个字节
}
else
Result := B;
{
mov eax,[ebp+$08] // Copy B into eax B 到 eax
mov [ebp-$04],eax // Copy B into stackframe B 到堆栈
}
{
fld dword ptr [ebp-$04] // Load A or B from stackframe onto FP stack 装载 A 或 B 到 FP 堆栈
pop ecx // Add 4 to esp esp 加 4
pop ebp // Restore ebp 恢复 ebp
}
end;
This time I commented every line of BASM. Read the details there.
The first new instruction introduced by this example is fcomp. F says floating-point instruction as always.
Com says compare and p says pop FP stack. Fcom compares two floating-point values and sets the condition code
flags of the floating-point unit, named C0, C1, C2 and C3.
这一次我注释了每一行 BASM。从这可以看到详细说明。
这个例子中第一个被介绍的新指令是 fcomp。
F 表示浮点数指令,Com 表示比较,p 表示从浮点数寄存器堆栈弹出。
FCom 比较两个浮点数,并且设置被命名为 C0, C1, C2, C3的浮点数单元的条件代码标志。
These flags are the equivalents of the EFlags register of the CPU.
The flags are checked by conditional jump instructions, which jump or not depending on the type of jump and
the status of the flags.
这些标志等同于 CPU 的标志寄存器。
条件跳转指令检查这些标志,跳不跳转,依赖于跳转的类型和这些标志的状态。
Conditional jump instructions check the CPU flags and not the FPU flags and therefore it is
necessary to copy the flags from the FPU to the CPU.
条件跳转指令检查 CPU 的标志而不是 FPU 的标志,因此需要将标志从 FPU 拷贝到 CPU。
This is done by the next two instructions.
fstsw stores the FP flags into the ax register and sahf copy the 8 bits from ah to the EFlags register.
This is a long way for the flags to travel before they can be used by the jnb instruction.
Jnb is short for jump not below. In the Intel SW Developers Manual Vol 2 at page 394 there is a table with all
the different jump instructions and a description of the flags they test.
下两行指令实现这个功能。
fstsw 存放 FP 标志到 ax 寄存器,sahf 从 ah 复制 8 位到状态标志寄存器。
在它们被 jnb 指令使用之前,经历了一个比较长的步骤。
Jnb 是一个短跳转,如果不低于则跳转。在 Intel 软件开发手册卷 2 的第 394 页中有一张表描述了
所有跳转指令的不同以及它们所检测的标志寄存器。
Here it is seen that jnb jumps if CF=1 and ZF=1.
Try tracing through the code with the FPU view and the CPU view open.
Watch how the FPU flags are set, their values are copied into ah and then copied to the CPU EFlags register.
If the jnb jump is not taken execution continues in the line after it.
This is the if part of the if-else construct.
If the jump is taken execution is continued at the line 8 bytes ahead.
This is at the start of the else part. The if and the else part is very much the same.
这儿看起来如果 CF =1 并且 ZF =1 时 jnb 跳转。
通过 FPU 窗口和 CPU 窗口跟踪代码。
观察 FPU 状态怎样被设置,它们的值如何拷贝到 ah,并且再拷贝到 CPU 的状态寄存器。
如果 jnb 跳转没有执行,那么将继续执行它之后的指令。这是 if-else 结构的 if 部分。
如果 jnb 跳转执行了,那么将从这一行前面的 8 个字节处执行。
这是 else 部分的开始。if 和 else 部分非常相似。
As seen in the Pascal code A or B is copied to the Result variable depending on the if condition.
Instead of copying A or B directly on the top of the FP stack, which is the place of a FP Result from a function
due to the register calling convention, the Delphi compiler chose to use the stack as a temporary location.
The fld dword ptr [ebp-$04] instruction copies the result in the right place.
从 Pascal 代码中可以看到,A 或者 B 被拷贝到结果变量依赖于 if 条件。
不是从 FP 堆栈的顶部直接复制 A 或 B,其栈顶存放由寄存器调用预定产生的结果,Delphi 编译器选择使用堆栈作为临时本地变量。
fld dword ptr [ebp-$04] 装载指令右面的结果。
Observe that an unconditional jump is needed at the end of the if block such that execution does not continue in the else block.
This way there is a jump no matter with branch is taken. A short word on branch prediction.
Branch prediction comes in two flavours, static and dynamic. The first time a branch is executed the CPU can have no
knowledge about the probability it will be taken or not. In this situation it uses the static predictor,
which says that forward jumps are not taken and backward branches are taken.
显然,如果不执行 else 块,则在 if 块的结尾需要一个无条件跳转指令。
不论是否进入分支都有一个跳转。是分支预测中的短字。
分支预测有两种方式,静态预测和动态预测。
分支一开始执行,CPU 不知道分支是否能够执行。
这种情况下,它使用静态预测,表示向前的跳转不被执行,向后的跳转分支被执行。
Our example jump no 1 is a forward branch and it will be predicted not taken the first time through.
If we have knowledge about the values A and B, we can use it to code the if-else such that
the if part is the most often executed and static prediction is optimized.
An unconditional jump needs no prediction - it is always taken ;-)
A backward jump is often part of a loop and most loops will run more than once (why else have a loop?).
This way it makes sense to statically predict backward jumps taken.
Dynamic branch prediction tries to accumulate knowledge about the probability that a certain jump is taken
or not and then predict it correctly as often as possible.
例子中的跳转 1 是向前分支,第一次不会进入它。
如果我们知道 A 和 B 的值,我们可以这样使用 if-else 编码,if 部分放置最经常被执行的代码,静态预测被优化了 。
一个无条件跳转不需要预测 - 它总是被执行
一个向后的跳转常常是循环中的一部分,大部分循环都将运行多次(为什么 else 有一个循环)。
这种方式对于静态预测向后跳转比较有意义。
动态分支预测累计某种跳转是否执行,然后尽可能的正确预测它。
Now it is time to transform the function into a pure asm one.
是把函数转换成纯 asm 代码的时候了。
function Min4(const A, B: Single) : Single;
asm
//push ebp
//mov ebp,esp
push ecx
//if A < B then
fld dword ptr [ebp+$0c]
fcomp dword ptr [ebp+$08]
fstsw ax
sahf
jnb @ElseBegin
//Result := A
mov eax,[ebp+$0c]
mov [ebp-$04],eax
jmp @ElseEnd
//else
//Result := B;
@ElseBegin :
mov eax,[ebp+$08]
mov [ebp-$04],eax
@ElseEnd :
fld dword ptr [ebp-$04]
pop ecx
//pop ebp
end;
The new thing this time is that we need some labels.
Our analyse of the function made it clear where we jumped to when,I hope.
In general when placing labels it is a good thing to use the knowledge we have about the structure of the code.
You can open the FPU view and just trace through the code and see where you go when jumps are taken.
If you want to place labels without having to think, then use some math. Example follows
We have this jump
这一次我们需要加上一些标签。
我希望已经清楚地分析了这个函数什么时候跳转到哪里。
通常,放置标签的时候,使用我们已知的代码结构命名是个好习惯。
你可以打开 FPU 窗口,并跟踪代码,查看跳转执行的过程。
如果你想放置没有任何意义的标签,可以使用数学的方法。
比如下面的跳转例子:
00452465 7308 jnb +$08
This is the line after it 这一行之后
00452467 8B450C mov eax,[ebp+$0c]
This is the line 8 byte after that line 判断行之后的 8 个字节处
0045246F 8B4508 mov eax,[ebp+$08]
Take the address of the line after the jump and add the jump offset to it and you have the jump target address.
跳转行之后的地址加上跳转的偏移量等于要跳转到的目标地址。
This is the math: 00452467 + 8 = 0045246F.
公式为 00452467 + 8 = 0045246F.
Why do we add the jump offset to the address of the line after the jump and not to the address of the line with the
jump?
为什么我们用跳转指令之后的地址来加偏移量,而不是用跳转指令这一行的地址呢?
It's time to optimise. 现在开始优化:
function Min5(const A, B: Single) : Single;
asm
push ecx
//if A < B then
fld dword ptr [ebp+$0c]
fcomp dword ptr [ebp+$08]
fstsw ax
sahf
jnb @ElseBegin
//Result := A
mov eax,[ebp+$0c]
mov [ebp-$04],eax
jmp @ElseEnd
//else
//Result := B;
@ElseBegin :
mov eax,[ebp+$08]
mov [ebp-$04],eax
@ElseEnd :
fld dword ptr [ebp-$04]
pop ecx
end;
This is just a cleaned up version of the function. Lets change the push ecx,
pop ecx instructions with instructions that manipulate the esp register directly and do not move data between
ecx and the stack.
这只是整理之后的代码。
让我们来优化掉 push ecx, pop ecx 这些直接操作 esp 寄存器的指令, ecx 和堆栈之间每有传递数据。
function Min6(const A, B: Single) : Single;
asm
//push ecx
sub esp, 4
//if A < B then
fld dword ptr [ebp+$0c]
fcomp dword ptr [ebp+$08]
fstsw ax
sahf
jnb @ElseBegin
//Result := A
mov eax,[ebp+$0c]
mov [ebp-$04],eax
jmp @ElseEnd
//else
//Result := B;
@ElseBegin :
mov eax,[ebp+$08]
mov [ebp-$04],eax
@ElseEnd :
fld dword ptr [ebp-$04]
//pop ecx
add esp, 4
end;
When the code was analyzed we saw that the flags were traveling a long way and 2 instructions were needed for this.
What about a floating point compare instruction that sets the EFlags directly?
This is the fcomi instruction, which was introduced in the P6 architecture.
Lets use it and abandon those CPUs older then the Pro. It would be fair to believe that these lines
当我们分析传输标志指令时,发现是用两个指令来实现的。
一个浮点数指令怎样来直接设置 CPU 的标志寄存器呢?
通过 fcomi 指令, 它在 P6 结构中被介绍.
fcomp dword ptr [ebp+$08]
fstsw ax
sahf
could be substituted by this 可以被下面的代码替代
fcomip dword ptr [ebp+$08]
The fcomi instruction does however not accept a memory reference as operand.
Therefore it is necessary to load data before issuing it.
fcomi 指令不能引用内存操作数。因此使用它之前需要装载数据:
fld dword ptr [ebp+$0c]
fcomip st(0), st(1)
Then because we loaded data we have to remove them again with the ffree instruction.
An fcomipp instruction would have been nice to have.
之后,因为我们装入了数据,因此必须用 ffree 指令移除。
fcomipp 指令可能更适合做这个。
fld dword ptr [ebp+$0c]
fcomip st(0), st(1)
ffree st(0)
This is a hell of an optimization, substituting three lines with 3 other.
It is necessary to time the two versions to know whether it was an optimization.
Now the function looks like this.
这是一个地狱般的优化,用另外 3 行代替原来的 3 行。
需要来测试这两个版本,看看哪个是比较有效率的。
此函数如下:
function Min7(const A, B: Single) : Single;
asm
sub esp, 4
//if A < B then
fld dword ptr [ebp+$08]
fld dword ptr [ebp+$0c]
fcomip st(0), st(1)
ffree st(0)
//fstsw ax
//sahf
jnb @ElseBegin
//Result := A
mov eax,[ebp+$0c]
mov [ebp-$04],eax
jmp @ElseEnd
//else
//Result := B;
@ElseBegin :
mov eax,[ebp+$08]
mov [ebp-$04],eax
@ElseEnd :
fld dword ptr [ebp-$04]
add esp, 4
end;
Now we apply some logic thinking. Why copy the result around? Both A & B is on the stack for use by the comparison
by fcom and the result should be delivered on the stack. The only thing needed is to delete either
A or B and leave the smallest one of them on the stack
现在我们进行一些逻辑上的思考。为什么要反复复制结果呢?
A 和 B 通过堆栈传递进行 fcom 比较,结果也在堆栈上传输。
只需要删除 A 或者 B,将最小的留在堆栈上。
function Min8(const A, B: Single) : Single;
asm
sub esp, 4
//if A < B then
fld dword ptr [ebp+$08]
fld dword ptr [ebp+$0c]
//fcomip st(0), st(1)
fcomi st(0), st(1)
//ffree st(0)
jnb @ElseBegin
//Result := A
//mov eax,[ebp+$0c]
//mov [ebp-$04],eax
ffree st(1)
jmp @ElseEnd
//else
//Result := B;
@ElseBegin :
//mov eax,[ebp+$08]
//mov [ebp-$04],eax
fxch
ffree st(1)
@ElseEnd :
//fld dword ptr [ebp-$04]
add esp, 4
end;
Fcomip is replaced with fcomi because we do not want to remove B from the stack at this point.
Ffree is deleted because it deleted A. Then all the lines that copied the result around are removed.
In the if block A is the result and B should be deleted. B is in st(1) and ffree st(1) does the job.
In the else block we want to delete A and leave B in st(0). Swapping A and B do this with the fxch
instruction and then delete A in st(1) with ffree. Fxch is nearly free (takes 0 cycles),
because if works by register renaming instead of actually copying data.
Fcomip 被 fcomi 替代,因为我们不想从堆栈上移除 B。Ffree 被删除,因为它释放了 A。
所有的重复的复制都被删除。
在 if 块中, A 是结果,那么 B 会被删除。B 是在 st(1) 中, ffree st(1) 将其删除。
在 else 块中,我们想删除 A, 将 B 留在 st(0) 中。用 fxch 指令交换 A 和 B 的值,然后用
ffree 删除 st(1) 中的 A。
Fxch 是很节省的(花费 0 个周期),因为它是重命名寄存器而不是复制数据。
function Min9(const A, B: Single) : Single;
asm
//sub esp, 4
//if A < B then
fld dword ptr [ebp+$08]
fld dword ptr [ebp+$0c]
fcomi st(0), st(1)
jnb @ElseBegin
//Result := A
ffree st(1)
jmp @ElseEnd
//else
//Result := B;
@ElseBegin :
fxch
ffree st(1)
@ElseEnd :
//add esp, 4
end;
Now the stack frame is not needed and we delete the code that set it up.
操作堆栈的代码也不需要:
function Min10(const A, B: Single) : Single;
asm
//if A < B then
fld dword ptr [ebp+$08]
fld dword ptr [ebp+$0c]
fcomi st(0), st(1)
jnb @ElseBegin
//Result := A
ffree st(1)
jmp @ElseEnd
//else
//Result := B;
@ElseBegin :
fxch
ffree st(1)
@ElseEnd :
end;
This is a pretty nice function, but somebody in the newsgroup has told us about conditional moves.
fcmovnb is such a thing - floating point conditional move not below. It copies data from st(1)-st(7) to st(0)
if the condition is fulfilled. The Eflags are tested for the condition. Fcmov was introduced in the P6
architecture as well fcomi was.
现在是一个非常简洁的函数了,但是在新闻组里有人告诉我关于条件移动指令。
fcmovnb 做这样一件事 - 不低于则浮点数条件移动。
如果条件满足,它从 st(1)-st(7) 拷贝数据到 st(0)。
Eflags 用于条件测试。 Fcmov 和 fcomi 一样在 P6 体系结构中被介绍。
function Min11(const A, B: Single) : Single;
asm
fld dword ptr [ebp+$08]
fld dword ptr [ebp+$0c]
fcomi st(0), st(1)
fcmovnb st(0), st(1)
ffree st(1)
end;
Instead of all the jumping around we copy A to the top of the stack where B is,
but only if A is the smaller of the two.Delete B and done.
This is a nice clean function with only one piece of redundancy left.
The compiler still push/ pop ebp even though it is not modified in the function.
如果 A 是两者中最小的,则不需要将 B 移到栈顶,删除 B,结束。
这是一个简洁的函数,只有一点点冗余。
因为,编译器仍然会加入 push/ pop ebp,尽管函数中没有改变 ebp 寄存器的值。