CnPack Forum » 技术板块灌水区 » <<BASM 初学者入门>> 第 5 课


2007-4-30 12:46 skyjacker
<<BASM 初学者入门>> 第 5 课

Lesson 5
<<BASM 初学者入门>> 第 5 课

[url=http://www.cnpack.org]http://www.cnpack.org[/url]
QQ Group: 130970
翻译:SkyJacker
版本:草稿版
状态:未校对
时间:2007

Welcome to lesson number 5. Today topic is loops.
We will take a look on how the compiler implements for loops, and which optimizations it applies on them.
We will evaluate the efficiency of these optimizations.
欢迎来到第 5 课。今天的主题是循环。
我们将看一下编译器如何执行循环,同时做了哪些优化。
我们将评价这些优化的效率。

function ForLoop(Start, Stop : Integer) : Integer;
var
I : Integer;
begin
Result := 0;
for I := Start to Stop do
  begin
    Result := Result + I;
  end;
end;
This example function does nothing useful except giving us an example for loop to examine.
Lets see what the compiler translates the function into.
In this lesson we try something new and compile with optimizations off.
这个例子除了表示一个循环的例子之外没有什么用处。我们来看看编译器是怎样编译的。
在这一课中我们试试一些新东西,把编译器优化选项关了。

function ForLoopNonOpt(Start, Stop : Integer) : Integer;
var
I : Integer;
begin
{
push ebp
mov ebp,esp
add esp,-$14
mov [ebp-$08],edx
mov [ebp-$04],eax
}
Result := 0;
{
xor eax,eax
mov [ebp-$0c],eax
}
for I := Start to Stop do
{
mov eax,[ebp-$04]
mov edx,[ebp-$08]
sub edx,eax
jl +$15
inc edx
mov [ebp-$14],edx
mov [ebp-$10],eax
}
  begin
    Result := Result + I;
    {
    mov eax,[ebp-$10]
    add [ebp-$0c],eax
    inc dword ptr [ebp-$10]
    }
  end;
  {
  dec dword ptr [ebp-$14]
  jnz -$0e
  mov eax,[ebp-$0c]
  }
{
mov esp,ebp
pop ebp
ret
}
end;
It is seen that the compiler generates a lot of code when no or few optimizations are applied.
As usual the first 3 lines set up a stack frame. This time it is 20 byte big (16 hex).
The two next lines copy the Start and Stop variables unto the stack. Start is transferred in eax and
Stop is transferred in edx to the function due to the register calling convention.
The next two lines create a zero in eax and copy it to the stack frame at [ebp-$0c],
which is the place of the result variable.
在没有进行优化时,编译器产生了许多代码。
前三行设置堆栈。这一次开辟了 20 个字节的大小(16 hex)
接下来的两行复制 Start 和 Stop 变量到堆栈中。根据寄存器调用约定, Start 通过 eax 传递,Stop 通过 edx 传递。
然后下面的两行代码将 eax 清零并复制到堆栈 [ebp-$0c] 中,用于存放结果变量。

Then we are ready to enter the for loop.
Before entering a for loop it is necessary to test whether the loop will iterate at all.
If Stop is bigger than Start this is the case.
Start and Stop is fetched from their stack locations into eax & edx.
We compute Stop-Start and if this is negative the loop should not be entered
and execution is transferred past the end of the loop by the jl (jump low) instruction.
然后准备进入循环。
在进入循环之前需要检测一下循环是否需要进入。如果 Stop 大于 Start 则需要。
Start 和 Stop 的值是通过将本地堆栈中数值复制到 eax & edx 中获得。
计算 Stop-Start,如果是负数,则不进入循环,并且通过 jl(低则跳转) 指令到达循环的末尾处执行指令。

The next line increments Stop and then it is copied to location [ebp-$14].
We have no name for the local variable at this location.
The purpose of it also needs some further explanations.
It is a variable introduced by the compiler as an optimization and this is weird because
we compiled with optimizations off.
下一行 Stop 加 1,然后将它复制到局部变量 [ebp-$14]。
我们没有声明这个局部变量。
关于它的目的需要更深入的解释。
编译器将它作为优化使用,真是令人不可思议,因为我们已经关闭了编译选项。

There is only one more line accessing the NoName variable and this is the dec dword ptr [ebp-$14] line.
dec dword ptr [ebp-$14] 也访问了这个未命名的局部变量(NoName),
This line decrements NoName by one at the end of each loop iteration
and the loop closing test is a test that it has not reached zero.
在每次循环的末尾它将未命名的变量(NoName)减 1,通过检查它是否是零来判断循环是否结束。

The dec instruction sets the flags and the jnz jumps back to the top of the loop if NoName <> 0.
We would expect that I was used as the loop counter and that it was running from Start to Stop.
It is in fact doing so, but it is not used for control of the loop.
This is the sole purpose of NoName.
dec 指令设置标志,如果 NoName <> 0,jnz 将跳转回循环的开始。
我们期望它用于循环计数,从 Start to Stop。
事实上,它确实是这样做的,但是它不用于循环的控制。这是 NoName 唯一的目的。

The advantage of this is that it saves an instruction for comparing I to Stop.
There is also a cost and this is the dec NoName instruction.
On P4 the latency/throughput of cmp is  0.5/0.5 clock cycles and for dec they are 1/0.5.
Therefore we would expect this optimization to be a "deoptimization".
The latency and throughput numbers for P4 is found in the
"Intel Pentium 4 and Xeon Processor Optimization" manual from Intel.
这样做的优点是,它节省了一个 I 与 Stop 的比较指令。
在 P4 上, cmp 的潜伏期/吞吐量是 0.5/0.5 时钟周期,dec 指令是 1/0.5。
因此,我们预期的优化变成了一个 “非优化”
在 Intel 的 "Intel Pentium 4 和 Xeon 处理器优化"手册上有 P4 指令的潜伏期和吞吐量。

Back to the mov [ebp-$10], eax line. It copies I to the stack.
The loop body consist of only one line of Pascal Result := Result + I;.
This is translated into 3 lines of ASM.
The first two lines are loading I into eax and then adding it to Result at the stack at [ebp-$0c].
The third line increments I.
This ended the explanation of the loop code and only two things are left.
The Result must be copied to eax, which is the register to use for returning the result from
a register calling convention function.
The last three lines remove the stack frame and return execution to the line after the line that called the function.
回到 mov [ebp-$10], eax 这一行,它复制 I 到堆栈。
循环体只由一行 Pascal 代码 Result := Result + I;
它被编译成了 3 行 ASM 代码。
前两行将 I 装入 eax,然后将它累加到结果堆栈 [ebp-$0c]。
第三行是 I 加 1。
只剩下下面两个问题就解释完循环代码了:
结果必须被复制到 eax,eax 是被用来存放返回结果。
最后三行用于移除堆栈,并且返回到调用此函数的代码的下一行。

As an exercise let us change the ASM code such that it follows the Pascal code and our understanding of a for loop.
We start by turning the function into a pure BASM one.
This is done by out commenting the Pascal code and "in commenting" the ASM code.
Defining the two labels LoopEnd and LoopStart is also necessary.
The two jumps are edited such that they jump to the labels.
作为一个练习,我们将自己理解的 pascal 循环代码转为 ASM 代码。
首先将函数转为纯 BASM。
注释掉 pascal 代码,使用 ASM 代码。
也需要定义两个标签 LoopEnd 和 LoopStart。这两个标签是根据它们的跳转意义来命名的。

function ForLoopBASM1(Start, Stop : Integer) : Integer;
asm
push ebp
mov ebp,esp
add esp,-$14
mov [ebp-$08],edx
mov [ebp-$04],eax
  //Result := 0;
  xor eax,eax
  mov [ebp-$0c],eax
  //for I := Start to Stop do
  mov eax,[ebp-$04]
  mov edx,[ebp-$08]
  sub edx,eax
  jl @LoopEnd
  inc edx
  mov [ebp-$14],edx
  mov [ebp-$10],eax
   //begin
   @LoopStart :
     //Result := Result + I;
     mov eax,[ebp-$10]
     add [ebp-$0c],eax
     inc dword ptr [ebp-$10]
   //end;
   dec dword ptr [ebp-$14]
   jnz @LoopStart
  @LoopEnd :
   mov eax,[ebp-$0c]
  mov esp,ebp
  pop ebp
  //ret
end;
The first thing we do is removing the NoName variable.
首先,我们移除 NoName 变量
function ForLoopBASM2(Start, Stop : Integer) : Integer;
asm
push ebp
push ebx                      //New 新
mov ebp,esp
add esp,-$14
mov [ebp-$08],edx
mov [ebp-$04],eax
  //Result := 0;
  xor eax,eax
  mov [ebp-$0c],eax
  //for I := Start to Stop do
  mov eax,[ebp-$04]
  mov edx,[ebp-$08]
  sub edx,eax
  jl @LoopEnd
  //inc edx                    //NoName intialize NoName 初始化
  //mov [ebp-$14],edx          //NoName intialize NoName 初始化
  mov [ebp-$10],eax
   //begin
   @LoopStart :
     //Result := Result + I;
     mov eax,[ebp-$10]
     add [ebp-$0c],eax
     inc dword ptr [ebp-$10]
   //end;
   //dec dword ptr [ebp-$14]  //NoName decrement NoName 减一
   mov ebx, [ebp-$10]         //New
   mov ecx, [ebp-$08]         //New
   cmp ebx, ecx               //New
   //jnz @LoopStart
   jbe @LoopStart             //New
  @LoopEnd :
   mov eax,[ebp-$0c]
  mov esp,ebp
  pop ebx                     //New
  pop ebp
  //ret
end;
The lines marked "New" are introduced to make I the loop control variable.
The mov ebx, [ebp-$10] line copies I into ebx. The next line copies Stop into ecx.
Then the line cmp ebx, ecx compare them and jbe @LoopStart transfer execution to the start of the loop if
I is below Stop or equal to it. Because we use ebx and it is not free for use we remember to push and pop it.
We expect a for loop to evaluate the loop condition at the top.
This test is split in two by the compiler implementation.
Before entering the loop it is tested that it will execute at least once and the actual loop closing test is done
at the bottom. This is an optimization technique called loop inversion. Now we change the loop such that
this optimization is removed. Then we see what was the benefit of the optimization.
被标记为 "New"  的行是用来制造 I 循环控制变量。
mov ebx, [ebp-$10] 将 I 复制到 ebx。下一行将 Stop 复制到 ecx。
cmp ebx, ecx 进行比较,如果 I 小于等于 Stop ,则 jbe @LoopStart 执行到循环的开始。
因为我们使用了 ebx,它不能被自由的使用,因此我们要记得 push 和 pop 它。
一个循环是在开始就计算循环条件的。
可以将编译器执行过程被分为两步。
在进入循环之前测试,它将至少被执行一次,然后在结尾判断是否循环结束。这种优化技术叫做循环倒置。
现在我们开始修改这个没被优化的循环,来看看优化的好处。
  
function ForLoopBASM4(Start, Stop : Integer) : Integer;
asm
push ebp
push ebx
mov ebp,esp
add esp,-$14
mov [ebp-$08],edx
mov [ebp-$04],eax
  //Result := 0;
  xor eax,eax
  mov [ebp-$0c],eax
  //for I := Start to Stop do
  mov eax,[ebp-$04]
  mov edx,[ebp-$08]
  //sub edx,eax
  //jl @LoopEnd
  mov [ebp-$10],eax
   //begin
   @LoopStart :
     mov ebx, [ebp-$10]
     mov ecx, [ebp-$08]
     cmp ebx, ecx
     ja  @LoopEnd
     //Result := Result + I;
     mov eax,[ebp-$10]
     add [ebp-$0c],eax
     inc dword ptr [ebp-$10]
   //end;
   //mov ebx, [ebp-$10]
   //mov ecx, [ebp-$08]
   //cmp ebx, ecx
   //jbe @LoopStart
   jmp @LoopStart
  @LoopEnd :
   mov eax,[ebp-$0c]
  mov esp,ebp
  pop ebx
  pop ebp
end;
The loop-closing test has been moved to the top and the test has been inverted.
At the place of the test there now is an unconditional jump to the top.
This jump is what the loop inversion optimization is all about.
In the unoptimized for loop there is two jumps and only one in the optimized one.
The test at the top that tested whether Start was above Stop is now redundant and is removed.
Before making any timing to evaluate the effect of the two optimisations it is a good idea to optimise away some,
or all if possible, of the stack to register and register to stack moves.
循环结束的检测被移到循环开始,做了一个反转。
在原先检测的位置用了一个无条件跳转到循环开始。
这个跳转说明了整个循环倒置优化。
在没有优化的循环里有两个跳转,在优化的循环里只有一个跳转。
在开头的测试已经测试了是否 Start,Stop 现在是冗余的因此被移除。
在比较这两个优化的效率之前,最好能尽可能的优化掉比如堆栈到寄存器,寄存器到堆栈的数据移动。

This process is called register allocation and is one of the most important optimisations on all architectures,
but it is even more important on the Intel architecture because of the low number of available registers.
If there is not a register available to all variables it is crucial which variables get a register.
The mov instructions inside the loop body are the most important ones to get rid of.
They are executed as many times as the number of loop iterations.
The instructions outside the loop are only executed once.
The variables used inside the loop should be allocated to registers first. This is I, Stop and Result.
At this point we could be smart and take a look at the use of registers as temporaries.
If a variable were always copied into the same temp register it would be smart to allocate this register for
the variable.
Stop is in the edx register when we enter the function and this register is also used as temporary
register for it in all but two lines. This is the two lines of the loop test that we added.
这个处理叫做寄存器分配,它是所有体系结构中最重要的优化之一。
因为寄存器数量很少,因此在 Intel 体系结构中它显得很重要。
如果寄存器不能满足所有的变量,因此哪个变量获得寄存器是相当重要的。
在循环体内的 mov 指令是最重要的一个,需要用到一个寄存器。
因为在循环迭代中,它们需要执行很多次。
循环以外的指令只执行一次。
用在循环内部的变量应该首先被分配寄存器。这有 I, Stop 和 Result。
在这一点上,我们可以动动脑筋,观察一下寄存器作为临时变量的使用。
如果一个变量总是被复制到相同的临时寄存器,它将被聪明的分配一个寄存器。
当我们进入函数时, Stop 在 edx 中,这个寄存器也总是作为临时寄存器使用,但是只有两行。
下面是我们为循环测试增加的两行:
      
Let us change 让我们把
mov ecx, [ebp-$08]
cmp ebx, ecx
to           改为
mov edx, [ebp-$08]
cmp ebx, edx
Eax is used by Start in the top of the function and by Result in the rest of the function.
If there is no overlap in usage we can allocate eax for Result as soon as Start has finished using it.
After Start is assigned to I (mov [ebp-$10], eax) it is not used any more and eax is free to use by Result,
if it was not for those lines where eax is used as temp for I.
Eax 在函数开始被作为 Start ,在函数结尾作为返回结果。
一旦 Start 用完它,如果再没有覆盖 eax 的操作了,那么可以将 eax 作为返回结果了。
在 Start 被赋值为 I (mov [ebp-$10], eax) 之后,如果没有那些将 eax 作为 I 的临时变量,则可以将 eax 作为返回结果了。

mov eax,[ebp-$10]
add [ebp-$0c],eax
inc dword ptr [ebp-$10]
After ecx got out of use by the last change we can use that as temp for I instead of eax.
在 ecx 不被使用之后,我们可以用它作为临时变量 I 来代替 eax 。
mov ecx,[ebp-$10]
add [ebp-$0c],ecx
inc dword ptr [ebp-$10]
The summary of the first part of the register allocation is: Result in eax, I in ecx and Stop in edx.
Lets first change the lines with Stop. [ebp-$08] is replaced by edx everywhere.
寄存器分配的第一部分的主题是: 结果在 eax 中, I 在 ecx 中, Stop 在 edx 中。
首先修改 Stop 部分,用 edx 替换所有的 [ebp-$08]。

function ForLoopBASM6(Start, Stop : Integer) : Integer;
asm
push ebp
push ebx
mov ebp,esp
add esp,-$14
//mov [ebp-$08],edx
mov edx,edx
mov [ebp-$04],eax
  //Result := 0;
  xor eax,eax
  mov [ebp-$0c],eax
  //for I := Start to Stop do
  mov eax,[ebp-$04]
  //mov edx,[ebp-$08]
  mov edx,edx
  mov [ebp-$10],eax
   //begin
   @LoopStart :
     mov ebx, [ebp-$10]
     //mov edx, [ebp-$08]
     mov edx, edx
     cmp ebx, edx
     ja  @LoopEnd
     //Result := Result + I;
     mov ecx,[ebp-$10]
     add [ebp-$0c],ecx
     inc dword ptr [ebp-$10]
   //end;
   jmp @LoopStart
  @LoopEnd :
   mov eax,[ebp-$0c]
  mov esp,ebp
  pop ebx
  pop ebp
end;
Then allocate ecx for I by replacing [ebp-$10] by ecx.
用 ecx 作为 I 来替换 [ebp-$10]
function ForLoopBASM7(Start, Stop : Integer) : Integer;
asm
push ebp
push ebx
mov ebp,esp
add esp,-$14
mov edx,edx
mov [ebp-$04],eax
  //Result := 0;
  xor eax,eax
  mov [ebp-$0c],eax
  //for I := Start to Stop do
  mov eax,[ebp-$04]
  mov edx,edx
  //mov [ebp-$10],eax
  mov ecx,eax
   //begin
   @LoopStart :
     //mov ebx, [ebp-$10]
     mov ebx, ecx
     mov edx, edx
     cmp ebx, edx
     ja  @LoopEnd
     //Result := Result + I;
     //mov ecx,[ebp-$10]
     mov ecx,ecx
     add [ebp-$0c],ecx
     //inc dword ptr [ebp-$10]
     inc ecx
   //end;
   jmp @LoopStart
  @LoopEnd :
   mov eax,[ebp-$0c]
  mov esp,ebp
  pop ebx
  pop ebp
end;
最后,让 eax 存放结果。
因为 eax 在函数的开始用来传递了参数 Start,因此需要将它初始化为零。
如果 eax 没有再被用于其他目的,我们需要增加一行来将结果复制到 eax。
function ForLoopBASM8(Start, Stop : Integer) : Integer;
asm
push ebp
push ebx
mov ebp,esp
add esp,-$14
mov edx,edx
mov [ebp-$04],eax
  //Result := 0;
  xor eax,eax
  mov [ebp-$0c],eax
  //for I := Start to Stop do
  mov eax,[ebp-$04]
  mov edx,edx
  mov ecx,eax
  mov eax, [ebp-$0c]                //New
   //begin
   @LoopStart :
     mov ebx, ecx
     mov edx, edx
     cmp ebx, edx
     ja  @LoopEnd
     //Result := Result + I;
     mov ecx,ecx
     //add [ebp-$0c],ecx
     add eax,ecx
     inc ecx
   //end;
   jmp @LoopStart
  @LoopEnd :
   //mov eax,[ebp-$0c]
   mov eax,eax
  mov esp,ebp
  pop ebx
  pop ebp
end;
Because we were pretty smart when we chose the registers there is a lot of lines like mov eax, eax.
It is easy to see how redundant they are ;-). Let us remove them.
我们应该灵活的选择像 mov eax, eax 这样的寄存器操作行。
很容易发现他们是否冗余。移除它们:
function ForLoopBASM9(Start, Stop : Integer) : Integer;
asm
push ebp
push ebx
mov ebp,esp
add esp,-$14
//mov edx,edx
mov [ebp-$04],eax
  //Result := 0;
  xor eax,eax
  mov [ebp-$0c],eax
  //for I := Start to Stop do
  mov eax,[ebp-$04]
  //mov edx,edx
  mov ecx,eax
  mov eax, [ebp-$0c]               
   //begin
   @LoopStart :
     mov ebx, ecx
     //mov edx, edx
     cmp ebx, edx
     ja  @LoopEnd
     //Result := Result + I;
     //mov ecx,ecx
     add eax,ecx
     inc ecx
   //end;
   jmp @LoopStart
  @LoopEnd :
   //mov eax,eax
  mov esp,ebp
  pop ebx
  pop ebp
end;
When optimizing ASM code there is generally two lines of thinking we can choose to follow.
We can think as the human beings we are and try to be smart, use whatever information we can get from the code.
We did some of this when we selected the registers here.
Another line of thought is trying to do it systematically as an optimizer/compiler has to.
This way we develop algorithms that can be coded in a tool.
This tool can later take over the most boring of optimizations, these we do over and over again.
Removal of the most obviously redundant line of code of all, the mov eax, eax, was an example of dead code removal,
which is basic to any optimiser.
当我们优化 ASM 代码是通常有以下两种考虑。
我们可以发挥我们的聪明才智尽可能的从这些代码中获取优化信息,就像我们上面所做的用寄存器处理。
另一个考虑是去实现一个系统的优化器或编译器。开发一个工具来定义优化规则。
这个工具可能代替大部分繁琐的优化,就是我们一而再,再而三做得这些人工优化。
一个优化器需要实现的基本功能:删除代码中所有的明显的冗余,mov eax, eax 是一行典型的无效代码。

In the top of the function we still have some references to the stack.
To get rid of these we allocate registers for those variables too.
This time we just pick edi and esi, which are not used elsewhere.
Allocate esi for [ebp-$04] and edi for [ebp-$0c].
Because esi and edi must be preserved by the function we must push and pop them.
在这个函数的开始我们仍然有一些对堆栈的引用。
为了去除这些,我们也为这些变量分配寄存器。
这时候我们可以使用 edi, esi ,他们还没有在别的地方被使用。
为 [ebp-$04] 分配 esi,为 [ebp-$0c] 分配 edi。
因为 esi, edi 是函数所保留的,因此必须 push 和 pop 它们。
function ForLoopBASM10(Start, Stop : Integer) : Integer;
asm
push ebp
push ebx
push esi
push edi
mov ebp,esp
add esp,-$14
//mov [ebp-$04],eax
mov esi,eax
  //Result := 0;
  xor eax,eax
  //mov [ebp-$0c],eax
  mov edi,eax
  //for I := Start to Stop do
  //mov eax,[ebp-$04]
  mov eax,esi
  mov ecx,eax
  //mov eax, [ebp-$0c]
  mov eax, edi
   //begin
   @LoopStart :
     mov ebx, ecx
     cmp ebx, edx
     ja  @LoopEnd
     //Result := Result + I;
     add eax,ecx
     inc ecx
   //end;
   jmp @LoopStart
  @LoopEnd :
  mov esp,ebp
  pop edi
  pop esi
  pop ebx
  pop ebp
end;
The stack frame is not used anymore and there is no need to set it up.
This removes 4 instructions.
Then we observe that the two lines
堆栈不再被使用,不需要设置它。删除这 4 行指令。
接下来,我们观察下面两行.
mov eax,esi
mov ecx,eax
Can be replaced by one.
可以被一行代替:
mov ecx, esi
This is an example of a simple copy propagation followed by dead code removal.
Any other lines do not use the value in eax than the next line that copies it back to ecx.
It is in fact immediately overwritten by the line mov eax, edi.
Therefore we can replace the second line by mov ecx,esi and remove the first one, which becomes dead.
这时一个关于移除无效代码复制繁殖的例子。
没有任何一行使用 eax 的值,在第二行又被拷贝回 ecx。
事实上,eax 马上就被 mov eax, edi 重写。
因此,我们用 mov ecx,esi 替换第二行,删除无效的第一行。

function ForLoopBASM11(Start, Stop : Integer) : Integer;
asm
//push ebp
push ebx
push esi
push edi
//mov ebp,esp
//add esp,-$14
mov esi,eax
  //Result := 0;
  xor eax,eax
  mov edi,eax
  //for I := Start to Stop do
  //mov eax,esi
  //mov ecx,eax
  mov ecx, esi
  mov eax, edi
   //begin
   @LoopStart :
     //mov ebx, ecx
     //cmp ebx, edx
     cmp ecx, edx
     ja  @LoopEnd
     //Result := Result + I;
     add eax,ecx
     inc ecx
   //end;
   jmp @LoopStart
  @LoopEnd :
  //mov esp,ebp
  pop edi
  pop esi
  pop ebx
  //pop ebp
end;
The line xor eax, eax that initialises Result to zero can together with the line right
after it be moved some lines down closer to the place where eax is used for the first time.
It should not be moved into the loop that would change the logic of the function,
but just before the line before loopStart.
This removes the need for copying eax into edi and back into eax again in the line just before
the comment line //for I := Start to Stop do, and in the line before the out commented begin.
xor eax, eax 将结果初始化为零。它应当被放在第一次使用 eax 的地方。
它不应该放入循环中,因为会改变函数的逻辑,仅用于 loopStart 之前。
需要移除注释行 //for I := Start to Stop do 之前的复制 eax 到 edi ,然后再复制回 eax 的代码。

function ForLoopBASM12(Start, Stop : Integer) : Integer;
asm
push ebx
push esi
push edi
mov esi,eax
  //for I := Start to Stop do
  mov ecx, esi
  //Result := 0;
  xor eax,eax
  //mov edi,eax
  //mov eax, edi
   //begin
   @LoopStart :
     cmp ecx, edx
     ja  @LoopEnd
     //Result := Result + I;
     add eax,ecx
     inc ecx
   //end;
   jmp @LoopStart
  @LoopEnd :
  pop edi
  pop esi
  pop ebx
end;
After having cleaned up we spot two lines of mov which together copy eax to ecx via esi.
This leaves a copy of eax in esi, which is not used. Therefore one that moves eax directly into ecx can replace
these two lines. This is also copy propagation + dead code removal.
清除之后我们发现两行 mov 指令,是将 eax 通过 esi 复制到 ecx。
这样在 esi 中留下了一个 eax 的拷贝,它没有被使用。
因此,用直接将 eax 移入 ecx 的方法替换这两行。
这也是一个复制繁殖 + 无效代码移除。

function ForLoopBASM13(Start, Stop : Integer) : Integer;
asm
push ebx
//push esi
push edi
//mov esi,eax
  //for I := Start to Stop do
  //mov ecx, esi
  mov ecx, eax
  //Result := 0;
  xor eax,eax
   //begin
   @LoopStart :
     cmp ecx, edx
     ja  @LoopEnd
     //Result := Result + I;
     add eax,ecx
     inc ecx
   //end;
   jmp @LoopStart
  @LoopEnd :
  pop edi
  //pop esi
  pop ebx
end;
After having removed the only use of esi there is no need to push and pop it.
移除之后只需要使用 esi,不需要 push 和 pop 它。
function ForLoopBASM14(Start, Stop : Integer) : Integer;
asm
//push ebx
//push edi
  //for I := Start to Stop do
  mov ecx, eax
  //Result := 0;
  xor eax,eax
   //begin
   @LoopStart :
     cmp ecx, edx
     ja  @LoopEnd
     //Result := Result + I;
     add eax,ecx
     inc ecx
   //end;
   jmp @LoopStart
  @LoopEnd :
  //pop edi
  //pop ebx
end;
We also, a little late perhaps, observe that ebx and edi is not used either. After having cleaned up
and relocated the comments a little, there is a nice clean function as a result.
还有一个可能,就是我们发现 ebx 和 edi 也没有被使用。
整理代码,并进行少量注释之后,一个简洁的函数就产生了:
function ForLoopBASM15(Start, Stop : Integer) : Integer;
asm
  mov ecx, eax
  //Result := 0;
  xor eax,eax
  //for I := Start to Stop do
@LoopStart :
  cmp ecx, edx
  ja  @LoopEnd
  //Result := Result + I;
  add eax,ecx
  inc ecx
  jmp @LoopStart
@LoopEnd :
end;
It took a long time and a lot of optimisations to get here because we started with the non-optimised output from
the compiler. This long process illustrated the amount of work the compiler leaves to the optimizer.
Sometimes we did not use the algorithmic approach to optimization,
but we could have achieved the same result by doing it.
Instead of taking the same long road again with the function with loop inversion present,
we can cheat and compile the Pascal function with optimizations on.
The compiler might do all the optimisation we did.
经过了长时间和大量的优化我们终于得到这个函数。当然,这是因为我们在开始的时候没有选择优化选项。
这个长过程展示了编译器将大量的工作留给了优化器。
有时,我们不使用优化选项,但是我们可以自己优化它,从而得到相同优化结果。
我们可以打开编译器优化选项来优化代码,编译器内部可能也使用了我们用的循环倒置来进行这么长的优化步骤。

function ForLoopOpt(Start, Stop : Integer) : Integer;
var
I : Integer;
begin
{
}
Result := 0;
{
xor ecx,ecx
}
for I := Start to Stop do
{
sub edx,eax
jl +$08
inc edx
xchg eax,edx
}
  begin
   Result := Result + I;
   {
   add ecx,edx
   inc edx
   }
  end;
  {
  dec eax
  jnz -$06
  }
{
mov eax,ecx
}
end;
This time Delphi did a really nice job. Only two lines jump into our eyes as possibly redundant.
xchg eax, edx simply exchange the values in eax and edx and mov eax, ecx copy the Result into eax.
Both lines are outside the loop and make little harm.
这次 Delphi 做的真实非常的好。在我们看来,只有两行冗余的跳转代码。
xchg 简单的交换 eax, edx 的值,mov eax, ecx 将结果复制到 eax 中。
这两行在循环的外面,会有一点效率上的损害。

We now have two functions - one with no loop optimizations and one with two.
To make things complete we need two more functions, one with loop inversion only and one with
the NoName variable optimization only.
In the beginning of the lesson we saw how to remove the two optimisations and this is what
I have done to get to the last 2 functions.
In the Delphi optimized function above, I optimized away the xchg instruction by swapping the use of
the two registers it exchanged.
Because we want to se the maximum effect of the loop optimizations I have removed the loop body code doing
the operation Result := Result + I;
Here are the four final functions
现在,我们有两个函数 - 没有循环优化有一个,循环优化的有两个。
为了完成我们要讲的,我们还需要两个函数。一个是只有循环倒置,一个是只有 NoName 变量优化.
在这课的开始我们已经讲了如何进行这两种优化, 并写了两个函数.
在 Delphi 优化函数的基础上, 我通过使用两个寄存器进行交换来优化掉 xchg 指令.
因为我们想看看循环体 Result := Result + I 的最高循环效率。
下面是四个最终的函数:

function ForLoopNoLoopInverNoNoName(Start, Stop : Integer) : Integer;
asm
  mov ecx, eax
  //Result := 0;
  xor eax,eax
  //for I := Start to Stop do
@LoopStart :
  cmp ecx, edx
  ja  @LoopEnd
  inc ecx
  jmp @LoopStart
@LoopEnd :
end;
循环由 4 个指令组成。cmp, ja, inc 和 jmp。
在 P4 上这些指令的潜伏期和吞吐量是:cmp 0.5/0.5, ja X/0.5, inc 0.5/1 and jmp X/0.5 。
X 表示 "潜伏期不能应用于这条指令"。
累加潜伏期,我们得到:0.5 + X + 0.5 + X = ? 时钟周期
function ForLoopLoopInverNoNoName(Start, Stop : Integer) : Integer;
asm
  mov ecx, eax
  //Result := 0;
  xor eax,eax
  //for I := Start to Stop do
  cmp ecx, edx
  ja  @LoopEnd
@LoopStart :
  inc ecx
  cmp ecx, edx
  jbe @LoopStart
@LoopEnd :
end;
This loop consists of 3 instructions also with unknown sum of latency.
这个循环由三条指令组成,也不知道潜伏期的和。
function ForLoopNoLoopInverNoName(Start, Stop : Integer) : Integer;
asm
  //Result := 0;
  xor ecx,ecx
  //for I := Start to Stop do
  sub edx,eax
  cmp edx, 0
@LoopStart :
  jz  @LoopEnd
  inc eax
  dec edx
  jmp @LoopStart
@LoopEnd :
  mov eax,ecx
end;
This loop consists of 4 instructions also with unknown sum of latency.
We observe that the two inc/dec instructions are able to execute in parallel.
Because the dec NoName instruction is not followed by the conditional jmp it would look
like we throw away the benefit of not needing a cmp or test instruction to set the flags,
but the jmp instruction does not change the flags and they are valid when we reach the jz instruction at the top of the loop.
Only at the first iteration a cmp edx,0 instruction is needed.
这个循环由 4 条指令组成,也不知道潜伏期的和。
我们发现 inc/dec 指令能够并行执行。
因为 dec NoName 指令后面不是一个条件 jmp,因此不需要一个 cmp 或 test 指令来设置标志。
又因为 jmp 指令不能改变标志, 因此当到达循环的开始 jz 指令时,它们都是正确的。
只是在迭代的开始需要一个 cmp edx,0 指令。  

function ForLoopLoopInverNoName(Start, Stop : Integer) : Integer;
asm
  //Result := 0;
  xor ecx,ecx
  //for I := Start to Stop do
  sub edx,eax
  jl @LoopEnd
  inc edx
@LoopStart :
  inc eax
  dec edx
  jnz @LoopStart
@LoopEnd :
  mov eax,ecx
end;
这个循环由 3 条指令组成,也不知道潜伏期的和。
这也有一对独立的 inc/dec。
This is the simple benchmark I have used to find the performance of the 4 functions
我使用下面的基准测试方法来检测 4 函数的性能。
const
LOOPSTART : Integer = 1;
LOOPEND : Integer = 2000000000;
LOOPEND2 : Integer = 20;
CLOCK : Double = 1920E6;
var
Starttime, Endtime, Runtime : TDateTime;
I, LoopResult : Integer;
RunTimeSec, NoOfLoopsPerSec, NoOfLoops, ClockCount, LoopEnd2Float, LoopEndFloat, LoopStartFloat : Double;
begin
Starttime := Time;
for I := 1 to LOOPEND2 do
  begin
   LoopResult := ForLoopNoLoopInverNoName(LOOPSTART, LOOPEND);
  end;
Endtime := Time;
Runtime := Endtime - Starttime;
CEdit.Text := IntToStr(LoopResult);
RuntimeEdit4.Text := TimeToStr(Runtime);
RunTimeSec := RunTime*24*60*60;
LoopEnd2Float := LOOPEND2;
LoopEndFloat := LOOPEND;
LoopStartFloat := LOOPSTART;
NoOfLoops := LoopEnd2Float * (LoopEndFloat - LoopStartFloat);
NoOfLoopsPerSec := NoOfLoops / RunTimeSec;
ClockCount := CLOCK / NoOfLoopsPerSec;
ClockCountEdit4.Text := FloatToStrf(ClockCount, ffFixed, 9, 1);
end;
Results on P4 1920 are 在 P4 1920 上的结果
No Loop Inversion and No NoName variable    00:00:55  (2.7 Clock cycles)
无循环倒置,无未命名变量
Loop Inversion but No NoName variable       00:00:39  (1.9 Clock cycles)
循环倒置,无未命名变量
No Loop Inversion but NoName variable       00:01:02  (3.0 Clock cycles)
无循环倒置,有未命名变量
Loop Inversion + NoName                     00:00:41  (2.0 Clock cycles)
循环倒置,有未命名变量
Results on P3 1400 are 在 P3 1400 上的结果
No Loop Inversion and No NoName variable    00:01:26  (3.0 Clock cycles)
无循环倒置,无未命名变量
Loop Inversion but No NoName variable       00:01:26  (3.0 Clock cycles)
循环倒置,无未命名变量
No Loop Inversion but NoName variable       00:01:55  (4.0 Clock cycles)
无循环倒置,有未命名变量
Loop Inversion + NoName                     00:01:26  (3.0 Clock cycles)
循环倒置,有未命名变量

Of course the clock count numbers should be integers.
On P4 half cycles are possible due to the double-clocked ALU.
Our timings are not as good as we could wish, but for comparing the performance of the loops they are OK.
Conclusion on P4. Apply loop inversion only or loop inversion with NoName variable optimization.
Conclusion on P3. Do not apply the NoName variable optimization alone.
Conclusion on both targets. Apply both optimizations as Delphi does.
Also observe how efficient P4 is on this code.
当然,时间计数值应该是整数。
由于双时钟的 ALU, 因此在 P4 上可能有半个时钟周期。
我们的测试没有我们想象的那样好,但是比较循环的执行效率足够了。
在 P4 上,只应用了循环倒置和使用未命名(NoName)变量的循环倒置优化。
在 P3 上,没有单独的应用未命名变量优化。
结果是两个平台上的。Delphi 做了这两种优化。用这个代码也可以观察在 P4 上运行的效率。

2007-5-2 11:24 Passion
有关 latency/throughput 的补充解释:

某条指令的 latency (潜伏期、延迟),外头解释的是“完全执行一个指令所需的时钟周期”,可能包括取指,排队等额外操作。
而某条Throughput (吞吐量)包括两种含义:
  
     第一种:执行一条指令所需的最少时钟周期数,越少越好。执行的速度越快,下一条指令和它抢占资源的机率也越少。   
     第二种:在一定时间内可以执行最多指令数,当然是越大越好。

本文中的应该是第一种解释。

2007-5-2 11:27 Passion
咳贴重复了,第一课里头你就贴了。:lol

页: [1]


Powered by Discuz! Archiver 5.0.0  © 2001-2006 Comsenz Inc.