Lesson 5
<<BASM ѧ>>  5 

http://www.cnpack.org
QQ Group: 130970
룺SkyJacker
汾ݸ
״̬δУ
ʱ䣺2007

Welcome to lesson number 5. Today topic is loops. 
We will take a look on how the compiler implements for loops, and which optimizations it applies on them. 
We will evaluate the efficiency of these optimizations.
ӭ 5 Ρѭ
ǽһ±ִѭͬʱЩŻ
ǽЩŻЧʡ

function ForLoop(Start, Stop : Integer) : Integer;
var
 I : Integer;
begin
 Result := 0;
 for I := Start to Stop do
  begin
    Result := Result + I;
  end;
end;

This example function does nothing useful except giving us an example for loop to examine.
Lets see what the compiler translates the function into. 
In this lesson we try something new and compile with optimizations off.
ӳ˱ʾһѭ֮ûʲôôġ
һһЩ¶ѱŻѡˡ

function ForLoopNonOpt(Start, Stop : Integer) : Integer;
var
 I : Integer;
begin
{
push ebp
mov ebp,esp
add esp,-$14
mov [ebp-$08],edx
mov [ebp-$04],eax
}
 Result := 0;
 {
 xor eax,eax
 mov [ebp-$0c],eax
 }
 for I := Start to Stop do
 {
 mov eax,[ebp-$04]
 mov edx,[ebp-$08]
 sub edx,eax
 jl +$15
 inc edx
 mov [ebp-$14],edx
 mov [ebp-$10],eax
 }
  begin
    Result := Result + I;
    {
    mov eax,[ebp-$10]
    add [ebp-$0c],eax
    inc dword ptr [ebp-$10]
    }
  end;
  {
  dec dword ptr [ebp-$14]
  jnz -$0e
  mov eax,[ebp-$0c]
  }
 {
 mov esp,ebp
 pop ebp
 ret
 }
end;

It is seen that the compiler generates a lot of code when no or few optimizations are applied.
As usual the first 3 lines set up a stack frame. This time it is 20 byte big (16 hex).
The two next lines copy the Start and Stop variables unto the stack. Start is transferred in eax and
Stop is transferred in edx to the function due to the register calling convention.
The next two lines create a zero in eax and copy it to the stack frame at [ebp-$0c],
which is the place of the result variable.
ûнŻʱ롣
ǰöջһο 20 ֽڵĴС(16 hex)
и Start  Stop ջСݼĴԼ, Start ͨ eax ݣStop ͨ edx ݡ
Ȼд뽫 eax 㲢Ƶջ [ebp-$0c] УڴŽ

Then we are ready to enter the for loop.
Before entering a for loop it is necessary to test whether the loop will iterate at all. 
If Stop is bigger than Start this is the case.
Start and Stop is fetched from their stack locations into eax & edx.
We compute Stop-Start and if this is negative the loop should not be entered
and execution is transferred past the end of the loop by the jl (jump low) instruction.
Ȼ׼ѭ
ڽѭ֮ǰҪһѭǷҪ롣 Stop  Start Ҫ
Start  Stop ֵͨضջֵƵ eax & edx лá
 Stop-StartǸ򲻽ѭͨ jl(ת) ָѭĩβִָ

The next line increments Stop and then it is copied to location [ebp-$14].
We have no name for the local variable at this location.
The purpose of it also needs some further explanations.
It is a variable introduced by the compiler as an optimization and this is weird because
we compiled with optimizations off.
һ Stop  1ȻƵֲ [ebp-$14]
ûֲ
ĿҪĽ͡
ΪŻʹã˲˼飬ΪѾر˱ѡ

There is only one more line accessing the NoName variable and this is the dec dword ptr [ebp-$14] line.
dec dword ptr [ebp-$14] Ҳδľֲ(NoName)

This line decrements NoName by one at the end of each loop iteration
and the loop closing test is a test that it has not reached zero.
ÿѭĩβδı(NoName) 1ͨǷжѭǷ

The dec instruction sets the flags and the jnz jumps back to the top of the loop if NoName <> 0.
We would expect that I was used as the loop counter and that it was running from Start to Stop.
It is in fact doing so, but it is not used for control of the loop.
This is the sole purpose of NoName. 
dec ָñ־ NoName <> 0jnz תѭĿʼ
ѭ Start to Stop
ʵϣȷʵģѭĿơ NoName ΨһĿġ

The advantage of this is that it saves an instruction for comparing I to Stop.
There is also a cost and this is the dec NoName instruction.
On P4 the latency/throughput of cmp is  0.5/0.5 clock cycles and for dec they are 1/0.5.
Therefore we would expect this optimization to be a "deoptimization".
The latency and throughput numbers for P4 is found in the
"Intel Pentium 4 and Xeon Processor Optimization" manual from Intel.
ŵǣʡһ I  Stop ıȽָ
 P4 , cmp Ǳ/ 0.5/0.5 ʱڣdec ָ 1/0.5
ˣԤڵŻһ Ż
 Intel  "Intel Pentium 4  Xeon Ż"ֲ P4 ָǱں

Back to the mov [ebp-$10], eax line. It copies I to the stack.
The loop body consist of only one line of Pascal Result := Result + I;.
This is translated into 3 lines of ASM.
The first two lines are loading I into eax and then adding it to Result at the stack at [ebp-$0c].
The third line increments I.
This ended the explanation of the loop code and only two things are left.
The Result must be copied to eax, which is the register to use for returning the result from
a register calling convention function.
The last three lines remove the stack frame and return execution to the line after the line that called the function.
ص mov [ebp-$10], eax һУ I ջ
ѭֻһ Pascal  Result := Result + I;
 3  ASM 롣
ǰн I װ eaxȻۼӵջ [ebp-$0c]
 I  1
ֻʣͽѭˣ
뱻Ƶ eaxeax Ǳŷؽ
Ƴջҷصô˺ĴһС

As an exercise let us change the ASM code such that it follows the Pascal code and our understanding of a for loop.
We start by turning the function into a pure BASM one.
This is done by out commenting the Pascal code and "in commenting" the ASM code.
Defining the two labels LoopEnd and LoopStart is also necessary.
The two jumps are edited such that they jump to the labels.
ΪһϰǽԼ pascal ѭתΪ ASM 롣
ȽתΪ BASM
ע͵ pascal 룬ʹ ASM 롣
ҲҪǩ LoopEnd  LoopStartǩǸǵתġ

function ForLoopBASM1(Start, Stop : Integer) : Integer;
asm
 push ebp
 mov ebp,esp
 add esp,-$14
 mov [ebp-$08],edx
 mov [ebp-$04],eax
  //Result := 0;
  xor eax,eax
  mov [ebp-$0c],eax
  //for I := Start to Stop do
  mov eax,[ebp-$04]
  mov edx,[ebp-$08]
  sub edx,eax
  jl @LoopEnd
  inc edx
  mov [ebp-$14],edx
  mov [ebp-$10],eax
   //begin
   @LoopStart :
     //Result := Result + I;
     mov eax,[ebp-$10]
     add [ebp-$0c],eax
     inc dword ptr [ebp-$10]
   //end;
   dec dword ptr [ebp-$14]
   jnz @LoopStart
  @LoopEnd :
   mov eax,[ebp-$0c]
  mov esp,ebp
  pop ebp
  //ret
end;

The first thing we do is removing the NoName variable.
ȣƳ NoName 

function ForLoopBASM2(Start, Stop : Integer) : Integer;
asm
 push ebp
 push ebx                      //New 
 mov ebp,esp
 add esp,-$14
 mov [ebp-$08],edx
 mov [ebp-$04],eax
  //Result := 0;
  xor eax,eax
  mov [ebp-$0c],eax
  //for I := Start to Stop do
  mov eax,[ebp-$04]
  mov edx,[ebp-$08]
  sub edx,eax
  jl @LoopEnd
  //inc edx                    //NoName intialize NoName ʼ
  //mov [ebp-$14],edx          //NoName intialize NoName ʼ
  mov [ebp-$10],eax
   //begin
   @LoopStart :
     //Result := Result + I;
     mov eax,[ebp-$10]
     add [ebp-$0c],eax
     inc dword ptr [ebp-$10]
   //end;
   //dec dword ptr [ebp-$14]  //NoName decrement NoName һ
   mov ebx, [ebp-$10]         //New 
   mov ecx, [ebp-$08]         //New
   cmp ebx, ecx               //New 
   //jnz @LoopStart
   jbe @LoopStart             //New 
  @LoopEnd :
   mov eax,[ebp-$0c]
  mov esp,ebp
  pop ebx                     //New 
  pop ebp
  //ret
end;

The lines marked "New" are introduced to make I the loop control variable. 
The mov ebx, [ebp-$10] line copies I into ebx. The next line copies Stop into ecx. 
Then the line cmp ebx, ecx compare them and jbe @LoopStart transfer execution to the start of the loop if 
I is below Stop or equal to it. Because we use ebx and it is not free for use we remember to push and pop it.
We expect a for loop to evaluate the loop condition at the top. 
This test is split in two by the compiler implementation.
Before entering the loop it is tested that it will execute at least once and the actual loop closing test is done
at the bottom. This is an optimization technique called loop inversion. Now we change the loop such that 
this optimization is removed. Then we see what was the benefit of the optimization.
Ϊ "New"   I ѭƱ
mov ebx, [ebp-$10]  I Ƶ ebxһн Stop Ƶ ecx
cmp ebx, ecx бȽϣ I Сڵ Stop  jbe @LoopStart ִеѭĿʼ
Ϊʹ ebxܱɵʹãҪǵápush  pop 
һѭڿʼͼѭġ
Խִй̱Ϊ
ڽѭ֮ǰԣٱִһΣȻڽβжǷѭŻѭá
ǿʼ޸ûŻѭŻĺô
  
function ForLoopBASM4(Start, Stop : Integer) : Integer;
asm
 push ebp
 push ebx
 mov ebp,esp
 add esp,-$14
 mov [ebp-$08],edx
 mov [ebp-$04],eax
  //Result := 0;
  xor eax,eax
  mov [ebp-$0c],eax
  //for I := Start to Stop do
  mov eax,[ebp-$04]
  mov edx,[ebp-$08]
  //sub edx,eax
  //jl @LoopEnd
  mov [ebp-$10],eax
   //begin
   @LoopStart :
     mov ebx, [ebp-$10]
     mov ecx, [ebp-$08]
     cmp ebx, ecx
     ja  @LoopEnd
     //Result := Result + I;
     mov eax,[ebp-$10]
     add [ebp-$0c],eax
     inc dword ptr [ebp-$10]
   //end;
   //mov ebx, [ebp-$10]
   //mov ecx, [ebp-$08]
   //cmp ebx, ecx
   //jbe @LoopStart
   jmp @LoopStart
  @LoopEnd :
   mov eax,[ebp-$0c]
  mov esp,ebp
  pop ebx
  pop ebp
end;

The loop-closing test has been moved to the top and the test has been inverted.
At the place of the test there now is an unconditional jump to the top.
This jump is what the loop inversion optimization is all about.
In the unoptimized for loop there is two jumps and only one in the optimized one.
The test at the top that tested whether Start was above Stop is now redundant and is removed.
Before making any timing to evaluate the effect of the two optimisations it is a good idea to optimise away some, 
or all if possible, of the stack to register and register to stack moves.
ѭļⱻƵѭʼһת
ԭȼλһתѭʼ
ת˵ѭŻ
ûŻѭתŻѭֻһת
ڿͷĲѾǷ StartStop ˱Ƴ
ڱȽŻЧ֮ǰܾܵŻջĴĴջƶ

This process is called register allocation and is one of the most important optimisations on all architectures, 
but it is even more important on the Intel architecture because of the low number of available registers. 
If there is not a register available to all variables it is crucial which variables get a register.
The mov instructions inside the loop body are the most important ones to get rid of.
They are executed as many times as the number of loop iterations. 
The instructions outside the loop are only executed once. 
The variables used inside the loop should be allocated to registers first. This is I, Stop and Result. 
At this point we could be smart and take a look at the use of registers as temporaries. 
If a variable were always copied into the same temp register it would be smart to allocate this register for
 the variable.
Stop is in the edx register when we enter the function and this register is also used as temporary
register for it in all but two lines. This is the two lines of the loop test that we added.
Ĵ䣬ϵṹҪŻ֮һ
ΪĴ٣ Intel ϵṹԵúҪ
ĴеıĸüĴ൱Ҫġ
ѭڵ mov ָҪһҪõһĴ
ΪѭУҪִкܶΡ
ѭִָֻһΡ
ѭڲıӦȱĴ I, Stop  Result
һϣǿԶԽ۲һ¼ĴΪʱʹá
һǱƵͬʱĴķһĴ
ǽ뺯ʱ, Stop  edx УĴҲΪʱĴʹãֻС
ΪѭӵУ
      
Let us change ǰ
mov ecx, [ebp-$08]
cmp ebx, ecx
to           Ϊ
mov edx, [ebp-$08]
cmp ebx, edx

Eax is used by Start in the top of the function and by Result in the rest of the function.
If there is no overlap in usage we can allocate eax for Result as soon as Start has finished using it.
After Start is assigned to I (mov [ebp-$10], eax) it is not used any more and eax is free to use by Result,
if it was not for those lines where eax is used as temp for I.

Eax ںʼΪ Start ںβΪؽ
һ Start ,ûи eax Ĳ,ôԽ eax Ϊؽˡ
 Start ֵΪ I (mov [ebp-$10], eax) ֮ûЩ eax Ϊ I ʱԽ eax Ϊؽˡ

mov eax,[ebp-$10]
add [ebp-$0c],eax
inc dword ptr [ebp-$10]

After ecx got out of use by the last change we can use that as temp for I instead of eax.
 ecx ʹ֮ǿΪʱ I  eax 

mov ecx,[ebp-$10]
add [ebp-$0c],ecx
inc dword ptr [ebp-$10]

The summary of the first part of the register allocation is: Result in eax, I in ecx and Stop in edx.
Lets first change the lines with Stop. [ebp-$08] is replaced by edx everywhere.
Ĵĵһֵǣ  eax, I  ecx , Stop  edx С
޸ Stop ֣ edx 滻е [ebp-$08]

function ForLoopBASM6(Start, Stop : Integer) : Integer;
asm
 push ebp
 push ebx
 mov ebp,esp
 add esp,-$14
 //mov [ebp-$08],edx
 mov edx,edx
 mov [ebp-$04],eax
  //Result := 0;
  xor eax,eax
  mov [ebp-$0c],eax
  //for I := Start to Stop do
  mov eax,[ebp-$04]
  //mov edx,[ebp-$08]
  mov edx,edx
  mov [ebp-$10],eax
   //begin
   @LoopStart :
     mov ebx, [ebp-$10]
     //mov edx, [ebp-$08]
     mov edx, edx
     cmp ebx, edx
     ja  @LoopEnd
     //Result := Result + I;
     mov ecx,[ebp-$10]
     add [ebp-$0c],ecx
     inc dword ptr [ebp-$10]
   //end;
   jmp @LoopStart
  @LoopEnd :
   mov eax,[ebp-$0c]
  mov esp,ebp
  pop ebx
  pop ebp
end;
Then allocate ecx for I by replacing [ebp-$10] by ecx.
 ecx Ϊ I 滻 [ebp-$10] 
function ForLoopBASM7(Start, Stop : Integer) : Integer;
asm
 push ebp
 push ebx
 mov ebp,esp
 add esp,-$14
 mov edx,edx
 mov [ebp-$04],eax
  //Result := 0;
  xor eax,eax
  mov [ebp-$0c],eax
  //for I := Start to Stop do
  mov eax,[ebp-$04]
  mov edx,edx
  //mov [ebp-$10],eax
  mov ecx,eax
   //begin
   @LoopStart :
     //mov ebx, [ebp-$10]
     mov ebx, ecx
     mov edx, edx
     cmp ebx, edx
     ja  @LoopEnd
     //Result := Result + I;
     //mov ecx,[ebp-$10]
     mov ecx,ecx
     add [ebp-$0c],ecx
     //inc dword ptr [ebp-$10]
     inc ecx
   //end;
   jmp @LoopStart
  @LoopEnd :
   mov eax,[ebp-$0c]
  mov esp,ebp
  pop ebx
  pop ebp
end;
 eax Ž
Ϊ eax ںĿʼ˲ StartҪʼΪ㡣
 eax ûٱĿģҪһƵ eax
function ForLoopBASM8(Start, Stop : Integer) : Integer;
asm
 push ebp
 push ebx
 mov ebp,esp
 add esp,-$14
 mov edx,edx
 mov [ebp-$04],eax
  //Result := 0;
  xor eax,eax
  mov [ebp-$0c],eax
  //for I := Start to Stop do
  mov eax,[ebp-$04]
  mov edx,edx
  mov ecx,eax
  mov eax, [ebp-$0c]                //New
   //begin
   @LoopStart :
     mov ebx, ecx
     mov edx, edx
     cmp ebx, edx
     ja  @LoopEnd
     //Result := Result + I;
     mov ecx,ecx
     //add [ebp-$0c],ecx
     add eax,ecx
     inc ecx
   //end;
   jmp @LoopStart
  @LoopEnd :
   //mov eax,[ebp-$0c]
   mov eax,eax
  mov esp,ebp
  pop ebx
  pop ebp
end;
Because we were pretty smart when we chose the registers there is a lot of lines like mov eax, eax.
It is easy to see how redundant they are ;-). Let us remove them.
Ӧѡ mov eax, eax ļĴС
׷ǷࡣƳǣ
function ForLoopBASM9(Start, Stop : Integer) : Integer;
asm
 push ebp
 push ebx
 mov ebp,esp
 add esp,-$14
 //mov edx,edx
 mov [ebp-$04],eax
  //Result := 0;
  xor eax,eax
  mov [ebp-$0c],eax
  //for I := Start to Stop do
  mov eax,[ebp-$04]
  //mov edx,edx
  mov ecx,eax
  mov eax, [ebp-$0c]                
   //begin
   @LoopStart :
     mov ebx, ecx
     //mov edx, edx
     cmp ebx, edx
     ja  @LoopEnd
     //Result := Result + I;
     //mov ecx,ecx
     add eax,ecx
     inc ecx
   //end;
   jmp @LoopStart
  @LoopEnd :
   //mov eax,eax
  mov esp,ebp
  pop ebx
  pop ebp
end;
When optimizing ASM code there is generally two lines of thinking we can choose to follow.
We can think as the human beings we are and try to be smart, use whatever information we can get from the code.
We did some of this when we selected the registers here.
Another line of thought is trying to do it systematically as an optimizer/compiler has to.
This way we develop algorithms that can be coded in a tool.
This tool can later take over the most boring of optimizations, these we do over and over again.
Removal of the most obviously redundant line of code of all, the mov eax, eax, was an example of dead code removal,
which is basic to any optimiser.
Ż ASM ֿͨǡ
ǿԷǵĴǾܵĴЩлȡŻϢüĴ
һȥʵһϵͳŻһŻ
߿ܴ󲿷ַŻһ٣ٶЩ˹Ż
һŻҪʵֵĻܣɾеԵ࣬mov eax, eax һе͵Ч롣

In the top of the function we still have some references to the stack.
To get rid of these we allocate registersfor those variables too.
This time we just pick edi and esi, which are not used elsewhere.
Allocate esi for [ebp-$04] and edi for [ebp-$0c].
Because esi and edi must be preserved by the function we must pushand pop them.
ĿʼȻһЩԶջá
ΪȥЩҲΪЩĴ
ʱǿʹ edi, esi ǻûڱĵطʹá
Ϊ [ebp-$04]  esiΪ [ebp-$0c]  edi
Ϊ esi, edi Ǻģ˱ push  pop ǡ

function ForLoopBASM10(Start, Stop : Integer) : Integer;
asm
 push ebp
 push ebx
 push esi
 push edi
 mov ebp,esp
 add esp,-$14
 //mov [ebp-$04],eax
 mov esi,eax
  //Result := 0;
  xor eax,eax
  //mov [ebp-$0c],eax
  mov edi,eax
  //for I := Start to Stop do
  //mov eax,[ebp-$04]
  mov eax,esi
  mov ecx,eax
  //mov eax, [ebp-$0c]
  mov eax, edi
   //begin
   @LoopStart :
     mov ebx, ecx
     cmp ebx, edx
     ja  @LoopEnd
     //Result := Result + I;
     add eax,ecx
     inc ecx
   //end;
   jmp @LoopStart
  @LoopEnd :
  mov esp,ebp
  pop edi
  pop esi
  pop ebx
  pop ebp
end;
The stack frame is not used anymore and there is no need to set it up.
This removes 4 instructions. 
Then we observe that the two lines
ջٱʹãҪɾ 4 ָ
ǹ۲.

mov eax,esi
mov ecx,eax

Can be replaced by one.
Աһд棺
mov ecx, esi

This is an example of a simple copy propagation followed by dead code removal.
Any other lines do not use the value in eax than the next line that copies it back to ecx.
It is in fact immediately overwritten by the line mov eax, edi.
Therefore we can replace the second line by mov ecx,esi and remove the first one, which becomes dead.
ʱһƳЧ븴Ʒֳӡ
ûκһʹ eax ֵڵڶֱ ecx
ʵϣeax Ͼͱ mov eax, edi д
ˣ mov ecx,esi 滻ڶУɾЧĵһС
function ForLoopBASM11(Start, Stop : Integer) : Integer;
asm
 //push ebp
 push ebx
 push esi
 push edi
 //mov ebp,esp
 //add esp,-$14
 mov esi,eax
  //Result := 0;
  xor eax,eax
  mov edi,eax
  //for I := Start to Stop do
  //mov eax,esi
  //mov ecx,eax
  mov ecx, esi
  mov eax, edi
   //begin
   @LoopStart :
     //mov ebx, ecx
     //cmp ebx, edx
     cmp ecx, edx
     ja  @LoopEnd
     //Result := Result + I;
     add eax,ecx
     inc ecx
   //end;
   jmp @LoopStart
  @LoopEnd :
  //mov esp,ebp
  pop edi
  pop esi
  pop ebx
  //pop ebp
end;
The line xor eax, eax that initialises Result to zero can together with the line right
after it be moved some lines down closer to the place where eax is used for the first time.
It should not be moved into the loop that would change the logic of the function,
but just before the line before loopStart.
This removes the need for copying eax into edi and back into eax again in the line just before
the comment line //for I := Start to Stop do, and in the line before the out commented begin.
xor eax, eax ʼΪ㡣Ӧڵһʹ eax ĵط
Ӧ÷ѭУΪı亯߼ loopStart ֮ǰ
ҪƳע //for I := Start to Stop do ֮ǰĸ eax  edi Ȼٸƻ eax Ĵ롣

function ForLoopBASM12(Start, Stop : Integer) : Integer;
asm
 push ebx
 push esi
 push edi
 mov esi,eax
  //for I := Start to Stop do
  mov ecx, esi
  //Result := 0;
  xor eax,eax
  //mov edi,eax
  //mov eax, edi
   //begin
   @LoopStart :
     cmp ecx, edx
     ja  @LoopEnd
     //Result := Result + I;
     add eax,ecx
     inc ecx
   //end;
   jmp @LoopStart
  @LoopEnd :
  pop edi
  pop esi
  pop ebx
end;
After having cleaned up we spot two lines of mov which together copy eax to ecx via esi.
This leaves a copy of eax in esi, which is not used. Therefore one that moves eax directly into ecx can replace
these two lines. This is also copy propagation + dead code removal.
֮Ƿ mov ָǽ eax ͨ esi Ƶ ecx
 esi һ eax Ŀûбʹá
ˣֱӽ eax  ecx ķ滻С
ҲһƷֳ + ЧƳ

function ForLoopBASM13(Start, Stop : Integer) : Integer;
asm
 push ebx
 //push esi
 push edi
 //mov esi,eax
  //for I := Start to Stop do
  //mov ecx, esi
  mov ecx, eax
  //Result := 0;
  xor eax,eax
   //begin
   @LoopStart :
     cmp ecx, edx
     ja  @LoopEnd
     //Result := Result + I;
     add eax,ecx
     inc ecx
   //end;
   jmp @LoopStart
  @LoopEnd :
  pop edi
  //pop esi
  pop ebx
end;
After having removed the only use of esi there is no need to push and pop it.
Ƴֻ֮Ҫʹ esiҪ push  pop 

function ForLoopBASM14(Start, Stop : Integer) : Integer;
asm
 //push ebx
 //push edi
  //for I := Start to Stop do
  mov ecx, eax
  //Result := 0;
  xor eax,eax
   //begin
   @LoopStart :
     cmp ecx, edx
     ja  @LoopEnd
     //Result := Result + I;
     add eax,ecx
     inc ecx
   //end;
   jmp @LoopStart
  @LoopEnd :
  //pop edi
  //pop ebx
end;
We also, a little late perhaps, observe that ebx and edi is not used either. After having cleaned up
 and relocated the comments a little, there is a nice clean function as a result.
һܣǷ ebx  edi Ҳûбʹá
룬ע֮һĺͲˣ

function ForLoopBASM15(Start, Stop : Integer) : Integer;
asm
  mov ecx, eax
  //Result := 0;
  xor eax,eax
  //for I := Start to Stop do
 @LoopStart :
  cmp ecx, edx
  ja  @LoopEnd
  //Result := Result + I;
  add eax,ecx
  inc ecx
  jmp @LoopStart
 @LoopEnd :
end;

It took a long time and a lot of optimisations to get here because we started with the non-optimised output from
the compiler. This long process illustrated the amount of work the compiler leaves to the optimizer.
Sometimes we did not use the algorithmic approach to optimization,
but we could have achieved the same result by doing it.
Instead of taking the same long road again with the function with loop inversion present,
we can cheat and compile the Pascal function with optimizations on.
The compiler might do all the optimisation we did.
˳ʱʹŻڵõȻΪڿʼʱûѡŻѡ
չʾ˱ĹŻ
ʱǲʹŻѡǿԼŻӶõͬŻ
ǿԴ򿪱ŻѡŻ룬ڲҲʹõѭôŻ衣

function ForLoopOpt(Start, Stop : Integer) : Integer;
var
 I : Integer;

begin
{
}
 Result := 0;
 {
 xor ecx,ecx
 }
 for I := Start to Stop do
 {
 sub edx,eax
 jl +$08
 inc edx
 xchg eax,edx
 }
  begin
   Result := Result + I;
   {
   add ecx,edx
   inc edx
   }
  end;
  {
  dec eax
  jnz -$06
  }
{
mov eax,ecx
}
end;
This time Delphi did a really nice job. Only two lines jump into our eyes as possibly redundant.
xchg eax, edx simply exchange the values in eax and edx and mov eax, ecx copy the Result into eax.
Both lines are outside the loop and make little harm.
 Delphi ʵǳĺáǿֻת롣
xchg 򵥵Ľ eax, edx ֵmov eax, ecx Ƶ eax С
ѭ棬һЧϵ𺦡

We now have two functions - one with no loop optimizations and one with two.
To make things complete we need two more functions, one with loop inversion only and one with
the NoName variable optimization only.
In the beginning of the lesson we saw how to remove the two optimisations and this is what
I have done to get to the last 2 functions.
In the Delphi optimized function above, I optimized away the xchg instruction by swapping the use of
the two registers it exchanged.
Because we want to se the maximum effect of the loop optimizations I have removed the loop body code doing
the operation Result := Result + I;
Here are the four final functions

ڣ - ûѭŻһѭŻ
ΪҪģǻҪһֻѭ,һֻ NoName Ż.
εĿʼѾνŻ, д.
 Delphi ŻĻ, ͨʹĴнŻ xchg ָ.
Ϊ뿴ѭ Result := Result + I ѭЧʡ
ĸյĺ

function ForLoopNoLoopInverNoNoName(Start, Stop : Integer) : Integer;
asm
  mov ecx, eax
  //Result := 0;
  xor eax,eax
  //for I := Start to Stop do
 @LoopStart :
  cmp ecx, edx
  ja  @LoopEnd
  inc ecx
  jmp @LoopStart
 @LoopEnd :
end;
ѭ 4 ָɡcmp, ja, inc  jmp
 P4 ЩָǱںǣcmp 0.5/0.5, ja X/0.5, inc 0.5/1 and jmp X/0.5 
X ʾ "ǱڲӦָ"
ۼǱڣǵõ0.5 + X + 0.5 + X =  ʱ

function ForLoopLoopInverNoNoName(Start, Stop : Integer) : Integer;
asm
  mov ecx, eax
  //Result := 0;
  xor eax,eax
  //for I := Start to Stop do
  cmp ecx, edx
  ja  @LoopEnd
 @LoopStart :
  inc ecx
  cmp ecx, edx
  jbe @LoopStart
 @LoopEnd :
end;
This loop consists of 3 instructions also with unknown sum of latency.
ѭָɣҲ֪Ǳڵĺ͡

function ForLoopNoLoopInverNoName(Start, Stop : Integer) : Integer;
asm
  //Result := 0;
  xor ecx,ecx
  //for I := Start to Stop do
  sub edx,eax
  cmp edx, 0
 @LoopStart :
  jz  @LoopEnd
  inc eax
  dec edx
  jmp @LoopStart
 @LoopEnd :
  mov eax,ecx
end;
This loop consists of 4 instructions also with unknown sum of latency. 
We observe that the two inc/dec instructions are able to execute in parallel. 
Because the dec NoName instruction is not followed by the conditional jmp it would look
like we throw away the benefit of not needing a cmp or test instruction to set the flags, 
but the jmp instruction does not change the flags and they are valid when we reach the jz instruction at the top of the loop.
Only at the first iteration a cmp edx,0 instruction is needed.
ѭ 4 ָɣҲ֪Ǳڵĺ͡
Ƿ inc/dec ִָܹС
Ϊ dec NoName ָ治һ jmp˲Ҫһ cmp  test ָñ־
Ϊ jmp ָܸı־, ˵ѭĿʼ jz ָʱǶȷġ
ֻڵĿʼҪһ cmp edx,0 ָ  

function ForLoopLoopInverNoName(Start, Stop : Integer) : Integer;
asm
  //Result := 0;
  xor ecx,ecx
  //for I := Start to Stop do
  sub edx,eax
  jl @LoopEnd
  inc edx
 @LoopStart :
  inc eax
  dec edx
  jnz @LoopStart
 @LoopEnd :
  mov eax,ecx
end;
ѭ 3 ָɣҲ֪Ǳڵĺ͡
ҲһԶ inc/dec

This is the simple benchmark I have used to find the performance of the 4 functions
ʹĻ׼Է 4 ܡ

const
 LOOPSTART : Integer = 1;
 LOOPEND : Integer = 2000000000;
 LOOPEND2 : Integer = 20;
 CLOCK : Double = 1920E6;

var
 Starttime, Endtime, Runtime : TDateTime;
 I, LoopResult : Integer;
 RunTimeSec, NoOfLoopsPerSec, NoOfLoops, ClockCount, LoopEnd2Float, LoopEndFloat, LoopStartFloat : Double;

begin
 Starttime := Time;
 for I := 1 to LOOPEND2 do
  begin
   LoopResult := ForLoopNoLoopInverNoName(LOOPSTART, LOOPEND);
  end;
 Endtime := Time;
 Runtime := Endtime - Starttime;
 CEdit.Text := IntToStr(LoopResult);
 RuntimeEdit4.Text := TimeToStr(Runtime);
 RunTimeSec := RunTime*24*60*60;
 LoopEnd2Float := LOOPEND2;
 LoopEndFloat := LOOPEND;
 LoopStartFloat := LOOPSTART;
 NoOfLoops := LoopEnd2Float * (LoopEndFloat - LoopStartFloat);
 NoOfLoopsPerSec := NoOfLoops / RunTimeSec;
 ClockCount := CLOCK / NoOfLoopsPerSec;
 ClockCountEdit4.Text := FloatToStrf(ClockCount, ffFixed, 9, 1);
end;

Results on P4 1920 are  P4 1920 ϵĽ
No Loop Inversion and No NoName variable    00:00:55  (2.7 Clock cycles)
ѭãδ
Loop Inversion but No NoName variable       00:00:39  (1.9 Clock cycles)
ѭãδ
No Loop Inversion but NoName variable       00:01:02  (3.0 Clock cycles)
ѭãδ
Loop Inversion + NoName                     00:00:41  (2.0 Clock cycles)
ѭãδ

Results on P3 1400 are  P3 1400 ϵĽ
No Loop Inversion and No NoName variable    00:01:26  (3.0 Clock cycles)
ѭãδ
Loop Inversion but No NoName variable       00:01:26  (3.0 Clock cycles)
ѭãδ
No Loop Inversion but NoName variable       00:01:55  (4.0 Clock cycles)
ѭãδ
Loop Inversion + NoName                     00:01:26  (3.0 Clock cycles)
ѭãδ

Of course the clock count numbers should be integers. 
On P4 half cycles are possible due to the double-clocked ALU. 
Our timings are not as good as we could wish, but for comparing the performance of the loops they are OK. 
Conclusion on P4. Apply loop inversion only or loop inversion with NoName variable optimization. 
Conclusion on P3. Do not apply the NoName variable optimization alone.
Conclusion on both targets. Apply both optimizations as Delphi does.
Also observe how efficient P4 is on this code.
ȻʱֵӦ
˫ʱӵ ALU,  P4 Ͽаʱڡ
ǵĲûãǱȽѭִЧ㹻ˡ
 P4 ϣֻӦѭúʹδ(NoName)ѭŻ
 P3 ϣûеӦδŻ
ƽ̨ϵġDelphi ŻҲԹ۲ P4 еЧʡ

