Lesson 4
<<BASM ѧ>>  4 

http://www.cnpack.org
QQ Group: 130970
룺SkyJacker
汾ݸ
״̬δУ
ʱ䣺2007

In this lesson we will learn about branching, illustrated by an if-else construct. 
Conditional floating point move will also be introduced.
The example function of this lesson is the Min function from the Delphi Math unit.
һ,ǽѧϰ֧䣬ͨһ if-else ṹ˵
ͬʱܸתơ
һεӺ Delphi Math Ԫе Min 

function Min1(const A, B: Single) : Single;
begin
 if A < B then
  Result := A
 else
  Result := B;
end;
The compiler generated asm for this function looks like this
£
function Min2(const A, B: Single) : Single;
begin
 {
 00452458 55               push ebp
 00452459 8BEC             mov ebp,esp
 0045245B 51               push ecx
 }
 if A < B then
 {
 0045245C D9450C           fld dword ptr [ebp+$0c]
 0045245F D85D08           fcomp dword ptr [ebp+$08]
 00452462 DFE0             fstsw ax
 00452464 9E               sahf
 00452465 7308             jnb +$08
 }
  Result := A
  {
  00452467 8B450C           mov eax,[ebp+$0c]
  0045246A 8945FC           mov [ebp-$04],eax
  0045246D EB06             jmp +$06
  }
 else
  Result := B;
  {
  0045246F 8B4508           mov eax,[ebp+$08]
  00452472 8945FC           mov [ebp-$04],eax
  }
  {
  00452475 D945FC           fld dword ptr [ebp-$04]
  00452478 59               pop ecx
  00452479 5D               pop ebp
  }
end;

This time I included the address and opcode columns, because we need them later.
Lets analyze it line by line, like we always do.
 һҸ˵ַͻУΪõ
֮ǰһһз
function Min3(const A, B: Single) : Single;
begin
 {
 push ebp                     // Save ebp on stack   ebp
 mov ebp,esp                  // New basepointer is the old stackpointer ջָ븳ַָ  
 push ecx                     // subtract 4 from esp  esp  4 
 }
 if A < B then
 {
 fld dword ptr [ebp+$0c]      // Load A on FP stack  A Ĵջ
 fcomp dword ptr [ebp+$08]    // FP compare A to B and pop A from stack A  B Ƚ֮󣬽 A  FP е
 fstsw ax                     // Store FP statusword in ax  FP ״ִ̬ ax
 sahf                         // Store ah into EFlags register 洢 ah ־Ĵ
 jnb +$08                     // If not below jump 8 bytes forward Сڣǰ 8 ֽ
 }
  Result := A
  {
  mov eax,[ebp+$0c]           // Copy A into eax  A  eax 
  mov [ebp-$04],eax           // Copy A into stackframe A ջ
  jmp +$06                    // Jmp 6 bytes forward ǰ 6 ֽ
  }
 else
  Result := B;
  {
  mov eax,[ebp+$08]           // Copy B into eax B  eax
  mov [ebp-$04],eax           // Copy B into stackframe  B ջ
  }
  {
  fld dword ptr [ebp-$04]     // Load A or B from stackframe onto FP stack װ A  B  FP ջ
  pop ecx                     // Add 4 to espesp  4   
  pop ebp                     // Restore ebp ָ ebp
  }
end;

This time I commented every line of BASM. Read the details there.
The first new instruction introduced by this example is fcomp. F says floating-point instruction as always.
Com says compare and p says pop FP stack. Fcom compares two floating-point values and sets the condition code
flags of the floating-point unit, named C0, C1, C2 and C3.
һעÿһ BASMԿϸ˵
еһָܵ fcomp
F ʾָCom ʾȽϣp ʾӸĴջ
FCom ȽñΪ C0, C1, C2, C3ĸԪ־ 

These flags are the equivalents of the EFlags register of the CPU.
The flags are checked by conditional jump instructions, which jump or not depending on the type of jump and
the status of the flags.
Щ־ͬ CPU ı־Ĵ
תָЩ־תתͺЩ־״̬

Conditional jump instructions check the CPU flags and not the FPU flags and therefore it is
necessary to copy the flags from the FPU to the CPU.
תָ CPU ı־ FPU ı־Ҫ־ FPU  CPU
  
This is done by the next two instructions.
fstsw stores the FP flags into the ax register and sahf copy the 8 bits from ah to the EFlags register.
This is a long way for the flags to travel before they can be used by the jnb instruction.
Jnb is short for jump not below. In the Intel SW Developers Manual Vol 2 at page 394 there is a table with all
the different jump instructions and a description of the flags they test.
ָʵܡ
fstsw  FP ־ ax Ĵsahf  ah  8 λ״̬־Ĵ
Ǳ jnb ָʹ֮ǰһȽϳĲ衣
Jnb һתת Intel ֲ 2 ĵ 394 ҳһű
תָĲͬԼı־Ĵ

Here it is seen that jnb jumps if CF=1 and ZF=1.
Try tracing through the code with the FPU view and the CPU view open.
Watch how the FPU flags are set, their values are copied into ah and then copied to the CPU EFlags register.
If the jnb jump is not taken execution continues in the line after it.
This is the if part of the if-else construct.
If the jump is taken execution is continued at the line 8 bytes ahead.
This is at the start of the else part. The if and the else part is very much the same.
 CF =1  ZF =1 ʱ jnb ת
ͨ FPU ں CPU ڸٴ롣
۲ FPU ״̬ãǵֵο ahٿ CPU ״̬Ĵ
 jnb תûִУôִָ֮ if-else ṹ if ֡
 jnb תִˣôһǰ 8 ֽڴִС
 else ֵĿʼif  else ַǳơ

As seen in the Pascal code A or B is copied to the Result variable depending on the if condition.
Instead of copying A or B directly on the top of the FP stack, which is the place of a FP Result from a function
due to the register calling convention, the Delphi compiler chose to use the stack as a temporary location.
The fld dword ptr [ebp-$04] instruction copies the result in the right place.
 Pascal пԿA  B  if 
Ǵ FP ջĶֱӸ A  BջɼĴԤĽDelphi ѡʹöջΪʱر
fld dword ptr [ebp-$04] װָĽ

Observe that an unconditional jump is needed at the end of the if block such that execution does not continue in the else block. 
This way there is a jump no matter with branch is taken. A short word on branch prediction. 
Branch prediction comes in two flavours, static and dynamic. The first time a branch is executed the CPU can have no
 knowledge about the probability it will be taken or not. In this situation it uses the static predictor, 
 which says that forward jumps are not taken and backward branches are taken.
Ȼִ else 飬 if ĽβҪһתָ
Ƿ֧һתǷ֧ԤеĶ֡
֧Ԥַʽ̬ԤͶ̬Ԥ⡣
֧һʼִУCPU ֪֧ǷִܹС
£ʹþ̬Ԥ⣬ʾǰתִУתִ֧С

Our example jump no 1 is a forward branch and it will be predicted not taken the first time through.
If we have knowledge about the values A and B, we can use it to code the if-else such that
the if part is the most often executed and static prediction is optimized.
An unconditional jump needs no prediction - it is always taken ;-)
A backward jump is often part of a loop and most loops will run more than once (why else have a loop?).
This way it makes sense to statically predict backward jumps taken.
Dynamic branch prediction tries to accumulate knowledge about the probability that a certain jump is taken
or not and then predict it correctly as often as possible.
еת 1 ǰ֧һβ
֪ A  B ֵǿʹ if-else 룬if ַִеĴ룬̬ԤⱻŻ 
һתҪԤ - Ǳִ :)
һתѭеһ֣󲿷ѭж(Ϊʲô else һѭ)
ַʽھ̬ԤתȽ塣
̬֧ԤۼĳתǷִУȻ󾡿ܵȷԤ

Now it is time to transform the function into a pure asm one.
ǰѺתɴ asm ʱˡ
function Min4(const A, B: Single) : Single;
asm
 //push  ebp
 //mov   ebp,esp
 push  ecx
 //if A < B then
 fld   dword ptr [ebp+$0c]
 fcomp dword ptr [ebp+$08]
 fstsw ax
 sahf
 jnb   @ElseBegin
 //Result := A
 mov   eax,[ebp+$0c]
 mov   [ebp-$04],eax
 jmp   @ElseEnd
 //else
 //Result := B;
 @ElseBegin :
 mov   eax,[ebp+$08]
 mov   [ebp-$04],eax
 @ElseEnd :
 fld   dword ptr [ebp-$04]
 pop   ecx
 //pop   ebp
end;
The new thing this time is that we need some labels.
Our analyse of the function made it clear where we jumped to when,I hope.
In general when placing labels it is a good thing to use the knowledge we have about the structure of the code.
You can open the FPU view and just trace through the code and see where you go when jumps are taken.
If you want to place labels without having to think, then use some math. Example follows
We have this jump
һҪһЩǩ
ϣѾطʲôʱת
ͨñǩʱʹ֪ĴṹǸϰߡ
Դ FPU ڣٴ룬鿴תִеĹ̡
ûκıǩʹѧķ
תӣ

00452465 7308             jnb +$08
This is the line after it һ֮
00452467 8B450C           mov eax,[ebp+$0c]
This is the line 8 byte after that line ж֮ 8 ֽڴ
0045246F 8B4508           mov eax,[ebp+$08]
Take the address of the line after the jump and add the jump offset to it and you have the jump target address.
ת֮ĵַתƫҪתĿַ

This is the math: 00452467 + 8 = 0045246F.   
ʽΪ 00452467 + 8 = 0045246F.
Why do we add the jump offset to the address of the line after the jump and not to the address of the line with the
jump?
Ϊʲôתָ֮ĵַƫתָһеĵַأ

It's time to optimise. ڿʼŻ
function Min5(const A, B: Single) : Single;
asm
 push  ecx
 //if A < B then
 fld   dword ptr [ebp+$0c]
 fcomp dword ptr [ebp+$08]
 fstsw ax
 sahf
 jnb   @ElseBegin
 //Result := A
 mov   eax,[ebp+$0c]
 mov   [ebp-$04],eax
 jmp   @ElseEnd
 //else
 //Result := B;
 @ElseBegin :
 mov   eax,[ebp+$08]
 mov   [ebp-$04],eax
 @ElseEnd :
 fld   dword ptr [ebp-$04]
 pop   ecx
end;
This is just a cleaned up version of the function. Lets change the push ecx, 
pop ecx instructions with instructions that manipulate the esp register directly and do not move data between 
ecx and the stack.
ֻ֮Ĵ롣
Ż push ecx, pop ecx ЩֱӲ esp Ĵָ ecx Ͷջ֮ÿдݡ
function Min6(const A, B: Single) : Single;
asm
 //push  ecx
 sub   esp, 4
 //if A < B then
 fld   dword ptr [ebp+$0c]
 fcomp dword ptr [ebp+$08]
 fstsw ax
 sahf
 jnb   @ElseBegin
 //Result := A
 mov   eax,[ebp+$0c]
 mov   [ebp-$04],eax
 jmp   @ElseEnd
 //else
 //Result := B;
 @ElseBegin :
 mov   eax,[ebp+$08]
 mov   [ebp-$04],eax
 @ElseEnd :
 fld   dword ptr [ebp-$04]
 //pop   ecx
 add esp, 4
end;
When the code was analyzed we saw that the flags were traveling a long way and 2 instructions were needed for this.
What about a floating point compare instruction that sets the EFlags directly?
This is the fcomi instruction, which was introduced in the P6 architecture.
 Lets use it and abandon those CPUs older then the Pro. It would be fair to believe that these lines
Ƿ־ָʱָʵֵġ
һֱָ CPU ı־Ĵ?
ͨ fcomi ָ,  P6 ṹб.

fcomp dword ptr [ebp+$08]
fstsw ax
sahf

could be substituted by this ԱĴ
fcomip dword ptr [ebp+$08]

The fcomi instruction does however not accept a memory reference as operand. 
Therefore it is necessary to load data before issuing it.
fcomi ָڴʹ֮ǰҪװݣ

fld    dword ptr [ebp+$0c]
fcomip st(0), st(1)

Then because we loaded data we have to remove them again with the ffree instruction.
An fcomipp instruction would have been nice to have.
֮Ϊװݣ˱ ffree ָƳ
fcomipp ָܸʺ

fld    dword ptr [ebp+$0c]
fcomip st(0), st(1)
ffree  st(0)

This is a hell of an optimization, substituting three lines with 3 other.
It is necessary to time the two versions to know whether it was an optimization.
Now the function looks like this.
һŻ 3 дԭ 3 С
Ҫ汾ĸǱȽЧʵġ
˺£

function Min7(const A, B: Single) : Single;
asm
   sub    esp, 4
   //if A < B then
   fld    dword ptr [ebp+$08]
   fld    dword ptr [ebp+$0c]
   fcomip st(0), st(1)
   ffree  st(0)
   //fstsw ax
   //sahf
   jnb    @ElseBegin
   //Result := A
   mov    eax,[ebp+$0c]
   mov    [ebp-$04],eax
   jmp    @ElseEnd
   //else
   //Result := B;
 @ElseBegin :
   mov    eax,[ebp+$08]
   mov    [ebp-$04],eax
 @ElseEnd :
   fld    dword ptr [ebp-$04]
   add    esp, 4
end;
Now we apply some logic thinking. Why copy the result around? Both A & B is on the stack for use by the comparison
by fcom and the result should be delivered on the stack. The only thing needed is to delete either 
A or B and leave the smallest one of them on the stack
ǽһЩ߼ϵ˼ΪʲôҪƽأ
A  B ͨջݽ fcom ȽϣҲڶջϴ䡣
ֻҪɾ A  BСڶջϡ

function Min8(const A, B: Single) : Single;
asm
   sub    esp, 4
   //if A < B then
   fld    dword ptr [ebp+$08]
   fld    dword ptr [ebp+$0c]
   //fcomip  st(0), st(1)
   fcomi  st(0), st(1)
   //ffree  st(0)
   jnb    @ElseBegin
   //Result := A
   //mov    eax,[ebp+$0c]
   //mov    [ebp-$04],eax
   ffree st(1)
   jmp    @ElseEnd
   //else
   //Result := B;
 @ElseBegin :
   //mov    eax,[ebp+$08]
   //mov    [ebp-$04],eax
   fxch
   ffree st(1)
 @ElseEnd :
   //fld    dword ptr [ebp-$04]
   add    esp, 4
end;
Fcomip is replaced with fcomi because we do not want to remove B from the stack at this point.
Ffree is deleted because it deleted A. Then all the lines that copied the result around are removed. 
In the if block A is the result and B should be deleted. B is in st(1) and ffree st(1) does the job.
In the else block we want to delete A and leave B in st(0). Swapping A and B do this with the fxch
instruction and then delete A in st(1) with ffree. Fxch is nearly free (takes 0 cycles), 
because if works by register renaming instead of actually copying data.

Fcomip  fcomi ΪǲӶջƳ BFfree ɾΪͷ A
еظĸƶɾ
 if У A ǽô B ᱻɾB  st(1) , ffree st(1) ɾ
 else Уɾ A,  B  st(0) С fxch ָ A  B ֵȻ
ffree ɾ st(1) е A
Fxch Ǻܽʡ( 0 )ΪĴǸݡ

function Min9(const A, B: Single) : Single;
asm
   //sub    esp, 4
   //if A < B then
   fld    dword ptr [ebp+$08]
   fld    dword ptr [ebp+$0c]
   fcomi  st(0), st(1)
   jnb    @ElseBegin
   //Result := A
   ffree st(1)
   jmp    @ElseEnd
   //else
   //Result := B;
 @ElseBegin :
   fxch
   ffree st(1)
 @ElseEnd :
   //add    esp, 4
end;

Now the stack frame is not needed and we delete the code that set it up.
ջĴҲҪ:

function Min10(const A, B: Single) : Single;
asm
   //if A < B then
   fld    dword ptr [ebp+$08]
   fld    dword ptr [ebp+$0c]
   fcomi  st(0), st(1)
   jnb    @ElseBegin
   //Result := A
   ffree st(1)
   jmp    @ElseEnd
   //else
   //Result := B;
 @ElseBegin :
   fxch
   ffree st(1)
 @ElseEnd :
end;
This is a pretty nice function, but somebody in the newsgroup has told us about conditional moves. 
fcmovnb is such a thing - floating point conditional move not below. It copies data from st(1)-st(7) to st(0) 
if the condition is fulfilled. The Eflags are tested for the condition. Fcmov was introduced in the P6
architecture as well fcomi was.

һǳĺˣ˸ҹƶָ
fcmovnb һ - 򸡵ƶ
㣬 st(1)-st(7) ݵ st(0)
Eflags ԡ Fcmov  fcomi һ P6 ϵṹбܡ
function Min11(const A, B: Single) : Single;
asm
   fld     dword ptr [ebp+$08]
   fld     dword ptr [ebp+$0c]
   fcomi   st(0), st(1)
   fcmovnb st(0), st(1)
   ffree   st(1)
end;

Instead of all the jumping around we copy A to the top of the stack where B is,
but only if A is the smaller of the two.Delete B and done.
This is a nice clean function with only one piece of redundancy left.
The compiler still push/ pop ebp even though it is not modified in the function.
 A СģҪ B Ƶջɾ B
һĺֻһࡣ
ΪȻ push/ pop ebpܺûиı ebp Ĵֵ