<<BASM ѧ>>  7 

http://www.cnpack.org
QQ Group: 130970
룺SkyJacker
汾ݸ
״̬δУ
ʱ䣺2007

Lesson 7
Welcome to lesson number 7. Todays subject is floating point BASM.
This was also the subject of an earlier lesson, but this lesson will add new information.
We will look at how to code scalar SSE2 code and how instructions are scheduled in the fp pipelines.
ӭ 7 Ρ BASM еĸ
Ҳ֮ǰγ̵һ⣬һνһЩݡ
ǽд SS2  fp ܵεЩָ
 
Todays example function is evaluating a third order polynomial.
һζʽ
function ArcSinApprox1a(X, A, B, C, D : Double) : Double;
begin
 Result := A*X*X*X + B*X*X + C*X + D;
end;

Instead of just analyzing and optimizing this function we will see an actual use of it.
A third order polynomial can approximate the ArcSin function on its entire interval, [-1;1], 
with a maximal absolute error of 0.086.
This is not impressive but what we develop in this lesson will extend to higher order polynomials
in a straightforward manner, yielding higher precision.
ȿʵʹãϷŻ
һζʽڱ [-1, 1] пԽΪ ArcSin ǵֵ 0.086
ⲻӡ̣ǽһ쵽ߴݵĶʽȷ

The parameters A, B, C and D define the shape of the curve for the function and the values for a fit to the
 ArcSin approximation with minimal error must be found.
 For this purpose we develop an optimizer, which is also used as the benchmark.
Because ArcSin(0) = 0 we immediately see that D=0 and D can be left out of optimization.
 A, B, C  D ˺ߵ״ڽƵ ArcSin УСΧڣ϶ҵЩֵ
ΪĿǿһŻҲΪ׼ԡ
Ϊ ArcSin(0) =0 ǿ֪ D=0 D ԱŻ

We also know that ArcSin is an odd function and therefore the second order term B*X*X is of no use
in the approximation.
This is because a second order term is even and has symmetry around the Y-axis.
Odd order functions have antisymmetry around the Y-axis with f(X) = -f(-X).
All this means that our function could be reduced to Result := A*X*X*X + C*X;
Ҳ֪ ArcSin һ溯˶ B*X*X ڽûá
Ϊʽ߸żԣ Y Գơ
κ Y Գԣ f(X) = -f(-X)
ЩԣԱΪ Result := A*X*X*X + C*X;

We do however not does that because it is more illustrative to use the full function.
ArcSin is a special case and we want to be as general as possible.
ȴΪʹĺ˵ԡ
ArcSin һǾͨá

Function number 1a has 6 multiplications and 3 additions. Writing it on Horner form
 6 ˺ 3 ӡдʽ
Result := ((A*X + B)*X + C)*X + D;

reduces this to 3 multiplications and 3 additions.
Ϊ 3 ˺ 3 ӡ

Another form is һʽ
Result := (A*X + B)*(X*X)+(C*X + D);
which has 4 multiplications and 3 additions.
һ 4 ˺ 3 ӡ

On modern processors it is as important how much parallelism can be extracted from the formula
as it is how many multiplications and additions it has.
Modern processors such as AMD Athlon, Intel P4 and P3 have pipelines.
Pipelines are a necessity on processors running at high frequencies, because the amount of work for an addition, 
subtraction, multiplication or division cannot be done in a single clock cycle.
ִĴУһʽ˶ٴμжٸ˷ӷһҪ
ִ Amd AthlonIntel P4  P3 йܵߡ 
ܵڸеĴǱģΪĹǴӡˡ㣬ڵʱвꡣ

On P4 there is a pipeline called FP_ADD that handles addition and subtraction.
This pipeline has 5 stages, which means that the process of performing an addition or subtraction
is broken down into 5 sub tasks.
Therefore addition and subtraction is done in 5 clock cycles.
 P4 һ FP_ADD Ĺܵߣӷͼ
ܵ 5 ζŴһӷֳ 5 ִС
ˣӷͼ 5 ʱɵġ

The advantage of having a pipeline is that even though it takes 5 cycles to complete an addition a new one
can be started in every clock cycle.
This is because the first add in a series leaves the first stage of the pipeline in the second clock cycle
and then this stage can accept add number 2.
If we have this series of adds the first one will leave the pipeline in cycle 5, number 2 will leave in cycle 6 etc.
Throughput is one add per cycle.
ܵߵŵǣһӷ 5 ڣÿʱڶԿʼһµļӷ
Ϊһ add ܵߵĵһڵڶʱʱܵߵĵһԽյ 2  add
һϵеļӷһӷڵ 5 ʱڽڶӷڵ 6 ʱڽơ
ÿһ add

Parallelism is achieved by having up to 5 additions/subtractions in flight in the pipeline at the same time. 
The drawback is that if a second of two following additions need the output from the first addition it has to wait
for the first one to complete. We say that there is a data dependency between the two instructions and we see the
full latency of the additions, which is 2 times 5 cycles.
ʹùܵߣͬʱ 5 ӷ
һȱ㣬ڵļӷڶӷҪһôͲòȴһɡ
ǽΪָ֮ǿӷǱ 5 ʱڵ 2 

Lets use our function as an example to see how pipelines work.
ǵĺܵι
Result := A*X*X*X + B*X*X + C*X + D;

It is seen that the 4 terms can be evaluated in parallel and then be added as the final act Of course term number 4
is not "evaluated". A*X is the first operation ready to be scheduled to run in the F_MUL pipeline. The latency for
 FMUL on P4 is 7 cycles and A*X will be ready in cycle 7.
 4 Աеļ㣬Ȼ󽫽ӣȻ 4  "" ˡ
A*X ǱȽ F_MUL ܵߵĵһ P4  FMUL Ǳ 7 ʱڣA*X ᱻ 7 ʱڡ

FMUL has a throughput of 2 cycles. From this we see that FMUL is not fully pipelined.
The pipeline will accept a new instruction in cycle 3 and not in cycle 2.
B*X is the second instruction ready to execute and it will start in cycle 3.
In cycle 5 the pipeline is again ready to receive a new instruction and this is C*X.
In cycle 7 A*X is complete and (A*X)*X can start in cycle 8.
In cycle 10 B*X is finished and (B*X)*X will start.
FMUL  2 ڵ FMUL ûռܵߡ
߽ܵڵ 3 ڽһµָڵ 2 ڡ
B*X Ҫִеĵڶָڵ 3 ڿʼ
ڵ 5 ڣ߽ܵٽһָ C*X
ڵ 7 ڣ A*X ˣ(A*X)*X ڵ 8 ڿʼ
ڵ 10  B*X ɣ (B*X)*X ʼִС

In cycle 12 C*X is ready to go to the F_ADD pipeline to be added to D.
In cycle 15 is (A*X)*X finished and (A*X*X)*X can start.
In cycle 17 are (B*X)*X and (C*X) + D complete and they can start in the F_ADD pipeline.
This addition completes in cycle 21, where (A*X*X)*X is also ready Then the last addition can start in cycle 22.
Now there is only one operation in flight and we must lean back and wait for the full latency of FADD,
which is 5 cycles. In clock 27 the last addition is finished and the job is done.
ڵ 12  C*X ׼ F_ADD ܵ D ӡ
ڵ 15  (A*X)*X ɣ(A*X*X)*X ʼ
ڵ 17  (B*X)*X  (C*X) + D ɣȻʼ F_ADD ܵߡ
ӷڵ 21 ɣ 22  (A*X*X)*X ļӷ
ڣֻһУǱȴ FADD ɣ 5 ڡ
ڵ 27 ļӷˡ
       
These tables give the details. The left column symbolizes the F_MUL pipeline with 7 stages and the right column 
symbolizes the F_ADD pipeline with 5 stages.
ıϸ̡
ߵбʾ F_MUL ܵ 7 ұߵбʾ F_ADD  5 

[˱ DelphiBeginners..doc]

An out of order core schedules instructions to run as soon as data and resources are ready.
Resources are registers and execution pipelines. I do not know for sure, but I think that instructions are scheduled 
to run in program order except when an instruction stalls.
In this situation the next ready instruction in program order is scheduled.
The stalled instruction will be started as soon as the stall reason is removed.
It can stall because of missing resources or not ready data.
˳ִеں˵ָֻҪݺԴ׼þС
ԴǼĴִйܵߡ
Ҳȷָᰴճ˳ȥУһָͣת
£һָȡͣתָͣתԭ֮ῪʼִС
ͣתԭΪûԴû׼ݡ
   
After having seen how instructions are scheduled to run in the pipelines of a P4 we establish the benchmark.
The benchmark is an optimizer that search for the best possible fit of our polynomial to ArcSin. It is based on the
most simple of all optimization algorithms, which is the exhaustive search. We simply try a lot of combinations of
parameter values and remember the set of parameters, which give the best fit.
A and C are run in the intervals [AStart;AEnd] and [CStart; CEnd], with the step sizes AStepSize and CStepsize.
This is done in two nested while loops.
Ѿѧϰ P4 еĹܵεָе, ǽһ׼ԡ
׼һŻܹܵҵ ArcSin Ķʽ
򵥵Ż㷨Ǽ򵥵ĳԲֵĶϣҼ¼Ĳֵ
A  C ֱڱ [AStart;AEnd]  [CStart; CEnd]УС AStepSize  CStepsize
 while Ƕѭʵ֡

 StartA    := 0;
 StartC    := -1;
 EndA      := 1;
 EndC      := 1;
 AStepSize := 1E-2;
 CStepSize := 1E-3;
 OptA      := 9999;
 OptC      := 9999;
 A         := StartA;
 while A <= EndA do
  begin
   C := StartC;
   while C <= EndC do
    begin
     Inc(NoOfIterations);
     MaxAbsError := CalculateMaxAbsError(A,C, ArcSinArray);
     if MaxAbsError <= MinMaxAbsError then
      begin
       MinMaxAbsError := MaxAbsError;
       OptA := A;
       OptC := C;
      end;
     C := C + CStepSize;
    end;
   A := A + AStepSize;
  end;
  
The CalculateMaxAbsError function calculates a number of points on the X interval [-1;1], which is the definition 
interval of the ArcSin function
CalculateMaxAbsError  X ı [-1;1] е ArcSin 䡣

TMainForm.CalculateMaxAbsError(A, C : Double; ArcSinArray : TArcSinArray) : Double;
var
 X, Y, D, B, Yref, Error, AbsError, MaxAbsError : Double;

begin
 B := 0;
 D := 0;
 MaxAbsError := 0;
 X := -1;
 repeat
  Yref := ArcSin (X);
  Y := ArcSinApproxFunction(X, A, B, C, D);
  Error := Yref-Y;
  AbsError := Abs(Error);
  MaxAbsError := Max(MaxAbsError, AbsError);
  X := X + XSTEPSIZE;
 until(X > 1);
 Result := MaxAbsError;
end;

At each point we calculate the error by subtracting the Y value from our approximation function from the reference
Y value obtained from a call to the Delphi RTL ArcSin function.
ʹԼĽ ArcSin ĺһ Y ֵʹ Delhi RTL е ArcSin һ Y οֵ
Ȼ Y οֵ Y ֵÿϵ

The error can be positive or negative, but we are interested in the absolute value.
We remember the biggest absolute error by taking the maximum of the two values MaxAbsError and
AbsError assigning it to MaxAbsError.
MaxAbsError is initialized to zero and in the first evaluation it will get the value of the first error
(if it is bigger than zero).
The MaxAbsError is returned as the result from the function after a full sweep has been completed.
In the optimizer function the two values of A and C that gave the smallest maximum error are remembered
together with the actual MinMaxAbsError.
ֻԾֵȤ
ͨ  MaxAbsError  AbsError ֵֵ MaxAbsError ¼ľֵ
MaxAbsError ʼΪ㣬ڼʱȻõһֵ()
ڼ֮ MaxAbsError Ϊķֵ
ŻУ¼Сʱ AC ֵ

All that matters in an optimizer of this kind is to be able to evaluate as many parameter combinations as possible.
For this purpose we must optimize the optimizer ;-) and the functions we evaluate.
In this lesson the purpose is slightly different because all we want is some valid benchmarks
for the functions we optimize.
The means are however the same, the code of the optimizer must take as few cycles as possible such that
the cycles the functions use is the biggest part of the total number of cycles used.
͵Żܵļϲ
ˣǽŻŻǽмĺ
һεĿ΢е㲻ͬΪŻЩȷĻ׼Ժ
˼ҲƣŻҲ뾡ܵĻѺСʱڣѭںܵռһ֡

The first optimizer optimization that is done is to realize that there is no need to evaluate the reference
function over and over again.
It returns, of course, the same values no matter which values A and C have.
Sweeping the reference function once and storing the Yref values in an array do this.
The next optimization is to compact the lines that evaluate the MaxAbsError Long version
ŻŻвҪһδμĲο
зֵȻ A  C Ƿͬء
οֻһβС
Ż MaxAbsError ߳

Yref := ArcSinArray[I];
Error := Yref-Y;
AbsError := Abs(Error);

Short version ̰汾

AbsError := Abs(ArcSinArray[I]-Y);

This helps because Delphi creates a lot of redundant code, when compiling FP code.
The long version compiles into this
, Ϊ FP ʱ, Delphi ,
汾Ϊ

Yref := ArcSinArray[I];

mov eax,[ebp-$14]
mov edx,[eax+ebx*8]
mov [ebp-$48],edx
mov edx,[eax+ebx*8+$04]
mov [ebp-$44],edx

Error := Yref-Y;

fld   qword ptr [ebp-$48]
fsub qword ptr [ebp-$30]
fstp  qword ptr [ebp-$50]
wait

AbsError := Abs(Error);

fld qword ptr [ebp-$50]
fabs
fstp qword ptr [ebp-$10]
wait

There are a lot of redundancies in this code and we must conclude that Delphi is doing a bad job on optimizing 
floating point code.
Let us add some explanations to the code.
ڴд, ǿԶ϶ Delphi ûжԸЧŻ.
һ´.

The first line of Pascal is an assignment of one double variable to another.
This is done by to pairs of mov, one for the lower 4 byte of the variables and one for  the upper 4 byte.
The first line of asm loads the address of the array into eax, which is used as base for addressing
into the array. Ebx is I and it is multiplied by 8 because each entry in the array is 8 byte.
The 4 byte offset in the last two lines (in the last line it is hidden!) is moving
the reference into the middle of element I.
һ Pascal ǽһ Double һ.
һһ mov ʵ, һڱĵ 4 ֽ, һڸ 4 ֽڡ
һ asm  ĵַװ eaxָݵĻַ
Ebx  I 8 Ϊÿ 8 ֽڡ
(һص) 4 ֽڵƫָԪ I м䡣

Yref is located in the stack frame at [ebp-$48] and is loaded by the first line of FP code.
Y is located at [ebp-$30] and is subtracted from Yref by fsub.
The result is Error and this is stored at [ebp-$50].
Yref ڶջ [ebp-$48] Уɵһ FP װ롣
Y  [ebp-$30] Уfsub  Yref ȥ Y
Ϊ洢  [ebp-$50]

The last line of Pascal compiles into four lines of asm of which the first starts loading Error.
Saving and then loading Error is redundant and the optimizer should have removed it.
Fabs is the Abs function and is probably one of the shortest function implementations seen ;-)
The Delphi compiler does not have the inlining optimization, but it applies compiler magic to a few functions,
one of which is Abs. The last line stores AbsError on the stack.
The short version compiles into this
һ Pascal  asmװʼ
ȻװģŻӦɾ
Fabs  Abs ̵ִкˡ
Delphi ûŻԶһЩӦ ħAbs еһ
һн AbsError ջ

̰汾

mov eax,[ebp-$14]
fld qword ptr [eax+ebx*8]
fsub qword ptr [ebp-$30]
fabs
fstp qword ptr [ebp-$10]
wait

Here there is no redundancy and the compiler should have emitted this code as result of the long Pascal version as well.
All lines of code are present in the long version, but all redundant lines have been removed.
ʱû࣬ҲӦý汾 Pascal Ż
ЩҲڳ汾Уежɾˡ

The first line loads the base address of the array into eax. The second line loads element I, I is in ebx,
unto the fp stack. The third line subtracts Y from Yref. The fourth line is the Abs function.
The fifth line stores the result in the AbsError variable.
һнĻַװ eaxڶװԪ I  fp ջI  ebx С
д Yref ȥ Y Abs 
н AbsError 
  
There is a peculiarity with this benchmark that I have no explanation of.
The benchmark values are heavily influenced by the way it is activated.
׼һһûн͵ԡ
׼ֵļʽӰˡ

If the keyboard is used to hit the button we get a different score from the one
we get by clicking the button with the mouse!
The one who can explain this will probably get the Nobel Prize in Delphi;-)
ü̵ťõֵťõֵǲͬ!
˭ܽ Delphi ŵ ;-)

Another irritating thing is that Delphi does not align double variables properly.
They shall be 8 byte aligned but Delphi only 4 byte aligns them.
һߵ Delhpi ܶ double 
Ӧ 8 ֽڶ룬 Delphi ֻ 4 ֽڶǡ

The penalty we can get from this is a level 1 cache line split (and L2 cache line splits as well).
Loading a variable that splits over two cache lines takes the double time of loading one that does not.
ǽõһΪһ cache ֿ(  cache Ҳֿ)
װطΪֵıٻ彫˫װʱ䣬Ϊһװ롣

Because double variables are 8 byte and the L1 cache line of P4 is 64 byte at most 1 of 8 variables can have a split. 
On P3 that has 32 byte L1 cache lines it is 1 of 4.
Ϊ double  8 ֽڣP4  L1 cache  64 ֽڣ 8  1 ֿ
 P3  32 ֽڣL1 cache  4  1 ֿ

  
The ideal situation is that variables are aligned at 4 byte if they are 4 byte big, 8 if 8 byte etc.
To make things simple let us imagine that the first line in the level one cache is where our variables are loaded.
The first line starts at address 0, that is - memory from address 0 is loaded into it.
Our first variable is aligned and occupies the first 8 bytes at line 1.
ǱС 4 ֽ 4 ֽڶ룬 8 ֽ 8 ֽڶ롣
Ϊʹ򵥻Ǽһнװһ cache
һеַʼ 0˵ӵַ 0 ڴװ롣
һǶģڵ 1 ռ 8 ֽڡ

Variable number two occupies byte 9-16 .Variable number 8 occupies byte 57-64 and does not cross the line boundary.
If variables are only 4 byte aligned the first one could start at byte 4 and number 8 could start at byte 61.
The first 4 byte of it will be in line 1 and the next 4 bytes will be in line 2.
The processor loads it by first loading the lower part and then loading the higher part instead of loading it
all in one go.
 2 ռ 9-16 ֽڡ 8 ռ 57-64 ֽڡܽб߽硣
ֻ 4 ֽڣһԶ룬֮ӵ 4 ֽڿʼ 8 ʼڵ 61 ֽڡ
ǰ 4 ֽڵ 1 Уֽڵ 2 С
װλȻװλһװ롣
    
Because of this misalignment of double variables in Delphi our benchmark will not be as stable as we could wish.
Alignment can change when we recompile especially if code is changed.
I have chosen (a bad choice) to not include code to align variables in the benchmark,
but will give an example of it in a later lesson.
Ϊ Delphi  double ֽڶ룬ǵĻ׼Բôȶ
ͨ޸Ĵ룬±룬Ըı롣
һѡ(һѡ)ڻ׼Уͨ

Let us dive into the first function optimization.
We start with the function that uses the naive formula in textbook format.
ǿʼŻһ
ʹý̿Ĺʽʽһ
function ArcSinApprox1a(X, A, B, C, D : Double) : Double;
begin
 Result := A*X*X*X + B*X*X + C*X + D;
end;

This function implementation scores the benchmark 43243 on my P4 1600 MHz clocked at 1920 MHz
Delphi compiled it like this
ҵ P4 1600 MHz ʱƵ 1920 MHz ĻϣĻ׼Է 43243
Ĵ£

function ArcSinApprox1b(X, A, B, C, D : Double) : Double;
begin
 {
 push  ebp
 mov   ebp,esp
 add   esp,-$08
 }
 Result := A*X*X*X + B*X*X + C*X + D;
 {
 fld   qword ptr [ebp+$20]
 fmul  qword ptr [ebp+$28]
 fmul  qword ptr [ebp+$28]
 fmul  qword ptr [ebp+$28]
 fld   qword ptr [ebp+$18]
 fmul  qword ptr [ebp+$28]
 fmul  qword ptr [ebp+$28]
 faddp st(1)
 fld   qword ptr [ebp+$10]
 fmul  qword ptr [ebp+$28]
 faddp st(1)
 fadd  qword ptr [ebp+$08]
 fstp  qword ptr [ebp-$08]
 wait
 fld   qword ptr [ebp-$08]
 }
 {
 pop   ecx
 pop   ecx
 pop   ebp
 }
end;
The code from the CPU view will not compile because of the instruction faddp st(1) and we remove st(1).
 As default the faddp instruction operates on st(0), st(1) and there is no need to write it out.
 CPU еĴ벻ܱ룬Ϊָ faddp st(1)Ƴ st(1)
faddp ָĬǲ st(0), st(1)Ҫдϡ

function ArcSinApprox1c(X, A, B, C, D : Double) : Double;
asm
 //push  ebp  //Added by compiler ɱ
 //mov   ebp,esp   //Added by compiler ɱ
 add   esp,-$08
 //Result := A*X*X*X + B*X*X + C*X + D;
 fld   qword ptr [ebp+$20]
 fmul  qword ptr [ebp+$28]
 fmul  qword ptr [ebp+$28]
 fmul  qword ptr [ebp+$28]
 fld   qword ptr [ebp+$18]
 fmul  qword ptr [ebp+$28]
 fmul  qword ptr [ebp+$28]
 faddp //st(1)
 fld   qword ptr [ebp+$10]
 fmul  qword ptr [ebp+$28]
 faddp //st(1)
 fadd  qword ptr [ebp+$08]
 fstp  qword ptr [ebp-$08]
 wait
 fld   qword ptr [ebp-$08]
 pop   ecx
 pop   ecx
 //pop   ebp //Added by compiler  ɱ
end;

First we observe that there is no need to set up a stack frame.
The stack is actually used for storing the result temporarily and reloading it again in the lines
ǹ۲ǷҪöջ.
ջʵ洢ʱֱװ븡Ĵ

fstp  qword ptr [ebp-$08]
wait
fld   qword ptr [ebp-$08]

but the base pointer and not the stack pointer are used for this.
The lines that use ebp plus a value are accessing the parameters, which are located above the base pointer,
which is in the calling functions stack frame.
The stack pointer is not used at all in the function and changing its value is meaningless.
ǻַָ룬ջָ벻ʹãַ ebp һֵԷɻַָ붨λĲ 
ջָʹãıֵġ

The mov ebp, esp instruction added by the compiler together with the line add esp, -$08 creates an 8-byte stack frame.
Because these lines change the ebp register it is necessary to back it up by pushing it to the stack.
Unfortunately we can only remove the add esp, 8 line and the two pop ecx lines that has the purpose of
subtracting 8 bytes from the stack pointer, esp.
 mov ebp, esp ָڶջϷ 8 ֽڵ add esp, -$08
ΪЩиı ebp ĴҪͨѹջ
źǣֻƳ add esp, 8  pop ecxΪ˻ָǰѾջָ esp  8 Ĳ

function ArcSinApprox1d(X, A, B, C, D : Double) : Double;
asm
 //add   esp,-$08
 //Result := A*X*X*X + B*X*X + C*X + D;
 fld   qword ptr [ebp+$20]
 fmul  qword ptr [ebp+$28]
 fmul  qword ptr [ebp+$28]
 fmul  qword ptr [ebp+$28]
 fld   qword ptr [ebp+$18]
 fmul  qword ptr [ebp+$28]
 fmul  qword ptr [ebp+$28]
 faddp
 fld   qword ptr [ebp+$10]
 fmul  qword ptr [ebp+$28]
 faddp
 fadd  qword ptr [ebp+$08]
 fstp  qword ptr [ebp-$08]
 wait
 fld   qword ptr [ebp-$08]
 //pop   ecx
 //pop   ecx
end;
This function implementation scores the benchmark 42391 and performance actually dipped a little.
The compiler inserts the line mov ebp, esp and we can make it redundant by using esp instead of ebp.
Ļ׼Է 42391ʵϼһ㡣
 mov ebp, espǿʹ esp  ebp

function ArcSinApprox1e(X, A, B, C, D : Double) : Double;
asm
 //Result := A*X*X*X + B*X*X + C*X + D;
 //fld   qword ptr [ebp+$20]
 fld   qword ptr [esp+$20]
 //fmul  qword ptr [ebp+$28]
 fmul  qword ptr [esp+$28]
 //fmul  qword ptr [ebp+$28]
 fmul  qword ptr [esp+$28]
 //fmul  qword ptr [ebp+$28]
 fmul  qword ptr [esp+$28]
 //fld   qword ptr [ebp+$18]
 fld   qword ptr [esp+$18]
 //fmul  qword ptr [ebp+$28]
 fmul  qword ptr [esp+$28]
 //fmul  qword ptr [ebp+$28]
 fmul  qword ptr [esp+$28]
 faddp
 //fld   qword ptr [ebp+$10]
 fld   qword ptr [esp+$10]
 //fmul  qword ptr [ebp+$28]
 fmul  qword ptr [esp+$28]
 faddp
 //fadd  qword ptr [ebp+$08]
 fadd  qword ptr [esp+$08]
 //fstp  qword ptr [ebp-$08]
 fstp  qword ptr [esp-$08]
 wait
 //fld   qword ptr [ebp-$08]
 fld   qword ptr [esp-$08]
end;
Unfortunately the compiler still inserts the mov instruction and we performed a copy propagation that
gave no optimization because it is not followed by a dead code removal.
Therefore performance is almost the same C 43094.
ҵǣȻ mov ָִŻĸƴΪΪЧ뱻Ƴ
ܷǳ - 43094.

Without investigating whether the result stored on the stack
is used we can optimize the lines coping it there and reloading it.
The result of them is that there is a copy of Result left on the stack.
They redundantly pop the result of the FP stack and reload Result from the stack.
ܹȷǴڶջϣŻƺװһ
еһǽڶջϡ
ѽ FP ջ pop ٴӶջװġ

This single line has the same effect, but redundancy is removed.
һͬЧ౻ɾˡ
fst  qword ptr [ebp-$08]
This optimization is very often possible on Delphi generated code and is important to remember.
Ż Delphi ĴоʹãӦüס

function ArcSinApprox1f(X, A, B, C, D : Double) : Double;
asm
 //Result := A*X*X*X + B*X*X + C*X + D;
 fld   qword ptr [esp+$20]
 fmul  qword ptr [esp+$28]
 fmul  qword ptr [esp+$28]
 fmul  qword ptr [esp+$28]
 fld   qword ptr [esp+$18]
 fmul  qword ptr [esp+$28]
 fmul  qword ptr [esp+$28]
 faddp
 fld   qword ptr [esp+$10]
 fmul  qword ptr [esp+$28]
 faddp
 fadd  qword ptr [esp+$08]
 //fstp  qword ptr [esp-$08]
 fst  qword ptr [esp-$08]
 wait
 //fld   qword ptr [esp-$08]
end;
This function implementation scores the benchmark 47939 and the improvement is 11 %
The next question to ask is: Is the copy of the Result on the stack ever used?
To answer it we must inspect the code at the location of the call to the function.
Ļ׼ 47939 11%
ǣڶջϵĽʹأ
Ϊ˻شҪıش룺

Y := ArcSinApproxFunction(X, A, B, C, D);

call dword ptr [ArcSinApproxFunction]
fstp qword ptr [ebp-$30]
wait

The first line after the call stores the result in Y and pops the stack.
Seeing this we assume that the result on the stack is not used,
but to be sure we must scan through the rest of the code too.
If the rule for the Register calling convention is that FP results are transferred on the FP stack
it is weird that a copy is also placed on the stack.
һн Yջ
ǼڶջϵĽûбʹãͬʱȷʣµĴҲûʹá
 FP Ľͨ FP ջ䣬ڽƵջļĴԼǲ˼ġ

We conclude that it is redundant to copy the Result to the stack and remove the line doing it.
Ƕ϶Ƶջģɾ
function ArcSinApprox1g(X, A, B, C, D : Double) : Double;
asm
 //Result := A*X*X*X + B*X*X + C*X + D;
 fld   qword ptr [esp+$20]
 fmul  qword ptr [esp+$28]
 fmul  qword ptr [esp+$28]
 fmul  qword ptr [esp+$28]
 fld   qword ptr [esp+$18]
 fmul  qword ptr [esp+$28]
 fmul  qword ptr [esp+$28]
 faddp
 fld   qword ptr [esp+$10]
 fmul  qword ptr [esp+$28]
 faddp
 fadd  qword ptr [esp+$08]
 //fst  qword ptr [esp-$08]
 wait
end;
This function implementation scores the benchmark 47405
Instead of writing all the qword ptr [esp+$xx] lines we can write the names of the variables and let the compiler
translate them into addresses.
This actually makes the code more robust.
If the location of the variables should change then the code breaks if we use hard coded addressing.
This will however only happen if the calling convention is changed and this is not likely to happen very often.
Ļ׼Է 47405 ʹı滻е [esp+$xx] УñǷɵַ
ʵϣʹ׳ʹӲַܵصııӰ졣
ʹ´룬Լı䣬ҲӰ졣Ȼ⼸ܷ

function ArcSinApprox1g_2(X, A, B, C, D : Double) : Double;
asm
 //Result := A*X*X*X + B*X*X + C*X + D;
 //fld   qword ptr [esp+$20]
 fld   A
 //fmul  qword ptr [esp+$28]
 fmul  X
 //fmul  qword ptr [esp+$28]
 fmul  X
 //fmul  qword ptr [esp+$28]
 fmul  X
 //fld   qword ptr [esp+$18]
 fld   B
 //fmul  qword ptr [esp+$28]
 fmul  X
 //fmul  qword ptr [esp+$28]
 fmul  X 
 faddp
 //fld   qword ptr [esp+$10]
 fld   C
 //fmul  qword ptr [esp+$28]
 fmul  X 
 faddp
 //fadd  qword ptr [esp+$08]
 fadd  D
 wait
end;
Try having both types of lines enabled

fld   qword ptr [esp+$20]
fld   A
and see in the CPU view how the compiler generated exactly the same code for both versions.
ͨ۲ CPU ڷ֣Ϊ汾ͬĴ롣

X is used in a lot of lines and it is referenced on the stack.
Therefore it is loaded from the stack into the internal FP registers every time.
It should be faster to load it once into the FP stack and let all uses reference the FP stack.
ջϵ X ʹá
ÿʹöǴӶջװ FP Ĵ
ֻװ FP ջһΣʹĶο FP ջӦǱȽϿġ

function ArcSinApprox1h(X, A, B, C, D : Double) : Double;
asm
 //Result := A*X*X*X + B*X*X + C*X + D;
 fld   qword ptr [esp+$20]
 fld   qword ptr [esp+$28] //New
 fxch
 //fmul qword ptr [esp+$28]
 fmul  st(0),st(1)
 //fmul qword ptr [esp+$28]
 fmul  st(0),st(1)
 //fmul qword ptr [esp+$28]
 fmul  st(0),st(1)
 fld   qword ptr [esp+$18]
 //fmul qword ptr [esp+$28]
 fmul  st(0),st(2)
 //fmul qword ptr [esp+$28]
 fmul  st(0),st(2)
 faddp
 fld   qword ptr [esp+$10]
 //fmul qword ptr [esp+$28]
 fmul  st(0),st(2)
 ffree st(2)
 faddp
 fadd  qword ptr [esp+$08]
 fst   qword ptr [esp-$08]
 wait
end;
The second line is one we added and it loads X once and for all.
Because it places X on the top of the stack in st(0) 
and this position is needed as temp for further operations we exchange st(0) with st(1) with the fxch instruction. 
We could of course have changed the position of line 1 and 2 and obtained the same effect.
All the lines multiplying
ӵڶһװ X
 X ջ st(0)λҪΪʱλã fxch ָ st(0), st(1)
ǵȻͨı 12 еλͬЧ
еĳ˷

st(0) with X  st(0)  X
fmul qword ptr [esp+$28]
are changed to Ϊ
fmul  st(0),st(1)
after the FP copy of X has been used for the last time we remove it with the instruction ffree.
This function implementation scores the benchmark 46882 and performance is decreased by 1 %.
This was a surprise. Fxch is claimed by Intel to be nearly free, because it works by renaming the internal registers.
Let us check that by removing it
һʹ X  FP ϵĿ֮ free ָɾ
Ļ׼Է 46882ܼ 1 ٷֵ㡣
˾档Fxch  Intel ϼķʱ䣬ΪֻڲĴ

function ArcSinApprox1i(X, A, B, C, D : Double) : Double;
asm
 //Result := A*X*X*X + B*X*X + C*X + D;
 fld   qword ptr [esp+$28]
 fld   qword ptr [esp+$20]
 //fld   qword ptr [esp+$28]
 //fxch
 fmul  st(0),st(1)
 fmul  st(0),st(1)
 fmul  st(0),st(1)
 fld   qword ptr [esp+$18]
 fmul  st(0),st(2)
 fmul  st(0),st(2)
 faddp
 fld   qword ptr [esp+$10]
 fmul  st(0),st(2)
 ffree st(2)
 faddp
 fadd  qword ptr [esp+$08]
 wait
end;
This function implementation scores the benchmark 45393 and performance is decreased by 3 %.
Fxch is surely not to blame because performance once more went down. What is going on?
The wait instruction was discussed in an earlier lesson and this time we just remove it.
Ļ׼Է 45393ܼ 3 ٷֵ㡣
Ĳ Fxchһ½
ʲô
Wait ָǰĿγ۹ˣһƳ

function ArcSinApprox1j(X, A, B, C, D : Double) : Double;
asm
 //Result := A*X*X*X + B*X*X + C*X + D;
 fld   qword ptr [esp+$28]
 fld   qword ptr [esp+$20]
 fmul  st(0),st(1)
 fmul  st(0),st(1)
 fmul  st(0),st(1)
 fld   qword ptr [esp+$18]
 fmul  st(0),st(2)
 fmul  st(0),st(2)
 faddp
 fld   qword ptr [esp+$10]
 fmul  st(0),st(2)
 ffree st(2)
 faddp
 fadd  qword ptr [esp+$08]
 //wait
end;
Performance went down to 44140.
Let us crosscheck these surprising results by running the functions on a P3.
½ 44140
һ P3 ϷЩ˾ȵĽ

ArcSinApprox1a	63613
ArcSinApprox1b	64412
ArcSinApprox1c	64433
ArcSinApprox1d	65062
ArcSinApprox1e	64830
ArcSinApprox1f	62598
ArcSinApprox1g	79586
ArcSinApprox1h	85361
ArcSinApprox1i	80515
ArcSinApprox1j	80192
First of all we see that ArcSinApprox1h is the fastest function on P3.
Thereby it is seen that loading data from the level 1 data cache is more expensive on P3 than on P4,
because changing the code such that X is loaded only once improved performance on P3 and not on P4.
On the other hand we could also say that it must always be slower to get data from the cache than
from internal registers if the architecture is sound and this is only true for the P6 architecture here.
P4 has a fast L1 data cache, which can be read in only 2 clock cycles, but an internal register
read should have a latency of only one cycle. It however looks like reads from registers are 2 clock cycles.
Then we see that a P3 at 1400 nearly 80 % faster than a P4 at 1920 on this code. We know that latencies on P3 are 
shorter, but this is not enough to explain the huge difference.
ǿ P3  ArcSinApprox1h ġ
һ cache װ P3  P4 Ϊֻװ X һεĴ P3 Ͽܣ P4 ϲܡ
һ棬ǿԿ cache лݱȴڲĴϵṹģֻ P6 Ч
P4 һٵһ cacheʱ֮ 2 ʱڣڲĴֻһڵǱڡ
ȻĴʹ 2 ʱڡǿ룬 P3 1400 ϱ P4 1920 Ͽ˽ 80%
֪ P3 ϵǱʱ̵ģǲܽΪʲôôĲ

The latencies and throughput of the used instructions are on P3
Щָ P3 ϵǱں

Fadd latency is 3 clock cycles and throughput is 1
Fmul latency is 5 clock cycles and throughput is 1
Fadd Ǳ 3 ʱڣ 1
Fmul Ǳ 5 ʱڣ 1
On P4  P4 
Fadd Ǳ 5 ʱڣ 1
Fmul Ǳ 7 ʱڣ 2

I could not find numbers for fld The explanation for the very bad performance of P4 on this code
must be the 2-cycle throughput on fmul together with the slow FP register access.
The fmul pipeline only accepts a new instruction on every second cycle where the P3 pipeline accepts one every cycle.
ûз fld ݣ P4 ܺܲĽǣ˷ FP Ĵ 2 ڵ
fmul ܵ P4 ÿڽһָ P3 ܵÿڽһ
 
Scaling the results by clock frequency
Ƶı
47939 / 1920 = 25
85361 / 1400 = 61
reveals that clock by clock on the fastest function version for each processor P3 is nearly 2.5 times faster than P4.
This is truly astonishing. If P4 should have a slight chance against a P3 we must remove some of
 those multiplications.
˵ˣ P3 ĺ汾 P4  2.5 
˾ȡ P4 ҪһЩı䣬ǱƳЩ˷
This leads us to the Horner version of our function.
ڣǵŰ汾

function ArcSinApprox3a(X, A, B, C, D : Double) : Double;
begin
 Result := ((A*X + B)*X + C)*X + D;
end;
Which is compiled into  

function ArcSinApprox3b(X, A, B, C, D : Double) : Double;
begin
{
push ebp
mov  ebp,esp
add  esp,-$08
}
 Result := ((A*X + B)*X + C)*X + D;
 {
 fld  qword ptr [ebp+$20]
 fmul qword ptr [ebp+$28]
 fadd qword ptr [ebp+$18]
 fmul qword ptr [ebp+$28]
 fadd qword ptr [ebp+$10]
 fmul qword ptr [ebp+$28]
 fadd qword ptr [ebp+$08]
 fstp qword ptr [ebp-$08]
 wait
 fld  qword ptr [ebp-$08]
 }
{
pop  ecx
pop  ecx
pop  ebp
}
end;
The first three versions of this function are identical and they surprisingly score the same benchmark.
Our benchmark is not perfect but it was precise this time ;-)
ǰ汾ĺĻ׼ͬģ˾档
ǵĻ׼ԲԾȷʱ䣺
ArcSinApprox3a	45076
ArcSinApprox3b	45076
ArcSinApprox3c	45076

Here is the first BASM version with no optimizations.
The out commented the compiler supplies code.
Optimization follows the same pattern as on the first function.
ŻһһŻ
ûŻĵһ BASM 汾ע͵Ĵ

function ArcSinApprox3c(X, A, B, C, D : Double) : Double;
asm
 //push ebp
 //mov ebp,esp
 add  esp,-$08
 //Result := ((A*X + B)*X + C)*X + D;
 fld  qword ptr [ebp+$20]
 fmul qword ptr [ebp+$28]
 fadd qword ptr [ebp+$18]
 fmul qword ptr [ebp+$28]
 fadd qword ptr [ebp+$10]
 fmul qword ptr [ebp+$28]
 fadd qword ptr [ebp+$08]
 fstp qword ptr [ebp-$08]
 wait
 fld  qword ptr [ebp-$08]
 pop  ecx
 pop  ecx
 //pop ebp
end;
First thing is to remove the add esp, -$08 line and the two pop ecx.
They are setting up a stack frame and do nothing but manipulate the stack pointer, which is not used at all.
Ƴ add esp, -$08  pop ecx
һջ˲ջָûκ飬ûκô

function ArcSinApprox3d(X, A, B, C, D : Double) : Double;
asm
 //add  esp,-$08
 //Result := ((A*X + B)*X + C)*X + D;
 fld  qword ptr [ebp+$20]
 fmul qword ptr [ebp+$28]
 fadd qword ptr [ebp+$18]
 fmul qword ptr [ebp+$28]
 fadd qword ptr [ebp+$10]
 fmul qword ptr [ebp+$28]
 fadd qword ptr [ebp+$08]
 fstp qword ptr [ebp-$08]
 wait
 fld  qword ptr [ebp-$08]
 //pop  ecx
 //pop  ecx
end;
This function implementation scores the benchmark 43535.
Both of the redundant lines copying the result to stack and back are removed at the same time.
Ļ׼Է 43535.
ɾƵջ,ٴӶջƵ

function ArcSinApprox3e(X, A, B, C, D : Double) : Double;
asm
 //Result := ((A*X + B)*X + C)*X + D;
 fld  qword ptr [ebp+$20]
 fmul qword ptr [ebp+$28]
 fadd qword ptr [ebp+$18]
 fmul qword ptr [ebp+$28]
 fadd qword ptr [ebp+$10]
 fmul qword ptr [ebp+$28]
 fadd qword ptr [ebp+$08]
 //fstp qword ptr [ebp-$08]
 wait
 //fld  qword ptr [ebp-$08]
end;
This function implementation scores the benchmark 47237 and the improvement is 8.5 %
Then we change the code such that X is loaded only once.
Ļ׼ 47237,  8.5%.
Ȼ,޸ĳֻװһ X ĵĴ.

function ArcSinApprox3f(X, A, B, C, D : Double) : Double;
asm
 //Result := ((A*X + B)*X + C)*X + D;
 fld   qword ptr [ebp+$20]
 fld   qword ptr [ebp+$28]
 fxch
 //fmul qword ptr [ebp+$28]
 fmul  st(0),st(1)
 fadd  qword ptr [ebp+$18]
 //fmul qword ptr [ebp+$28]
 fmul  st(0),st(1)
 fadd  qword ptr [ebp+$10]
 //fmul qword ptr [ebp+$28]
 fmul  st(0),st(1)
 ffree st(1)
 fadd qword ptr [ebp+$08]
 wait
end;
This function implementation scores the benchmark 47226 and performance is unchanged.
The ffree instruction can be removed by using fmulp instead of fmul, but to do this we must interchange the
two registers used. Only these two registers are in use and A*B = B*A so there is no problem doing that.
 We are not removing any redundancy by this and the two ways of coding the same thing should perform identically.
Ļ׼Է 47226ûиı䡣
 fmulp ָ fmul ֮ffree ɾʹ֮ڲõĴ
Ĵִֻ A*B = B*A˽֮ԽûκӰ졣
ַǲκֱ࣬뷽ִͬĲ

function ArcSinApprox3g(X, A, B, C, D : Double) : Double;
asm
 //Result := ((A*X + B)*X + C)*X + D;
 fld   qword ptr [ebp+$20]
 fld   qword ptr [ebp+$28]
 fxch  st(1)
 fmul  st(0),st(1)
 fadd  qword ptr [ebp+$18]
 fmul  st(0),st(1)
 fadd  qword ptr [ebp+$10]
 //fmul  st(0),st(1)
 fmulp st(1),st(0)
 //ffree st(1)
 fadd qword ptr [ebp+$08]
 wait
end;
This function implementation scores the benchmark 47416.
Then we remove the wait instruction.
Ļ׼Է 47416
Ƴ wait ָ

function ArcSinApprox3h(X, A, B, C, D : Double) : Double;
asm
 //Result := ((A*X + B)*X + C)*X + D;
 fld   qword ptr [ebp+$20]
 fld   qword ptr [ebp+$28]
 fxch  st(1)
 fmul  st(0),st(1)
 fadd  qword ptr [ebp+$18]
 fmul  st(0),st(1)
 fadd  qword ptr [ebp+$10]
 fmulp st(1),st(0)
 fadd qword ptr [ebp+$08]
 //wait
end;
This function implementation scores the benchmark 47059.
The last thing to do is interchanging the lines that load X and A, and remove the fxch instruction.
Ļ׼Է 47059
һǽ load X  Aɾ fxch ָ

function ArcSinApprox3i(X, A, B, C, D : Double) : Double;
asm
 //Result := ((A*X + B)*X + C)*X + D;
 fld   qword ptr [ebp+$28]
 fld   qword ptr [ebp+$20]
 //fld   qword ptr [ebp+$28]
 //fxch  st(1)
 fmul  st(0),st(1)
 fadd  qword ptr [ebp+$18]
 fmul  st(0),st(1)
 fadd  qword ptr [ebp+$10]
 fmulp st(1),st(0)
 fadd qword ptr [ebp+$08]
end;
This function implementation scores the benchmark 46544 and performance went down!
Ļ׼Է 46544½ˡ

Let us compare the performance of the Horner style function with the naive one by picking the fastest
implementations of both on P4.
 P4 ϱȽ͵ĺִĺ
ArcSinApprox1g	47939
ArcSinApprox3g	47416

On P3  P3 
ArcSinApprox1h	85361
ArcSinApprox3h	87604
There difference is not big, but the naive approach is a little faster on P4 and slower on P3.
The naive approach has more calculations, but parallelism makes up for it.
The Horner way has very little parallelism and latencies are fully exposed.
This is especially bad on P4.
Having this in mind we continue to the third possible approach, which looks like this.
Ǻܴ󣬵ԭʼ P4 Ͽ㣬 P3 
ԭʼ࣬ǱִС
źִУǱڱ˷ѡ
 P4 رĲ
еֿܵĿǣ

function ArcSinApprox4b(X, A, B, C, D : Double) : Double;
begin
{
push  ebp
mov  ebp,esp
add   esp,-$08
}
 Result := (A*X + B)*(X*X)+(C*X + D);
 {
 fld     qword ptr [ebp+$20]
 fmul  qword ptr [ebp+$28]
 fadd  qword ptr [ebp+$18]
 fld     qword ptr [ebp+$28]
 fmul  qword ptr [ebp+$28]
 fmulp st(1)
 fld     qword ptr [ebp+$10]
 fmul  qword ptr [ebp+$28]
 fadd   qword ptr [ebp+$08]
 faddp st(1)
 fstp   qword ptr [ebp-$08]
 wait
 fld   qword ptr [ebp-$08]
 }
{
pop ecx
pop ecx
pop ebp
}
end;
Experienced as we are now optimizing this function is going to be easy and fast ;-)
This version is as Delphi made it
оǽŻ
 Delphi İ汾

function ArcSinApprox4c(X, A, B, C, D : Double) : Double;
asm
 //push ebp
 //mov ebp,esp
 add   esp,-$08
 //Result := (A*X + B)*(X*X)+(C*X + D);
 fld     qword ptr [ebp+$20]
 fmul  qword ptr [ebp+$28]
 fadd  qword ptr [ebp+$18]
 fld     qword ptr [ebp+$28]
 fmul  qword ptr [ebp+$28]
 fmulp //st(1)
 fld     qword ptr [ebp+$10]
 fmul  qword ptr [ebp+$28]
 fadd  qword ptr [ebp+$08]
 faddp //st(1)
 fstp   qword ptr [ebp-$08]
 wait
 fld   qword ptr [ebp-$08]
 pop   ecx
 pop   ecx
//pop  ebp
end;
Removing the stack frame and the two lines that store the result in the stack frame
ɾջصĴ룺

function ArcSinApprox4d(X, A, B, C, D : Double) : Double;
asm
 //add  esp,-$08
 //Result := (A*X + B)*(X*X)+(C*X + D);
 fld  qword ptr [ebp+$20]
 fmul qword ptr [ebp+$28]
 fadd qword ptr [ebp+$18]
 fld  qword ptr [ebp+$28]
 fmul qword ptr [ebp+$28]
 fmulp //st(1)
 fld  qword ptr [ebp+$10]
 fmul qword ptr [ebp+$28]
 fadd qword ptr [ebp+$08]
 faddp //st(1)
 //fstp qword ptr [ebp-$08]
 wait
 //fld  qword ptr [ebp-$08]
 //pop  ecx
 //pop  ecx
end;
Load X once
ֻװһ X
function ArcSinApprox4e(X, A, B, C, D : Double) : Double;
asm
 //Result := (A*X + B)*(X*X)+(C*X + D);
 fld   qword ptr [ebp+$20]
 fld   qword ptr [ebp+$28]
 //fmul qword ptr [ebp+$28]
 fxch
 fmul  st(0),st(1)
 fadd  qword ptr [ebp+$18]
 //fld  qword ptr [ebp+$28]
 fld   st(1)
 //fmul  qword ptr [ebp+$28]
 fmul  st(0),st(2)
 fmulp
 fld   qword ptr [ebp+$10]
 //fmul  qword ptr [ebp+$28]
 fmul  st(0),st(2)
 fadd  qword ptr [ebp+$08]
 faddp
 ffree st(1)
 wait
end;
Remove fxch and wait.
Ƴ fxch  wait
function ArcSinApprox4f(X, A, B, C, D : Double) : Double;
asm
 //Result := (A*X + B)*(X*X)+(C*X + D);
 fld   qword ptr [ebp+$28]
 fld   qword ptr [ebp+$20]
 //fxch
 fmul  st(0),st(1)
 fadd  qword ptr [ebp+$18]
 fld   st(1)
 fmul  st(0),st(2)
 fmulp
 fld   qword ptr [ebp+$10]
 fmul  st(0),st(2)
 fadd  qword ptr [ebp+$08]
 faddp
 ffree st(1)
 //wait
end;
Reschedule ffree st(1)
ص ffree string(1)
function ArcSinApprox4g(X, A, B, C, D : Double) : Double;
asm
 //Result := (A*X + B)*(X*X)+(C*X + D);
 fld   qword ptr [ebp+$28]
 fld   qword ptr [ebp+$20]
 fmul  st(0),st(1)
 fadd  qword ptr [ebp+$18]
 fld   st(1)
 fmul  st(0),st(2)
 fmulp
 fld   qword ptr [ebp+$10]
 fmul  st(0),st(2)
 ffree st(2)
 fadd  qword ptr [ebp+$08]
 faddp
 //ffree st(1)
end;
Replace fmul/ffree by fmulp
 fmulp  fmul/ffree
function ArcSinApprox4h(X, A, B, C, D : Double) : Double;
asm
 //Result := (A*X + B)*(X*X)+(C*X + D);
 fld   qword ptr [ebp+$28]
 fld   qword ptr [ebp+$20]
 fmul  st(0),st(1)
 fadd  qword ptr [ebp+$18]
 fld   st(1)
 fmul  st(0),st(2)
 fmulp
 fld   qword ptr [ebp+$10]
 //fmul  st(0),st(2)
 fmulp st(2),st(0)
 //ffree st(2)
 fadd  qword ptr [ebp+$08]
 faddp
end;
Cleaning up and observing that the compiler still backs up ebp and modifies esp redundantly.
ע⵽Խı ebp ޸ esp
function ArcSinApprox4i(X, A, B, C, D : Double) : Double;
asm
 //Result := (A*X + B)*(X*X)+(C*X + D);
 fld   qword ptr [ebp+$28]
 fld   qword ptr [ebp+$20]
 fmul  st(0),st(1)
 fadd  qword ptr [ebp+$18]
 fld   st(1)
 fmul  st(0),st(2)
 fmulp
 fld   qword ptr [ebp+$10]
 fmulp st(2),st(0)
 fadd  qword ptr [ebp+$08]
 faddp
end;
The big question is now how well this function version performs.
ʱʵִЩ

ArcSinApprox4a	45228
ArcSinApprox4b	45239
ArcSinApprox4c	45228
ArcSinApprox4d	51813
ArcSinApprox4e	49044
ArcSinApprox4f	48674
ArcSinApprox4g	48852
ArcSinApprox4h	44914
ArcSinApprox4i	44914
We see that optimizations from function d to i are deoptimizations on P4 except for g.
ǿ d  i g ֮ P4 ǡŻġ 

On P3  P3 
ArcSinApprox4a	68871
ArcSinApprox4b	68871
ArcSinApprox4c	68634
ArcSinApprox4d	86806
ArcSinApprox4e	85727
ArcSinApprox4f	83542
ArcSinApprox4g	80548
ArcSinApprox4h	88378
ArcSinApprox4i	85324
We see that optimizations d and h are very good and optimizations e, f g and I are bad.
It is quite possible that the optimal function implementation is none of the ones we have made.
We could pick version h and remove the bad optimizations or simply make some more variants
and this way get a faster implementation.
ǿŻ d  h ǷǳĺãŻ e,f,g  I Чܲ
пѾִеĺûѵĺ
Ǹıһ°汾 hɾŻһЩ򵥵ı仯õһִС

Which function approach is the winner? To find out we pick the fastest implementation of each approach On P4
ĸʤأ
ҳ P4 ִĺ
ArcSinApprox1f	47939
ArcSinApprox3g	47416
ArcSinApprox4d	51813
The last version is the fastest.
İ汾졣
Parallelisation is very important on a modern processor and version 4 beats the others by 9 %.
дִϷǳҪ汾 4 汾 9%

On P3  P3 
ArcSinApprox1h	85361
ArcSinApprox3h	87604
ArcSinApprox4h	88378
The function version 4 is a winner on P3 as well, but with a less margin.
The P4 has an SSE2 instruction set, which contains instructions for double precision floating-point calculations.
The main idea with this instruction set is SIMD calculations.
SIMD is an acronym for Single Instruction Multiple Data.
汾 4  P3 Ҳǹھֻ΢ơ
P4  SSE2 ָ˫ȸָ㡣
ЩָҪ SIMD 㡣
SIMD ǵָ (Single Instruction Multiple Data) д

Multiple data is here two FP double precision variables (64 bit) and two sets of these data can be added, 
subtracted, multiplied or divided with one instruction.
ָ FP ȱ(64 λ)ݿһִָмӡˡ

SSE2 also have some instructions for scalar calculations, which are calculations on
one pair of data just like ordinary FP math in the FPU.
The biggest difference between ordinary FP math and SSE2 scalar math is that FP math is performed on extended
precision and results are rounded to double precision when copied to double precision variables in RAM/cache.
SSE2 math is double precision calculation and double precision registers.
SSE2 Ҳͨ㣬Ҳ FPU ִͨһһݡ
ͨ FP  SSE2 ּͬǣFP ʹչľִУ RAM/cache Ƶ˫ȱʱҲ
չΪ˫ȡ
SSE2 ˫ȼ˫ȼĴ

The code examples of this lesson have relatively few calculations and precision on the FPU will be double.
If we load data, perform all calculations and store the result, the result will only bee a little less than
extended precision when still on the FPU stack, and will be rounded to exact double precision
when copied to a memory location.
һεĴУ FPU 㾫йأҲмǹ˫ȵġ
װݣִеļ㣬洢ֻ FPU ջϵչȲһ㡣ƵڴʱתΪ
ȷ˫ȡ

SSE2 calculations on the other hand are a little less than double precision and the result
in a register is a little less than double precision too.
If there is only one calculation the result will be in double precision, but when performing additional
calculations the error from each will sum up.
Because the FPU does all calculations in extended precision and can hold temporary results in registers,
there can be done many calculations before precision declines below double.
We have seen that the drawback of SSE2 is that precision is double or less versus the double precision of IA32 FP.
SSE2 һҲ˫ȲǾڼĴнҲ˫Ȳ
ֻһ˫ȣǵִۼӼʱ
Ϊ FPU չľȽеļ㣬ԽʱĴ˫ȼǰҲ㡣
Ҳ SSE2 ȱǾ˫ģС IA32 FP ˫ȡ

What is the advantage? There are two advantages.
Registers are not ordered as a stack, which makes it easier to arrange code and secondly calculations
in double precision are faster than calculations in extended precision.
We must expect scalar SSE2 instructions to have shorter latencies than their IA32 FP counterparts.
SSE2 ŵʲôأŵ㡣
ĴûΪջ˳򣬿ԸİŴ룬ڶ˫ȵļչȵļ졣
Ҳϣּ SSE2 ָǱҲ IA32 FP ԵָЩ.

Fadd latency is 5
Fsub latency is 5
Fmul latency is 7
Fdiv latency is 38

Addsd latency is 4
Subsd latency is 4

Mulsd Divsd latency is 35

Fadd Ǳ 5
Fsub Ǳ 5
Fmul Ǳ 7
Fdiv Ǳ 38

Addsd Ǳ 4
Subsd Ǳ 4

Mulsd Divsd Ǳ 35

The P4 optimizations reference manual has no latency and throughput information for the Mulsd instruction!
We see that latencies are one cycle less for scalar SSE2 in general, but 3 cycles less for division.
 P4 Żοֲû Mulsd ָǱϢ

Throughput is 
Fadd throughput is 1
Fsub throughput is 1
Fmul throughput is 2
Fdiv throughput is 38

Addsd throughput is 2
Subsd throughput is 2
Mulsd
Divsd latency is 35

Fadd  1
Fsub  1
Fmul  2
Fdiv  38

Addsd  2
Subsd  2

Mulsd Divsd  35

We see that throughput for addsd and subsd surprisingly are the double of fadd and fsub.
All that think SSE2 has dedicated hardware and that SIMD is calculation on two sets of data in parallel
raise your hands!
From the manual Optimizations for Intel P4 and Intel Xeon latency and throughput tables at page C-1 it is seen
that all SSE2 FP instructions are executed in the same pipelines as old time FP instructions.
ǾĿ addsd  subsd  fadd  fsub 
 "Intel P4  Intel Xeon Ż" ֲ C-1 ҳǱںУԿе SSE2 FP ָִͬ
ܵУǰ FP ָһ

This means that an SIMD addition as example is generating two microinstructions that execute in the F_ADD pipeline.
At clock cycle one the first one enters the pipeline, at the second cycle number 2 enters the pipeline.
Because latency is 4 cycles the first one leaves the pipeline at clock cycle 3 and
the second one leaves at cycle four.
This leads us to expect that a scalar SSE2 add should generate one microinstruction of the same type and
have a latency of 3 cycles and a throughput of 1.
ʾ SIMD ӷеһ F_ADD ܵв΢ָ
һָڵһʱڽܵߣڶָڵڶڽܵߡ
ΪǱ 4 ,һָʱ 3 뿪ܵ,ڶ 2 뿪ܵ.
ʹҲּ SSE2 ӷҲһͬ͵΢ָ, 3 ڵǱں 1 .

From the tables it is seen that the SIMD version of add, addpd, has the same latency and throughput
as the scalar version, addsd.
Either there is an error in the tables or the scalar instruction also generates two microinstructions of
which one is blind, that is have no effect.
ӱпԿ SIMD 汾 add, addpd ͷּ汾İ汾ͬǱں.
ڱд  ָּ΢ָһЧġ

Come on Intel!
To verify the numbers from the table we create some dedicated code and time the instructions.
Intel ͡
Ϊ˼еֵǴһЩרŵĴ¼ָʱ䡣

procedure TMainForm.BenchmarkADDSDLatency;
var
 RunNo, ClockFrequency : Cardinal;
 StartTime, EndTime, RunTime : TDateTime;
 NoOfClocksPerRun, RunTimeSec : Double;
const
 ONE : Double = 1;
 NOOFINSTRUCTIONS : Cardinal = 895;

begin
 ADDSDThroughputEdit.Text := 'Running';
 ADDSDThroughputEdit.Color := clBlue;
 Update;
 StartTime := Time;
 for RunNo := 1 to MAXNOOFRUNS do
  begin
   asm
    movsd xmm0, ONE
    movsd xmm1, xmm0
    movsd xmm2, xmm0
    movsd xmm3, xmm0
    movsd xmm4, xmm0
    movsd xmm5, xmm0
    movsd xmm6, xmm0
    movsd xmm7, xmm0

    addsd xmm0, xmm1
    addsd xmm0, xmm1
    addsd xmm0, xmm1
    addsd xmm0, xmm1
    addsd xmm0, xmm1
    addsd xmm0, xmm1
    addsd xmm0, xmm1

    //Repeat the addsd block of code such that there are 128 blocks
    // ظ addsd 飬 128 ʱ
   end;
  end;
 EndTime := Time;
 RunTime := EndTime - StartTime;
 RunTimeSec := (24 * 60 *60 * RunTime);
 ClockFrequency := StrToInt(ClockFrequencyEdit.Text);
 NoOfClocksPerRun := (RunTimeSec / MaxNoOfRuns) * ClockFrequency * 1000000 / NOOFINSTRUCTIONS;
 ADDSDThroughputEdit.Text := FloatToStrF(NoOfClocksPerRun, ffFixed, 9, 1);
 ADDSDThroughputEdit.Color := clLime;
 Update;
end;
The addsd instructions all operate on the same two registers and therefore they cannot execute in parallel.
 The second instruction has to wait for the first to finish and the full latency of the instruction is exposed.
addsd ָͬļĴ˲ָܱ
ڶָòȴһָĽָǱڶ¶

For measuring throughput insert this block 128 times
Ϊ˲ 12 ʱ
addsd xmm1, xmm0
addsd xmm2, xmm0
addsd xmm3, xmm0
addsd xmm4, xmm0
addsd xmm5, xmm0
addsd xmm6, xmm0
addsd xmm7, xmm0

Here there are no data decencies between instructions and they can execute in parallel.
Xmm0 is used as source in every line but this does not create a data dependency.
Results from run of the code show us that latency is 4 cycles and throughput is 2 cycles.
This is in consistency with the table numbers.
ָ֮ûһǿԲִС
Xmm0 ÿһбԴûв
нǣǱ 4 ڣ 2 ڡ
еֵһ¡

Let us code the three functions in scalar SSE2 and perform some benchmarks.
The 8 SSE2 registers are called xmm0-xmm7 and Delphi has no register view for them.
So we must create our own, by creating a global (or local) variable for each register,
put a watch on them and add code in the function to copy register contents to variables.
÷ּ SSE2 ָдִͬĻ׼ԡ
SSE2  8 Ĵ xmm0-xmm7Delphi ûмĴ۲ǡ
ǱԼȫ֣߾ֲ۲ÿһĴĴڴ

It is somewhat cumbersome to do all this and I am looking forward to Borland creating an xmm register view.
This code shows how I do it.
΢һ鷳, ڴ Borland һ xmm Ĵڡ

var
 XMM0reg, XMM1reg, XMM2reg, XMM3reg, XMM4reg : Double;

function ArcSinApprox3i(X, A, B, C, D : Double) : Double;
asm
 //Result := ((A*X + B)*X + C)*X + D;
 
 fld   qword ptr [ebp+$20]
 movsd xmm0,qword ptr [ebp+$20]

 movsd XMM0reg,xmm0
 movsd XMM1reg,xmm1
 movsd XMM2reg,xmm2
 movsd XMM3reg,xmm3

 fld       qword ptr [ebp+$28]
 movsd xmm1,qword ptr [ebp+$28]

 movsd XMM0reg,xmm0
 movsd XMM1reg,xmm1
 movsd XMM2reg,xmm2
 movsd XMM3reg,xmm3

 fxch    st(1)
 fmul    st(0),st(1)
 mulsd xmm0,xmm1

 movsd XMM0reg,xmm0
 movsd XMM1reg,xmm1
 movsd XMM2reg,xmm2
 movsd XMM3reg,xmm3

 fadd   qword ptr [ebp+$18]
 addsd xmm0,qword ptr [ebp+$18]

 movsd XMM0reg,xmm0
 movsd XMM1reg,xmm1
 movsd XMM2reg,xmm2
 movsd XMM3reg,xmm3

 fmul    st(0),st(1)
 mulsd xmm0,xmm1

 movsd XMM0reg,xmm0
 movsd XMM1reg,xmm1
 movsd XMM2reg,xmm2
 movsd XMM3reg,xmm3

 fadd    qword ptr [ebp+$10]
 addsd xmm0,qword ptr [ebp+$10]

 movsd XMM0reg,xmm0
 movsd XMM1reg,xmm1
 movsd XMM2reg,xmm2
 movsd XMM3reg,xmm3

 fmulp st(1),st(0)
 mulsd xmm0,xmm1

 movsd XMM0reg,xmm0
 movsd XMM1reg,xmm1
 movsd XMM2reg,xmm2
 movsd XMM3reg,xmm3

 fadd    qword ptr [ebp+$08]
 addsd xmm0,qword ptr [ebp+$08]

 movsd XMM0reg,xmm0
 movsd XMM1reg,xmm1
 movsd XMM2reg,xmm2
 movsd XMM3reg,xmm3

 movsd [esp-8],xmm0
 fld       qword ptr [esp-8]

 movsd XMM0reg,xmm0
 movsd XMM1reg,xmm1
 movsd XMM2reg,xmm2
 movsd XMM3reg,xmm3

 wait
end;
The code is not using xmm4-xmm7 and there was no need to create a register view for them.
There is added xmm view code after each line of SSE2 code.
All lines but the last two are the FP code with the SSE2 code added such that every operation is done in FP
as well as in SSE2.
This way it is possible to trace through the code and control that the SSE2 version is doing the same as
the classic version.
ûʹ xmm4-xmm7˲ҪΪǴĴʾ
ÿһ SSE2 ĺ涼һ xmm ʾ롣
 FP ⣬е SSE2  FP һ
ԸٴͿ SSE2 İ汾һ SSE 2 汾

Open the FPU view and see how the FP stack is updated and control that xmm registers are updated in the same way.
I developed the SSE2 code simply by adding an SSE2 instruction after each line of FP code.
 FPU ڣ鿴 FP ջθ£ͬķ xmm Ĵĸ¡
ÿһ FP ֮Ҽ򵥵һ SSE2 ָ

fld       qword ptr [ebp+$20]
movsd xmm0,qword ptr [ebp+$20]

movsd copy one double precision variable from the memory location at [ebp+$20] into an xmm register.
 qword ptr is not needed but I kept it to emphasise the pattern between SSE2 code and FP code.
movsd ڴ [ebp+20] һ˫ȱһ xmm Ĵ
"qword ptr" ҪұΪǿ SSE2  FP ʽ
 
A big difference between FP code and scalar SSE2 code is that the FP registers are organized as a stack and 
SSE2 registers are not.
At first while coding the SSE2 code I just ignored this and then after having made all the SSE2 lines I went back
and traced through the lines one by one and corrected them to work on the correct variable/register.
FP  SSE2 ͬ FP Ĵ֯Ϊһջ, SSE2 Ĵ.
д SSE2 ʱȺ,Ȼȥ, ȷȷعڱ߼Ĵ.

Activate the function with some variable values that are easy to follow in the two views
(e.g. X=2, A=3, B=4, C=5, D=6), and see that first 2 is loaded, then 3, then 2 is multiplied by 3 
 and 2 is overwritten by 6 etc.
һЩַʽ۲(: X=2, A=3, C=5, D=6) 2 װ룬Ȼ 2  3 ˣ 2  6 дȵȡ

The scalar SSE2 counterpart for fmul is mulsd. The sd prefix means Scalar C Double.
ּ SSE2  fmul  mulsdsd ǰ׺ʾּ - ˫

fxch  st(1)
fmul  st(0),st(1)
mulsd xmm0,xmm1

The SSE2 counterpart for fadd is addsd. SSE  fadd  addsd

fadd  qword ptr [ebp+$18]
addsd xmm0,qword ptr [ebp+$18]

Continue this way line by line. һеĹ۲

The FP code leaves the result in st(0), but the SSE2 code leaves the result in an xmm register.
Then the result has to be copied from xmm to st(0) via a memory location on the stack.
FP 뽫 st(0) SSE2 ָһ xmm Ĵ
Ȼ󣬽òͨڶջ xmm  st(0)

movsd [esp-8],xmm0
fld   qword ptr [esp-8]

These two lines do this.
ʵ

At esp-8, 8 bytes above the top of the stack, there is some place we can use as the temporary location for the result.
The first line copies xmm0 to temp and then the last line loads temp on the FP stack.
These two lines are overhead that will make small SSE2 functions less effective than their FP cousins.
After having double-checked the SSE2 code we can remove the instrumentation code as well as the old FP code,
leaving a nice scalar SSE2 function with only a little extra overhead.
 esp-8 ʹջ 8 ֽڣǴŽʱλá
 SSE2 Чʽϵ͵ FP ð档
ϸ SSE2 ֮ɾ FP 룬ʣµľһķּ SSE2 ֻһĲ

function ArcSinApprox3j(X, A, B, C, D : Double) : Double;
asm
 //Result := ((A*X + B)*X + C)*X + D;
 movsd xmm0,qword ptr [ebp+$20]
 movsd xmm1,qword ptr [ebp+$28]
 mulsd xmm0,xmm1
  addsd xmm0,qword ptr [ebp+$18]
  mulsd xmm0,xmm1
  addsd xmm0,qword ptr [ebp+$10]
  mulsd xmm0,xmm1
  addsd xmm0,qword ptr [ebp+$08]
  movsd [esp-8],xmm0
  fld   qword ptr [esp-8]
 end;

It can be even nicer if we remove the not needed qword ptr text.
ƳҪ "qword ptr" ֣ࡣ

function ArcSinApprox3j(X, A, B, C, D : Double) : Double;
asm
 //Result := ((A*X + B)*X + C)*X + D;
 movsd xmm0, [ebp+$20]
 movsd xmm1, [ebp+$28]
 mulsd xmm0,xmm1
 addsd xmm0, [ebp+$18]
 mulsd xmm0,xmm1
 addsd xmm0, [ebp+$10]
 mulsd xmm0,xmm1
 addsd xmm0, [ebp+$08]
 movsd [esp-8],xmm0
 fld   qword ptr [esp-8]
end;

Change the pointers with the parameter names
ò滻ָ 
function ArcSinApprox3j(X, A, B, C, D : Double) : Double;
asm
 //Result := ((A*X + B)*X + C)*X + D;
 movsd xmm0, A
 movsd xmm1, X
 mulsd xmm0,xmm1
 addsd xmm0, B
 mulsd xmm0,xmm1
 addsd xmm0, C
 mulsd xmm0,xmm1
 addsd xmm0, D
 movsd [esp-8],xmm0
 fld   qword ptr [esp-8]
end;

Well how does this version perform?
The benchmark is 45882
This version is somewhat slower than the FP version, which scored 48292.
汾ִЧΣ
׼Է 45822
汾΢ȷ 48292  FP 汾

We need to investigate what is the reason for this.
Is it the overhead of the last two lines or is it due to the 2-cycle throughput of addsd and mulsd?
The overhead can be removed by transferring the result as an out parameter or we can inline the function.
It would be interesting for us to see how big an advantage it is to inline this relatively small function.
After all there is a good deal of overhead of copying 5 double precision parameters each 8 byte big.
Let us see how much code is actually needed for this.
Ҫоԭ
⻹ addsd  mulsd  2 ڵ
ͨǿƳ overhead
кܴ洦
һõİ취ͨÿθ 8 ֽڴС 5 ˫Ȳ
ǿҪٴ:

push dword ptr [ebp+$14]
push dword ptr [ebp+$10]
push dword ptr [ebp+$34]
push dword ptr [ebp+$30]
push dword ptr [ebp+$2c]
push dword ptr [ebp+$28]
push dword ptr [ebp+$24]
push dword ptr [ebp+$20]
push dword ptr [ebp+$1c]
push dword ptr [ebp+$18]
call dword ptr [ArcSinApproxFunction]
fstp qword ptr [ebp+$08]

No less than 10 push instructions each pushing a 4 byte half of each parameter onto the stack.
Imagine that the register calling convention took its name seriously and transferred the parameters on the FP stack 
instead.
Then we would have 5 fld, which would also remove the need to load parameters from the stack in the function.
 10  push ָÿѹÿ 4 ֽڵջ
ļĴԼͨ FP ջ
ȻҪ 5  fld, ƳдӶջװ.

That is C 5 fld in the function would be replaced by 5 fld at the call place of the function
and 10 push instructions would go away.
This would lead to a dramatic increase in performance.
 5  fld ںе 5 װ, 10  push ָɾˡ
⽫ϵϷԵ

Inlining the function would also remove the overhead of the call/ret pair
which is a lot less than the overhead of the mega push, and this would give as
a clue about the performance of the updated register2 calling convention ;-).
ҲƳ call/ret öԣͬʱǸ¼ĴԼܵʾ

Inlined ArcSinApprox3i 156006
Inlined ArcSinApprox3j 160000
The improvement is an astonishing 400 %.

 ArcSinApprox3i 156006
 ArcSinApprox3j 160000
The improvement is an astonishing 400 %.
˾400%

I truly wish Borland introduced a true register calling convention for floating point parameters in the near future.
The SSE2 version is only 3 % faster than the IA32 version. This could be more on a decent SSE2 implementation.
ϣ Borland ڽΪһĴù.
SSE2 汾ֻǱ IA32  3%⽫õ SSE2 ִС

Lesson 7 has hereby come to an end. 
You now know nearly all about floating point programming ;-)
 7 ͽˡ
ڼ֪˹иı ;-)