CnPack Forum » 技术板块灌水区 » <<BASM 初学者入门>> 第 7 课


2007-6-12 15:39 skyjacker
<<BASM 初学者入门>> 第 7 课

http://www.cnpack.org
QQ Group: 130970
翻译:SkyJacker
版本:草稿版
状态:未校对
时间:2007

Lesson 7
Welcome to lesson number 7. Today’s subject is floating point BASM.
This was also the subject of an earlier lesson, but this lesson will add new information.
We will look at how to code scalar SSE2 code and how instructions are scheduled in the fp pipelines.
欢迎来到第 7 课。今天的主题是 BASM 中的浮点数。
这也是我们之前课程的一个主题,但是这一课将增加一些新内容。
我们将着眼于怎样编写标量 SS2 代码和在 fp 管道线中是如何调度这些指令。

Today’s example function is evaluating a third order polynomial.
今天的例子是评测一个三次多项式。
function ArcSinApprox1a(X, A, B, C, D : Double) : Double;
begin
Result := A*X*X*X + B*X*X + C*X + D;
end;

Instead of just analyzing and optimizing this function we will see an actual use of it.
A third order polynomial can approximate the ArcSin function on its entire interval, [-1;1],
with a maximal absolute error of 0.086.
This is not impressive but what we develop in this lesson will extend to higher order polynomials
in a straightforward manner, yielding higher precision.
我们先看看它的实际使用,而不是马上分析和优化。
一个三次多项式在闭区间 [-1, 1] 中可以近似为 ArcSin 函数,它们的最大绝对值误差是 0.086。
这不令人印象深刻,但是我们将在这一课中延伸到更高次幂的多项式,让它更精确。

The parameters A, B, C and D define the shape of the curve for the function and the values for a fit to the
ArcSin approximation with minimal error must be found.
For this purpose we develop an optimizer, which is also used as the benchmark.
Because ArcSin(0) = 0 we immediately see that D=0 and D can be left out of optimization.
参数 A, B, C 和 D 决定了函数曲线的形状。在近似的 ArcSin 中,在最小误差允许范围内,肯定可以找到这些值。
为了这个目的我们开发一个优化器,它也可以作为基准测试。
因为 ArcSin(0) =0 ,我们可以马上知道 D=0,所以 D 可以被优化掉。

We also know that ArcSin is an odd function and therefore the second order term B*X*X is of no use
in the approximation.
This is because a second order term is even and has symmetry around the Y-axis.
Odd order functions have antisymmetry around the Y-axis with f(X) = -f(-X).
All this means that our function could be reduced to Result := A*X*X*X + C*X;
我们也知道 ArcSin 是一个奇函数,因此二次项 B*X*X 在近似中没有作用。
这是因为二次项式具个偶特性,沿着 Y 轴对称。
奇次函数具有沿着 Y 反对称特性, f(X) = -f(-X)。
根据这些特性,函数可以被减为 Result := A*X*X*X + C*X;

We do however not does that because it is more illustrative to use the full function.
ArcSin is a special case and we want to be as general as possible.
但我们却不能这样做,因为使用这个完整的函数更有说明性。
ArcSin 是一个特例,我们尽可能让它通用。

Function number 1a has 6 multiplications and 3 additions. Writing it on Horner form
函数有 6 个乘和 3 个加。将它写成括号形式
Result := ((A*X + B)*X + C)*X + D;

reduces this to 3 multiplications and 3 additions.
精简为 3 个乘和 3 个加。

Another form is 另一种形式是
Result := (A*X + B)*(X*X)+(C*X + D);
which has 4 multiplications and 3 additions.
这一个有 4 个乘和 3 个加。

On modern processors it is as important how much parallelism can be extracted from the formula
as it is how many multiplications and additions it has.
Modern processors such as AMD Athlon, Intel P4 and P3 have pipelines.
Pipelines are a necessity on processors running at high frequencies, because the amount of work for an addition,
subtraction, multiplication or division cannot be done in a single clock cycle.
在现代的处理器中,一个公式进行了多少次计算和它有多少个乘法、加法一样重要。
现代处理器像 Amd Athlon,Intel P4 和 P3 都有管道线。
管道线在高速运行的处理器上是必需的,因为大量的工作是处理加、减、乘、除运算,它们在单个时钟周期里运行不完。

On P4 there is a pipeline called FP_ADD that handles addition and subtraction.
This pipeline has 5 stages, which means that the process of performing an addition or subtraction
is broken down into 5 sub tasks.
Therefore addition and subtraction is done in 5 clock cycles.
在 P4 有一个叫做 FP_ADD 的管道线,它操作加法和减法。
管道线有 5 步,意味着处理器将一个加法或减法分成 5 个子任务来执行。
因此,加法和减法是用 5 个时钟周期来完成的。

The advantage of having a pipeline is that even though it takes 5 cycles to complete an addition a new one
can be started in every clock cycle.
This is because the first add in a series leaves the first stage of the pipeline in the second clock cycle
and then this stage can accept add number 2.
If we have this series of adds the first one will leave the pipeline in cycle 5, number 2 will leave in cycle 6 etc.
Throughput is one add per cycle.
管道线的优点是,尽管完成一个加法花费 5 个周期,但是在每个时钟周期都可以开始一个新的加法。
这是因为第一个 add 进入管道线的第一步,在第二个时钟周期时管道线的第一步则可以接收第 2 个 add。
如果我们有这一系列的加法,第一个加法将在第 5 个时钟周期结束,第二个加法将在第 6 个时钟周期结束,依次类推。
吞吐量是每个周期一个 add。

Parallelism is achieved by having up to 5 additions/subtractions in flight in the pipeline at the same time.
The drawback is that if a second of two following additions need the output from the first addition it has to wait
for the first one to complete. We say that there is a data dependency between the two instructions and we see the
full latency of the additions, which is 2 times 5 cycles.
使用管道线,可以同时并行完成 5 个加法或减法。
有一个缺点,就是如果两个相邻的加法,第二个加法需要第一个的输出结果,那么它就不得不等待第一个完成。
我们将它称为两个指令之间的数据依赖,我们看到这两个加法的潜伏期是 5 个时钟周期的 2 倍。

Let’s use our function as an example to see how pipelines work.
用我们的函数来看看管道线是如何工作的
Result := A*X*X*X + B*X*X + C*X + D;

It is seen that the 4 terms can be evaluated in parallel and then be added as the final act Of course term number 4
is not "evaluated". A*X is the first operation ready to be scheduled to run in the F_MUL pipeline. The latency for
FMUL on P4 is 7 cycles and A*X will be ready in cycle 7.
这 4 项可以被并行的计算,然后将结果相加,当然第 4 项不用 "计算" 了。
A*X 是被调度进 F_MUL 管道线的第一个操作。在 P4 上 FMUL 的潜伏期是 7 个时钟周期,A*X 会被安排 7 个时钟周期。

FMUL has a throughput of 2 cycles. From this we see that FMUL is not fully pipelined.
The pipeline will accept a new instruction in cycle 3 and not in cycle 2.
B*X is the second instruction ready to execute and it will start in cycle 3.
In cycle 5 the pipeline is again ready to receive a new instruction and this is C*X.
In cycle 7 A*X is complete and (A*X)*X can start in cycle 8.
In cycle 10 B*X is finished and (B*X)*X will start.
FMUL 有 2 个周期的吞吐量。因此 FMUL 没有占满管道线。
管道线将在第 3 个周期接收一个新的指令,而不是在第 2 个周期。
B*X 是要执行的第二个指令,它将在第 3 个周期开始。
在第 5 个周期,管道线将再接收一个新指令,它是 C*X。
在第 7 个周期, A*X 完成了,(A*X)*X 在第 8 个周期开始。
在第 10 个周期 B*X 完成,并且 (B*X)*X 将开始执行。

In cycle 12 C*X is ready to go to the F_ADD pipeline to be added to D.
In cycle 15 is (A*X)*X finished and (A*X*X)*X can start.
In cycle 17 are (B*X)*X and (C*X) + D complete and they can start in the F_ADD pipeline.
This addition completes in cycle 21, where (A*X*X)*X is also ready Then the last addition can start in cycle 22.
Now there is only one operation in flight and we must lean back and wait for the full latency of FADD,
which is 5 cycles. In clock 27 the last addition is finished and the job is done.
在第 12 周期 C*X 准备进入 F_ADD 管道线与 D 相加。
在第 15 周期 (A*X)*X 完成,(A*X*X)*X 开始。
在第 17 周期 (B*X)*X 和 (C*X) + D 完成,然后开始进入 F_ADD 管道线。
这个加法在第 21 周期完成,第 22 周期 (A*X*X)*X 进入最后的加法。
现在,只有一个操作在运行,我们必须等待 FADD 的完成,它用了 5 个周期。
在第 27 个周期最后的加法结束,整个工作结束了。
      
These tables give the details. The left column symbolizes the F_MUL pipeline with 7 stages and the right column
symbolizes the F_ADD pipeline with 5 stages.
下面的表格详细的描述了这个过程。
左边的列表示 F_MUL 管道线用了 7 步,右边的列表示 F_ADD 用了 5 步。

[此表请见 DelphiBeginners..doc]

An out of order core schedules instructions to run as soon as data and resources are ready.
Resources are registers and execution pipelines. I do not know for sure, but I think that instructions are scheduled
to run in program order except when an instruction stalls.
In this situation the next ready instruction in program order is scheduled.
The stalled instruction will be started as soon as the stall reason is removed.
It can stall because of missing resources or not ready data.
非顺序执行的内核调度指令只要数据和资源准备好就运行。
资源就是寄存器和执行管道线。
我不能确定,但是我想指令会按照程序的顺序去运行,除非一个指令停转。
在这种情况下,下一个指令将被调度。这个停转的指令在停转原因消除了之后会开始执行。
它停转的原因是因为没有资源或者没有准备好数据。
   
After having seen how instructions are scheduled to run in the pipelines of a P4 we establish the benchmark.
The benchmark is an optimizer that search for the best possible fit of our polynomial to ArcSin. It is based on the
most simple of all optimization algorithms, which is the exhaustive search. We simply try a lot of combinations of
parameter values and remember the set of parameters, which give the best fit.
A and C are run in the intervals [AStart;AEnd] and [CStart; CEnd], with the step sizes AStepSize and CStepsize.
This is done in two nested while loops.
我们已经学习了 P4 中的管道线是如何调度指令运行的, 接下来我们建立一个基准测试。
基准是一个优化器,它能够尽可能的找到满足 ArcSin 的多项式。
它基于最简单的优化算法,就是穷举搜索。我们简单的尝试参数值的多种组合,并且记录最满足条件的参数值。
A 和 C 分别在闭区间 [AStart;AEnd] 和 [CStart; CEnd]中,步进大小是 AStepSize 和 CStepsize。
用两个 while 嵌套循环实现。

StartA    := 0;
StartC    := -1;
EndA      := 1;
EndC      := 1;
AStepSize := 1E-2;
CStepSize := 1E-3;
OptA      := 9999;
OptC      := 9999;
A         := StartA;
while A <= EndA do
  begin
   C := StartC;
   while C <= EndC do
    begin
     Inc(NoOfIterations);
     MaxAbsError := CalculateMaxAbsError(A,C, ArcSinArray);
     if MaxAbsError <= MinMaxAbsError then
      begin
       MinMaxAbsError := MaxAbsError;
       OptA := A;
       OptC := C;
      end;
     C := C + CStepSize;
    end;
   A := A + AStepSize;
  end;
  
The CalculateMaxAbsError function calculates a number of points on the X interval [-1;1], which is the definition
interval of the ArcSin function
CalculateMaxAbsError 函数计算 X 的闭区间 [-1;1] 中的数,这个区间是 ArcSin 函数的区间。

TMainForm.CalculateMaxAbsError(A, C : Double; ArcSinArray : TArcSinArray) : Double;
var
X, Y, D, B, Yref, Error, AbsError, MaxAbsError : Double;

begin
B := 0;
D := 0;
MaxAbsError := 0;
X := -1;
repeat
  Yref := ArcSin (X);
  Y := ArcSinApproxFunction(X, A, B, C, D);
  Error := Yref-Y;
  AbsError := Abs(Error);
  MaxAbsError := Max(MaxAbsError, AbsError);
  X := X + XSTEPSIZE;
until(X > 1);
Result := MaxAbsError;
end;

At each point we calculate the error by subtracting the Y value from our approximation function from the reference
Y value obtained from a call to the Delphi RTL ArcSin function.
我们使用自己的近似 ArcSin 的函数求一个 Y 值,使用 Delhi RTL 中的 ArcSin 函数求一个 Y 参考值,
然后用 Y 参考值减 Y 值来计算在每个点上的误差。

The error can be positive or negative, but we are interested in the absolute value.
We remember the biggest absolute error by taking the maximum of the two values MaxAbsError and
AbsError assigning it to MaxAbsError.
MaxAbsError is initialized to zero and in the first evaluation it will get the value of the first error
(if it is bigger than zero).
The MaxAbsError is returned as the result from the function after a full sweep has been completed.
In the optimizer function the two values of A and C that gave the smallest maximum error are remembered
together with the actual MinMaxAbsError.
误差可能是正数或负数,但我们只对绝对值感兴趣。
我们通过将  MaxAbsError 和 AbsError 两个值的最大值赋给 MaxAbsError 来记录最大的绝对值误差。
MaxAbsError 被初始化为零,在计算时,首先获得第一个误差值(如果误差大于零)。
在计算完成之后 MaxAbsError 作为函数的返回值。
在优化器函数中,将记录最小的最大误差时的 A,C 值。

All that matters in an optimizer of this kind is to be able to evaluate as many parameter combinations as possible.
For this purpose we must optimize the optimizer ;-) and the functions we evaluate.
In this lesson the purpose is slightly different because all we want is some valid benchmarks
for the functions we optimize.
The means are however the same, the code of the optimizer must take as few cycles as possible such that
the cycles the functions use is the biggest part of the total number of cycles used.
这种类型的优化器将尽可能的计算更多的组合参数。
因此,我们将优化这个优化器和我们进行计算的函数。
这一课的目的稍微有点不同,因为我们想的是优化这些正确的基准测试函数。
意思也相似,优化器代码也必须尽可能的花费很小的时钟周期,循环周期函数在总的周期内占了最大的一部分。

The first optimizer optimization that is done is to realize that there is no need to evaluate the reference
function over and over again.
It returns, of course, the same values no matter which values A and C have.
Sweeping the reference function once and storing the Yref values in an array do this.
The next optimization is to compact the lines that evaluate the MaxAbsError Long version
首先优化优化器中并不需要一次次计算的参考函数。
它有返回值,当然不管 A 和 C 是否相同都返回。
消除参考函数,只需计算一次并将结果存入数组中。
接下来,优化计算 MaxAbsError 的冗长代码

Yref := ArcSinArray[I];
Error := Yref-Y;
AbsError := Abs(Error);

Short version 短版本

AbsError := Abs(ArcSinArray[I]-Y);

This helps because Delphi creates a lot of redundant code, when compiling FP code.
The long version compiles into this
这是有益的, 因为当编译 FP 代码时, Delphi 会产生许多冗余代码,
长版本编译为以下

Yref := ArcSinArray[I];

mov eax,[ebp-$14]
mov edx,[eax+ebx*8]
mov [ebp-$48],edx
mov edx,[eax+ebx*8+$04]
mov [ebp-$44],edx

Error := Yref-Y;

fld   qword ptr [ebp-$48]
fsub qword ptr [ebp-$30]
fstp  qword ptr [ebp-$50]
wait

AbsError := Abs(Error);

fld qword ptr [ebp-$50]
fabs
fstp qword ptr [ebp-$10]
wait

There are a lot of redundancies in this code and we must conclude that Delphi is doing a bad job on optimizing
floating point code.
Let us add some explanations to the code.
在代码里有大量的冗余, 我们可以断定 Delphi 没有对浮点数代码进行有效的优化.
我们来解释一下代码.

The first line of Pascal is an assignment of one double variable to another.
This is done by to pairs of mov, one for the lower 4 byte of the variables and one for  the upper 4 byte.
The first line of asm loads the address of the array into eax, which is used as base for addressing
into the array. Ebx is I and it is multiplied by 8 because each entry in the array is 8 byte.
The 4 byte offset in the last two lines (in the last line it is hidden!) is moving
the reference into the middle of element I.
第一行 Pascal 代码是将一个 Double 变量赋给另一个.
这一行由一对 mov 实现, 一个用于变量的低 4 个字节, 一个用于高 4 个字节。
第一行 asm 代码 将数组的地址装入 eax,它指向数据的基地址。
Ebx 是 I,它乘以 8 是因为每个数组数据是 8 个字节。
最后两行中(最后一行是隐藏的) 4 个字节的偏移量,用于指向元素 I 的中间。

Yref is located in the stack frame at [ebp-$48] and is loaded by the first line of FP code.
Y is located at [ebp-$30] and is subtracted from Yref by fsub.
The result is Error and this is stored at [ebp-$50].
Yref 在堆栈 [ebp-$48] 中,由第一行 FP 代码装入。
Y 在 [ebp-$30] 中,fsub 用 Yref 减去 Y。
结果为误差,其存储在  [ebp-$50]。

The last line of Pascal compiles into four lines of asm of which the first starts loading Error.
Saving and then loading Error is redundant and the optimizer should have removed it.
Fabs is the Abs function and is probably one of the shortest function implementations seen ;-)
The Delphi compiler does not have the inlining optimization, but it applies “compiler magic” to a few functions,
one of which is Abs. The last line stores AbsError on the stack.
The short version compiles into this
最后一行 Pascal 被编译成四行 asm,从装入误差开始。
保存然后再装入是冗余的,优化器应该删除它。
Fabs 是 Abs 函数,看来可能是最短的执行函数了。
Delphi 编译器没有内联优化,但是它可以对一些函数应用 “编译器魔法”,Abs 是其中的一个。
最后一行将 AbsError 存入堆栈。

简短版本被编译成如下

mov eax,[ebp-$14]
fld qword ptr [eax+ebx*8]
fsub qword ptr [ebp-$30]
fabs
fstp qword ptr [ebp-$10]
wait

Here there is no redundancy and the compiler should have emitted this code as result of the long Pascal version as well.
All lines of code are present in the long version, but all redundant lines have been removed.
这时没有冗余,编译器也应该将长版本的 Pascal 代码优化成它。
这些行也存在于长版本中,但是所有的冗余行都被删除了。

The first line loads the base address of the array into eax. The second line loads element I, I is in ebx,
unto the fp stack. The third line subtracts Y from Yref. The fourth line is the Abs function.
The fifth line stores the result in the AbsError variable.
第一行将数组的基址装入 eax。第二行装入元素 I 到 fp 堆栈,I 在 ebx 中。
第三行从 Yref 减去 Y。第四行是 Abs 函数。
第五行将结果存入 AbsError 变量。
  
There is a peculiarity with this benchmark that I have no explanation of.
The benchmark values are heavily influenced by the way it is activated.
这个基准测试有一个我还没有解释的特性。
基准值被它的激活方式严重影响了。

If the keyboard is used to hit the button we get a different score from the one
we get by clicking the button with the mouse!
The one who can explain this will probably get the Nobel Prize in Delphi;-)
用键盘点击按钮得到的值与用鼠标点击按钮得到的值是不相同的!
谁能解释它将获得 Delphi 诺贝尔奖 ;-)

Another irritating thing is that Delphi does not align double variables properly.
They shall be 8 byte aligned but Delphi only 4 byte aligns them.
另一个令人气愤地事情是 Delhpi 不能对齐 double 变量。
它们应该是 8 字节对齐,但是 Delphi 只是 4 字节对齐它们。

The penalty we can get from this is a level 1 cache line split (and L2 cache line splits as well).
Loading a variable that splits over two cache lines takes the double time of loading one that does not.
我们将得到一个处罚,因为一级 cache 将分开它们( 二级 cache 也将分开它们)
装载分为两部分的变量到高速缓冲将花费双倍的装入时间,因为不能一次装入。

Because double variables are 8 byte and the L1 cache line of P4 is 64 byte at most 1 of 8 variables can have a split.
On P3 that has 32 byte L1 cache lines it is 1 of 4.
因为 double 变量是 8 个字节,P4 的 L1 cache 是 64 个字节,最多 8 个变量中有 1 个被分开。
在 P3 上有 32 个字节,L1 cache 最多 4 个变量中有 1 个被分开。

  
The ideal situation is that variables are aligned at 4 byte if they are 4 byte big, 8 if 8 byte etc.
To make things simple let us imagine that the first line in the level one cache is where our variables are loaded.
The first line starts at address 0, that is - memory from address 0 is loaded into it.
Our first variable is aligned and occupies the first 8 bytes at line 1.
理想的情况是变量大小是 4 个字节则按 4 个字节对齐,是 8 个字节则按 8 个字节对齐。
为了使事情简单化,我们假设第一行将变量装入一级 cache。
第一行地址起始于 0,就是说,从地址是 0 的内存装入。
第一个变量是对齐的,在第 1 行占用 8 个字节。

Variable number two occupies byte 9-16 ….Variable number 8 occupies byte 57-64 and does not cross the line boundary.
If variables are only 4 byte aligned the first one could start at byte 4 and number 8 could start at byte 61.
The first 4 byte of it will be in line 1 and the next 4 bytes will be in line 2.
The processor loads it by first loading the lower part and then loading the higher part instead of loading it
all in one go.
第 2 个变量占用 9-16 字节。第 8 个变量占用 57-64 字节。不能交叉行边界。
如果变量只有 4 个字节,第一个可以对齐,之后从第 4 个字节开始,第 8 个变量将开始于第 61 个字节。
它的前面 4 个字节在第 1 行,后面的字节在第 2 行。
处理器首先装入低位,然后再装入高位,而不是一次性装入。
   
Because of this misalignment of double variables in Delphi our benchmark will not be as stable as we could wish.
Alignment can change when we recompile especially if code is changed.
I have chosen (a bad choice) to not include code to align variables in the benchmark,
but will give an example of it in a later lesson.
因为在 Delphi 中 double 变量不能字节对齐,因此我们的基准测试不会像我们想象的那么稳定。
通过修改代码,重新编译,可以改变对齐。
我有一个选择(一个坏的选择),在基准测试中,通过不包含代码来对其变量。

Let us dive into the first function optimization.
We start with the function that uses the naive formula in textbook format.
让我们开始优化第一个函数。
我们使用教科书般的公式格式来讲解第一个函数。
function ArcSinApprox1a(X, A, B, C, D : Double) : Double;
begin
Result := A*X*X*X + B*X*X + C*X + D;
end;

This function implementation scores the benchmark 43243 on my P4 1600 MHz clocked at 1920 MHz
Delphi compiled it like this
在我的 P4 1600 MHz 时钟频率是 1920 MHz 的机器上,函数的基准测试分数是 43243。
编译后的代码如下:

function ArcSinApprox1b(X, A, B, C, D : Double) : Double;
begin
{
push  ebp
mov   ebp,esp
add   esp,-$08
}
Result := A*X*X*X + B*X*X + C*X + D;
{
fld   qword ptr [ebp+$20]
fmul  qword ptr [ebp+$28]
fmul  qword ptr [ebp+$28]
fmul  qword ptr [ebp+$28]
fld   qword ptr [ebp+$18]
fmul  qword ptr [ebp+$28]
fmul  qword ptr [ebp+$28]
faddp st(1)
fld   qword ptr [ebp+$10]
fmul  qword ptr [ebp+$28]
faddp st(1)
fadd  qword ptr [ebp+$08]
fstp  qword ptr [ebp-$08]
wait
fld   qword ptr [ebp-$08]
}
{
pop   ecx
pop   ecx
pop   ebp
}
end;
The code from the CPU view will not compile because of the instruction faddp st(1) and we remove st(1).
As default the faddp instruction operates on st(0), st(1) and there is no need to write it out.
来自 CPU 窗口中的代码不能被编译,因为指令 faddp st(1),我们移除 st(1)。
faddp 指令默认是操作 st(0), st(1),不需要将它们写上。

function ArcSinApprox1c(X, A, B, C, D : Double) : Double;
asm
//push  ebp  //Added by compiler 由编译器添加
//mov   ebp,esp   //Added by compiler 由编译器添加
add   esp,-$08
//Result := A*X*X*X + B*X*X + C*X + D;
fld   qword ptr [ebp+$20]
fmul  qword ptr [ebp+$28]
fmul  qword ptr [ebp+$28]
fmul  qword ptr [ebp+$28]
fld   qword ptr [ebp+$18]
fmul  qword ptr [ebp+$28]
fmul  qword ptr [ebp+$28]
faddp //st(1)
fld   qword ptr [ebp+$10]
fmul  qword ptr [ebp+$28]
faddp //st(1)
fadd  qword ptr [ebp+$08]
fstp  qword ptr [ebp-$08]
wait
fld   qword ptr [ebp-$08]
pop   ecx
pop   ecx
//pop   ebp //Added by compiler  由编译器添加
end;

First we observe that there is no need to set up a stack frame.
The stack is actually used for storing the result temporarily and reloading it again in the lines
首先我们观察是否需要设置堆栈.
堆栈实际上用来存储临时结果变量,在下面行中又被重新装入浮点数寄存器

fstp  qword ptr [ebp-$08]
wait
fld   qword ptr [ebp-$08]

but the base pointer and not the stack pointer are used for this.
The lines that use ebp plus a value are accessing the parameters, which are located above the base pointer,
which is in the calling functions stack frame.
The stack pointer is not used at all in the function and changing its value is meaningless.
但是这是基址指针,堆栈指针不能这样使用,基址 ebp 加上一个值可以访问由基址指针定位的参数。
堆栈指针根本不能这样使用,改变它的值是无意义的。

The mov ebp, esp instruction added by the compiler together with the line add esp, -$08 creates an 8-byte stack frame.
Because these lines change the ebp register it is necessary to back it up by pushing it to the stack.
Unfortunately we can only remove the add esp, 8 line and the two pop ecx lines that has the purpose of
subtracting 8 bytes from the stack pointer, esp.
编译器增加了 mov ebp, esp 指令,还有在堆栈上分配 8 个字节的 add esp, -$08。
因为这些行改变了 ebp 寄存器,因此需要通过压入堆栈来备份它。
遗憾的是,我们只能移除 add esp, 8 ,还有两行 pop ecx,它们是为了恢复前面已经将堆栈指针 esp 减 8 的操作。

function ArcSinApprox1d(X, A, B, C, D : Double) : Double;
asm
//add   esp,-$08
//Result := A*X*X*X + B*X*X + C*X + D;
fld   qword ptr [ebp+$20]
fmul  qword ptr [ebp+$28]
fmul  qword ptr [ebp+$28]
fmul  qword ptr [ebp+$28]
fld   qword ptr [ebp+$18]
fmul  qword ptr [ebp+$28]
fmul  qword ptr [ebp+$28]
faddp
fld   qword ptr [ebp+$10]
fmul  qword ptr [ebp+$28]
faddp
fadd  qword ptr [ebp+$08]
fstp  qword ptr [ebp-$08]
wait
fld   qword ptr [ebp-$08]
//pop   ecx
//pop   ecx
end;
This function implementation scores the benchmark 42391 and performance actually dipped a little.
The compiler inserts the line mov ebp, esp and we can make it redundant by using esp instead of ebp.
这个函数的基准测试分数是 42391,性能实际上减少了一点。
编译器插入了 mov ebp, esp,我们可以使用 esp 来代替冗余的 ebp。

function ArcSinApprox1e(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
//fld   qword ptr [ebp+$20]
fld   qword ptr [esp+$20]
//fmul  qword ptr [ebp+$28]
fmul  qword ptr [esp+$28]
//fmul  qword ptr [ebp+$28]
fmul  qword ptr [esp+$28]
//fmul  qword ptr [ebp+$28]
fmul  qword ptr [esp+$28]
//fld   qword ptr [ebp+$18]
fld   qword ptr [esp+$18]
//fmul  qword ptr [ebp+$28]
fmul  qword ptr [esp+$28]
//fmul  qword ptr [ebp+$28]
fmul  qword ptr [esp+$28]
faddp
//fld   qword ptr [ebp+$10]
fld   qword ptr [esp+$10]
//fmul  qword ptr [ebp+$28]
fmul  qword ptr [esp+$28]
faddp
//fadd  qword ptr [ebp+$08]
fadd  qword ptr [esp+$08]
//fstp  qword ptr [ebp-$08]
fstp  qword ptr [esp-$08]
wait
//fld   qword ptr [ebp-$08]
fld   qword ptr [esp-$08]
end;
Unfortunately the compiler still inserts the mov instruction and we performed a copy propagation that
gave no optimization because it is not followed by a dead code removal.
Therefore performance is almost the same – 43094.
不幸的是,编译器仍然会插入 mov 指令。我们执行无优化的复制传播,因为它不能作为无效代码被移除。
因此性能非常近似 - 43094.

Without investigating whether the result stored on the stack
is used we can optimize the lines coping it there and reloading it.
The result of them is that there is a copy of Result left on the stack.
They redundantly pop the result of the FP stack and reload Result from the stack.
如果我们能够确定结果是存在堆栈上,则可以优化复制和装载这一步。
其中的一个结果是结果存放在堆栈上。
把结果从 FP 堆栈 pop 出,再从堆栈装载是冗余的。

This single line has the same effect, but redundancy is removed.
这一行有相同的效果,但是冗余被删除了。
fst  qword ptr [ebp-$08]
This optimization is very often possible on Delphi generated code and is important to remember.
这中优化在 Delphi 产生的代码中经常使用,应该记住它。

function ArcSinApprox1f(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
fld   qword ptr [esp+$20]
fmul  qword ptr [esp+$28]
fmul  qword ptr [esp+$28]
fmul  qword ptr [esp+$28]
fld   qword ptr [esp+$18]
fmul  qword ptr [esp+$28]
fmul  qword ptr [esp+$28]
faddp
fld   qword ptr [esp+$10]
fmul  qword ptr [esp+$28]
faddp
fadd  qword ptr [esp+$08]
//fstp  qword ptr [esp-$08]
fst  qword ptr [esp-$08]
wait
//fld   qword ptr [esp-$08]
end;
This function implementation scores the benchmark 47939 and the improvement is 11 %
The next question to ask is: Is the copy of the Result on the stack ever used?
To answer it we must inspect the code at the location of the call to the function.
这个函数的基准测试是 47939,进步了 11%。
接下来的问题是,在堆栈上的结果如何使用呢?
为了回答它,我们需要看调用这个函数的本地代码:

Y := ArcSinApproxFunction(X, A, B, C, D);

call dword ptr [ArcSinApproxFunction]
fstp qword ptr [ebp-$30]
wait

The first line after the call stores the result in Y and pops the stack.
Seeing this we assume that the result on the stack is not used,
but to be sure we must scan through the rest of the code too.
If the rule for the Register calling convention is that FP results are transferred on the FP stack
it is weird that a copy is also placed on the stack.
第一行将函数结果存入 Y,弹出堆栈。
我们假设在堆栈上的结果没有被使用,同时必须确定剩下的代码也没有使用。
如果 FP 的结果通过 FP 堆栈传输,并且在将结果复制到堆栈,这样的寄存器调用约定是不可思议的。

We conclude that it is redundant to copy the Result to the stack and remove the line doing it.
我们断定将结果复制到堆栈是冗余的,删除它。
function ArcSinApprox1g(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
fld   qword ptr [esp+$20]
fmul  qword ptr [esp+$28]
fmul  qword ptr [esp+$28]
fmul  qword ptr [esp+$28]
fld   qword ptr [esp+$18]
fmul  qword ptr [esp+$28]
fmul  qword ptr [esp+$28]
faddp
fld   qword ptr [esp+$10]
fmul  qword ptr [esp+$28]
faddp
fadd  qword ptr [esp+$08]
//fst  qword ptr [esp-$08]
wait
end;
This function implementation scores the benchmark 47405
Instead of writing all the qword ptr [esp+$xx] lines we can write the names of the variables and let the compiler
translate them into addresses.
This actually makes the code more robust.
If the location of the variables should change then the code breaks if we use hard coded addressing.
This will however only happen if the calling convention is changed and this is not likely to happen very often.
这个函数的基准测试分数是 47405, 使用命名的变量替换掉所有的 [esp+$xx] 行,让编译器将他们翻译成地址。
实际上,这使代码更健壮。如果我们使用硬编码地址,会受到本地的变量改变的影响。
使用最新代码,如果调用约定改变,也不会对它产生影响。当然,这几乎不可能发生。

function ArcSinApprox1g_2(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
//fld   qword ptr [esp+$20]
fld   A
//fmul  qword ptr [esp+$28]
fmul  X
//fmul  qword ptr [esp+$28]
fmul  X
//fmul  qword ptr [esp+$28]
fmul  X
//fld   qword ptr [esp+$18]
fld   B
//fmul  qword ptr [esp+$28]
fmul  X
//fmul  qword ptr [esp+$28]
fmul  X
faddp
//fld   qword ptr [esp+$10]
fld   C
//fmul  qword ptr [esp+$28]
fmul  X
faddp
//fadd  qword ptr [esp+$08]
fadd  D
wait
end;
Try having both types of lines enabled
试试着两种类型
fld   qword ptr [esp+$20]
fld   A
and see in the CPU view how the compiler generated exactly the same code for both versions.
通过观察 CPU 窗口发现,编译器为这两个版本产生了相同的代码。

X is used in a lot of lines and it is referenced on the stack.
Therefore it is loaded from the stack into the internal FP registers every time.
It should be faster to load it once into the FP stack and let all uses reference the FP stack.
堆栈上的 X 被许多行使用。
每次使用都是从堆栈中装入 FP 寄存器。
将它只装入 FP 堆栈一次,所有使用它的都参考 FP 堆栈,这应该是比较快的。

function ArcSinApprox1h(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
fld   qword ptr [esp+$20]
fld   qword ptr [esp+$28] //New
fxch
//fmul qword ptr [esp+$28]
fmul  st(0),st(1)
//fmul qword ptr [esp+$28]
fmul  st(0),st(1)
//fmul qword ptr [esp+$28]
fmul  st(0),st(1)
fld   qword ptr [esp+$18]
//fmul qword ptr [esp+$28]
fmul  st(0),st(2)
//fmul qword ptr [esp+$28]
fmul  st(0),st(2)
faddp
fld   qword ptr [esp+$10]
//fmul qword ptr [esp+$28]
fmul  st(0),st(2)
ffree st(2)
faddp
fadd  qword ptr [esp+$08]
fst   qword ptr [esp-$08]
wait
end;
The second line is one we added and it loads X once and for all.
Because it places X on the top of the stack in st(0)
and this position is needed as temp for further operations we exchange st(0) with st(1) with the fxch instruction.
We could of course have changed the position of line 1 and 2 and obtained the same effect.
All the lines multiplying
我们增加第二行来一次性装入 X。
由于 X 在栈顶 st(0),这个位置需要作为其他操作的临时位置,所以我们用 fxch 指令交换 st(0), st(1)。
我们当然可以通过改变第 1,2 行的位置来获得同样的效果。
所有的乘法都用

st(0) with X  st(0) 乘 X
fmul qword ptr [esp+$28]
are changed to 改为
fmul  st(0),st(1)
after the FP copy of X has been used for the last time we remove it with the instruction ffree.
This function implementation scores the benchmark 46882 and performance is decreased by 1 %.、
This was a surprise. Fxch is claimed by Intel to be nearly free, because it works by renaming the internal registers.
Let us check that by removing it
在最后一次使用完 X 在 FP 上的拷贝之后,用 free 指令删除它。
这个函数的基准测试分数是 46882,性能减少了 1 个百分点。
真令人惊奇。Fxch 是在 Intel 上几乎不耗费时间,因为它只是重命名内部寄存器。

function ArcSinApprox1i(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
fld   qword ptr [esp+$28]
fld   qword ptr [esp+$20]
//fld   qword ptr [esp+$28]
//fxch
fmul  st(0),st(1)
fmul  st(0),st(1)
fmul  st(0),st(1)
fld   qword ptr [esp+$18]
fmul  st(0),st(2)
fmul  st(0),st(2)
faddp
fld   qword ptr [esp+$10]
fmul  st(0),st(2)
ffree st(2)
faddp
fadd  qword ptr [esp+$08]
wait
end;
This function implementation scores the benchmark 45393 and performance is decreased by 3 %.
Fxch is surely not to blame because performance once more went down. What is going on?
The wait instruction was discussed in an earlier lesson and this time we just remove it.
这个函数的基准测试分数是 45393,性能减少了 3 个百分点。
真的不能责备 Fxch,性能再一次往下降。
接下来干什么?
Wait 指令在前面的课程中讨论过了,这一次我们移除它。

function ArcSinApprox1j(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
fld   qword ptr [esp+$28]
fld   qword ptr [esp+$20]
fmul  st(0),st(1)
fmul  st(0),st(1)
fmul  st(0),st(1)
fld   qword ptr [esp+$18]
fmul  st(0),st(2)
fmul  st(0),st(2)
faddp
fld   qword ptr [esp+$10]
fmul  st(0),st(2)
ffree st(2)
faddp
fadd  qword ptr [esp+$08]
//wait
end;
Performance went down to 44140.
Let us crosscheck these surprising results by running the functions on a P3.
性能下降到 44140。
我们在一个 P3 上反复测试这些令人惊讶的结果。

ArcSinApprox1a    63613
ArcSinApprox1b    64412
ArcSinApprox1c    64433
ArcSinApprox1d    65062
ArcSinApprox1e    64830
ArcSinApprox1f    62598
ArcSinApprox1g    79586
ArcSinApprox1h    85361
ArcSinApprox1i    80515
ArcSinApprox1j    80192
First of all we see that ArcSinApprox1h is the fastest function on P3.
Thereby it is seen that loading data from the level 1 data cache is more expensive on P3 than on P4,
because changing the code such that X is loaded only once improved performance on P3 and not on P4.
On the other hand we could also say that it must always be slower to get data from the cache than
from internal registers if the architecture is sound and this is only true for the P6 architecture here.
P4 has a fast L1 data cache, which can be read in only 2 clock cycles, but an internal register
read should have a latency of only one cycle. It however looks like reads from registers are 2 clock cycles.
Then we see that a P3 at 1400 nearly 80 % faster than a P4 at 1920 on this code. We know that latencies on P3 are
shorter, but this is not enough to explain the huge difference.
首先我们看到在 P3 上 ArcSinApprox1h 是最快的。
看起来,从一级数据 cache 装载数据 P3 比 P4 更珍贵,因为只装入 X 一次的代码在 P3 上可以增加性能,在 P4 上不能。
另一方面,我们可以看到从 cache 中获得数据比从内部寄存器中慢,条件是体系结构是完整的,这只在 P6 上有效。
P4 有一个快速的一级数据 cache,读时间之后 2 个时钟周期,但读内部寄存器只有一个周期的潜伏期。
然而,它看起来读寄存器使用了 2 个时钟周期。我们看到这个代码,在 P3 1400 上比 P4 1920 上快了将近 80%。
我们知道在 P3 上的潜伏期时短的,但是不能解释为什么会有这么大的差别。

The latencies and throughput of the used instructions are on P3
这些指令在 P3 上的潜伏期和吞吐量是

Fadd latency is 3 clock cycles and throughput is 1
Fmul latency is 5 clock cycles and throughput is 1
Fadd 潜伏期是 3 个时钟周期,吞吐量是 1
Fmul 潜伏期是 5 个时钟周期,吞吐量是 1
On P4 在 P4 上
Fadd 潜伏期是 5 个时钟周期,吞吐量是 1
Fmul 潜伏期是 7 个时钟周期,吞吐量是 2

I could not find numbers for fld The explanation for the very bad performance of P4 on this code
must be the 2-cycle throughput on fmul together with the slow FP register access.
The fmul pipeline only accepts a new instruction on every second cycle where the P3 pipeline accepts one every cycle.
我没有发现 fld 的数据,这个代码在 P4 上性能很差的解释是,乘法和慢的 FP 寄存器访问是 2 个周期的吞吐量。
fmul 管道线在 P4 上每两个周期接收一个新指令,在 P3 管道线上每个周期接收一个。

Scaling the results by clock frequency
结果与主频的比率是
47939 / 1920 = 25
85361 / 1400 = 61
reveals that clock by clock on the fastest function version for each processor P3 is nearly 2.5 times faster than P4.
This is truly astonishing. If P4 should have a slight chance against a P3 we must remove some of
those multiplications.
这说明了,在 P3 处理器上最快的函数版本比 P4 快 2.5 倍。
这真是令人惊讶。如果在 P4 上要做一些改变,我们必须移除那些乘法操作。
This leads us to the Horner version of our function.
现在,进入我们的括号版本

function ArcSinApprox3a(X, A, B, C, D : Double) : Double;
begin
Result := ((A*X + B)*X + C)*X + D;
end;
Which is compiled into  被编译成

function ArcSinApprox3b(X, A, B, C, D : Double) : Double;
begin
{
push ebp
mov  ebp,esp
add  esp,-$08
}
Result := ((A*X + B)*X + C)*X + D;
{
fld  qword ptr [ebp+$20]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$18]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$10]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$08]
fstp qword ptr [ebp-$08]
wait
fld  qword ptr [ebp-$08]
}
{
pop  ecx
pop  ecx
pop  ebp
}
end;
The first three versions of this function are identical and they surprisingly score the same benchmark.
Our benchmark is not perfect but it was precise this time ;-)
前三个版本的函数的基准测试是相同的,真令人惊奇。
我们的基准测试不完美,但是他可以精确时间:
ArcSinApprox3a    45076
ArcSinApprox3b    45076
ArcSinApprox3c    45076

Here is the first BASM version with no optimizations.
The out commented the compiler supplies code.
Optimization follows the same pattern as on the first function.
像优化第一个函数一样优化它。
这是没有优化的第一个 BASM 版本,注释掉编译器产生的代码

function ArcSinApprox3c(X, A, B, C, D : Double) : Double;
asm
//push ebp
//mov ebp,esp
add  esp,-$08
//Result := ((A*X + B)*X + C)*X + D;
fld  qword ptr [ebp+$20]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$18]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$10]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$08]
fstp qword ptr [ebp-$08]
wait
fld  qword ptr [ebp-$08]
pop  ecx
pop  ecx
//pop ebp
end;
First thing is to remove the add esp, -$08 line and the two pop ecx.
They are setting up a stack frame and do nothing but manipulate the stack pointer, which is not used at all.
首先移除 add esp, -$08 和两行 pop ecx。
他们设置一个堆栈,除了操作堆栈指针没有做任何事情,没有任何用处。

function ArcSinApprox3d(X, A, B, C, D : Double) : Double;
asm
//add  esp,-$08
//Result := ((A*X + B)*X + C)*X + D;
fld  qword ptr [ebp+$20]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$18]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$10]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$08]
fstp qword ptr [ebp-$08]
wait
fld  qword ptr [ebp-$08]
//pop  ecx
//pop  ecx
end;
This function implementation scores the benchmark 43535.
Both of the redundant lines copying the result to stack and back are removed at the same time.
这个函数的基准测试分数是 43535.
删除复制到堆栈,再从堆栈复制的冗余代码

function ArcSinApprox3e(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
fld  qword ptr [ebp+$20]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$18]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$10]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$08]
//fstp qword ptr [ebp-$08]
wait
//fld  qword ptr [ebp-$08]
end;
This function implementation scores the benchmark 47237 and the improvement is 8.5 %
Then we change the code such that X is loaded only once.
这个函数的基准测试性能是 47237, 提高了 8.5%.
然后,我们修改成只装载一次 X 的的代码.

function ArcSinApprox3f(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
fld   qword ptr [ebp+$20]
fld   qword ptr [ebp+$28]
fxch
//fmul qword ptr [ebp+$28]
fmul  st(0),st(1)
fadd  qword ptr [ebp+$18]
//fmul qword ptr [ebp+$28]
fmul  st(0),st(1)
fadd  qword ptr [ebp+$10]
//fmul qword ptr [ebp+$28]
fmul  st(0),st(1)
ffree st(1)
fadd qword ptr [ebp+$08]
wait
end;
This function implementation scores the benchmark 47226 and performance is unchanged.
The ffree instruction can be removed by using fmulp instead of fmul, but to do this we must interchange the
two registers used. Only these two registers are in use and A*B = B*A so there is no problem doing that.
We are not removing any redundancy by this and the two ways of coding the same thing should perform identically.
这个函数的基准测试分数是 47226,性能没有改变。
用 fmulp 指令代替 fmul 之后,ffree 可以删除,使用它之后必须在内部交换用到的两个寄存器。
这两个寄存器只用来执行 A*B = B*A,因此交换之后对结果没有任何影响。
用这种方法我们不能消除任何冗余,这两种编码方法执行相同的操作。

function ArcSinApprox3g(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
fld   qword ptr [ebp+$20]
fld   qword ptr [ebp+$28]
fxch  st(1)
fmul  st(0),st(1)
fadd  qword ptr [ebp+$18]
fmul  st(0),st(1)
fadd  qword ptr [ebp+$10]
//fmul  st(0),st(1)
fmulp st(1),st(0)
//ffree st(1)
fadd qword ptr [ebp+$08]
wait
end;
This function implementation scores the benchmark 47416.
Then we remove the wait instruction.
这个函数的基准测试分数是 47416。
我们移除 wait 指令。

function ArcSinApprox3h(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
fld   qword ptr [ebp+$20]
fld   qword ptr [ebp+$28]
fxch  st(1)
fmul  st(0),st(1)
fadd  qword ptr [ebp+$18]
fmul  st(0),st(1)
fadd  qword ptr [ebp+$10]
fmulp st(1),st(0)
fadd qword ptr [ebp+$08]
//wait
end;
This function implementation scores the benchmark 47059.
The last thing to do is interchanging the lines that load X and A, and remove the fxch instruction.
这个函数的基准测试分数是 47059。
最后一件事是交换 load X 和 A,删除 fxch 指令。

function ArcSinApprox3i(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
fld   qword ptr [ebp+$28]
fld   qword ptr [ebp+$20]
//fld   qword ptr [ebp+$28]
//fxch  st(1)
fmul  st(0),st(1)
fadd  qword ptr [ebp+$18]
fmul  st(0),st(1)
fadd  qword ptr [ebp+$10]
fmulp st(1),st(0)
fadd qword ptr [ebp+$08]
end;
This function implementation scores the benchmark 46544 and performance went down!
这个函数的基准测试分数是 46544,性能下降了。

Let us compare the performance of the Horner style function with the naive one by picking the fastest
implementations of both on P4.
我们在 P4 上比较括号类型的函数和执行最快的函数:
ArcSinApprox1g    47939
ArcSinApprox3g    47416

On P3 在 P3 上
ArcSinApprox1h    85361
ArcSinApprox3h    87604
There difference is not big, but the naive approach is a little faster on P4 and slower on P3.
The naive approach has more calculations, but parallelism makes up for it.
The Horner way has very little parallelism and latencies are fully exposed.
This is especially bad on P4.
Having this in mind we continue to the third possible approach, which looks like this.
结果差别不是很大,但是原始函数在 P4 上快点,在 P3 上慢。
原始函数计算多,但是被并行执行。
括号函数几乎不并行执行,潜伏期被浪费。
在 P4 上特别的差。
我们特意进行第三种可能的考虑,像这样:

function ArcSinApprox4b(X, A, B, C, D : Double) : Double;
begin
{
push  ebp
mov  ebp,esp
add   esp,-$08
}
Result := (A*X + B)*(X*X)+(C*X + D);
{
fld     qword ptr [ebp+$20]
fmul  qword ptr [ebp+$28]
fadd  qword ptr [ebp+$18]
fld     qword ptr [ebp+$28]
fmul  qword ptr [ebp+$28]
fmulp st(1)
fld     qword ptr [ebp+$10]
fmul  qword ptr [ebp+$28]
fadd   qword ptr [ebp+$08]
faddp st(1)
fstp   qword ptr [ebp-$08]
wait
fld   qword ptr [ebp-$08]
}
{
pop ecx
pop ecx
pop ebp
}
end;
Experienced as we are now optimizing this function is going to be easy and fast ;-)
This version is as Delphi made it
有经验的我们将很轻快的优化这个函数。
这是 Delphi 的版本:

function ArcSinApprox4c(X, A, B, C, D : Double) : Double;
asm
//push ebp
//mov ebp,esp
add   esp,-$08
//Result := (A*X + B)*(X*X)+(C*X + D);
fld     qword ptr [ebp+$20]
fmul  qword ptr [ebp+$28]
fadd  qword ptr [ebp+$18]
fld     qword ptr [ebp+$28]
fmul  qword ptr [ebp+$28]
fmulp //st(1)
fld     qword ptr [ebp+$10]
fmul  qword ptr [ebp+$28]
fadd  qword ptr [ebp+$08]
faddp //st(1)
fstp   qword ptr [ebp-$08]
wait
fld   qword ptr [ebp-$08]
pop   ecx
pop   ecx
//pop  ebp
end;
Removing the stack frame and the two lines that store the result in the stack frame
删除与堆栈相关的代码:

function ArcSinApprox4d(X, A, B, C, D : Double) : Double;
asm
//add  esp,-$08
//Result := (A*X + B)*(X*X)+(C*X + D);
fld  qword ptr [ebp+$20]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$18]
fld  qword ptr [ebp+$28]
fmul qword ptr [ebp+$28]
fmulp //st(1)
fld  qword ptr [ebp+$10]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$08]
faddp //st(1)
//fstp qword ptr [ebp-$08]
wait
//fld  qword ptr [ebp-$08]
//pop  ecx
//pop  ecx
end;
Load X once
只装载一次 X
function ArcSinApprox4e(X, A, B, C, D : Double) : Double;
asm
//Result := (A*X + B)*(X*X)+(C*X + D);
fld   qword ptr [ebp+$20]
fld   qword ptr [ebp+$28]
//fmul qword ptr [ebp+$28]
fxch
fmul  st(0),st(1)
fadd  qword ptr [ebp+$18]
//fld  qword ptr [ebp+$28]
fld   st(1)
//fmul  qword ptr [ebp+$28]
fmul  st(0),st(2)
fmulp
fld   qword ptr [ebp+$10]
//fmul  qword ptr [ebp+$28]
fmul  st(0),st(2)
fadd  qword ptr [ebp+$08]
faddp
ffree st(1)
wait
end;
Remove fxch and wait.
移除 fxch 和 wait
function ArcSinApprox4f(X, A, B, C, D : Double) : Double;
asm
//Result := (A*X + B)*(X*X)+(C*X + D);
fld   qword ptr [ebp+$28]
fld   qword ptr [ebp+$20]
//fxch
fmul  st(0),st(1)
fadd  qword ptr [ebp+$18]
fld   st(1)
fmul  st(0),st(2)
fmulp
fld   qword ptr [ebp+$10]
fmul  st(0),st(2)
fadd  qword ptr [ebp+$08]
faddp
ffree st(1)
//wait
end;
Reschedule ffree st(1)
重调整 ffree string(1)
function ArcSinApprox4g(X, A, B, C, D : Double) : Double;
asm
//Result := (A*X + B)*(X*X)+(C*X + D);
fld   qword ptr [ebp+$28]
fld   qword ptr [ebp+$20]
fmul  st(0),st(1)
fadd  qword ptr [ebp+$18]
fld   st(1)
fmul  st(0),st(2)
fmulp
fld   qword ptr [ebp+$10]
fmul  st(0),st(2)
ffree st(2)
fadd  qword ptr [ebp+$08]
faddp
//ffree st(1)
end;
Replace fmul/ffree by fmulp
用 fmulp 代替 fmul/ffree
function ArcSinApprox4h(X, A, B, C, D : Double) : Double;
asm
//Result := (A*X + B)*(X*X)+(C*X + D);
fld   qword ptr [ebp+$28]
fld   qword ptr [ebp+$20]
fmul  st(0),st(1)
fadd  qword ptr [ebp+$18]
fld   st(1)
fmul  st(0),st(2)
fmulp
fld   qword ptr [ebp+$10]
//fmul  st(0),st(2)
fmulp st(2),st(0)
//ffree st(2)
fadd  qword ptr [ebp+$08]
faddp
end;
Cleaning up and observing that the compiler still backs up ebp and modifies esp redundantly.
整理并且注意到编译器仍将多余的备份 ebp 和修改 esp。
function ArcSinApprox4i(X, A, B, C, D : Double) : Double;
asm
//Result := (A*X + B)*(X*X)+(C*X + D);
fld   qword ptr [ebp+$28]
fld   qword ptr [ebp+$20]
fmul  st(0),st(1)
fadd  qword ptr [ebp+$18]
fld   st(1)
fmul  st(0),st(2)
fmulp
fld   qword ptr [ebp+$10]
fmulp st(2),st(0)
fadd  qword ptr [ebp+$08]
faddp
end;
The big question is now how well this function version performs.
现在最大的问题时怎样适当的执行这些函数:

ArcSinApprox4a    45228
ArcSinApprox4b    45239
ArcSinApprox4c    45228
ArcSinApprox4d    51813
ArcSinApprox4e    49044
ArcSinApprox4f    48674
ArcSinApprox4g    48852
ArcSinApprox4h    44914
ArcSinApprox4i    44914
We see that “optimizations” from function d to i are “deoptimizations” on P4 except for g.
我们看到从 d 到 i,除了 g 之外在 P4 上是“非优化”的。

On P3 在 P3 上
ArcSinApprox4a    68871
ArcSinApprox4b    68871
ArcSinApprox4c    68634
ArcSinApprox4d    86806
ArcSinApprox4e    85727
ArcSinApprox4f    83542
ArcSinApprox4g    80548
ArcSinApprox4h    88378
ArcSinApprox4i    85324
We see that optimizations d and h are very good and optimizations e, f g and I are bad.
It is quite possible that the optimal function implementation is none of the ones we have made.
We could pick version h and remove the bad optimizations or simply make some more variants
and this way get a faster implementation.
我们看到优化的 d 和 h 是非常的好,优化的 e,f,g 和 I 效果很差。
很有可能我们已经执行的函数中没有最佳的函数。
我们改变一下版本 h,删除差的优化,做一些简单的变化,得到一个更快的执行。

Which function approach is the winner? To find out we pick the fastest implementation of each approach On P4
哪个函数是胜利者呢?
我们找出在 P4 上执行最快的函数
ArcSinApprox1f    47939
ArcSinApprox3g    47416
ArcSinApprox4d    51813
The last version is the fastest.
最后的版本最快。
Parallelisation is very important on a modern processor and version 4 beats the others by 9 %.
并行处理在现代处理器上非常重要,版本 4 比其他版本快 9%。

On P3 在 P3 上
ArcSinApprox1h    85361
ArcSinApprox3h    87604
ArcSinApprox4h    88378
The function version 4 is a winner on P3 as well, but with a less margin.
The P4 has an SSE2 instruction set, which contains instructions for double precision floating-point calculations.
The main idea with this instruction set is SIMD calculations.
SIMD is an acronym for Single Instruction Multiple Data.
版本 4 在 P3 上也是冠军,但是只有微弱的优势。
P4 有 SSE2 指令集,它包含双精度浮点数指令计算。
这些指令主要用于 SIMD 计算。
SIMD 是单指令多数据 (Single Instruction Multiple Data) 的缩写。

Multiple data is here two FP double precision variables (64 bit) and two sets of these data can be added,
subtracted, multiplied or divided with one instruction.
多数据在这里是指两个 FP 浮点数精度变量(64 位),两个这样的数据可以用一条指令来执行加、减、乘、除。

SSE2 also have some instructions for scalar calculations, which are calculations on
one pair of data just like ordinary FP math in the FPU.
The biggest difference between ordinary FP math and SSE2 scalar math is that FP math is performed on extended
precision and results are rounded to double precision when copied to double precision variables in RAM/cache.
SSE2 math is double precision calculation and double precision registers.
SSE2 也有普通的数量计算,它们也可以像在 FPU 中执行普通的算术运行一样,计算一对数据。
普通 FP 算术和 SSE2 分级算术的最大不同是,FP 算术使用扩展的精度执行,当用 RAM/cache 复制到双精度变量时,结果也被
扩展为双精度。
SSE2 算术是双精度计算和双精度寄存器。

The code examples of this lesson have relatively few calculations and precision on the FPU will be double.
If we load data, perform all calculations and store the result, the result will only bee a little less than
extended precision when still on the FPU stack, and will be rounded to exact double precision
when copied to a memory location.
这一课的代码例子中,在 FPU 上与计算精度有关,也有几个是关于双精度的。
如果装载数据,执行所有的计算,存储结果,则结果只比在 FPU 堆栈上的扩展精度差一点。当复制到本地内存时,将被转为
精确的双精度。

SSE2 calculations on the other hand are a little less than double precision and the result
in a register is a little less than double precision too.
If there is only one calculation the result will be in double precision, but when performing additional
calculations the error from each will sum up.
Because the FPU does all calculations in extended precision and can hold temporary results in registers,
there can be done many calculations before precision declines below double.
We have seen that the drawback of SSE2 is that precision is double or less versus the double precision of IA32 FP.
SSE2 计算在另一方面也比双精度差,那就是在寄存器中结果也比双精度差。
如果只有一个计算结果则将用双精度,但是当执行累加计算时,将产生错误。
因为 FPU 用扩展的精度进行所有的计算,可以将临时结果存入寄存器,在双精度计算以前也可以做许多计算。
我们也看到 SSE2 的缺点是精度是双倍的,其小于 IA32 FP 的双精度。

What is the advantage? There are two advantages.
Registers are not ordered as a stack, which makes it easier to arrange code and secondly calculations
in double precision are faster than calculations in extended precision.
We must expect scalar SSE2 instructions to have shorter latencies than their IA32 FP counterparts.
SSE2 的优点是什么呢?它有两个优点。
寄存器没有作为堆栈的顺序,可以更方便的安排代码,第二,用双精度的计算比用扩展精度的计算更快。
我们也希望分级 SSE2 指令的潜伏期也比 IA32 FP 配对的指令短些.

Fadd latency is 5
Fsub latency is 5
Fmul latency is 7
Fdiv latency is 38

Addsd latency is 4
Subsd latency is 4

Mulsd Divsd latency is 35

Fadd 的潜伏期是 5
Fsub 的潜伏期是 5
Fmul 的潜伏期是 7
Fdiv 的潜伏期是 38

Addsd 的潜伏期是 4
Subsd 的潜伏期是 4

Mulsd Divsd 的潜伏期是 35。

The P4 optimizations reference manual has no latency and throughput information for the Mulsd instruction!
We see that latencies are one cycle less for scalar SSE2 in general, but 3 cycles less for division.
在 P4 的优化参考手册中没有 Mulsd 指令的潜伏和吞吐量信息。

Throughput is 吞吐量是
Fadd throughput is 1
Fsub throughput is 1
Fmul throughput is 2
Fdiv throughput is 38

Addsd throughput is 2
Subsd throughput is 2
Mulsd
Divsd latency is 35

Fadd 的吞吐量是 1
Fsub 的吞吐量是 1
Fmul 的吞吐量是 2
Fdiv 的吞吐量是 38

Addsd 的吞吐量是 2
Subsd 的吞吐量是 2

Mulsd Divsd 的吞吐量是 35。

We see that throughput for addsd and subsd surprisingly are the double of fadd and fsub.
All that think SSE2 has dedicated hardware and that SIMD is calculation on two sets of data in parallel
raise your hands!
From the manual “Optimizations for Intel P4 and Intel Xeon” latency and throughput tables at page C-1 it is seen
that all SSE2 FP instructions are executed in the same pipelines as old time FP instructions.
我们惊奇的看到 addsd 和 subsd 的吞吐量是 fadd 和 fsub 的两倍。
在 "Intel P4 和 Intel Xeon 优化" 手册的 C-1 页的潜伏期和吞吐量表中,可以看到所有的 SSE2 FP 指令都是执行在相同的
管道线中,就像以前的 FP 指令一样。

This means that an SIMD addition as example is generating two microinstructions that execute in the F_ADD pipeline.
At clock cycle one the first one enters the pipeline, at the second cycle number 2 enters the pipeline.
Because latency is 4 cycles the first one leaves the pipeline at clock cycle 3 and
the second one leaves at cycle four.
This leads us to expect that a scalar SSE2 add should generate one microinstruction of the same type and
have a latency of 3 cycles and a throughput of 1.
这表示 SIMD 加法就是例子中的一样在 F_ADD 管道线中产生了两个微指令。
第一个指令在第一个时钟周期进入管道线,第二个指令在第二个周期进入管道线。
因为潜伏期是 4 个周期,第一个指令在时钟周期 3 离开管道线,第二个在周期 2 离开管道线.
这使我们也期望分级 SSE2 加法也产生一个相同类型的微指令,有 3 个周期的潜伏期和 1 个吞吐量.

From the tables it is seen that the SIMD version of add, addpd, has the same latency and throughput
as the scalar version, addsd.
Either there is an error in the tables or the scalar instruction also generates two microinstructions of
which one is “blind”, that is have no effect.
从表中可以看到 SIMD 版本的 add, addpd 和分级版本的版本有相同的潜伏期和吞吐量.
可能在表中有错误 或者 分级指令产生的两个微指令中有一个是无效的。

Come on Intel!
To verify the numbers from the table we create some dedicated code and time the instructions.
Intel 加油。
为了检验表中的数值,我们创建一些专门的代码记录指令的时间。

procedure TMainForm.BenchmarkADDSDLatency;
var
RunNo, ClockFrequency : Cardinal;
StartTime, EndTime, RunTime : TDateTime;
NoOfClocksPerRun, RunTimeSec : Double;
const
ONE : Double = 1;
NOOFINSTRUCTIONS : Cardinal = 895;

begin
ADDSDThroughputEdit.Text := 'Running';
ADDSDThroughputEdit.Color := clBlue;
Update;
StartTime := Time;
for RunNo := 1 to MAXNOOFRUNS do
  begin
   asm
    movsd xmm0, ONE
    movsd xmm1, xmm0
    movsd xmm2, xmm0
    movsd xmm3, xmm0
    movsd xmm4, xmm0
    movsd xmm5, xmm0
    movsd xmm6, xmm0
    movsd xmm7, xmm0

    addsd xmm0, xmm1
    addsd xmm0, xmm1
    addsd xmm0, xmm1
    addsd xmm0, xmm1
    addsd xmm0, xmm1
    addsd xmm0, xmm1
    addsd xmm0, xmm1

    //Repeat the addsd block of code such that there are 128 blocks
    // 重复 addsd 代码块,像这样会有 128 个时钟周期
   end;
  end;
EndTime := Time;
RunTime := EndTime - StartTime;
RunTimeSec := (24 * 60 *60 * RunTime);
ClockFrequency := StrToInt(ClockFrequencyEdit.Text);
NoOfClocksPerRun := (RunTimeSec / MaxNoOfRuns) * ClockFrequency * 1000000 / NOOFINSTRUCTIONS;
ADDSDThroughputEdit.Text := FloatToStrF(NoOfClocksPerRun, ffFixed, 9, 1);
ADDSDThroughputEdit.Color := clLime;
Update;
end;
The addsd instructions all operate on the same two registers and therefore they cannot execute in parallel.
The second instruction has to wait for the first to finish and the full latency of the instruction is exposed.
addsd 指令操作两个相同的寄存器,因此不能被并行指令。
第二个指令不得不等待第一个指令的结束,指令的整个潜伏期都将暴露。

For measuring throughput insert this block 128 times
为了测量吞吐量,插入 12 个时钟周期
addsd xmm1, xmm0
addsd xmm2, xmm0
addsd xmm3, xmm0
addsd xmm4, xmm0
addsd xmm5, xmm0
addsd xmm6, xmm0
addsd xmm7, xmm0

Here there are no data decencies between instructions and they can execute in parallel.
Xmm0 is used as source in every line but this does not create a data dependency.
Results from run of the code show us that latency is 4 cycles and throughput is 2 cycles.
This is in consistency with the table numbers.
这里两个指令之间没有数据依靠,一次它们可以并行执行。
Xmm0 在每一行被用作源操作数,但是没有产生数据依赖。
代码的运行结果告诉我们,潜伏期是 4 周期,吞吐量是 2 个周期。
这与表中的数值一致。

Let us code the three functions in scalar SSE2 and perform some benchmarks.
The 8 SSE2 registers are called xmm0-xmm7 and Delphi has no register view for them.
So we must create our own, by creating a global (or local) variable for each register,
put a watch on them and add code in the function to copy register contents to variables.
我们用分级 SSE2 指令来编写这三个函数,执行相同的基准测试。
SSE2 的 8 个寄存器是 xmm0-xmm7,Delphi 没有寄存器来观察它们。
因此我们必须自己创建全局(或者局部)变量来观察每一个寄存器,将寄存器的内存存入变量。

It is somewhat cumbersome to do all this and I am looking forward to Borland creating an xmm register view.
This code shows how I do it.
这样做稍微有一点麻烦, 我期待 Borland 创建一个 xmm 寄存器窗口。

var
XMM0reg, XMM1reg, XMM2reg, XMM3reg, XMM4reg : Double;

function ArcSinApprox3i(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;

fld   qword ptr [ebp+$20]
movsd xmm0,qword ptr [ebp+$20]

movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3

fld       qword ptr [ebp+$28]
movsd xmm1,qword ptr [ebp+$28]

movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3

fxch    st(1)
fmul    st(0),st(1)
mulsd xmm0,xmm1

movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3

fadd   qword ptr [ebp+$18]
addsd xmm0,qword ptr [ebp+$18]

movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3

fmul    st(0),st(1)
mulsd xmm0,xmm1

movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3

fadd    qword ptr [ebp+$10]
addsd xmm0,qword ptr [ebp+$10]

movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3

fmulp st(1),st(0)
mulsd xmm0,xmm1

movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3

fadd    qword ptr [ebp+$08]
addsd xmm0,qword ptr [ebp+$08]

movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3

movsd [esp-8],xmm0
fld       qword ptr [esp-8]

movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3

wait
end;
The code is not using xmm4-xmm7 and there was no need to create a register view for them.
There is added xmm view code after each line of SSE2 code.
All lines but the last two are the FP code with the SSE2 code added such that every operation is done in FP
as well as in SSE2.
This way it is possible to trace through the code and control that the SSE2 version is doing the same as
the classic version.
代码没有使用 xmm4-xmm7,因此不需要为它们创建寄存器显示。
每一行 SSE2 代码的后面都有一个 xmm 相加显示代码。
除了最后两行是 FP 代码以外,其他所有的 SSE2 代码就像在 FP 相加中一样。
这个方法可以跟踪代码和控制 SSE2 的版本就像一个经典的 SSE 2 版本。

Open the FPU view and see how the FP stack is updated and control that xmm registers are updated in the same way.
I developed the SSE2 code simply by adding an SSE2 instruction after each line of FP code.
打开 FPU 窗口,查看 FP 堆栈如何更新,用同样的方法控制 xmm 寄存器的更新。
在每一行 FP 代码之后,我简单的增加了一个 SSE2 指令。

fld       qword ptr [ebp+$20]
movsd xmm0,qword ptr [ebp+$20]

movsd copy one double precision variable from the memory location at [ebp+$20] into an xmm register.
“qword ptr” is not needed but I kept it to emphasise the pattern between SSE2 code and FP code.
movsd 从内存 [ebp+20] 处复制一个双精度变量到一个 xmm 寄存器。
"qword ptr" 不需要,但是我保留它是为了强调 SSE2 和 FP 代码的样式。

A big difference between FP code and scalar SSE2 code is that the FP registers are organized as a stack and
SSE2 registers are not.
At first while coding the SSE2 code I just ignored this and then after having made all the SSE2 lines I went back
and traced through the lines one by one and corrected them to work on the correct variable/register.
FP 代码和 SSE2 代码的最大不同是 FP 寄存器被组织为一个堆栈, SSE2 寄存器不是.
当编写 SSE2 代码时我首先忽略这个,然后回去单步跟踪, 确定它们正确地工作在变量或者寄存器中.

Activate the function with some variable values that are easy to follow in the two views
(e.g. X=2, A=3, B=4, C=5, D=6), and see that first “2” is loaded, then “3”, then 2 is multiplied by “3”
and “2” is overwritten by “6” etc.
函数的一些变量很容易用两种方式观察(例如: X=2, A=3, C=5, D=6),首先 2 被装入,然后 2 被 3 乘,并且 2 被 6 重写,等等。

The scalar SSE2 counterpart for fmul is mulsd. The sd prefix means Scalar – Double.
分级 SSE2 的 fmul 副本是 mulsd。sd 前缀表示分级 - 双倍。

fxch  st(1)
fmul  st(0),st(1)
mulsd xmm0,xmm1

The SSE2 counterpart for fadd is addsd. SSE 的 fadd 副本是 addsd。

fadd  qword ptr [ebp+$18]
addsd xmm0,qword ptr [ebp+$18]

Continue this way line by line. 继续一行行的观察

The FP code leaves the result in st(0), but the SSE2 code leaves the result in an xmm register.
Then the result has to be copied from xmm to st(0) via a memory location on the stack.
FP 代码将结果留在 st(0),但是 SSE2 指令结果存在一个 xmm 寄存器。
然后,结果不得不通过在堆栈将结果从 xmm 传到 st(0)

movsd [esp-8],xmm0
fld   qword ptr [esp-8]

These two lines do this.
这两行做实现这个操作。

At esp-8, 8 bytes above the top of the stack, there is some place we can use as the temporary location for the result.
The first line copies xmm0 to temp and then the last line loads temp on the FP stack.
These two lines are overhead that will make small SSE2 functions less effective than their FP cousins.
After having double-checked the SSE2 code we can remove the instrumentation code as well as the old FP code,
leaving a nice scalar SSE2 function with only a little extra overhead.
在 esp-8 处,使栈顶的 8 个字节,这是我们存放结果的临时位置。
这两行是 SSE2 效率较低的 FP 姊妹版。
在仔细检查 SSE2 代码之后,我们再删除 FP 代码,剩下的就是一个精简的分级 SSE2 函数,只有一点额外的操作。

function ArcSinApprox3j(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
movsd xmm0,qword ptr [ebp+$20]
movsd xmm1,qword ptr [ebp+$28]
mulsd xmm0,xmm1
  addsd xmm0,qword ptr [ebp+$18]
  mulsd xmm0,xmm1
  addsd xmm0,qword ptr [ebp+$10]
  mulsd xmm0,xmm1
  addsd xmm0,qword ptr [ebp+$08]
  movsd [esp-8],xmm0
  fld   qword ptr [esp-8]
end;

It can be even nicer if we remove the not needed “qword ptr” text.
移除不需要的 "qword ptr" 文字,函数将更简洁。

function ArcSinApprox3j(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
movsd xmm0, [ebp+$20]
movsd xmm1, [ebp+$28]
mulsd xmm0,xmm1
addsd xmm0, [ebp+$18]
mulsd xmm0,xmm1
addsd xmm0, [ebp+$10]
mulsd xmm0,xmm1
addsd xmm0, [ebp+$08]
movsd [esp-8],xmm0
fld   qword ptr [esp-8]
end;

Change the pointers with the parameter names
用参数名替换指针
function ArcSinApprox3j(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
movsd xmm0, A
movsd xmm1, X
mulsd xmm0,xmm1
addsd xmm0, B
mulsd xmm0,xmm1
addsd xmm0, C
mulsd xmm0,xmm1
addsd xmm0, D
movsd [esp-8],xmm0
fld   qword ptr [esp-8]
end;

Well how does this version perform?
The benchmark is 45882。
This version is somewhat slower than the FP version, which scored 48292.
这个版本的执行效果如何?
基准测试分数是 45822。
这个版本稍微比分数是 48292 的 FP 版本慢。

We need to investigate what is the reason for this.
Is it the overhead of the last two lines or is it due to the 2-cycle throughput of addsd and mulsd?
The overhead can be removed by transferring the result as an out parameter or we can inline the function.
It would be interesting for us to see how big an advantage it is to inline this relatively small function.
After all there is a good deal of overhead of copying 5 double precision parameters each 8 byte big.
Let us see how much code is actually needed for this.
我们需要研究具体的原因。
最后两行是问题还是由于 addsd 和 mulsd 是 2 个周期的吞吐量。
通过输出参数或者内联这个函数我们可以移除 overhead。
将函数内联看起来会有很大的益处。
一个好的办法就通过每次复制 8 个字节大小来传输 5 个双精度参数。
我们看看这样需要多少代码:

push dword ptr [ebp+$14]
push dword ptr [ebp+$10]
push dword ptr [ebp+$34]
push dword ptr [ebp+$30]
push dword ptr [ebp+$2c]
push dword ptr [ebp+$28]
push dword ptr [ebp+$24]
push dword ptr [ebp+$20]
push dword ptr [ebp+$1c]
push dword ptr [ebp+$18]
call dword ptr [ArcSinApproxFunction]
fstp qword ptr [ebp+$08]

No less than 10 push instructions each pushing a 4 byte half of each parameter onto the stack.
Imagine that the register calling convention took its name seriously and transferred the parameters on the FP stack
instead.
Then we would have 5 fld, which would also remove the need to load parameters from the stack in the function.
不少于 10 个 push 指令,每次压入每个参数的 4 个字节到堆栈。
设想用这样的寄存器调用约定来代替通过 FP 堆栈传输参数。
然后我们要有 5 个 fld, 来移除函数中从堆栈装入参数部分.

That is – 5 fld in the function would be replaced by 5 fld at the call place of the function
and 10 push instructions would go away.
This would lead to a dramatic increase in performance.
这 5 个 fld 用在函数中的 5 处装载, 10 个 push 指令可以删除了。
这将导致性能上的戏剧性的增长。

Inlining the function would also remove the overhead of the call/ret pair
which is a lot less than the overhead of the mega push, and this would give as
a clue about the performance of the updated register2 calling convention ;-).
内联函数也将移除 call/ret 调用对,同时给了我们更新寄存器调用约定性能的启示。

Inlined ArcSinApprox3i 156006
Inlined ArcSinApprox3j 160000
The improvement is an astonishing 400 %.

内联的 ArcSinApprox3i 156006
内联的 ArcSinApprox3j 160000
The improvement is an astonishing 400 %.
令人惊奇的增长了400%。

I truly wish Borland introduced a true register calling convention for floating point parameters in the near future.
The SSE2 version is only 3 % faster than the IA32 version. This could be more on a decent SSE2 implementation.
我真的希望 Borland 在将来可以为浮点数参数引入一种真寄存器调用规则.
SSE2 版本只是比 IA32 快 3%。这将更好的在 SSE2 上执行。

Lesson 7 has hereby come to an end.
You now know nearly all about floating point programming ;-)
第 7 课这就结束了。
你现在几乎知道了关于所有浮点数的编程 ;-)

页: [1]


Powered by Discuz! Archiver 5.0.0  © 2001-2006 Comsenz Inc.