CnPack Forum » 技术板块灌水区 » <<BASM 初学者入门>> 第 3 课


2007-4-23 22:36 skyjacker
<<BASM 初学者入门>> 第 3 课

<<BASM 初学者入门>> 第 3 课

http://www.cnpack.org
QQ Group: 130970
翻译:SkyJacker
版本:草稿版
状态:未校对
时间:2007

In this third lesson topic such as MMX and SSE2 will be introduced together with Int64 arithmetic’s.
This is the first time we will see processor dependent optimisations.
第 3 课的主题是介绍 Int64 的算术运算,同时也会介绍 MMX,SSE2 指令。
我们将第一次看到依赖处理器的优化。

The example looks like this   
例子如下:
function AddInt64_1(A, B : Int64) : Int64;
begin
Result := A + B;
end;

Let us jump straight into the asm code.
我们直接来看汇编代码:
function AddInt64_2(A, B : Int64) : Int64;
begin
{
push ebp
mov ebp,esp
add esp,-$08 // 分配临时空间
}
Result := A + B;
{
mov eax,[ebp+$10]
mov edx,[ebp+$14]
add eax,[ebp+$08] // 低 32 位相加
adc edx,[ebp+$0c] // 带进位高 32 位相加
mov [ebp-$08],eax // 低 32 位
mov [ebp-$04],edx // 高 32 位
mov eax,[ebp-$08]
mov edx,[ebp-$04]
}
{
pop ecx
pop ecx
pop ebp
//ret
}
end;
The first three lines of code are recognized as setting up a stack frame like in the previous lessons.
This time we know that the compiler might add the first two for us.
The last three lines are also a well-known pattern. Again the compiler might add pop ebp for us.
This brings us into the meat, which are these 8 lines Result := A + B;
开始的前三行汇编代码用来设置堆栈,就像前面课程讲的一样。
我们知道编译器为我们增加了前两行。
最后三行代码也是我们熟悉的形式。此外,编译器为我们增加了 pop ebp。
我们来分析 Result := A + B 产生的这 8 行汇编代码:
{
mov eax,[ebp+$10]
mov edx,[ebp+$14]
add eax,[ebp+$08]
adc edx,[ebp+$0c]
mov [ebp-$08],eax
mov [ebp-$04],edx
mov eax,[ebp-$08]
mov edx,[ebp-$04]
}

They can be analysed in pairs because they work together in tandem doing 64-bit math by splitting
the problem up into 32 bit pieces. The first two lines load A into the register pair eax:edx.
They are loading a contiguous 64-bit block of data from the previous stack frame,
showing us that A was transferred on the stack.
我们来成对分析汇编代码,因为 64 位运算是通过把一个 64 位数分为低 32 位和高 32 位来运算。
前两行将形参 A 存入寄存器对 eax: edx。
它们将之前堆栈中连续的 64 位数据装入,显然 A 是通过栈传输。

The two load pointers are separated by 4 bytes.
One of them is pointing to the beginning of A and the other one is pointing into the middle of A.
Then comes two add instructions. The first is a normal add and the second one is add with carry.
The pointers in these two lines are pointing to B in the same fashion as the two previous were pointing at A.
The first add adds the lower 32 bits of B to the lower 32 bits of A.
This might lead to a carry if the sum is too big to fit into 32 bits.
This carry is included in the addition of the higher 32 bits.
这两个数据指针之间,有 4 个字节的间隔,  
其中一个指向 A 的开始,另外一个指向 A 的中间。
之后,是两个加法指令,第一个是普通相加,第二个是带进位相加,
这两行中的数据指针和前面指向 A 的数据指针是同样的方式。
第一个 add 指令是将 B 的低 32 位与 A 的低 32 位相加。
如果得到的和太大以至于用 32 位装不下它,会产生一个进位,
这个进位将包含在高 32 位的相加中。

To make things totally clear lets do a simple example on decimal numbers.
We have the addition 1+2 = 3.
Our imaginary data types for this in our brain CPU as two digits wide.
This means that the addition is actually looking like this 01+02=03.
There is no carry from the addition of the lower digits into the higher ones, which are zero.
为了清楚明了的解释它, 我用一个简单的 10 进制相加的例子来说明。
第一个例子是 1 + 2 = 3 。我们假设我们的大脑 CPU 是 2 个数字的宽度。
这就意味着这个加法其实是 01 + 02 = 03 这种形式,它的低位相加没有进位,进位是零。

Decimal example two. 13+38=?. First we add 3+8=11.
This results in a carry and a 1 in the lower half of the result.
Then we add Carry+1+3=1+1+3=5.
The result is 51.
第二个例子是 13 + 38 = ? 。首先我们计算 3 + 8 = 11,产生了一个进位。
然后,我们计算 进位 + 1 + 3 = 1 + 1 + 3 =5,结果是 51。

In the third example we provoke an overflow. 50+51=101.
101 is too big to fit in two digits and our brain CPU cannot perform the calculation.
There was a carry on the addition of the two higher digits. Back to code.
Two things can happen now.
If we have compiled without range check the result wraps around. With range check an exception will be thrown.
We see that there is now range check code in our listing and wraparound will occur.
第三个例子,我们制造一个溢出。50 + 51 = 101。
101 太大,用 2 位数装不下,并且我们的大脑 CPU 也不能执行计算了。因为两个高位相加也产生了一个进位。
回头看代码,可能有两种情况:
1、编译时没有使用范围检查结果的边界。
2、若使用范围检查将抛出一个意外。我们会看到进行范围检查的代码列表提示,同时产生了一个边界溢出。

The next two lines save the result into the current stack frame.
The last two lines load the result from the stack frame into eax and edx where it already was.
These 4 lines are redundant. They can be removed and this also removes the need for a stack frame.
it so easy to be an optimizer ;-)
接下来的两行代码是将结果 EAX, EBX 保存到堆栈。
最后两行再将结果从堆栈还原到 EAX,EDX 中。
显然,这 4 行是多余的。他们可以被删除,同时也删除堆栈相关的代码。

function AddInt64_6(A, B : Int64) : Int64;
asm
mov eax,[ebp+$10]
mov edx,[ebp+$14]
add eax,[ebp+$08]
adc edx,[ebp+$0c]
end;

This is a nice small function.
The compiler generated code consisted of 16 lines and we came down to 4 with only little effort.
Today Delphi was really sleepy.
Now we think like this: If we had 64 bit registers the addition could be done with two lines of code.
But the MMX registers are 64 bits wide and this might be worth taking advantage of.
这是一个非常简洁的函数。
编译器产生了代码包含了 16 行,而我们只是通过一点努力就将代码精简为 4 行。
Delphi 真是有点懒惰啊。
现在我们想,如果我们有 64 位的寄存器,那么 64 位加法用两行代码就能实现。
其实, MMX 寄存器就是 64 位宽,它们可能会有利用价值。

In the Intel SW Developers Manual instructions are not marked as belonging to IA32, MMX, SSE or SSE2.
This information would be nice to have, but we have to look elsewhere for it.
I normally use three small programs from Intel. The so called computer based tutorials on MMX, SSE & SSE2.
I do not know where to find them on the Intel website now, but mail me if you want them.
They are simple and nice - very illustrative.
In these I find that a mov for 64 bits from memory into an MMX register is movq.
Q stands for quad word. The mmx registers are named mm0, mm1....mm7.
They are not arranged as a stack, as the FP registers are, and we can pick which one we like.
在 Intel 软件开发手册中,没有显著的描述 IA32, MMX, SSE 或 SSE2 等指令。
如果手册中有这些指令那就太好了,但是我们不得不到其他地方寻找这些指令。
我利用了 Intel 提供的 3 个小程序,它们被称为关于 MMX, SSE & SSE2 计算机的基本教程。
我不知道现在如何从 Intel 网站找到它们,如果你想要可以给我发 Email。
它们是简单的,精巧的,非常有说明性.
在这些资料中, 我发现了一个 64 位的移动指令 movq, 它将数据从内存移到一个 MMX 寄存器.
Q 表示四倍字。mmx 寄存器组被命名为 mm0, mm1....mm7。
它们不能像 FP 寄存器那样作为堆栈使用,我们可以随意使用任意一个。

Lets pick mm0. The first instruction looks like this
movq    mm0, [ebp+$10]
There is to ways two go now. We can load B into a register too.
This makes it easy to see what is going on by using the FPU window.
The MMX registers are aliased onto the FP registers and the FPU view can show both sets.
Switch between FP and MMX view by select "Display as words/Display as extendeds" in the shortcut menu.
The second way to go is to use the pattern from the IA32 implementation and perform the addition with
the memory location of B as source.
以 mm0 为例。第一个指令像下面的形式:
movq mm0, [ebp+$10]

现在有两种方式可以选择。
第一种方式是我们也可以将 B 装载到一个寄存器。
这种方式很容易通过 FPU 寄存器看到。
MMX 寄存器作为 FP 寄存器的别名,FPU 窗口可以显示两种指令集。
转换 FP 和 MMX 显示可以通过选择右击快捷菜单上的 "显示字/显示扩展的"

第二种方式是使用 IA32 执行模式,把内存中的 B 作为源操作数。

The two solutions is expected to perform identically because the CPU needs to load B into registers
before doing the addition and whether it is done explicitly with mov or explicitly with the add instruction,
the number of micro instructions will be the same. We use the more illustrative first way.
这两种方式有相同的执行效果。因为 CPU 在执行加法指令之前,需要将 B 装载到寄存器。
不管它是否明确的使用 mov 指令或者 add 指令,所使用的微指令个数是相同的。
我们可以使用更多的命令来说明它。

The next line is then a movq again
还是一个 movq 指令
movq    mm1, [ebp+$08]

Then we have to go look for an add instruction which would be something like this- paddq.
P for MMX, add for addition and q for quad word.
Now we get disappointed because there is no such MMX instruction. What about SSE.
One more disappointment.
然后我们需要寻找一个加法指令,像 paddq 形式。
P 表示 MMX,add 表示加法,q 表示四个字大小
当然,我们很失望,因为没有这样的 MMX 指令。
SSE 呢? 也令人失望。

Finally SSE2 got it and we are happy or are we? If we use it the code will be targeting P4 and not run P3 or Athlon.
Like the P4 lovers we are we proceed anyway.
paddq   mm0, mm1
This line is very intuitive. Adding mm1 to mm0.
Only thing left is to copy the result from mm0 into eax:edx.
To do this we need a double word mov instruction that can take 32 bits from a MMX register as source and
an IA32 register as destination.
movd    eax, mm0
This MMX instruction does the job. It copies the lower 32 bits of mm0 to eax.
最后我们在 SSE2 中发现了它,我们该不该为之高兴呢?
如果我们使用它,那么代码只能运行在 P4 上,不能运行在 P3 或 Athlon 上。
P4 爱好者们可以继续看下去

paddq   mm0, mm1

这一行非常直观,就是累加 mm1 到 mm0。
接下来的事情是将结果从 mm0 复制到 eax:edx。
为了实现它,我们需要一个双字 mov 指令,它可以从作为源操作数的 MMX 寄存器中复制 32 位,
存入作为目标操作数的 IA32 寄存器中。
movd    eax, mm0

MMX 指令可以完成这个工作,movd 复制 mm0 的低 32 位到 EAX 中。

Then we need to copy the upper 32 bits of the result to edx. I could not find an instruction for that and instead
I shift the upper 32 bits down into the lower 32 bit using a 64-bit MMX right shift instruction.

然后,我们需要复制 mm0 的高 32 位到 eax 中。我没有发现实现这个功能的指令。
一个替代的方法是使用一个 64 位的 MMX 右移指令,把高 32 位移到低 32 位中。

psrlq   mm0, 32
Then we copy
然后我们复制
movd    edx, mm0
Then we are done? Unfortunately we have to issue the emms instruction because we have used MMX instructions.

现在结束了吗?很不幸,因为我们使用了 MMX 指令集,因此需要发布 emms 指令。

It cleans up the FP stack and leaves in a well-defined empty state. Emms bums 23 cycles on a P4.
Together with the shift which is also ineffective (2 cycles throughput and latency) on P4 our solution is not
especially fast and it will only run on P4 and this AMD thing nobody has yet:-(
它清理 FP 堆栈,即清空堆栈。
Emms 在 P4 上需要 23 个时钟周期。
加上移位指令也不是很有效率(2 cycles throughput and latency 两个时钟周期的吞吐量和潜伏期), 因此我们的 P4 解决方案不是特别的快,
并且它只能运行在 P4 上, AMD 还没有人实验过。

This ended the 3. lesson. We left the ball hanging in the air. Can we come up with a more efficient solution?
Moving data between MMX register and IA32 registers is expensive. The calling convention is no good, because data
were transferred on the stack and not in registers. eax->mm0 is 2 cycles.
The other way is 5 cycles. emms is 23 cycles. Addition is only 2 cycles. Overhead is plenty.
第 3 课的总结。
我们把球留在空中。
我们能拿出一个更有效率的解决方案吗?
在 MMX 寄存器和 IA32 寄存器中传输数据的代价很高。调用约定不是很好,因为数据在堆栈上传输而不是在寄存器中传输。
eax -> mm0 是 2 个周期,其他方式是 5 个周期, emms 是 23 个周期,加法只是 2 个周期,整个时间的耗费是很大的。

2007-4-23 22:46 skyjacker
<<BASM 初学者入门>> 第 3 课
   
译者注(英语文化学习):

1、This brings us into the meat, which are these 8 lines Result := A + B;
翻译过程:
A. 将我们带进肉的是这 8 行;
B. 让我们接触到肉的是...
C. 让我们进入实质内容...
the meat 的含义是什么?实质,好的内容。

这行这样翻译了:
我们来分析 Result := A + B 产生的这 8 行汇编代码:

2、We left the ball hanging in the air.
A. 我们把球悬在空中。我们把球抛向空中。
B. 我们留个悬念。

这句很有意思,就是找不出合适的中文意思。

这行这样翻译了:
我们把球留在空中。

3、Today Delphi was really sleepy.
A.今天 Delphi 真是昏昏欲睡。
B.现在 Delhpi 真是困了。

这行这样翻译了:
Delphi 真是有点懒惰啊。

页: [1]


Powered by Discuz! Archiver 5.0.0  © 2001-2006 Comsenz Inc.