Introduction to BASM for Beginners
BASM 入门教程简介
The series of articles named “BASM for beginners” currently consists of 7 articles and no. 8 and 9 are in progress.
Common for the articles, and coming articles, is that they explain some BASM issues by use of an example function.
Most often this function is first implemented in Pascal and then the compiler generated assembler code is copied from the CPU view in Delphi and then analyzed and optimized.
Sometimes optimization involves the usage of MMX, SSE or SSE2 instructions.
“BASM入门教程”系列文章共7课。第8课和第9课正在完成中。
这些文章通过函数实例来介绍BASM。通常,这些函数使用 pascal 编写,然后从Delphi 编译器的 Cpu 窗口中复制汇编代码,再进行分析和优化。同时,也介绍了 MMX, SSE 或 SSE2 指令的优化。
By taking the code made by the compiler from a Pascal function the most commonly used instructions from the big 32-bit Intel Architecture instruction set are introduced to the beginner first. Seeing which code the compiler generates is leading to a valuable insight in the effectiveness of compiler generated code in general and into the Delphi compiler specifically.
通过借用编译器编译 Pascal 函数产生的汇编代码,为初学者介绍了大32位 Intel 体系结构指令集。
研究编译器产生的代码将使我们更清楚编译器产生的代码的效率,同时了解 Delphi 编译器的特点。
As specific assembly code optimizations are introduced generalizations will be introduced when suitable. These general optimizations are suitable for implementation in compilers and most compilers including Delphi have them. At some point in the future a tool that automatically optimizes assembler code will be developed.
本教程将介绍一些通用的汇编代码优化技术。这些通用的优化技术已经被大部分编译器使用,包括Delphi。在某些方面,一些汇编代码自动优化工具也将被开发。
Knowledge about the target processor is often needed when optimizing code and therefore are a lot of CPU details, such as pipelines, explained in the series too.
As far as I know there is only little literature available that explains all these issues on a level where beginners can follow it. I hope this series will help fill this void.
优化代码经常会用到相关的处理器特性等资料,比如管道线技术,也将在此系列教程中介绍。
据我所知,只有很少的文献资料是针对初学者来解释这些问题的,我希望这些文章能够填补这方面的空白。
Best regards
Dennis Kjaer Christensen.
Lesson 1
第 1 课
The first little example gets us started. It is a simple function in Pascal with multiplies an integer with the constant 2.
用一个小例子开始我们的BASM之旅。
这是一个用常量 2 来乘以一个整数的小函数。
function MulInt2(I : Integer) : Integer;
begin
Result := I * 2;
end;
Lets steal the BASM from the CPU view. I compiled with optimizations turned on.
从 CPU VIEW 中获取汇编代码。编译时打开优化选项。
function MulInt2_BASM(I : Integer) : Integer;
begin
Result := I * 2;
{
add eax,eax
ret
}
end;
From this we see that I am transferred to the function in eax and that the result is transferred back to the caller in eax too. This is the convention for the register calling convention, which is the default in Delphi. The actual code is very simple, the times 2 multiplication is obtained by adding I to itself, I+I = 2I. The ret instruction returns execution to the line after the one which called the function.
从上面可以看出参数值从 eax 传入,并从 eax 返回。这是Delphi 默认的调用约定。
实际代码非常简单,就是某数乘以 2 等于其自身相加,I + I = 2I。
Ret 指令返回到调用此函数处的下一条指令位置。
Lets code the function as a pure asm function.
其纯汇编函数如下:
function MulInt2_BASM2(I : Integer) : Integer;
asm
//Result := I * 2;
add eax,eax
//ret
end;
Observe that the ret function is supplied by the inline assembler.
显然,Ret 由内嵌汇编器本身提供。
Let us take a look at the calling code.
让我们看看调用代码。
This is the Pascal code
Pascal 代码如下:
procedure TForm1.Button1Click(Sender: TObject);
var
I, J : Integer;
begin
I := StrToInt(IEdit.Text);
J := MulInt2_BASM2(I);
JEdit.Text := IntToStr(J);
end;
The important line is 重要的行是:
J := MulInt2_BASM2(I);
From the cpu view 在 CpuView 中可以看到
call StrToInt
call MulInt2_BASM2
mov esi,eax
After the call to StrToInt from the line before the one, which calls our function, I am in eax. (StrToInt is also following the register calling convention). MulInt2_BASM2 is called and returns the result in eax, which is copied, to esi in the next line.
在调用函数 MulInt2_BASM2 之前首先调用了 StrToInt,StrToInt 返回结果 I 存放在 eax 中(StrToInt 同样也遵循寄存存器调用约定)。MulInt2_BASM2 调用后,结果存入 eax,并且被下一行代码复制到 esi 中。
Optimization issues: Multiplication by 2 can be done in two more ways. Use the mul instruction or shifting left by one. In the Intel IA32 SW developers manual 2 page 536 mul is described. It multiplies the value in eax by another register and the result is returned in the register pair edx:eax. A register pair is needed because a multiplication of two 32 bit numbers results in a 64 bit result, just like 9*9=81 - two one digit numbers (can) result in a two digit result.
优化方式:被2乘有多种方法。使用 mul 乘法指令或者左移一位都可以完成此功能。
在 IA32 软件开发手册 2 的第 536 页介绍了 mul。
被乘数存放在 Eax,乘数存放在其他的寄存器,结果存入寄存器对 edx: eax 中。
使用寄存器对的原因是两个32 位数相乘结果是 64 位,就像 9*9=81,两个一位数相乘得到了一个两位数。
This raises the issue of which registers must be preserved by a function and which can be used freely. This is explained in the Delphi help.
在这种情况下,必须保存寄存器的原始值后才能使用。在Delphi Help 中有解释:
"An asm statement must preserve the EDI, ESI, ESP, EBP, and EBX registers, but can freely modify the EAX, ECX, and EDX registers."
“在汇编语法中,必须保存 EDI, ESI, ESP, EBP, EBX 寄存器 ,但是 EAX, ECX, EDX 可以随意使用。“
We can conclude that it is no problem that edx is modified by the mul instruction and our function can also be implemented like this.
因此,我们可以确定乘法指令 mul 修改 edx 不会有任何问题。我们的函数也可以修改为如下:
function MulInt2_BASM3(I : Integer) : Integer;
asm
//Result := I * 2;
mov ecx, 2
mul ecx
end;
ecx is used also but this is also ok. As long as the result is less than the range of integer it is returned correctly in eax. If I am bigger than half the range of integer overflow will occur and the result is incorrect.
Ecx 也可以被任意使用。只要结果值在 Integer 范围内,eax 存放的就是正确的结果值。如果 I 大于整数的一半,那么相乘后结果会溢出,eax 值也就不准确了。
Implementation with shift 使用移位指令
function MulInt2_BASM4(I : Integer) : Integer;
asm
//Result := I * 2;
shl eax,1
end;
Timing can reveal which implementation is fastest. We can also consult Intel or AMD documents with latency and throughput tables. Add & mov is 0.5 cycles latency and throughput, mul is 14-18 cycles latency and 5 cycles throughput. shl is 4 cycles latency and 1 cycle throughput. The version chosen by Delphi is the most efficient on P4 and this will probably also be the case on Athlon and P3.
这个执行在时间上是高效的。我们也可以参考关于介绍潜伏期和吞吐量的 Intel 或 AMD 文档。
Add 或 Mov 指令是 0.5 个时钟周期和吞吐量,
Mul是 14~18 个周期和 5 个吞吐量,
Shl 是 4 个时钟周期和 1 个吞吐量。
这个版本的 Delphi 代码在 P4 上是最有效的,在 Athlon 和 P3 也可能是。
Issues not covered: mul versus imul and range checking, other calling conventions, benchmarking, clock count on other processors, clock count for call + ret, location of return address for ret etc..
这个版本没有考虑以下情况:mul 与 imul 的不同,范围检查,其他调用约定,基准,时钟周期,调用/返回指令的时钟周期,返回本地地址 ret 等等。