CnPack Forum


 
Subject: <<BASM 初学者入门>> 第 1 课 A
skyjacker
版主
Rank: 7Rank: 7Rank: 7
茶农


UID 2239
Digest Posts 9
Credits 617
Posts 269
点点分 617
Reading Access 100
Registered 2006-6-8
Status Offline
Post at 2007-4-21 00:31  Profile | Blog | P.M.  | QQ
<<BASM 初学者入门>> 第 1 课 A

http://www.cnpack.org
QQ Group: 130970
翻译:SkyJacker
版本:草稿版
校对:天天吃好
时间:2007

Introduction to BASM for Beginners
BASM 入门教程简介
The series of articles named “BASM for beginners” currently consists of 7 articles and no. 8 and 9 are in progress.
Common for the articles, and coming articles, is that they explain some BASM issues by use of an example function.
Most often this function is first implemented in Pascal and then the compiler generated assembler code is copied from the CPU view in Delphi and then analyzed and optimized.
Sometimes optimization involves the usage of MMX, SSE or SSE2 instructions.
“BASM入门教程”系列文章共7课。第8课和第9课正在完成中。
这些文章通过函数实例来介绍BASM。通常,这些函数使用 pascal 编写,然后从Delphi 编译器的 Cpu 窗口中复制汇编代码,再进行分析和优化。同时,也介绍了 MMX, SSE 或 SSE2 指令的优化。

By taking the code made by the compiler from a Pascal function the most commonly used instructions from the big 32-bit Intel Architecture instruction set are introduced to the beginner first. Seeing which code the compiler generates is leading to a valuable insight in the effectiveness of compiler generated code in general and into the Delphi compiler specifically.
通过借用编译器编译 Pascal 函数产生的汇编代码,为初学者介绍了大32位 Intel 体系结构指令集。
研究编译器产生的代码将使我们更清楚编译器产生的代码的效率,同时了解 Delphi 编译器的特点。

As specific assembly code optimizations are introduced generalizations will be introduced when suitable. These general optimizations are suitable for implementation in compilers and most compilers including Delphi have them. At some point in the future a tool that automatically optimizes assembler code will be developed.
本教程将介绍一些通用的汇编代码优化技术。这些通用的优化技术已经被大部分编译器使用,包括Delphi。在某些方面,一些汇编代码自动优化工具也将被开发。

Knowledge about the target processor is often needed when optimizing code and therefore are a lot of CPU details, such as pipelines, explained in the series too.
As far as I know there is only little literature available that explains all these issues on a level where beginners can follow it. I hope this series will help fill this void.
优化代码经常会用到相关的处理器特性等资料,比如管道线技术,也将在此系列教程中介绍。
据我所知,只有很少的文献资料是针对初学者来解释这些问题的,我希望这些文章能够填补这方面的空白。

Best regards
Dennis Kjaer Christensen.
Lesson 1

      第 1 课
The first little example gets us started. It is a simple function in Pascal with multiplies an integer with the constant 2.
用一个小例子开始我们的BASM之旅。
这是一个用常量 2 来乘以一个整数的小函数。
function MulInt2(I : Integer) : Integer;
begin
Result := I * 2;
end;
Lets steal the BASM from the CPU view. I compiled with optimizations turned on.
从 CPU VIEW 中获取汇编代码。编译时打开优化选项。
function MulInt2_BASM(I : Integer) : Integer;
begin
Result := I * 2;
{
add eax,eax
ret
}
end;
From this we see that I am transferred to the function in eax and that the result is transferred back to the caller in eax too. This is the convention for the register calling convention, which is the default in Delphi. The actual code is very simple, the times 2 multiplication is obtained by adding I to itself, I+I = 2I. The ret instruction returns execution to the line after the one which called the function.
从上面可以看出参数值从 eax 传入,并从 eax 返回。这是Delphi 默认的调用约定。
实际代码非常简单,就是某数乘以 2 等于其自身相加,I + I = 2I。
Ret 指令返回到调用此函数处的下一条指令位置。

Lets code the function as a pure asm function.
其纯汇编函数如下:
function MulInt2_BASM2(I : Integer) : Integer;
asm
//Result := I * 2;
add eax,eax
//ret
end;
Observe that the ret function is supplied by the inline assembler.
显然,Ret 由内嵌汇编器本身提供。
Let us take a look at the calling code.
让我们看看调用代码。
This is the Pascal code
Pascal 代码如下:
procedure TForm1.Button1Click(Sender: TObject);
var
I, J : Integer;
begin
I := StrToInt(IEdit.Text);
J := MulInt2_BASM2(I);
JEdit.Text := IntToStr(J);
end;
The important line is  重要的行是:
J := MulInt2_BASM2(I);
From the cpu view  在 CpuView 中可以看到
call StrToInt
call MulInt2_BASM2
mov esi,eax

After the call to StrToInt from the line before the one, which calls our function, I am in eax. (StrToInt is also following the register calling convention). MulInt2_BASM2 is called and returns the result in eax, which is copied, to esi in the next line.
在调用函数 MulInt2_BASM2 之前首先调用了 StrToInt,StrToInt 返回结果 I 存放在 eax 中(StrToInt 同样也遵循寄存存器调用约定)。MulInt2_BASM2 调用后,结果存入 eax,并且被下一行代码复制到 esi 中。

Optimization issues: Multiplication by 2 can be done in two more ways. Use the mul instruction or shifting left by one. In the Intel IA32 SW developers manual 2 page 536 mul is described. It multiplies the value in eax by another register and the result is returned in the register pair edx:eax. A register pair is needed because a multiplication of two 32 bit numbers results in a 64 bit result, just like 9*9=81 - two one digit numbers (can) result in a two digit result.
优化方式:被2乘有多种方法。使用 mul 乘法指令或者左移一位都可以完成此功能。
在 IA32 软件开发手册 2 的第 536 页介绍了 mul。
被乘数存放在 Eax,乘数存放在其他的寄存器,结果存入寄存器对 edx: eax 中。
使用寄存器对的原因是两个32 位数相乘结果是 64 位,就像 9*9=81,两个一位数相乘得到了一个两位数。

This raises the issue of which registers must be preserved by a function and which can be used freely. This is explained in the Delphi help.
在这种情况下,必须保存寄存器的原始值后才能使用。在Delphi Help 中有解释:
"An asm statement must preserve the EDI, ESI, ESP, EBP, and EBX registers, but can freely modify the EAX, ECX, and EDX registers."
“在汇编语法中,必须保存 EDI, ESI, ESP, EBP, EBX 寄存器 ,但是 EAX, ECX, EDX 可以随意使用。“
We can conclude that it is no problem that edx is modified by the mul instruction and our function can also be implemented like this.
因此,我们可以确定乘法指令 mul 修改 edx 不会有任何问题。我们的函数也可以修改为如下:
function MulInt2_BASM3(I : Integer) : Integer;
asm
//Result := I * 2;
mov ecx, 2
mul ecx
end;
ecx is used also but this is also ok. As long as the result is less than the range of integer it is returned correctly in eax. If I am bigger than half the range of integer overflow will occur and the result is incorrect.
Ecx 也可以被任意使用。只要结果值在 Integer 范围内,eax 存放的就是正确的结果值。如果 I 大于整数的一半,那么相乘后结果会溢出,eax 值也就不准确了。
Implementation with shift 使用移位指令
function MulInt2_BASM4(I : Integer) : Integer;
asm
//Result := I * 2;
shl eax,1
end;
Timing can reveal which implementation is fastest. We can also consult Intel or AMD documents with latency and throughput tables. Add & mov is 0.5 cycles latency and throughput, mul is 14-18 cycles latency and 5 cycles throughput. shl is 4 cycles latency and 1 cycle throughput. The version chosen by Delphi is the most efficient on P4 and this will probably also be the case on Athlon and P3.
这个执行在时间上是高效的。我们也可以参考关于介绍潜伏期和吞吐量的 Intel 或 AMD 文档。
Add 或 Mov 指令是 0.5 个时钟周期和吞吐量,
Mul是 14~18 个周期和 5 个吞吐量,
Shl 是 4 个时钟周期和 1 个吞吐量。
这个版本的 Delphi 代码在 P4 上是最有效的,在 Athlon 和 P3 也可能是。

Issues not covered: mul versus imul and range checking, other calling conventions, benchmarking, clock count on other processors, clock count for call + ret, location of return address for ret etc..
这个版本没有考虑以下情况:mul 与 imul 的不同,范围检查,其他调用约定,基准,时钟周期,调用/返回指令的时钟周期,返回本地地址 ret 等等。

译者附:
1、Cpu扩展指令全称
MMX(Multi Media eXtension,多媒体扩展指令集),
SSE(Streaming SIMD Extensions,单指令多数据流扩展)
SSE2(Streaming SIMD Extensions 2,Intel官方称为SIMD 流技术扩展 2或数据流单指令多数据扩展指令集 2 )。

2、Latency and throughput
http://www.yesky.com/120/1645120.shtml
Latency:潜伏期,从字面上了解其含义是比较困难的,实际上,它表示完全执行一个指令所需的时钟周期,潜伏期越少越好。
严格来说,潜伏期包括一个指令从接收到发送的全过程。现今的大多数x86指令都需要约5个时钟周期,但这些周期之中有部分是与其它指令交迭在一起的(并行处理),因此CPU制造商宣传的潜伏期要比实际的时间长。
Throughput:吞吐量,它包括两种含义:
第一种:执行一条指令所需的最少时钟周期数,越少越好。执行的速度越快,下一条指令和它抢占资源的机率也越少。
第二种:在一定时间内可以执行的最多指令数,当然是越大越好。
3、第 1课其实由 1A, 1B, 1C 构成。这属于 1A。

[ 本帖最后由 skyjacker 于 2007-6-2 23:22 编辑 ]


Attachment: [校对] l1.rar (2007-6-2 23:22, 4.52 K)
Download count 568




一壶清茶煮青春.
Top
skyjacker
版主
Rank: 7Rank: 7Rank: 7
茶农


UID 2239
Digest Posts 9
Credits 617
Posts 269
点点分 617
Reading Access 100
Registered 2006-6-8
Status Offline
Post at 2007-4-22 20:33  Profile | Blog | P.M.  | QQ
关于局部变量的一点解释:
1、
局部变量是从堆栈里分配的,delphi 在为函数生成汇编码时,只是在函数入口的地方简单把堆栈的指针修改一下。

2、当然 string、interface 这些有引用计数的资源是会初始化的,
而且也会自动生成类似 try...finally..end 的结构保证在函数执行完后释放掉。

From QQGroup: 130970 - 云深不知处




一壶清茶煮青春.
Top
kendling (小冬)
高级版主
Rank: 8Rank: 8
MyvNet


Medal No.1  
UID 703
Digest Posts 5
Credits 978
Posts 580
点点分 978
Reading Access 101
Registered 2005-2-18
Location 广东
Status Offline
Post at 2007-4-23 10:17  Profile | Site | Blog | P.M.  | QQ | Yahoo!
啊,终于出来了。




小冬
http://MyvNet.com
Top
cqpkcool
新警察
Rank: 1



UID 24726
Digest Posts 0
Credits 2
Posts 1
点点分 2
Reading Access 10
Registered 2007-9-2
Status Offline
Post at 2007-9-2 11:49  Profile | Blog | P.M. 
xf

fdfd
Top
 




All times are GMT++8, the time now is 2024-11-24 14:21

    本论坛支付平台由支付宝提供
携手打造安全诚信的交易社区 Powered by Discuz! 5.0.0  © 2001-2006 Comsenz Inc.
Processed in 0.013449 second(s), 8 queries , Gzip enabled

Clear Cookies - Contact Us - CnPack Website - Archiver - WAP