<<BASM 初学者入门>> 第 6 课 - 技术板块灌水区 - CnPack Forum

skyjacker

版主

茶农

UID 2239
Digest Posts 9
Credits 617
Posts 269
点点分 617
Reading Access 100
Registered 2006-6-8
Status Offline

Post at 2007-6-1 16:45 Profile | Blog | P.M. |

<<BASM 初学者入门>> 第 6 课

Lesson 6

http://www.cnpack.org
QQ Group: 130970
翻译：SkyJacker
版本：草稿版
状态：未校对
时间：2007

This is lesson number 6 and the topic is CharPos.
第 6 课的主题是 CharPos。

This function search for the first occurrence of a character in a string,and returns the position of it when found.
If nothing is found it returns zero. The Delphi library function does basically the same thing,
but with the difference that it search for a substring in a string.
Passing a character to Pos as the substring is possible and this way use Pos as a CharPos.
In this lesson we will develop a CharPos that is nearly 4 times faster than Delphi's Pos.
As usual we start with a Pascal implementation of the algorithm.
这个函数搜索一个字符串中的字符，并且返回发现的位置，如果没有发现则返回零。
Delphi 库也有类似的函数，但是不同的是，Delphi 是在一个串中搜索一个子串。
给 Pos 传输一个字符作为子串也可以像 CharPos 一样使用 Pos。
在这一课中我们将编写一个 CharPos, 它比 Delphi 的 Pos 快将近 4 倍。
照常，我们从一个 pascal 算法讲起。
function CharPos2(Chr : Char; const Str : AnsiString) : Cardinal;
var
I : Integer;
begin
if (Str <> '') then
  begin
I := 0;
repeat
Inc(I);
until((Str[I] = Chr) or (Str[I] = #0));
if (Str[I] <> #0) then
Result := I
else
Result := 0;
  end
else
  Result := 0;
end;

Input to the function is a Char and a string. The string is passed as const. The functions Result are a Cardinal.
First the function checks that the string is not empty and if is empty it returns zero.
If there is a string we iterate over it with a repeat until loop searching for a match to the input char.
If an occurrence of the char is not found before we reach the zero terminator this will also terminate the loop.
Because the loop can terminate on either of the two conditions it is necessary to check
what terminated the search before we know what to return as result.
If the loop was ended by a successful search we return the value of the loop counter, and if the search was
unsuccessful the result is zero.
It is also possible to use the length of the string as a condition for terminating the loop on an unsuccessful search.
Then the code looks like this.
函数的输入是一个 Char 和 string，string 是常量，函数返回一个 Cardinal。
首先，函数检查一个字符串是否为空，如果空，则返回零。
使用一个 repeat until 循环在字符串中搜索匹配的输入字符，直到迭代结束。
在到达零终止字符之前，如果没有发现这个字符，那么循环也将被终止。
因为循环可以被两个条件中的任何一个终止，所以在返回结果之前需要检查是谁终止了搜索。
如果成功搜索到此字符，将返回循环计数，如果没找到，则返回零。

也可以使用字符串的长度来结束一个没有成功搜索到字符的循环，代码如下：
function CharPos1(Chr : Char; const Str : AnsiString) : Cardinal;
var
StrLenght, I : Integer;
begin
StrLenght := Length(Str);
if StrLenght > 0 then
  begin
I := 0;
repeat
Inc(I);
until((Str[I] = Chr) or (I > StrLenght));
if I <= StrLenght then
Result := I
else
Result := 0;
  end
else
  Result := 0;
end;
Before we start jumping into ASM it is a good idea to see which of the Pascal implementations is the fastest.
For doing this benchmark must be established.
在我们开始进入 asm 之前，最好先看看哪个 pascal 代码执行的最快。
建立如下测试基准：

const
NOOFLOOPS : Cardinal = 200000;
SCALE : Cardinal = 1000;

procedure Benchmark;
var
lpPerformanceCount, StartCount, EndCount : TLargeInteger;
Succes : Boolean;
Str, Str1, FunctionName : AnsiString;
Chr1, Chr2 : Char;
I, CharPos, J, K, Bench, SumBench : Cardinal;
StringArray : array[1..255] of AnsiString;

begin
Series1.Clear;
Str1 := 'T';
for J := 1 to 255 do
  begin
StringArray[J] := Str1;
Str1 := 'A' + Str1;
  end;
SumBench := 0;
Chr1 := 'T';
Chr2 := 'X';
for K := 1 to 255 do
  begin
Str := StringArray[K];
Succes := QueryPerformanceCounter(lpPerformanceCount);
if Succes then
StartCount := lpPerformanceCount
else
raise Exception.Create('QueryPerformanceCounter failed');
for I := 1 to NOOFLOOPS do
begin
   CharPos := CharPosFunction(Chr1, Str);
end;
for I := 1 to NOOFLOOPS do
begin
   CharPos := CharPosFunction(Chr2, Str);
end;
Succes := QueryPerformanceCounter(lpPerformanceCount);
if Succes then
EndCount := lpPerformanceCount
else
raise Exception.Create('QueryPerformanceCounter failed');
Bench := Round((EndCount - StartCount) / SCALE);
Series1.AddXY(K, Bench, '', clBlue);
Bench := Round(Bench / K);
SumBench := SumBench + Bench;
Update;
  end;
FunctionName := FunctionSelectRadioGroup.Items[FunctionSelectRadioGroup.ItemIndex];
ReportRichEdit.Lines.Add(FunctionName + #9 + IntToStr(SumBench));
end;

The benchmark builds an array of 255 AnsiStrings. The first string is 'T'. 'T' is also the character we use as the
search character. String number 1 is therefore the shortest possible successful match.
The next strings are 'AT','AAT' and 'AAAT'. I guess the pattern is clear.
It is equally important to measure the performance on unsuccessful search.
For this purpose we pass 'X' as search character. The benchmark does NOOFLOOPS searches on each string and
measures the time used on each string.
Because we want results from each string length to contribute approximately the same to
the benchmark each timing is divided by the length of the string.
On this benchmark CharPos1 obtains the score 767 on a P4 1600A clocked at 1920 and CharPos2 obtains the score 791.
For comparison Delphi Pos scores 2637.
Because CharPos1 is slightly better than CharPos2 we select this as basis for optimisation.
This is the ASM code Delphi 6 compiled it into with optimizations turned on.
这个基准测试创建了一个大小为 255 的字符串数组。
第一个串是 'T', 它是我们要搜索的字符, 也是第一个最短的匹配串。
接下来的串是 'AT', 'AAT' 和 'AAAT'。
我想这个形式已经很清楚了。
测试未找到字符时的性能也很重要。为了这个目的，我们传入 'X'，作为要搜索的串。
基准循环搜索每个串，并测量在每个串上使用的时间。
这是因为我们用每次的基准测试值除以字符串的长度来近似的表示。
在 P4 1600A 主频是 1920 上进行基准测试，CharPos1 获得了 767 分，CharPos2  获得 791分,
Delphi 中的 Pos 是 2637 分。
因为 CharPos1 比 CharPos2 性能稍微高些，因此我们选择 Charpos1 进行优化。
打开 Delphi6 的优化编译选项，生成 ASM 代码如下：

function CharPos14(Chr : Char; const Str : AnsiString) : Cardinal;
var
StrLenght, I : Integer;

begin
{
push ebx
push esi
mov  esi,edx
mov  ebx,eax
}
StrLenght := Length(Str);
{
mov  eax,esi
call @LStrLen
mov  edx,eax
}
if StrLenght > 0 then
{
test edx,edx
jle  @Else1Begin
}
  begin
I := 0;
{
xor eax,eax
}
repeat
{
@RepeatBegin :
}
Inc(I);
{
inc eax
}
until((Str[I] = Chr) or (I > StrLenght));
{
cmp bl,[esi+eax-$01]
jz  @If2
cmp edx,eax
jnl @RepeatBegin :
}
if I <= StrLenght then
{
@If2 :
cmp edx,eax
jnl @Exit
}
Result := I
{
}
else
Result := 0;
{
xor eax,eax
pop esi
pop ebx
ret
}
  end
else
  Result := 0;
  {
  @Else1Begin :
  xor eax,eax
  }
{
@Exit :
pop esi
pop ebx
}
end;
This time there is no stack frame. Ebx and esi is used and needs to be backed up and restored by pushing and
popping them on the stack.
Because the function does not have its own stack frame they are simply pushed on the stack on the top of the
calling functions frame.
这次没有分配堆栈。
因为用到了 Ebx 和 esi ，因此最好需要通过 push 和 pop 来保存和恢复它们。
这个函数没有分配堆栈，在此函数的开始，它们只是被简单的压入栈中。

The input parameters are received in eax an edx and they are as first thing copied into esi and ebx.
The function Length has a secret internal name, which is LStrLen. This function is called on Str,
which is passed in eax. We see that LStrLen is following the register calling convention.
输入参数在 eax 和 edx 中，它们首先被复制到 esi 和 ebx。
Length 函数有个秘密的内部名字叫 LStrLen。这个函数用于 Str，通过 eax 传递。
我们看到 LStrLen 遵循了如下寄存器调用约定:

Str was received in edx copied to esi and then to eax. LStrLen delivers its result, StrLenght, in eax.
This result is copied into edx and compared to 0. Test edx, edx is the same as cmp edx,0 and it sets the flags.
The jle instruction checks the flags and pass execution to the else part of the if-else code if StrLenght is
lower than or equal to zero.
Str 通过 edx 接收后复制到 esi，然后再到 eax 中。LStrLen 通过 EAX 返回它的结果 StrLenght。
结果复制到 edx 中与 0 比较。Test edx, edx 与 cmp edx,0 类似都设置状态寄存器标志。
jle 检测标志，如果 StrLenght 小于等于零, 则执行 if-else 的 else 部分。

In this else code we have one line of Pascal, which is Result := 0;.
Because our function must return its result in eax we create a zero here by x-oring eax by itself.
If the string is longer than zero execution is continued in the if code.
在 else 代码中有一行 Pascal 代码 Result := 0。
因为我们的函数必须在 eax 中返回结果，因此我们通过异或 eax 自己来清零。
如果字符串长度大于零，则继续执行 if 部分的代码。

The first line here initialises the loop counter I to zero.
Again a xor instruction is used for this.
The loop body has one line only and this is easy to understand Inc(I); = inc eax.
Pascal and ASM is nearly the same thing ;-)
The implementation of the loop-closing test is where the real meat of this function is.
It is made up of these four lines of ASM.
第一行初始化循环计数为 I 为零。又使用了一个 xor 指令。
循环体只有一行，很容易明白 Inc(I); = inc eax。
Pascal 和 ASM 代码相当类似。
循环结束时测试函数是在哪里结束的。
它由 4 行 ASM 代码组成:

cmp bl,[esi+eax-$01]
jz  @If2
cmp edx,eax
jnl @RepeatBegin :

We see that there is two jump instructions.
The last one jumps to the start of the loop and the first one jmps out of the loop.
There are two cmp instructions to set the flags. Number two is the easiest to understand.
It compares eax with edx. A fast look at the Pascal code tells us that
StrLenght and I must be in these two registers.
In fact we have just seen that eax is I and we also saw in the top of the function that StrLenght was in edx.
我们看到有两个跳转指令。
最后一个跳到循环的开始，第一个跳出循环。
通过两个 cmp 指令设置标志，这两个很容易理解。
比较 eax 和 edx。Pascal 代码告诉我们 StrLenght 和 I 在这些寄存器中。
事实上，我们已经知道 eax 是 I，在函数的开始我们能看到 StrLenght 在 edx 中。

In line number 4 Chr was copied into ebx, but a char is only one byte.
Therefore the first cmp instruction compares something to bl, which is the lowest byte of the four bytes in ebx.
We expect that the search character - Chr - be compared to character number 1, 2, 3.. of the input string.
[esi+eax-$01] must therefore be a pointer into this string.
eax is the loop counter I, which is 1 at the first iteration. esi must be the address of the Str parameter that
was received in edx and immediately copied into esi.
-$01 is a constant offset, which is needed because the first character in an AnsiString is located at position 0.
The position where Str points to.
在第 4 行 Char 被复制到 ebx，但是一个字符只有一个字节。
因此，第一个比较指令与 bl 比较，它是 ebx 4 个字节中最低的字节。
我们希望通过比较输入字符串的第 1,2,3 ...个字符来搜索 Chr。
[esi+eax-$01] 指向这个字符串。
eax 是循环计数 I，在迭代的开始置为 1。esi 是字符串参数的地址，它是通过 edx 接收后立即复制到了 esi。
-$01 是一个常量偏移，它是必需的，因为字符串的第一个字符的偏移量是 0。
Str 已经指向了这个位置。

Where has the or from the Pascal code gone?
To understand this we need to talk about the optimisation called incomplete Boolean evaluation.
This optimisation can be applied to the Boolean operator and, and to the Boolean operator or.
Let us take a look at the truth table for and first.
Pascal 代码将执行到哪里呢？
为了弄明白这个，我们来说明一下不完全布尔计算优化。
这个优化可以被应用到布尔操作符 and 或 or。
首先，我们来看一下 and 的真值表
false and false is false
false and true is false
true and false is false
true and true is true

The and operator is only returning true if both operands are true.
This is taken advantage of by the incomplete Boolean evaluation optimisation.
If one operand is found false there is no need to check the second one, because the result of
and will be false no matter what it is.
只有两个操作数都是 true, And 操作符才会返回 true。
不完全布尔计算的优点是，如果有一个操作数为 false，则不需要检测第二个，因为无论它是什么结果都是 false。

The truth table for or is
or 的真值表如下：
false or false is false
false or true is true
true or false is true
true or true is true
The result of an or is true if one or both of the operands are true.
If one is seen to be true there is no need to check the second operand.
如果有一个或者两个都为 true，or 的结果就是 true。
如果有一个是 true，不需要检测第二个操作数。

Our loop-closing test takes advantage of this by jumping out of the loop if the first compare is successful.
This is the case if we found a match for the search character in the string.
If it was a match it does not make sense to see if it was the zero terminator too!
The last compare is only executed on unsuccessful matches.
If we knew something about the strings and characters our function was called upon most often we could take advantage
of it by changing the order of the tests such that the most often true one is located first.
在循环结尾我们可以利用这个特点检测，如果第一个比较是成功的，那么就跳出循环。
这是在字符串中找到了匹配字符的情况。
如果它也是零终止的，那么去匹配是没有意义的。
最后一个比较只在没有找到匹配字符时执行。
如果我们知道调用函数时经常传入的字符串和字符，我可以利用这个特点改变比较的顺序，将经常是 true 的比较放在前面。

Try switching on the compiler switch "complete Boolean evaluation", in project options,
and see what code is created then.
The rest of the code resample what we have seen in earlier lessons and
I will skip the explanation and go get a cup of coffee instead ;-)
Now it is time to do some optimizations. At first the function is made pure BASM.
The labels were introduced in the previous code listing.
Here it is seen that they follow the Pascal code in an intuitive way.
在工程选项中打开 “完全布尔检测”编译开关，看看代码发生了什么变化。
这部分代码我们已经在前面的课程中见到了，因此我不再解释它，我要去拿一杯咖啡了 ;-)
现在，让我们来做一些优化。首先将函数整理为纯 BASM。
在前面的代码列表中已经介绍了标签。
我们将这些书签定义为直观的形式。

function CharPos15(Chr : Char; const Str : AnsiString) : Cardinal;
//var
//StrLenght, I : Integer;
asm
push ebx
push esi
mov  esi,edx
mov  ebx,eax
//StrLenght := Length(Str);
mov  eax,esi
call System.@LStrLen
mov  edx,eax
//if StrLenght > 0 then
test edx,edx
jle  @Else1Begin
//I := 0;
xor  eax,eax
//repeat
@RepeatBegin :
//Inc(I);
inc  eax
//until((Str[I] = Chr) or (I > StrLenght));
cmp  bl,[esi+eax-$01]
jz @If2
cmp  edx,eax
jnl  @RepeatBegin
//if I <= StrLenght then
@If2 :
cmp  edx,eax
jnl  @Exit
//Result := I
//else
//Result := 0;
xor  eax,eax
pop  esi
pop  ebx
ret
//else
//Result := 0;
@Else1Begin :
xor  eax,eax
@Exit :
pop  esi
pop  ebx
end;
The call to LStrLen has been prefixed with System, otherwise the compiler will not recognize it.
LStrLen is implemented in the System unit.
The var section is removed because we do not reference any variables by name.
函数 LStrLen 加上了 System 前缀，否则编译器不认识它。
LStrLen 在 System 单元中实现。
var 区被删除了，因为我们没有引用任何变量。

function CharPos16(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push esi
mov  esi,edx
mov  ebx,eax
//StrLenght := Length(Str);
mov  eax,esi
//call System.@LStrLen
//*************
test eax,eax
jz  @LStrLenExit
mov eax,[eax-$04]
@LStrLenExit :
//*************
mov  edx,eax
//if StrLenght > 0 then
test edx,edx
jle @Else1Begin
//I := 0;
xor  eax,eax
//repeat
@RepeatBegin :
//Inc(I);
inc  eax
//until((Str[I] = Chr) or (I > StrLenght));
cmp  bl,[esi+eax-$01]
jz @If2
cmp  edx,eax
jnl  @RepeatBegin
//if I <= StrLenght then
@If2 :
cmp  edx,eax
jnl  @Exit
//Result := I
//else
//Result := 0;
xor  eax,eax
pop  esi
pop  ebx
ret
//else
//Result := 0;
@Else1Begin :
xor  eax,eax
@Exit :
pop  esi
pop  ebx
end;
The first thing we do is to inline LStrLen. This is done by tracing into it and copying the body of the function
from the cpu view. It is made up of these four lines.
首先我们来介绍内联函数 LStrLen。
我们在 cpu 窗口中跟进此函数，复制这个函数的内部实现。它是由下面四行组成：

test eax,eax
jz +$03
mov eax,[eax-$04]
ret

If a nil pointer is passed to LStrLen in eax nothing is done but returning.
If the pointer is valid the length of the string is found at the 4 bytes preceding the start of the string.
These 4 bytes are returned in eax. To inline it we replace the call with the 4 lines.
如果给 LStrLen 的 eax 传入了一个 nil 指针，则什么也不做，直接返回。
如果指针是合法的，在这个字符串起始的前面 4 个字节中有这个字符串的长度。
这 4 个字节用 eax 返回。使用它来代替这 4 行代码。

Jz is transferring execution to the ret instruction.
Instead of this ret instruction we place a label called LStrLenExit.
The ret returned execution to the line after the call of the function.
This ret has to be removed, otherwise it would pass execution to the function that called CharPos,
not exactly what we want.
Jz 跳到 ret 指令。我们用一个叫做 LStrLenExit 的标签来代替 ret。
ret 返回到调用本函数的函数的下一行。
ret 必须被删除，否则它将使函数 CharPos 返回，这不是我们想要的。

This is what the inlined LStrLen ended up looking like
内联 LStrLen 修改为如下：
test eax,eax
jz  @LStrLenExit
mov eax,[eax-$04]
@LStrLenExit :

function CharPos17(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push esi
mov  esi,edx
mov  ebx,eax
//StrLenght := Length(Str);
mov  eax,esi
//*************
test eax,eax
//jz  @LStrLenExit
jz  @Else1Begin
mov eax,[eax-$04]
//@LStrLenExit :
//*************
mov  edx,eax
//if StrLenght > 0 then
//test edx,edx
//jle @Else1Begin
//I := 0;
xor  eax,eax
//repeat
@RepeatBegin :
//Inc(I);
inc  eax
//until((Str[I] = Chr) or (I > StrLenght));
cmp  bl,[esi+eax-$01]
jz @If2
cmp  edx,eax
jnl  @RepeatBegin
//if I <= StrLenght then
@If2 :
cmp  edx,eax
jnl  @Exit
//Result := I
//else
//Result := 0;
xor  eax,eax
pop  esi
pop  ebx
ret
//else
//Result := 0;
@Else1Begin :
xor  eax,eax
@Exit :
pop  esi
pop  ebx
end;
现在我们来看 pascal 代码 if Strlenght>0，它和内联 LStrLen 第一行检测相同的东西。
必须检测是一次 Str 是 nil 的情况。
删除第二行，将第一行修改为调到 @Else1Begin，来替换 StrLen 函数中如果 Str 是 nil 则跳出。
LStrLenExit 标签不需要了。

function CharPos18(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push esi
mov  esi,edx
mov  ebx,eax
//StrLenght := Length(Str);
//mov  eax,esi
//if StrLenght > 0 then
//test eax,eax
test esi,esi
jz  @Else1Begin
//mov eax,[eax-$04]
mov eax,[esi-$04]
mov  edx,eax
//I := 0;
xor  eax,eax
//repeat
@RepeatBegin :
//Inc(I);
inc  eax
//until((Str[I] = Chr) or (I > StrLenght));
cmp  bl,[esi+eax-$01]
jz @If2
cmp  edx,eax
jnl  @RepeatBegin
//if I <= StrLenght then
@If2 :
cmp  edx,eax
jnl  @Exit
//Result := I
//else
//Result := 0;
xor  eax,eax
pop  esi
pop  ebx
ret
//else
//Result := 0;
@Else1Begin :
xor  eax,eax
@Exit :
pop  esi
pop  ebx
end;
We moved the test of StrLenght = 0 and the comment //if StrLenght > 0 then must be moved too.
After the inlining it becomes possible to copy propagate esi in these lines.
我们移动了 StrLenght = 0 的测试，注释 // if StrLenght > 0 也必须被移动。
在内联之后，下面这些行变成了重复复制 esi。

mov  eax,esi
//*************
test eax,eax
jz  @Else1Begin
mov eax,[eax-$04]
The last of the lines overwrites eax and is the last use of the value in eax that was copied from esi.
代码的最后重写了 eax，这是从 esi 复制之后最后一次使用 eax 的值。

mov  eax,esi
//*************
//test eax,eax
test esi,esi
jz  @Else1Begin
//mov eax,[eax-$04]
mov eax,[esi-$04]

Actually we must also follow the possible branch to Else1Begin and see whether the value in eax is used here.
We see that eax is zeroed out in the line immediately after the label and therefore is not used.
This way the first line is seen to be dead and can be removed.
实际上，我们也必须看看分支 Else1Begin 处是否使用了 eax 的值。
我们看到 eax 在这个标签之后被清零，因此没有再被用。
第一行是无用的，可以删除掉
//mov  eax,esi
test esi,esi
jz  @Else1Begin
mov eax,[esi-$04]

function CharPos19(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push esi
mov  esi,edx
mov  ebx,eax
//if StrLenght > 0 then
test esi,esi
jz  @Else1Begin
//StrLenght := Length(Str);
//mov eax,[esi-$04]
mov edx,[esi-$04]
//mov  edx,eax
//I := 0;
xor  eax,eax
//repeat
@RepeatBegin :
//Inc(I);
inc  eax
//until((Str[I] = Chr) or (I > StrLenght));
cmp  bl,[esi+eax-$01]
jz @If2
cmp  edx,eax
jnl  @RepeatBegin
//if I <= StrLenght then
@If2 :
cmp  edx,eax
jnl  @Exit
//Result := I
//else
//Result := 0;
xor  eax,eax
pop  esi
pop  ebx
ret
//else
//Result := 0;
@Else1Begin :
xor  eax,eax
@Exit :
pop  esi
pop  ebx
end;
Also as a result of inlining LStrLen we can remove one more mov.
LStrLen returned its result in eax and then it was copied into edx.
mov eax, [esi-$04] can then be changed to mov edx, [esi-$04] and the mov edx, eax can be removed.
After this we change focus to the loop. This is far more important because it is executed many more times,
depending on the length of the string or the position of a match.
内联了 LStrLen 之后我们可以删除更多的 mov 指令。
LStrLen 用 eax 返回结果，然后又被复制到 edx。
mov eax, [esi-$04] 可以被改为 mov edx, [esi-$04]，
mov edx, eax 可以被删除了。
之后我们将焦点转向循环。这是更重要的，因为循环会执行会很多次，它依赖于字符串的长度
和那个匹配字符的位置。

function CharPos20(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push esi
mov  esi,edx
mov  ebx,eax
//if StrLenght > 0 then
test esi,esi
jz  @Else1Begin
//StrLenght := Length(Str);
mov edx,[esi-$04]
//I := 0;
xor  eax,eax
dec  esi
@RepeatBegin :
//Inc(I);
inc  eax
//until((Str[I] = Chr) or (I > StrLenght));
//cmp  bl,[esi+eax-$01]
cmp  bl,[esi+eax]
jz @If2
cmp  edx,eax
jnl  @RepeatBegin
//if I <= StrLenght then
@If2 :
cmp  edx,eax
jnl  @Exit
//Result := 0;
xor  eax,eax
pop  esi
pop  ebx
ret
//Result := 0;
@Else1Begin :
xor  eax,eax
@Exit :
pop  esi
pop  ebx
end;
When we analyzed the code we saw that there was an offset of -1 in the addressing into the string.
There is no need to subtract this offset at each iteration. It is a better idea to decrement the pointer to
Str in esi once before entering the loop. We could also have decremented the loop counter in eax, but then we
would have to add 1 again before returning the result.
At the very top of the function the two input parameters are copied to new registers.
This is redundant and we would like to copy propagate both, but eax is used as loop counter and we must first
find another register for this purpose.
我们分析代码发现, 在索引字符串的地址中有个 -1 的偏移量。
没有必要在每次迭代中都减去这个偏移。
一个比较好的办法是，在进入循环之前将 str 的指针减 1。
我们也可以在循环中将 eax 减 1，但是在返回结果之前又不得不再加 1。
在函数的最上面，两个形参被复制到新的寄存器。
这是冗余的，我们想复制传播这两个，但是 eax 被作为循环计数，我们必须首先寻找另一个寄存器。

function CharPos22(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push esi
mov  esi,edx
mov  ebx,eax
//if StrLenght > 0 then
test esi,esi
jz  @Else1Begin
//StrLenght := Length(Str);
mov edx,[esi-$04]
//I := 0;
//xor  eax,eax
xor  ecx,ecx
dec  esi
@RepeatBegin :
//Inc(I);
//inc  eax
inc  ecx
//until((Str[I] = Chr) or (I > StrLenght));
//cmp  bl,[esi+eax]
cmp  bl,[esi+ecx]
jz @If2
//cmp  edx,eax
cmp  edx,ecx
jnl  @RepeatBegin
//if I <= StrLenght then
@If2 :
//cmp  edx,eax
cmp  edx,ecx
jnl  @Exit
//Result := 0;
xor  eax,eax
pop  esi
pop  ebx
ret
//Result := 0;
@Else1Begin :
xor  eax,eax
pop  esi    //New
pop  ebx    //New
ret          //New
@Exit :
mov  eax, ecx
pop  esi
pop  ebx
end;
All lines where eax is in use for I, eax is changed to ecx.
Because I is the return value of the function on a match and the return value is in eax,
we have to copy ecx into eax just beneath the @Exit label.
This introduces a little problem, because a jump to Else1Begin also brings us here,
and in this situation we copy a value from ecx into eax, which we have just cleared.
The fix is to add the lines marked new.
所有使用 eax 都是为了 I，将 eax 改为 ecx。
因为当有匹配字符时，I 是函数的返回值，并且用 eax 返回，因此我们需要在 @Exit 标签下面复制 ecx 到 eax。
说明一个小问题，因为 Else1Begin 也会到这里，在这种情况写刚刚被清空的 eax 会被 ecx 重写。
增加的行用 new 标识

Then we are ready to copy propagate eax.
Only one line uses ebx. This is cmp bl, [esi+ecx], which is changed to cmp al, [esi+ecx].
Then the copy mov ebx, eax becomes dead and is removed.
This was copy propagation followed by dead code removal once more and
we begin realising how important this optimization is.
接下来我们要实现复制传播 eax。
只有 cmp bl, [esi+ecx] 这一行使用 ebx, 将它改为 cmp al, [esi+ecx]。
mov ebx, eax 这一行没用了，可以删除。
复制传播可以移除更多的无效代码，现在我们明白了优化有多么重要了吧。

function CharPos23(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push esi
mov  esi,edx
//mov  ebx,eax
//if StrLenght > 0 then
test esi,esi
jz  @Else1Begin
//StrLenght := Length(Str);
mov edx,[esi-$04]
//I := 0;
xor  ecx,ecx
dec  esi
@RepeatBegin :
//Inc(I);
inc  ecx
//until((Str[I] = Chr) or (I > StrLenght));
//cmp  bl,[esi+ecx]
cmp  al,[esi+ecx]
jz @If2
cmp  edx,ecx
jnl  @RepeatBegin
//if I <= StrLenght then
@If2 :
cmp  edx,ecx
jnl  @Exit
//Result := 0;
xor  eax,eax
pop  esi
pop  ebx
ret
//Result := 0;
@Else1Begin :
xor  eax,eax
pop  esi
pop  ebx
ret
@Exit :
mov  eax, ecx
pop  esi
pop  ebx
end;

Before we can copy propagate edx (holding the pointer to Str), we must free edx from other uses.
It is used for StrLenght and we allocate ebx for this instead of edx.
在我们复制传播 edx (指向 str) 之前，我们必须将 edx 从其他使用它的地方解放出来。
它被用来存放 StrLenght，我们为它分配 ebx 来代替 edx。

function CharPos24(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push esi
mov  esi,edx
//if StrLenght > 0 then
test esi,esi
jz  @Else1Begin
//StrLenght := Length(Str);
//mov edx,[esi-$04]
mov ebx,[esi-$04]
//I := 0;
xor  ecx,ecx
dec  esi
@RepeatBegin :
//Inc(I);
inc  ecx
//until((Str[I] = Chr) or (I > StrLenght));
cmp  al,[esi+ecx]
jz @If2
//cmp  edx,ecx
cmp  ebx,ecx
jnl  @RepeatBegin
//if I <= StrLenght then
@If2 :
//cmp  edx,ecx
cmp  ebx,ecx
jnl  @Exit
//Result := 0;
xor  eax,eax
pop  esi
pop  ebx
ret
//Result := 0;
@Else1Begin :
xor  eax,eax
pop  esi
pop  ebx
ret
@Exit :
mov  eax, ecx
pop  esi
pop  ebx
end;
Then edx is copy propagated and the mov esi, edx becomes dead.
edx 被复制传播，mov esi, edx 无效了。

function CharPos25(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push esi
//mov  esi,edx
//if StrLenght > 0 then
//test esi,esi
test edx,edx
jz  @Else1Begin
//StrLenght := Length(Str);
//mov ebx,[esi-$04]
mov ebx,[edx-$04]
//I := 0;
xor  ecx,ecx
//dec  esi
dec  edx
@RepeatBegin :
//Inc(I);
inc  ecx
//until((Str[I] = Chr) or (I > StrLenght));
//cmp  al,[esi+ecx]
cmp  al,[edx+ecx]
jz @If2
cmp  ebx,ecx
jnl  @RepeatBegin
//if I <= StrLenght then
@If2 :
cmp  ebx,ecx
jnl  @Exit
//Result := 0;
xor  eax,eax
pop  esi
pop  ebx
ret
//Result := 0;
@Else1Begin :
xor  eax,eax
pop  esi
pop  ebx
ret
@Exit :
mov  eax, ecx
pop  esi
pop  ebx
end;
This removed the use of esi and it does not need to be saved and restored any more. When removing the pop esi,
we remember that there are 3 exit paths all with each own pop esi.
移除 esi 的使用，它不再需要保存和恢复了。
当我们移除 pop esi 时，要记着有 3 个退出路径都是自己 pop esi。
function CharPos26(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
//push esi
//if StrLenght > 0 then
test edx,edx
jz  @Else1Begin
//StrLenght := Length(Str);
mov ebx,[edx-$04]
//I := 0;
xor  ecx,ecx
dec  edx
@RepeatBegin :
//Inc(I);
inc  ecx
//until((Str[I] = Chr) or (I > StrLenght));
cmp  al,[edx+ecx]
jz @If2
cmp  ebx,ecx
jnl  @RepeatBegin
//if I <= StrLenght then
@If2 :
cmp  ebx,ecx
jnl  @Exit
//Result := 0;
xor  eax,eax
//pop  esi
pop  ebx
ret
//Result := 0;
@Else1Begin :
xor  eax,eax
//pop  esi
pop  ebx
ret
@Exit :
mov  eax, ecx
//pop  esi
pop  ebx
end;
In the line after the If2 label there is a line, which is identical to the second compare in the loop-closing test.
In Pascal it was necessary to have the line if I <= StrLenght after the loop because we could not know which
condition the loop terminated on.
This line generated the extra cmp ebx, ecx instruction, which looks a little redundant now.
It is in fact not redundant because two paths of execution lead to the If2 Label and only one of them has the test.
If we split the two exit paths such that only one of them goes to If2 we can remove the extra check.
Instead of jumping to If2 on a match we can jump directly to Exit.
在标签 If2 之后有一行是是否循环结束的第二次检测。
在 Pascal 代码的循环之后有 if I <= StrLenght, 因为我们不知道是被哪个条件终止的。
这一行产生额外的指令 cmp ebx, ecx，看起来有一点冗余。
实际上它不冗余，虽然有两路会执行到 If2 标签，但是只会测试一个。
如果我们分成两个退出路径，只让一个到达 If2，我们可以移除额外的检查。
将匹配时跳到 If2 改为直接跳到 Exit。

function CharPos27(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
//if StrLenght > 0 then
test edx,edx
jz  @Else1Begin
//StrLenght := Length(Str);
mov ebx,[edx-$04]
//I := 0;
xor  ecx,ecx
dec  edx
@RepeatBegin :
//Inc(I);
inc  ecx
//until((Str[I] = Chr) or (I > StrLenght));
cmp  al,[edx+ecx]
//jz @If2
jz @Exit
cmp  ebx,ecx
jnl  @RepeatBegin
//if I <= StrLenght then
//@If2 :
//cmp  ebx,ecx
//jnl  @Exit
//Result := 0;
xor  eax,eax
pop  ebx
ret
//Result := 0;
@Else1Begin :
xor  eax,eax
pop  ebx
ret
@Exit :
mov  eax,ecx
pop  ebx
end;
Then the If2 label goes out of use and when we reach the code at this position we know that the zero terminator was
reached and there is no need to test for it again.
There are two sections of identical code just before the Else1Begin label and just after it.
The upper section can be deleted.
If2 标签没用了，当代码到达这个位置，我们已经知道到了零终止符，因此不需要再次检测了。
在 Else1Begin 的前后有两块相同的代码，上面的那块可以删除掉。
function CharPos28(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
//if StrLenght > 0 then
test edx,edx
jz @Else1Begin
//StrLenght := Length(Str);
mov  ebx,[edx-$04]
//I := 0;
xor  ecx,ecx
dec  edx
@RepeatBegin :
//Inc(I);
inc  ecx
//until((Str[I] = Chr) or (I > StrLenght));
cmp  al,[edx+ecx]
jz @Exit
cmp  ebx,ecx
jnl  @RepeatBegin
//Result := 0;
//xor  eax,eax
//pop  ebx
//ret
//Result := 0;
@Else1Begin :
xor  eax,eax
pop  ebx
ret
@Exit :
mov  eax,ecx
pop  ebx
end;
This ended our search for redundant code to remove. The cleaned up function looks like this.
最终我们删除了发现的冗余代码，整理为如下函数：
function CharPos29(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
//if StrLenght > 0 then
test edx,edx
jz @Else1Begin
//StrLenght := Length(Str);
mov  ebx,[edx-$04]
//I := 0;
xor  ecx,ecx
dec  edx
@RepeatBegin :
//Inc(I);
inc  ecx
//until((Str[I] = Chr) or (I > StrLenght));
cmp  al,[edx+ecx]
jz @Exit
cmp  ebx,ecx
jnl  @RepeatBegin
@Else1Begin :
//Result := 0;
xor  eax,eax
pop  ebx
ret
@Exit :
mov  eax,ecx
pop  ebx
end;
When the loop is iterating in search for a match or the end of the string,
these lines of code are executed over and over again
当循环搜索匹配字符或者直到字符串的结束时，这些代码被一次又一次的执行：
inc  ecx
cmp  al,[edx+ecx]
jz @Exit
cmp  ebx,ecx
jnl  @RepeatBegin

Let us make some variants of them and see how they perform.
让我们来进行一些变化，看看他们怎样执行
The most exiting line is
最重要的行是
cmp  al,[edx+ecx]
It generates two microinstructions.
他产生了两个微指令。
One for loading a byte from the address [edx+ecx] and one to compare it against al.
一个是从地址 [edx+ecx] 装入一个字节，一个是再和 al 比较。
This line could also be coded as
这行也可以这样写
mov ah, byte ptr [edx+ecx]
cmp al, ah

This variant also generates two microinstructions, but it needs one more register (ah).
这个变化也产生了两条微指令，但它还需要一个寄存器 ah。

If we allocate an extra register it could also be coded as
如果我们分配额外的寄存器，也可以这样编码
movzx efx, byte ptr [edx+ecx]
cmp al, fh
movzx is mov with zero extension.
It loads one byte into the lowest of efx and fills the three remaining bytes with zeroes.
Of course there is no such thing as an efx register, but the two unused registers esi & edi cannot be accessed byte wise.
Therefore it is necessary to free eax, ebx, ecx or edx, by substituting it by edi or esi and then use erg.
ebx instead of "efx".
This function demonstrates the first variant.
movzx 是 mov 的扩展版本。
它将一个字节装入 efx 的最低字节，将剩余的 3 个字节用零填充。
当然，没有 efx 这样的寄存器，寄存器 esi&edi 虽然没被使用，但是不能被作为字节访问。
因此，可以用 edi 或 esi 来取代 eax, ebx, ecx 或 edx，让它们自由。我们用 ebx 来替代 "efx"。
function CharPos30(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
//if StrLenght > 0 then
test edx,edx
jz @Else1Begin
//StrLenght := Length(Str);
mov  ebx,[edx-$04]
//I := 0;
xor  ecx,ecx
dec  edx
@RepeatBegin :
//Inc(I);
inc  ecx
//until((Str[I] = Chr) or (I > StrLenght));
mov  ah, [edx+ecx]
//cmp  al,[edx+ecx]
cmp  al,ah
jz @Exit
cmp  ebx,ecx
jnl  @RepeatBegin
@Else1Begin :
//Result := 0;
xor  eax,eax
pop  ebx
ret
@Exit :
mov  eax,ecx
pop  ebx
end;

This function demonstrates the second variant.
这个函数说明了第二个变化。

function CharPos31(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push edi
//if StrLenght > 0 then
test edx,edx
jz @Else1Begin
//StrLenght := Length(Str);
mov  edi,[edx-$04]
//I := 0;
xor  ecx,ecx
dec  edx
@RepeatBegin :
//Inc(I);
inc  ecx
//until((Str[I] = Chr) or (I > StrLenght));
movzx ebx, byte ptr [edx+ecx]
//cmp  al,[edx+ecx]
cmp  al, bl
jz @Exit
cmp  edi,ecx
jnl  @RepeatBegin
@Else1Begin :
//Result := 0;
xor  eax,eax
pop  edi
pop  ebx
ret
@Exit :
mov  eax,ecx
pop  edi
pop  ebx
end;
Instead of adding edx and ecx in the address calculation at every iteration,
we could add them prior to entering the loop.
Then it is necessary to subtract them from each other again to extract the loop counter value for the result.
This is done with the sub instruction in line two after the Exit label.
为了替代在迭代中累加 edx 和 ecx 计算地址，我们可以在进入循环之前累加它们，
然后需要再分别减去他们来得到准确的循环计数结果。
在 Exit 标签后有两行代码来实现这个运算。

function CharPos32(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push  ebx
push  edi
//if StrLenght > 0 then
test  edx,edx
jz @Else1Begin
//StrLenght := Length(Str);
mov edi,[edx-$04]
//I := 0;
xor ecx,ecx
//dec  edx
add ecx,edx
@RepeatBegin :
//Inc(I);
//until((Str[I] = Chr) or (I > StrLenght));
movzx ebx, byte ptr [ecx]
inc ecx
cmp al, bl
jz @Exit
//cmp  edi,ecx
test  bl, bl
jnz @RepeatBegin
@Else1Begin :
//Result := 0;
xor eax,eax
pop edi
pop ebx
ret
@Exit :
mov eax,ecx
sub eax,edx
pop edi
pop ebx
end;

Then there are 4 functions to compare the performance of; CharPos29, CharPos30, CharPos31 and CharPos32.
比较 CharPos29, CharPos30, CharPos31 和 CharPos32 这四个函数的性能。
Results on P4 1920 are 在 P4 1920 上
CharPos29 716
CharPos30 973
CharPos31 710
CharPos32 702
Winner is CharPos32 CharPos32 赢了
Results on P3 1400 are 在 P4 1400 上
CharPos29  949
CharPos30  921
CharPos31  950
CharPos32 1403
Winner is CharPos30 CharPos30 赢了
Summed time 总计时间
CharPos29 716 + 949  = 1665
CharPos30 973 + 921  = 1894
CharPos31 710 + 950  = 1660
CharPos32 702 + 1403 = 2105
Winner is CharPos31  CharPos31 赢了
On P4 the winning loop is 在 P4 上胜利的循环是
@RepeatBegin :
movzx ebx, byte ptr [ecx]
inc ecx
cmp al, bl
jz @Exit
test  bl, bl
jnz @RepeatBegin
On P3 the winning loop is 在 P3 上胜利的循环是
@RepeatBegin :
inc  ecx
mov  ah, [edx+ecx]
cmp  al,ah
jz @Exit
cmp  ebx,ecx
jnl  @RepeatBegin
on a blend of targets the winning loop is 在混合平台上胜利的循环是
@RepeatBegin :
inc ecx
movzx ebx, byte ptr [edx+ecx]
cmp al, bl
jz @Exit
cmp edi,ecx
jnl @RepeatBegin

The P4 winner performs very bad on P3 and could not be used in a library targeting more targets than P4,
such as the Delphi RTL.
The winner on P3 is performing quite bad on P4 and this one should not be used in a blended target library either.
The winner on a blend of the two targets is CharPos31, which performs near to optimal on P4 and
also performs optimal within a few percent on P3.
This function would make a perfect choice for the Delphi RTL, where I have missed a CharPos function.
It is a relief to see that it is possible to optimise for both processors at the same time
without sacrificing more than a few percent of performance.

在 P4 上的赢家在 P3 上没效率，不能用在除了 P4 以外的库中。就像 Delphi RTL。
P3 上的赢家在 P4 上执行效率非常的低，这个也不能用在混合平台的库里。
在两个处理器上，综合效率的赢家是 CharPos31 , 在 P4 上接近最佳，在 P3 上也没差多少。
这个函数是 Delphi RTL 最好的选择，在 RTL 中我已经失去了一个 CharPos 函数。
我们欣慰的看到，它可以同时作为两个处理器的优化版本，并且几乎没有牺牲一点性能。

Comparing performance of P3 and P4 on a clock-by-clock basis is always a hit.
There is broad tendency to think at the P4 as having an inferior design,
but this is not proved by this code.
Taking the blended winner's
performance and scaling it to a 1400 MHz processor it is 950 on P3 and 710 * (1920/1400) = 973 on P4.
The processors are performing very much the same at the same clock.

在 P3 和 P4 上比较时钟周期的性能也是一个打击。
有一个大胆的猜测就是 P4 也有一个低级的设计，但是不能被这个代码证明。
在两种平台上比较综合赢家的性能，放在 P3 1400HZ 的处理器上是 950，在 P4 上是 710 * (1920/1400) = 973。
这两种处理器几乎花费了同样的时钟周期。

[ 本帖最后由 skyjacker 于 2007-6-2 23:15 编辑 ]

Attachment: L6o.rar (2007-6-1 16:45, 12.39 K)
Download count 420

一壶清茶煮青春.

skyjacker

版主

茶农

UID 2239
Digest Posts 9
Credits 617
Posts 269
点点分 617
Reading Access 100
Registered 2006-6-8
Status Offline

Post at 2007-6-1 16:46 Profile | Blog | P.M. |

这一课很精彩。
作者提供的不仅是思路，更将优化的思维过程展现给了我们。

一壶清茶煮青春.