<<BASM ѧ>>  6 

Lesson 6
<<BASM ѧ>>  6 

http://www.cnpack.org
QQ Group: 130970
룺SkyJacker
汾ݸ
״̬δУ
ʱ䣺2007

This is lesson number 6 and the topic is CharPos.
 6 ε CharPos

This function search for the first occurrence of a character in a string,and returns the position of it when found.
If nothing is found it returns zero. The Delphi library function does basically the same thing,
but with the difference that it search for a substring in a string.
Passing a character to Pos as the substring is possible and this way use Pos as a CharPos.
In this lesson we will develop a CharPos that is nearly 4 times faster than Delphi's Pos.
As usual we start with a Pascal implementation of the algorithm.
һַеַҷطֵλãûз򷵻㡣
Delphi ҲƵĺǲͬǣDelphi һһӴ
 Pos һַΪӴҲ CharPos һʹ Pos
һǽдһ CharPos,  Delphi  Pos 콫 4 
ճǴһ pascal 㷨
function CharPos2(Chr : Char; const Str : AnsiString) : Cardinal;
var
 I : Integer;
begin
 if (Str <> '') then
  begin
   I := 0;
   repeat
    Inc(I);
   until((Str[I] = Chr) or (Str[I] = #0));
   if (Str[I] <> #0) then
    Result := I
   else
    Result := 0;
  end
 else
  Result := 0;
end;

Input to the function is a Char and a string. The string is passed as const. The functions Result are a Cardinal.
First the function checks that the string is not empty and if is empty it returns zero.
If there is a string we iterate over it with a repeat until loop searching for a match to the input char.
If an occurrence of the char is not found before we reach the zero terminator this will also terminate the loop.
Because the loop can terminate on either of the two conditions it is necessary to check
what terminated the search before we know what to return as result.
If the loop was ended by a successful search we return the value of the loop counter, and if the search was
unsuccessful the result is zero.
It is also possible to use the length of the string as a condition for terminating the loop on an unsuccessful search.
Then the code looks like this.
һ Char  stringstring ǳһ Cardinal
ȣһַǷΪգգ򷵻㡣
ʹһ repeat until ѭַƥֱַ
ڵַֹ֮ǰûзַôѭҲֹ
ΪѭԱеκһֹڷؽ֮ǰҪ˭ֹ
ɹַѭûҵ򷵻㡣

Ҳʹַĳһûгɹַѭ£
function CharPos1(Chr : Char; const Str : AnsiString) : Cardinal;
var
 StrLenght, I : Integer;
begin
 StrLenght := Length(Str);
 if StrLenght > 0 then
  begin
   I := 0;
   repeat
    Inc(I);
   until((Str[I] = Chr) or (I > StrLenght));
   if I <= StrLenght then
    Result := I
   else
    Result := 0;
  end
 else
  Result := 0;
end;
Before we start jumping into ASM it is a good idea to see which of the Pascal implementations is the fastest. 
For doing this benchmark must be established.
ǿʼ asm ֮ǰȿĸ pascal ִе졣
²Ի׼

const
 NOOFLOOPS : Cardinal = 200000;
 SCALE : Cardinal = 1000;

procedure Benchmark;
var
 lpPerformanceCount, StartCount, EndCount : TLargeInteger;
 Succes : Boolean;
 Str, Str1, FunctionName : AnsiString;
 Chr1, Chr2 : Char;
 I, CharPos, J, K, Bench, SumBench : Cardinal;
 StringArray : array[1..255] of AnsiString;

begin
 Series1.Clear;
 Str1 := 'T';
 for J := 1 to 255 do
  begin
    StringArray[J] := Str1;
    Str1 := 'A' + Str1;
  end;
 SumBench := 0;
 Chr1 := 'T';
 Chr2 := 'X';
 for K := 1 to 255 do
  begin
   Str := StringArray[K];
   Succes := QueryPerformanceCounter(lpPerformanceCount);
   if Succes then
    StartCount := lpPerformanceCount
   else
    raise Exception.Create('QueryPerformanceCounter failed');
   for I := 1 to NOOFLOOPS do
    begin
     CharPos := CharPosFunction(Chr1, Str);
    end;
   for I := 1 to NOOFLOOPS do
    begin
     CharPos := CharPosFunction(Chr2, Str);
    end;
   Succes := QueryPerformanceCounter(lpPerformanceCount);
   if Succes then
    EndCount := lpPerformanceCount
   else
    raise Exception.Create('QueryPerformanceCounter failed');
   Bench := Round((EndCount - StartCount) / SCALE);
   Series1.AddXY(K, Bench, '', clBlue);
   Bench := Round(Bench / K);
   SumBench := SumBench + Bench;
   Update;
  end;
 FunctionName := FunctionSelectRadioGroup.Items[FunctionSelectRadioGroup.ItemIndex];
 ReportRichEdit.Lines.Add(FunctionName + #9 + IntToStr(SumBench));
end;

The benchmark builds an array of 255 AnsiStrings. The first string is 'T'. 'T' is also the character we use as the 
search character. String number 1 is therefore the shortest possible successful match.
The next strings are 'AT','AAT' and 'AAAT'. I guess the pattern is clear.
It is equally important to measure the performance on unsuccessful search.
For this purpose we pass 'X' as search character. The benchmark does NOOFLOOPS searches on each string and
measures the time used on each string.
Because we want results from each string length to contribute approximately the same to
the benchmark each timing is divided by the length of the string.
On this benchmark CharPos1 obtains the score 767 on a P4 1600A clocked at 1920 and CharPos2 obtains the score 791.
 For comparison Delphi Pos scores 2637.
Because CharPos1 is slightly better than CharPos2 we select this as basis for optimisation.
This is the ASM code Delphi 6 compiled it into with optimizations turned on.
׼ԴһСΪ 255 ַ顣
һ 'T', Ҫַ, Ҳǵһ̵ƥ䴮
Ĵ 'AT', 'AAT'  'AAAT'
ʽѾˡ
δҵַʱҲҪΪĿģǴ 'X'ΪҪĴ
׼ѭÿÿʹõʱ䡣
ΪÿεĻ׼ֵַĳƵıʾ
 P4 1600A Ƶ 1920 Ͻл׼ԣCharPos1  767 ֣CharPos2   791,
Delphi е Pos  2637 ֡
Ϊ CharPos1  CharPos2 ΢Щѡ Charpos1 Ż
 Delphi6 Żѡ ASM £

function CharPos14(Chr : Char; const Str : AnsiString) : Cardinal;
var
 StrLenght, I : Integer;

begin
 {
 push ebx
 push esi
 mov  esi,edx
 mov  ebx,eax
 }
 StrLenght := Length(Str);
 {
 mov  eax,esi
 call @LStrLen
 mov  edx,eax
 }
 if StrLenght > 0 then
 {
 test edx,edx
 jle  @Else1Begin
 }
  begin
   I := 0;
   {
   xor eax,eax
   }
   repeat
    {
    @RepeatBegin :
    }
    Inc(I);
    {
    inc eax
    }
   until((Str[I] = Chr) or (I > StrLenght));
   {
   cmp bl,[esi+eax-$01]
   jz  @If2
   cmp edx,eax
   jnl @RepeatBegin :
   }
   if I <= StrLenght then
   {
   @If2 :
   cmp edx,eax
   jnl @Exit
   }
    Result := I
    {
    }
   else
    Result := 0;
    {
    xor eax,eax
    pop esi
    pop ebx
    ret
    }
  end
 else
  Result := 0;
  {
  @Else1Begin :
  xor eax,eax
  }
 {
 @Exit :
 pop esi
 pop ebx
 }
end;
This time there is no stack frame. Ebx and esi is used and needs to be backed up and restored by pushing and
popping them on the stack.
Because the function does not have its own stack frame they are simply pushed on the stack on the top of the
calling functions frame.
ûзջ
Ϊõ Ebx  esi Ҫͨ push  pop ͻָǡ
ûзջڴ˺ĿʼֻǱ򵥵ѹջС

The input parameters are received in eax an edx and they are as first thing copied into esi and ebx.
The function Length has a secret internal name, which is LStrLen. This function is called on Str,
which is passed in eax. We see that LStrLen is following the register calling convention. 
 eax  edx УȱƵ esi  ebx
Length иܵڲֽ LStrLen Strͨ eax ݡ
ǿ LStrLen ѭ¼ĴԼ:

Str was received in edx copied to esi and then to eax. LStrLen delivers its result, StrLenght, in eax.
This result is copied into edx and compared to 0. Test edx, edx is the same as cmp edx,0 and it sets the flags.
The jle instruction checks the flags and pass execution to the else part of the if-else code if StrLenght is
lower than or equal to zero.
Str ͨ edx պƵ esiȻٵ eax СLStrLen ͨ EAX Ľ StrLenght
Ƶ edx  0 ȽϡTest edx, edx  cmp edx,0 ƶ״̬Ĵ־
jle ־ StrLenght Сڵ, ִ if-else  else ֡

In this else code we have one line of Pascal, which is Result := 0;.
Because our function must return its result in eax we create a zero here by x-oring eax by itself.
If the string is longer than zero execution is continued in the if code.
 else һ Pascal  Result := 0
Ϊǵĺ eax зؽͨ eax Լ㡣
ַȴ㣬ִ if ֵĴ롣

The first line here initialises the loop counter I to zero.
Again a xor instruction is used for this.
The loop body has one line only and this is easy to understand Inc(I); = inc eax.
Pascal and ASM is nearly the same thing ;-)
The implementation of the loop-closing test is where the real meat of this function is.
It is made up of these four lines of ASM.
һгʼѭΪ I Ϊ㡣ʹһ xor ָ
ѭֻһУ Inc(I); = inc eax
Pascal  ASM ൱ơ
ѭʱԺġ
 4  ASM :

cmp bl,[esi+eax-$01]
jz  @If2
cmp edx,eax
jnl @RepeatBegin :

We see that there is two jump instructions.
The last one jumps to the start of the loop and the first one jmps out of the loop.
There are two cmp instructions to set the flags. Number two is the easiest to understand. 
It compares eax with edx. A fast look at the Pascal code tells us that 
StrLenght and I must be in these two registers.
In fact we have just seen that eax is I and we also saw in the top of the function that StrLenght was in edx.
ǿתָ
һѭĿʼһѭ
ͨ cmp ָñ־ ⡣
Ƚ eax  edxPascal  StrLenght  I ЩĴС
ʵϣѾ֪ eax  IںĿʼܿ StrLenght  edx С

In line number 4 Chr was copied into ebx, but a char is only one byte.
Therefore the first cmp instruction compares something to bl, which is the lowest byte of the four bytes in ebx.
We expect that the search character - Chr - be compared to character number 1, 2, 3.. of the input string.
[esi+eax-$01] must therefore be a pointer into this string.
eax is the loop counter I, which is 1 at the first iteration. esi must be the address of the Str parameter that
was received in edx and immediately copied into esi.
-$01 is a constant offset, which is needed because the first character in an AnsiString is located at position 0.
The position where Str points to.
ڵ 4  Char Ƶ ebxһַֻһֽڡ
ˣһȽָ bl Ƚϣ ebx 4 ֽ͵ֽڡ
ϣͨȽַĵ 1,2,3 ...ַ Chr
[esi+eax-$01] ַָ
eax ѭ IڵĿʼΪ 1esi ַĵַͨ edx պƵ esi
-$01 һƫƣǱģΪַĵһַƫ 0
Str Ѿָλá

Where has the or from the Pascal code gone?
To understand this we need to talk about the optimisation called incomplete Boolean evaluation.
This optimisation can be applied to the Boolean operator and, and to the Boolean operator or. 
Let us take a look at the truth table for and first.
Pascal 뽫ִеأ
ΪŪ˵һ²ȫŻ
ŻԱӦõ and  or
ȣһ and ֵ
false and false is false
false and true is false
true and false is false
true and true is true

The and operator is only returning true if both operands are true.
This is taken advantage of by the incomplete Boolean evaluation optimisation.
If one operand is found false there is no need to check the second one, because the result of
and will be false no matter what it is.
ֻ true, And Ż᷵ true
ȫŵǣһΪ falseҪڶΪʲô false

The truth table for or is
or ֵ£
false or false is false
false or true is true
true or false is true
true or true is true
The result of an or is true if one or both of the operands are true.
If one is seen to be true there is no need to check the second operand.
һΪ trueor Ľ true
һ trueҪڶ

Our loop-closing test takes advantage of this by jumping out of the loop if the first compare is successful. 
This is the case if we found a match for the search character in the string.
If it was a match it does not make sense to see if it was the zero terminator too!
The last compare is only executed on unsuccessful matches.
If we knew something about the strings and characters our function was called upon most often we could take advantage
of it by changing the order of the tests such that the most often true one is located first.
ѭβǿص⣬һȽǳɹģôѭ
ַҵƥַ
Ҳֹģôȥƥûġ
һȽֻûҵƥַʱִС
֪úʱַַҿصıȽϵ˳򣬽 true ıȽϷǰ档

Try switching on the compiler switch "complete Boolean evaluation", in project options,
and see what code is created then.
The rest of the code resample what we have seen in earlier lessons and
I will skip the explanation and go get a cup of coffee instead ;-)
Now it is time to do some optimizations. At first the function is made pure BASM.
The labels were introduced in the previous code listing.
Here it is seen that they follow the Pascal code in an intuitive way.
ڹѡд ȫ⡱뿪أ뷢ʲô仯
ⲿִѾǰĿγмˣҲٽҪȥһ ;-)
ڣһЩŻȽΪ BASM
ǰĴбѾ˱ǩ
ǽЩǩΪֱ۵ʽ

function CharPos15(Chr : Char; const Str : AnsiString) : Cardinal;
//var
 //StrLenght, I : Integer;
asm
 push ebx
 push esi
 mov  esi,edx
 mov  ebx,eax
 //StrLenght := Length(Str);
 mov  eax,esi
 call System.@LStrLen
 mov  edx,eax
 //if StrLenght > 0 then
 test edx,edx
 jle  @Else1Begin
 //I := 0;
 xor  eax,eax
 //repeat
 @RepeatBegin :
 //Inc(I);
 inc  eax
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp  bl,[esi+eax-$01]
 jz   @If2
 cmp  edx,eax
 jnl  @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 cmp  edx,eax
 jnl  @Exit
 //Result := I
 //else
 //Result := 0;
 xor  eax,eax
 pop  esi
 pop  ebx
 ret
 //else
 //Result := 0;
 @Else1Begin :
 xor  eax,eax
 @Exit :
 pop  esi
 pop  ebx
end;
The call to LStrLen has been prefixed with System, otherwise the compiler will not recognize it.
LStrLen is implemented in the System unit.
The var section is removed because we do not reference any variables by name.
 LStrLen  System ǰ׺ʶ
LStrLen  System Ԫʵ֡
var ɾˣΪûκα

function CharPos16(Chr : Char; const Str : AnsiString) : Cardinal;
asm
 push ebx
 push esi
 mov  esi,edx
 mov  ebx,eax
 //StrLenght := Length(Str);
 mov  eax,esi
 //call System.@LStrLen
 //*************
 test eax,eax
 jz  @LStrLenExit
 mov eax,[eax-$04]
 @LStrLenExit :
 //*************
 mov  edx,eax
 //if StrLenght > 0 then
 test edx,edx
 jle @Else1Begin
 //I := 0;
 xor  eax,eax
 //repeat
 @RepeatBegin :
 //Inc(I);
 inc  eax
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp  bl,[esi+eax-$01]
 jz   @If2
 cmp  edx,eax
 jnl  @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 cmp  edx,eax
 jnl  @Exit
 //Result := I
 //else
 //Result := 0;
 xor  eax,eax
 pop  esi
 pop  ebx
 ret
 //else
 //Result := 0;
 @Else1Begin :
 xor  eax,eax
 @Exit :
 pop  esi
 pop  ebx
end;
The first thing we do is to inline LStrLen. This is done by tracing into it and copying the body of the function
from the cpu view. It is made up of these four lines.
 LStrLen
 cpu и˺ڲʵ֡ɣ

test eax,eax
jz +$03
mov eax,[eax-$04]
ret

If a nil pointer is passed to LStrLen in eax nothing is done but returning.
If the pointer is valid the length of the string is found at the 4 bytes preceding the start of the string.
These 4 bytes are returned in eax. To inline it we replace the call with the 4 lines.
 LStrLen  eax һ nil ָ룬ʲôҲֱӷء
ָǺϷģַʼǰ 4 ַֽĳȡ
 4 ֽ eax ءʹ 4 д롣

Jz is transferring execution to the ret instruction.
Instead of this ret instruction we place a label called LStrLenExit.
The ret returned execution to the line after the call of the function.
This ret has to be removed, otherwise it would pass execution to the function that called CharPos,
not exactly what we want.
Jz  ret ָһ LStrLenExit ıǩ ret
ret صñĺһС
ret 뱻ɾʹ CharPos أⲻҪġ

This is what the inlined LStrLen ended up looking like
 LStrLen ޸Ϊ£
test eax,eax
jz  @LStrLenExit
mov eax,[eax-$04]
@LStrLenExit :

function CharPos17(Chr : Char; const Str : AnsiString) : Cardinal;
asm
 push ebx
 push esi
 mov  esi,edx
 mov  ebx,eax
 //StrLenght := Length(Str);
 mov  eax,esi
 //*************
 test eax,eax
 //jz  @LStrLenExit
 jz  @Else1Begin
 mov eax,[eax-$04]
 //@LStrLenExit :
 //*************
 mov  edx,eax
 //if StrLenght > 0 then
 //test edx,edx
 //jle @Else1Begin
 //I := 0;
 xor  eax,eax
 //repeat
 @RepeatBegin :
 //Inc(I);
 inc  eax
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp  bl,[esi+eax-$01]
 jz   @If2
 cmp  edx,eax
 jnl  @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 cmp  edx,eax
 jnl  @Exit
 //Result := I
 //else
 //Result := 0;
 xor  eax,eax
 pop  esi
 pop  ebx
 ret
 //else
 //Result := 0;
 @Else1Begin :
 xor  eax,eax
 @Exit :
 pop  esi
 pop  ebx
end;
 pascal  if Strlenght>0 LStrLen һмͬĶ
һ Str  nil 
ɾڶУһ޸Ϊ @Else1Begin滻 StrLen  Str  nil 
LStrLenExit ǩҪˡ

function CharPos18(Chr : Char; const Str : AnsiString) : Cardinal;
asm
 push ebx
 push esi
 mov  esi,edx
 mov  ebx,eax
 //StrLenght := Length(Str);
 //mov  eax,esi
 //if StrLenght > 0 then
 //test eax,eax
 test esi,esi
 jz  @Else1Begin
 //mov eax,[eax-$04]
 mov eax,[esi-$04]
 mov  edx,eax
 //I := 0;
 xor  eax,eax
 //repeat
 @RepeatBegin :
 //Inc(I);
 inc  eax
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp  bl,[esi+eax-$01]
 jz   @If2
 cmp  edx,eax
 jnl  @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 cmp  edx,eax
 jnl  @Exit
 //Result := I
 //else
 //Result := 0;
 xor  eax,eax
 pop  esi
 pop  ebx
 ret
 //else
 //Result := 0;
 @Else1Begin :
 xor  eax,eax
 @Exit :
 pop  esi
 pop  ebx
end;
We moved the test of StrLenght = 0 and the comment //if StrLenght > 0 then must be moved too.
After the inlining it becomes possible to copy propagate esi in these lines.
ƶ StrLenght = 0 Ĳԣע // if StrLenght > 0 Ҳ뱻ƶ
֮Щбظ esi

mov  eax,esi
//*************
test eax,eax
jz  @Else1Begin
mov eax,[eax-$04]
The last of the lines overwrites eax and is the last use of the value in eax that was copied from esi.
д eaxǴ esi ֮һʹ eax ֵ

mov  eax,esi
//*************
//test eax,eax
test esi,esi
jz  @Else1Begin
//mov eax,[eax-$04]
mov eax,[esi-$04]

Actually we must also follow the possible branch to Else1Begin and see whether the value in eax is used here.
We see that eax is zeroed out in the line immediately after the label and therefore is not used. 
This way the first line is seen to be dead and can be removed.
ʵϣҲ뿴֧ Else1Begin Ƿʹ eax ֵ 
ǿ eax ǩ֮㣬ûٱá
һõģɾ
//mov  eax,esi
test esi,esi
jz  @Else1Begin
mov eax,[esi-$04]

function CharPos19(Chr : Char; const Str : AnsiString) : Cardinal;
asm
 push ebx
 push esi
 mov  esi,edx
 mov  ebx,eax
 //if StrLenght > 0 then
 test esi,esi
 jz  @Else1Begin
 //StrLenght := Length(Str);
 //mov eax,[esi-$04]
 mov edx,[esi-$04]
 //mov  edx,eax
 //I := 0;
 xor  eax,eax
 //repeat
 @RepeatBegin :
 //Inc(I);
 inc  eax
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp  bl,[esi+eax-$01]
 jz   @If2
 cmp  edx,eax
 jnl  @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 cmp  edx,eax
 jnl  @Exit
 //Result := I
 //else
 //Result := 0;
 xor  eax,eax
 pop  esi
 pop  ebx
 ret
 //else
 //Result := 0;
 @Else1Begin :
 xor  eax,eax
 @Exit :
 pop  esi
 pop  ebx
end;
Also as a result of inlining LStrLen we can remove one more mov.
LStrLen returned its result in eax and then it was copied into edx.
mov eax, [esi-$04] can then be changed to mov edx, [esi-$04] and the mov edx, eax can be removed.
After this we change focus to the loop. This is far more important because it is executed many more times,
depending on the length of the string or the position of a match.
 LStrLen ֮ǿɾ mov ָ
LStrLen  eax ؽȻֱƵ edx
mov eax, [esi-$04] ԱΪ mov edx, [esi-$04]
mov edx, eax Աɾˡ
֮ǽתѭǸҪģΪѭִлܶΣַĳ
Ǹƥַλá

function CharPos20(Chr : Char; const Str : AnsiString) : Cardinal;
asm
 push ebx
 push esi
 mov  esi,edx
 mov  ebx,eax
 //if StrLenght > 0 then
 test esi,esi
 jz  @Else1Begin
 //StrLenght := Length(Str);
 mov edx,[esi-$04]
 //I := 0;
 xor  eax,eax
 dec  esi
 @RepeatBegin :
 //Inc(I);
 inc  eax
 //until((Str[I] = Chr) or (I > StrLenght));
 //cmp  bl,[esi+eax-$01]
 cmp  bl,[esi+eax]
 jz   @If2
 cmp  edx,eax
 jnl  @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 cmp  edx,eax
 jnl  @Exit
 //Result := 0;
 xor  eax,eax
 pop  esi
 pop  ebx
 ret
 //Result := 0;
 @Else1Begin :
 xor  eax,eax
 @Exit :
 pop  esi
 pop  ebx
end;
When we analyzed the code we saw that there was an offset of -1 in the addressing into the string. 
There is no need to subtract this offset at each iteration. It is a better idea to decrement the pointer to 
Str in esi once before entering the loop. We could also have decremented the loop counter in eax, but then we 
would have to add 1 again before returning the result.
At the very top of the function the two input parameters are copied to new registers.
This is redundant and we would like to copy propagate both, but eax is used as loop counter and we must first
find another register for this purpose.
Ƿ뷢, ַĵַи -1 ƫ
ûбҪÿεжȥƫơ
һȽϺõİ취ǣڽѭ֮ǰ str ָ 1
Ҳѭн eax  1ڷؽ֮ǰֲòټ 1
ں棬βαƵµļĴ
ģ븴ƴ eax ΪѭǱѰһĴ

function CharPos22(Chr : Char; const Str : AnsiString) : Cardinal;
asm
 push ebx
 push esi
 mov  esi,edx
 mov  ebx,eax
 //if StrLenght > 0 then
 test esi,esi
 jz  @Else1Begin
 //StrLenght := Length(Str);
 mov edx,[esi-$04]
 //I := 0;
 //xor  eax,eax
 xor  ecx,ecx
 dec  esi
 @RepeatBegin :
 //Inc(I);
 //inc  eax
 inc  ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 //cmp  bl,[esi+eax]
 cmp  bl,[esi+ecx]
 jz   @If2
 //cmp  edx,eax
 cmp  edx,ecx
 jnl  @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 //cmp  edx,eax
 cmp  edx,ecx
 jnl  @Exit
 //Result := 0;
 xor  eax,eax
 pop  esi
 pop  ebx
 ret
 //Result := 0;
 @Else1Begin :
 xor  eax,eax
 pop  esi      //New
 pop  ebx      //New
 ret           //New
 @Exit :
 mov  eax, ecx
 pop  esi
 pop  ebx
end;
All lines where eax is in use for I, eax is changed to ecx.
Because I is the return value of the function on a match and the return value is in eax,
we have to copy ecx into eax just beneath the @Exit label.
This introduces a little problem, because a jump to Else1Begin also brings us here,
and in this situation we copy a value from ecx into eax, which we have just cleared.
The fix is to add the lines marked new.
ʹ eax Ϊ I eax Ϊ ecx
ΪƥַʱI Ǻķֵ eax أҪ @Exit ǩ渴 ecx  eax
˵һС⣬Ϊ Else1Begin Ҳᵽдոձյ eax ᱻ ecx д
ӵ new ʶ

Then we are ready to copy propagate eax.
Only one line uses ebx. This is cmp bl, [esi+ecx], which is changed to cmp al, [esi+ecx].
Then the copy mov ebx, eax becomes dead and is removed.
This was copy propagation followed by dead code removal once more and
we begin realising how important this optimization is.
Ҫʵָƴ eax
ֻ cmp bl, [esi+ecx] һʹ ebx, Ϊ cmp al, [esi+ecx]
mov ebx, eax һûˣɾ
ƴƳЧ룬ŻжôҪ˰ɡ

function CharPos23(Chr : Char; const Str : AnsiString) : Cardinal;
asm
 push ebx
 push esi
 mov  esi,edx
 //mov  ebx,eax
 //if StrLenght > 0 then
 test esi,esi
 jz  @Else1Begin
 //StrLenght := Length(Str);
 mov edx,[esi-$04]
 //I := 0;
 xor  ecx,ecx
 dec  esi
 @RepeatBegin :
 //Inc(I);
 inc  ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 //cmp  bl,[esi+ecx]
 cmp  al,[esi+ecx]
 jz   @If2
 cmp  edx,ecx
 jnl  @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 cmp  edx,ecx
 jnl  @Exit
 //Result := 0;
 xor  eax,eax
 pop  esi
 pop  ebx
 ret
 //Result := 0;
 @Else1Begin :
 xor  eax,eax
 pop  esi      
 pop  ebx      
 ret
 @Exit :
 mov  eax, ecx
 pop  esi
 pop  ebx
end;

Before we can copy propagate edx (holding the pointer to Str), we must free edx from other uses. 
It is used for StrLenght and we allocate ebx for this instead of edx.
Ǹƴ edx (ָ str) ֮ǰǱ뽫 edx ʹĵطų
 StrLenghtΪ ebx  edx

function CharPos24(Chr : Char; const Str : AnsiString) : Cardinal;
asm
 push ebx
 push esi
 mov  esi,edx
 //if StrLenght > 0 then
 test esi,esi
 jz  @Else1Begin
 //StrLenght := Length(Str);
 //mov edx,[esi-$04]
 mov ebx,[esi-$04]
 //I := 0;
 xor  ecx,ecx
 dec  esi
 @RepeatBegin :
 //Inc(I);
 inc  ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp  al,[esi+ecx]
 jz   @If2
 //cmp  edx,ecx
 cmp  ebx,ecx
 jnl  @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 //cmp  edx,ecx
 cmp  ebx,ecx
 jnl  @Exit
 //Result := 0;
 xor  eax,eax
 pop  esi
 pop  ebx
 ret
 //Result := 0;
 @Else1Begin :
 xor  eax,eax
 pop  esi      
 pop  ebx      
 ret
 @Exit :
 mov  eax, ecx
 pop  esi
 pop  ebx
end;
Then edx is copy propagated and the mov esi, edx becomes dead.
edx ƴmov esi, edx Чˡ

function CharPos25(Chr : Char; const Str : AnsiString) : Cardinal;
asm
 push ebx
 push esi
 //mov  esi,edx
 //if StrLenght > 0 then
 //test esi,esi
 test edx,edx
 jz  @Else1Begin
 //StrLenght := Length(Str);
 //mov ebx,[esi-$04]
 mov ebx,[edx-$04]
 //I := 0;
 xor  ecx,ecx
 //dec  esi
 dec  edx
 @RepeatBegin :
 //Inc(I);
 inc  ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 //cmp  al,[esi+ecx]
 cmp  al,[edx+ecx]
 jz   @If2
 cmp  ebx,ecx
 jnl  @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 cmp  ebx,ecx
 jnl  @Exit
 //Result := 0;
 xor  eax,eax
 pop  esi
 pop  ebx
 ret
 //Result := 0;
 @Else1Begin :
 xor  eax,eax
 pop  esi      
 pop  ebx      
 ret
 @Exit :
 mov  eax, ecx
 pop  esi
 pop  ebx
end;
This removed the use of esi and it does not need to be saved and restored any more. When removing the pop esi,
 we remember that there are 3 exit paths all with each own pop esi.
Ƴ esi ʹãҪͻָˡ
Ƴ pop esi ʱҪ 3 ˳·Լ pop esi
function CharPos26(Chr : Char; const Str : AnsiString) : Cardinal;
asm
 push ebx
 //push esi
 //if StrLenght > 0 then
 test edx,edx
 jz  @Else1Begin
 //StrLenght := Length(Str);
 mov ebx,[edx-$04]
 //I := 0;
 xor  ecx,ecx
 dec  edx
 @RepeatBegin :
 //Inc(I);
 inc  ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp  al,[edx+ecx]
 jz   @If2
 cmp  ebx,ecx
 jnl  @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 cmp  ebx,ecx
 jnl  @Exit
 //Result := 0;
 xor  eax,eax
 //pop  esi
 pop  ebx
 ret
 //Result := 0;
 @Else1Begin :
 xor  eax,eax
 //pop  esi      
 pop  ebx      
 ret
 @Exit :
 mov  eax, ecx
 //pop  esi
 pop  ebx
end;
In the line after the If2 label there is a line, which is identical to the second compare in the loop-closing test. 
In Pascal it was necessary to have the line if I <= StrLenght after the loop because we could not know which 
condition the loop terminated on.
This line generated the extra cmp ebx, ecx instruction, which looks a little redundant now.
It is in fact not redundant because two paths of execution lead to the If2 Label and only one of them has the test.
If we split the two exit paths such that only one of them goes to If2 we can remove the extra check.
Instead of jumping to If2 on a match we can jump directly to Exit.
ڱǩ If2 ֮һǷѭĵڶμ⡣
 Pascal ѭ֮ if I <= StrLenght, Ϊǲ֪Ǳĸֹġ
һвָ cmp ebx, ecxһࡣ
ʵ࣬Ȼ·ִе If2 ǩֻһ
Ƿֳ˳·ֻһ If2ǿƳļ顣
ƥʱ If2 Ϊֱ Exit

function CharPos27(Chr : Char; const Str : AnsiString) : Cardinal;
asm
 push ebx
 //if StrLenght > 0 then
 test edx,edx
 jz  @Else1Begin
 //StrLenght := Length(Str);
 mov ebx,[edx-$04]
 //I := 0;
 xor  ecx,ecx
 dec  edx
 @RepeatBegin :
 //Inc(I);
 inc  ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp  al,[edx+ecx]
 //jz   @If2
 jz   @Exit
 cmp  ebx,ecx
 jnl  @RepeatBegin
 //if I <= StrLenght then
 //@If2 :
 //cmp  ebx,ecx
 //jnl  @Exit
 //Result := 0;
 xor  eax,eax
 pop  ebx
 ret
 //Result := 0;
 @Else1Begin :
 xor  eax,eax
 pop  ebx
 ret
 @Exit :
 mov  eax,ecx
 pop  ebx
end;
Then the If2 label goes out of use and when we reach the code at this position we know that the zero terminator was
 reached and there is no need to test for it again.
There are two sections of identical code just before the Else1Begin label and just after it.
The upper section can be deleted.
If2 ǩûˣ뵽λãѾֹ֪˲Ҫٴμˡ 
 Else1Begin ǰͬĴ룬ǿɾ
function CharPos28(Chr : Char; const Str : AnsiString) : Cardinal;
asm
 push ebx
 //if StrLenght > 0 then
 test edx,edx
 jz   @Else1Begin
 //StrLenght := Length(Str);
 mov  ebx,[edx-$04]
 //I := 0;
 xor  ecx,ecx
 dec  edx
 @RepeatBegin :
 //Inc(I);
 inc  ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp  al,[edx+ecx]
 jz   @Exit
 cmp  ebx,ecx
 jnl  @RepeatBegin
 //Result := 0;
 //xor  eax,eax
 //pop  ebx
 //ret
 //Result := 0;
 @Else1Begin :
 xor  eax,eax
 pop  ebx
 ret
 @Exit :
 mov  eax,ecx
 pop  ebx
end;
This ended our search for redundant code to remove. The cleaned up function looks like this.
ɾ˷ֵ룬Ϊº
function CharPos29(Chr : Char; const Str : AnsiString) : Cardinal;
asm
 push ebx
 //if StrLenght > 0 then
 test edx,edx
 jz   @Else1Begin
 //StrLenght := Length(Str);
 mov  ebx,[edx-$04]
 //I := 0;
 xor  ecx,ecx
 dec  edx
 @RepeatBegin :
 //Inc(I);
 inc  ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp  al,[edx+ecx]
 jz   @Exit
 cmp  ebx,ecx
 jnl  @RepeatBegin
 @Else1Begin :
 //Result := 0;
 xor  eax,eax
 pop  ebx
 ret
 @Exit :
 mov  eax,ecx
 pop  ebx
end;
When the loop is iterating in search for a match or the end of the string,
these lines of code are executed over and over again
ѭƥֱַַĽʱЩ뱻һһεִУ
inc  ecx
cmp  al,[edx+ecx]
jz   @Exit
cmp  ebx,ecx
jnl  @RepeatBegin

Let us make some variants of them and see how they perform.
һЩ仯ִ
The most exiting line is
Ҫ
cmp  al,[edx+ecx]
It generates two microinstructions.
΢ָ
One for loading a byte from the address [edx+ecx] and one to compare it against al.
һǴӵַ [edx+ecx] װһֽڣһٺ al Ƚϡ
 This line could also be coded as
Ҳд
mov ah, byte ptr [edx+ecx]
cmp al, ah

This variant also generates two microinstructions, but it needs one more register (ah).
仯Ҳ΢ָҪһĴ ah

If we allocate an extra register it could also be coded as
ǷļĴҲ
movzx efx, byte ptr [edx+ecx]
cmp   al, fh
movzx is mov with zero extension.
It loads one byte into the lowest of efx and fills the three remaining bytes with zeroes.
Of course there is no such thing as an efx register, but the two unused registers esi & edi cannot be accessed byte wise.
Therefore it is necessary to free eax, ebx, ecx or edx, by substituting it by edi or esi and then use erg.
ebx instead of "efx".
This function demonstrates the first variant.
movzx  mov չ汾
һֽװ efx ֽڣʣ 3 ֽ䡣
Ȼû efx ļĴĴ esi&edi ȻûʹãǲܱΪֽڷʡ
ˣ edi  esi ȡ eax, ebx, ecx  edxɡ ebx  "efx"
function CharPos30(Chr : Char; const Str : AnsiString) : Cardinal;
asm
 push ebx
 //if StrLenght > 0 then
 test edx,edx
 jz   @Else1Begin
 //StrLenght := Length(Str);
 mov  ebx,[edx-$04]
 //I := 0;
 xor  ecx,ecx
 dec  edx
 @RepeatBegin :
 //Inc(I);
 inc  ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 mov  ah, [edx+ecx]
 //cmp  al,[edx+ecx]
 cmp  al,ah
 jz   @Exit
 cmp  ebx,ecx
 jnl  @RepeatBegin
 @Else1Begin :
 //Result := 0;
 xor  eax,eax
 pop  ebx
 ret
 @Exit :
 mov  eax,ecx
 pop  ebx
end;

This function demonstrates the second variant.
˵˵ڶ仯

function CharPos31(Chr : Char; const Str : AnsiString) : Cardinal;
asm
 push ebx
 push edi
 //if StrLenght > 0 then
 test edx,edx
 jz   @Else1Begin
 //StrLenght := Length(Str);
 mov  edi,[edx-$04]
 //I := 0;
 xor  ecx,ecx
 dec  edx
 @RepeatBegin :
 //Inc(I);
 inc  ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 movzx ebx, byte ptr [edx+ecx]
 //cmp  al,[edx+ecx]
 cmp  al, bl
 jz   @Exit
 cmp  edi,ecx
 jnl  @RepeatBegin
 @Else1Begin :
 //Result := 0;
 xor  eax,eax
 pop  edi
 pop  ebx
 ret
 @Exit :
 mov  eax,ecx
 pop  edi
 pop  ebx
end;
Instead of adding edx and ecx in the address calculation at every iteration,
we could add them prior to entering the loop.
Then it is necessary to subtract them from each other again to extract the loop counter value for the result.
This is done with the sub instruction in line two after the Exit label.
Ϊڵۼ edx  ecx ַǿڽѭ֮ǰۼǣ
ȻҪٷֱȥõ׼ȷѭ
 Exit ǩдʵ㡣

function CharPos32(Chr : Char; const Str : AnsiString) : Cardinal;
asm
 push  ebx
 push  edi
 //if StrLenght > 0 then
 test  edx,edx
 jz    @Else1Begin
 //StrLenght := Length(Str);
 mov   edi,[edx-$04]
 //I := 0;
 xor   ecx,ecx
 //dec  edx
 add   ecx,edx
 @RepeatBegin :
 //Inc(I);
 //until((Str[I] = Chr) or (I > StrLenght));
 movzx ebx, byte ptr [ecx]
 inc   ecx
 cmp   al, bl
 jz    @Exit
 //cmp  edi,ecx
 test  bl, bl
 jnz   @RepeatBegin
 @Else1Begin :
 //Result := 0;
 xor   eax,eax
 pop   edi
 pop   ebx
 ret
 @Exit :
 mov   eax,ecx
 sub   eax,edx
 pop   edi
 pop   ebx
end;

Then there are 4 functions to compare the performance of; CharPos29, CharPos30, CharPos31 and CharPos32.
Ƚ CharPos29, CharPos30, CharPos31  CharPos32 ĸܡ
Results on P4 1920 are  P4 1920 
CharPos29 716
CharPos30 973
CharPos31 710 
CharPos32 702
Winner is CharPos32    CharPos32 Ӯ
Results on P3 1400 are  P4 1400 
CharPos29  949
CharPos30  921
CharPos31  950
CharPos32 1403
Winner is CharPos30 CharPos30 Ӯ
Summed time   ܼʱ
CharPos29 716 + 949  = 1665
CharPos30 973 + 921  = 1894
CharPos31 710 + 950  = 1660
CharPos32 702 + 1403 = 2105
Winner is CharPos31  CharPos31 Ӯ
On P4 the winning loop is  P4 ʤѭ
@RepeatBegin :
movzx ebx, byte ptr [ecx]
inc   ecx
cmp   al, bl
jz    @Exit
test  bl, bl
jnz   @RepeatBegin
On P3 the winning loop is  P3 ʤѭ
@RepeatBegin :
inc  ecx
mov  ah, [edx+ecx]
cmp  al,ah
jz   @Exit
cmp  ebx,ecx
jnl  @RepeatBegin
on a blend of targets the winning loop is ڻƽ̨ʤѭ 
@RepeatBegin :
inc   ecx
movzx ebx, byte ptr [edx+ecx]
cmp   al, bl
jz    @Exit
cmp   edi,ecx
jnl   @RepeatBegin

The P4 winner performs very bad on P3 and could not be used in a library targeting more targets than P4,
such as the Delphi RTL.
The winner on P3 is performing quite bad on P4 and this one should not be used in a blended target library either.
The winner on a blend of the two targets is CharPos31, which performs near to optimal on P4 and
also performs optimal within a few percent on P3.
This function would make a perfect choice for the Delphi RTL, where I have missed a CharPos function.
It is a relief to see that it is possible to optimise for both processors at the same time
without sacrificing more than a few percent of performance.

 P4 ϵӮ P3 ûЧʣڳ P4 ĿС Delphi RTL
P3 ϵӮ P4 ִЧʷǳĵͣҲڻƽ̨Ŀ
ϣۺЧʵӮ CharPos31 ,  P4 Ͻӽѣ P3 Ҳû١
 Delphi RTL õѡ RTL Ѿʧȥһ CharPos 
οĿͬʱΪŻ汾Ҽûһܡ

Comparing performance of P3 and P4 on a clock-by-clock basis is always a hit.
There is broad tendency to think at the P4 as having an inferior design,
but this is not proved by this code.
Taking the blended winner's
performance and scaling it to a 1400 MHz processor it is 950 on P3 and 710 * (1920/1400) = 973 on P4.
The processors are performing very much the same at the same clock.

 P3  P4 ϱȽʱڵҲһ
һ󵨵Ĳ² P4 Ҳһͼƣǲܱ֤
ƽ̨ϱȽۺӮҵܣ P3 1400HZ Ĵ 950 P4  710 * (1920/1400) = 973
ִͬʱڡ
