arLabeling2 optimization

Hi,

I am glad someone is looking into optimizations!  Also note that the sqrt function is used extensively in variations of ArDetectMarker(), and this is a very costly function. It can (and should) be replaced by a sqrt table lookup function, such as the one available in Graphics Gems 3.

-Randy

-----Original Message-----
From: Thomas Pintaric [mailto:thomas@i ...............]
Sent: Friday, May 23, 2003 5:52 AM
To: ARToolKit Mailing List
Subject: arLabeling2 optimization

I did some ARToolKit profiling the other day and found a performance issue related
to excess addressing-index computations within the inner (per-pixel) loop of
ARToolKit's labeling2() function.

The original code contains the following statements:

work2[((*pnt2)-1)*7+0] ++;
work2[((*pnt2)-1)*7+1] += i;
work2[((*pnt2)-1)*7+2] += j;
work2[((*pnt2)-1)*7+6] = j;

Any compiler* will generate four separate index lookups (see below), since it
must not assume that (*pnt2) will remain constant across the entire instruction
sequence:

...
work2[((*pnt2)-1)*7+0] ++; 

0040B646  add         ecx,ecx 
0040B648  add         ecx,ecx 
0040B64A  lea         esi,[ecx+ecx] 
0040B64D  add         esi,esi 
0040B64F  add         esi,esi 
0040B651  sub         esi,ecx 
0040B653  mov         ecx,dword ptr [esp+0Ch] 
0040B657  add         dword ptr [esi+ecx-1Ch],1 

work2[((*pnt2)-1)*7+1] += i; 

0040B65C  movsx       edi,word ptr [ebp] 
0040B660  add         edi,edi 
0040B662  add         edi,edi 
0040B664  lea         esi,[edi+edi] 
0040B667  add         esi,esi 
0040B669  add         esi,esi 
0040B66B  sub         esi,edi 
0040B66D  add         dword ptr [esi+ecx-18h],eax 

work2[((*pnt2)-1)*7+2] += j; 

0040B671  movsx       eax,word ptr [ebp] 
0040B675  add         eax,eax 
0040B677  add         eax,eax 
0040B679  lea         esi,[eax+eax] 
0040B67C  add         esi,esi 
0040B67E  add         esi,esi 
0040B680  sub         esi,eax 
0040B682  mov         eax,dword ptr [esp+40h] 
0040B686  add         dword ptr [esi+ecx-14h],eax 

work2[((*pnt2)-1)*7+6] = j; 

0040B68A  movsx       edi,word ptr [ebp] 
0040B68E  add         edi,edi 
0040B690  add         edi,edi 
0040B692  lea         esi,[edi+edi] 
0040B695  add         esi,esi 
0040B697  add         esi,esi 
0040B699  sub         esi,edi 
0040B69B  mov         dword ptr [esi+ecx-4],eax 
0040B69F  mov         eax,dword ptr [esp+34h] 
0040B6A3  jmp         0040B6AB 

However, it's perfectly safe to rewrite the above code as:

pnt2_index = ((*pnt2)-1) * 7;
work2[pnt2_index+0]++;
work2[pnt2_index+1]+= i;
work2[pnt2_index+2]+= j;
work2[pnt2_index+6] = j;

... resulting in:

...
pnt2_index = ((*pnt2)-1) * 7; 

0040B4B9  lea         eax,[edx+edx] 
0040B4BC  add         eax,eax 
0040B4BE  add         eax,eax 
0040B4C0  sub         eax,edx 

work2[pnt2_index+0]++; 

0040B4C2  mov         edx,dword ptr [esp+2Ch] 
0040B4C6  add         dword ptr [edx+eax*4-1Ch],1 

work2[pnt2_index+6] = j; 

0040B4CB  mov         dword ptr [edx+eax*4-4],edi 

work2[pnt2_index+1]+= i; 

0040B4CF  add         dword ptr [edx+eax*4-18h],ebp 

work2[pnt2_index+2]+= j; 

0040B4D3  add         dword ptr [edx+eax*4-14h],edi 
0040B4D7  jmp         labeling2+28Ah (40B4DEh) 

Overall, this will result in a considerable speedup of arLabeling2().

Regards,
--Thomas

*) the assembly code above has been generated by the Intel C++ 7.1 compiler (with optimizations).

_________________________________________

Thomas Pintaric
Interactive Media Systems Group
Vienna University of Technology

<pintaric@i ...............>
http://www.ims.tuwien.ac.at/~thomas
_________________________________________ 

From:	"Stiles, Randy" <randy.stiles@l .......>	Received:	May 23, 2003
To	"'Thomas Pintaric'" <thomas@i ...............>, ARToolKit Mailing List <artoolkit@h ..................>
Subject:	RE: arLabeling2 optimization
Hi, I am glad someone is looking into optimizations! Also note that the sqrt function is used extensively in variations of ArDetectMarker(), and this is a very costly function. It can (and should) be replaced by a sqrt table lookup function, such as the one available in Graphics Gems 3. -Randy -----Original Message----- From: Thomas Pintaric [mailto:thomas@i ...............] Sent: Friday, May 23, 2003 5:52 AM To: ARToolKit Mailing List Subject: arLabeling2 optimization I did some ARToolKit profiling the other day and found a performance issue related to excess addressing-index computations within the inner (per-pixel) loop of ARToolKit's labeling2() function. The original code contains the following statements: work2[((pnt2)-1)7+0] ++; work2[((pnt2)-1)7+1] += i; work2[((pnt2)-1)7+2] += j; work2[((pnt2)-1)7+6] = j; Any compiler* will generate four separate index lookups (see below), since it must not assume that (pnt2) will remain constant across the entire instruction sequence: ... work2[((pnt2)-1)7+0] ++; 0040B646 add ecx,ecx 0040B648 add ecx,ecx 0040B64A lea esi,[ecx+ecx] 0040B64D add esi,esi 0040B64F add esi,esi 0040B651 sub esi,ecx 0040B653 mov ecx,dword ptr [esp+0Ch] 0040B657 add dword ptr [esi+ecx-1Ch],1 work2[((pnt2)-1)7+1] += i; 0040B65C movsx edi,word ptr [ebp] 0040B660 add edi,edi 0040B662 add edi,edi 0040B664 lea esi,[edi+edi] 0040B667 add esi,esi 0040B669 add esi,esi 0040B66B sub esi,edi 0040B66D add dword ptr [esi+ecx-18h],eax work2[((pnt2)-1)7+2] += j; 0040B671 movsx eax,word ptr [ebp] 0040B675 add eax,eax 0040B677 add eax,eax 0040B679 lea esi,[eax+eax] 0040B67C add esi,esi 0040B67E add esi,esi 0040B680 sub esi,eax 0040B682 mov eax,dword ptr [esp+40h] 0040B686 add dword ptr [esi+ecx-14h],eax work2[((pnt2)-1)7+6] = j; 0040B68A movsx edi,word ptr [ebp] 0040B68E add edi,edi 0040B690 add edi,edi 0040B692 lea esi,[edi+edi] 0040B695 add esi,esi 0040B697 add esi,esi 0040B699 sub esi,edi 0040B69B mov dword ptr [esi+ecx-4],eax 0040B69F mov eax,dword ptr [esp+34h] 0040B6A3 jmp 0040B6AB However, it's perfectly safe to rewrite the above code as: pnt2_index = ((pnt2)-1) * 7; work2[pnt2_index+0]++; work2[pnt2_index+1]+= i; work2[pnt2_index+2]+= j; work2[pnt2_index+6] = j; ... resulting in: ... pnt2_index = ((pnt2)-1) 7; 0040B4B9 lea eax,[edx+edx] 0040B4BC add eax,eax 0040B4BE add eax,eax 0040B4C0 sub eax,edx work2[pnt2_index+0]++; 0040B4C2 mov edx,dword ptr [esp+2Ch] 0040B4C6 add dword ptr [edx+eax4-1Ch],1 work2[pnt2_index+6] = j; 0040B4CB mov dword ptr [edx+eax4-4],edi work2[pnt2_index+1]+= i; 0040B4CF add dword ptr [edx+eax4-18h],ebp work2[pnt2_index+2]+= j; 0040B4D3 add dword ptr [edx+eax4-14h],edi 0040B4D7 jmp labeling2+28Ah (40B4DEh) Overall, this will result in a considerable speedup of arLabeling2(). Regards, --Thomas *) the assembly code above has been generated by the Intel C++ 7.1 compiler (with optimizations). _________________________________________ Thomas Pintaric Interactive Media Systems Group Vienna University of Technology <pintaric@i ...............> http://www.ims.tuwien.ac.at/~thomas _________________________________________

From:	Thomas Pintaric <thomas@i ...............>	Received:	May 23, 2003
To	"ARToolKit Mailing List" <artoolkit@h ..................>
Subject:	arLabeling2 optimization
I did some ARToolKit profiling the other day and found a performance issue related to excess addressing-index computations within the inner (per-pixel) loop of ARToolKit's labeling2() function. The original code contains the following statements: work2[((pnt2)-1)7+0] ++; work2[((pnt2)-1)7+1] += i; work2[((pnt2)-1)7+2] += j; work2[((pnt2)-1)7+6] = j; Any compiler* will generate four separate index lookups (see below), since it must not assume that (pnt2) will remain constant across the entire instruction sequence: ... work2[((pnt2)-1)7+0] ++; 0040B646 add ecx,ecx 0040B648 add ecx,ecx 0040B64A lea esi,[ecx+ecx] 0040B64D add esi,esi 0040B64F add esi,esi 0040B651 sub esi,ecx 0040B653 mov ecx,dword ptr [esp+0Ch] 0040B657 add dword ptr [esi+ecx-1Ch],1 work2[((pnt2)-1)7+1] += i; 0040B65C movsx edi,word ptr [ebp] 0040B660 add edi,edi 0040B662 add edi,edi 0040B664 lea esi,[edi+edi] 0040B667 add esi,esi 0040B669 add esi,esi 0040B66B sub esi,edi 0040B66D add dword ptr [esi+ecx-18h],eax work2[((pnt2)-1)7+2] += j; 0040B671 movsx eax,word ptr [ebp] 0040B675 add eax,eax 0040B677 add eax,eax 0040B679 lea esi,[eax+eax] 0040B67C add esi,esi 0040B67E add esi,esi 0040B680 sub esi,eax 0040B682 mov eax,dword ptr [esp+40h] 0040B686 add dword ptr [esi+ecx-14h],eax work2[((pnt2)-1)7+6] = j; 0040B68A movsx edi,word ptr [ebp] 0040B68E add edi,edi 0040B690 add edi,edi 0040B692 lea esi,[edi+edi] 0040B695 add esi,esi 0040B697 add esi,esi 0040B699 sub esi,edi 0040B69B mov dword ptr [esi+ecx-4],eax 0040B69F mov eax,dword ptr [esp+34h] 0040B6A3 jmp 0040B6AB However, it's perfectly safe to rewrite the above code as: pnt2_index = ((pnt2)-1) * 7; work2[pnt2_index+0]++; work2[pnt2_index+1]+= i; work2[pnt2_index+2]+= j; work2[pnt2_index+6] = j; ... resulting in: ... pnt2_index = ((pnt2)-1) 7; 0040B4B9 lea eax,[edx+edx] 0040B4BC add eax,eax 0040B4BE add eax,eax 0040B4C0 sub eax,edx work2[pnt2_index+0]++; 0040B4C2 mov edx,dword ptr [esp+2Ch] 0040B4C6 add dword ptr [edx+eax4-1Ch],1 work2[pnt2_index+6] = j; 0040B4CB mov dword ptr [edx+eax4-4],edi work2[pnt2_index+1]+= i; 0040B4CF add dword ptr [edx+eax4-18h],ebp work2[pnt2_index+2]+= j; 0040B4D3 add dword ptr [edx+eax4-14h],edi 0040B4D7 jmp labeling2+28Ah (40B4DEh) Overall, this will result in a considerable speedup of arLabeling2(). Regards, --Thomas *) the assembly code above has been generated by the Intel C++ 7.1 compiler (with optimizations). _________________________________________ Thomas Pintaric Interactive Media Systems Group Vienna University of Technology <pintaric@i ...............> http://www.ims.tuwien.ac.at/~thomas _________________________________________