Hi,
I am glad someone is looking into optimizations! Also note that the sqrt function is used extensively in variations of ArDetectMarker(), and this is a very costly function. It can (and should) be replaced by a sqrt table lookup function, such as the one available in Graphics Gems 3.
-Randy
-----Original Message-----
From: Thomas Pintaric [mailto:thomas@i ...............]
Sent: Friday, May 23, 2003 5:52 AM
To: ARToolKit Mailing List
Subject: arLabeling2 optimization
I did some ARToolKit profiling the other day and found a performance issue related
to excess addressing-index computations within the inner (per-pixel) loop of
ARToolKit's labeling2() function.
The original code contains the following statements:
work2[((*pnt2)-1)*7+0] ++;
work2[((*pnt2)-1)*7+1] += i;
work2[((*pnt2)-1)*7+2] += j;
work2[((*pnt2)-1)*7+6] = j;
Any compiler* will generate four separate index lookups (see below), since it
must not assume that (*pnt2) will remain constant across the entire instruction
sequence:
...
work2[((*pnt2)-1)*7+0] ++;
0040B646 add ecx,ecx
0040B648 add ecx,ecx
0040B64A lea esi,[ecx+ecx]
0040B64D add esi,esi
0040B64F add esi,esi
0040B651 sub esi,ecx
0040B653 mov ecx,dword ptr [esp+0Ch]
0040B657 add dword ptr [esi+ecx-1Ch],1
work2[((*pnt2)-1)*7+1] += i;
0040B65C movsx edi,word ptr [ebp]
0040B660 add edi,edi
0040B662 add edi,edi
0040B664 lea esi,[edi+edi]
0040B667 add esi,esi
0040B669 add esi,esi
0040B66B sub esi,edi
0040B66D add dword ptr [esi+ecx-18h],eax
work2[((*pnt2)-1)*7+2] += j;
0040B671 movsx eax,word ptr [ebp]
0040B675 add eax,eax
0040B677 add eax,eax
0040B679 lea esi,[eax+eax]
0040B67C add esi,esi
0040B67E add esi,esi
0040B680 sub esi,eax
0040B682 mov eax,dword ptr [esp+40h]
0040B686 add dword ptr [esi+ecx-14h],eax
work2[((*pnt2)-1)*7+6] = j;
0040B68A movsx edi,word ptr [ebp]
0040B68E add edi,edi
0040B690 add edi,edi
0040B692 lea esi,[edi+edi]
0040B695 add esi,esi
0040B697 add esi,esi
0040B699 sub esi,edi
0040B69B mov dword ptr [esi+ecx-4],eax
0040B69F mov eax,dword ptr [esp+34h]
0040B6A3 jmp 0040B6AB
However, it's perfectly safe to rewrite the above code as:
pnt2_index = ((*pnt2)-1) * 7;
work2[pnt2_index+0]++;
work2[pnt2_index+1]+= i;
work2[pnt2_index+2]+= j;
work2[pnt2_index+6] = j;
... resulting in:
...
pnt2_index = ((*pnt2)-1) * 7;
0040B4B9 lea eax,[edx+edx]
0040B4BC add eax,eax
0040B4BE add eax,eax
0040B4C0 sub eax,edx
work2[pnt2_index+0]++;
0040B4C2 mov edx,dword ptr [esp+2Ch]
0040B4C6 add dword ptr [edx+eax*4-1Ch],1
work2[pnt2_index+6] = j;
0040B4CB mov dword ptr [edx+eax*4-4],edi
work2[pnt2_index+1]+= i;
0040B4CF add dword ptr [edx+eax*4-18h],ebp
work2[pnt2_index+2]+= j;
0040B4D3 add dword ptr [edx+eax*4-14h],edi
0040B4D7 jmp labeling2+28Ah (40B4DEh)
Overall, this will result in a considerable speedup of arLabeling2().
Regards,
--Thomas
*) the assembly code above has been generated by the Intel C++ 7.1 compiler (with optimizations).
_________________________________________
Thomas Pintaric
Interactive Media Systems Group
Vienna University of Technology
<pintaric@i ...............>
http://www.ims.tuwien.ac.at/~thomas
_________________________________________
|