Board index » delphi » Bit counting problem

Bit counting problem


2006-12-09 07:58:55 AM
delphi127
I would like to count the differences between two "image thumbnails".
These thumbnails are 16 rows x 16 columns x 1 bit.
The fastest algorithm that I have been able to come up with is in Pascal.
I can not seem to leverage any of the {*word*118} MMX/SSE/whatever instructions
to make it run faster.
I tried an SSE implementation using 128 bit XOR which worked but extracting
the results and adding up the set bits actually took longer than the
pascal implementation.
So, I thought I would run the problem by the readers of this newsgroup.
Any improvements would be most welcome.
My pascal implementation follows:
// DEFINITIONS //
type
PImageData16x16x1 = ^TImageData16x16x1;
TImageData16x16x1 = Array[0..8-1] of UInt32;
const
ImageAddrExtent = 4000000;
type
TImageAddrEntry = packed record
case Byte of
1: (ImageData16x16x1: PImageData16x16x1);
2: (ImageData16x16x8: PImageData16x16x8);
end;
TImageAddrVector = Array[0..ImageAddrExtent] of TImageAddrEntry;
var
G_ImageAddrVector: TImageAddrVector;
// COMPARE //
procedure PerformCompare;
var
Outer: Integer;
OuterP: PImageData16x16x1;
Inner: Integer;
InnerP: PImageData16x16x1;
Sum: Integer;
CompareVal: Longword;
begin
while CompareManager.AllocWorkPacket(Outer) do
begin
OuterP := G_ImageAddrVector[Outer].ImageData16x16x1;
for Inner := Outer+1 to ImageCount do
begin
InnerP := G_ImageAddrVector[Inner].ImageData16x16x1;
Sum := 0;
CompareVal := OuterP^[0] XOR InnerP^[0];
Inc(Sum, G_BitCountWL[CompareVal AND $FFFF]);
Inc(Sum, G_BitCountWL[CompareVal SHR 16]);
CompareVal := OuterP^[1] XOR InnerP^[1];
Inc(Sum, G_BitCountWL[CompareVal AND $FFFF]);
Inc(Sum, G_BitCountWL[CompareVal SHR 16]);
CompareVal := OuterP^[2] XOR InnerP^[2];
Inc(Sum, G_BitCountWL[CompareVal AND $FFFF]);
Inc(Sum, G_BitCountWL[CompareVal SHR 16]);
CompareVal := OuterP^[3] XOR InnerP^[3];
Inc(Sum, G_BitCountWL[CompareVal AND $FFFF]);
Inc(Sum, G_BitCountWL[CompareVal SHR 16]);
CompareVal := OuterP^[4] XOR InnerP^[4];
Inc(Sum, G_BitCountWL[CompareVal AND $FFFF]);
Inc(Sum, G_BitCountWL[CompareVal SHR 16]);
CompareVal := OuterP^[5] XOR InnerP^[5];
Inc(Sum, G_BitCountWL[CompareVal AND $FFFF]);
Inc(Sum, G_BitCountWL[CompareVal SHR 16]);
CompareVal := OuterP^[6] XOR InnerP^[6];
Inc(Sum, G_BitCountWL[CompareVal AND $FFFF]);
Inc(Sum, G_BitCountWL[CompareVal SHR 16]);
CompareVal := OuterP^[7] XOR InnerP^[7];
Inc(Sum, G_BitCountWL[CompareVal AND $FFFF]);
Inc(Sum, G_BitCountWL[CompareVal SHR 16]);
if (Sum>Threshold) then
AddResult(Outer, Inner);
end;
end;
end;
 
 

Re:Bit counting problem

Since you're using a table of $FFFF length (65536 entries) you may suffer
from memory cache hits.
I'd advise to go to a smaller table of 256 entries, and split up the algo so
it does 4 table lookups per compare.
There are also fast algorithms (instead of table lookups) to count bits.
This one comes to mind (in C):
unsigned int
ones32(register unsigned int x)
{
/* 32-bit recursive reduction using SWAR...
but first step is mapping 2-bit values
into sum of 2 1-bit values in sneaky way
*/
x -= ((x>>1) & 0x55555555);
x = (((x>>2) & 0x33333333) + (x & 0x33333333));
x = (((x>>4) + x) & 0x0f0f0f0f);
x += (x>>8);
x += (x>>16);
return(x & 0x0000003f);
}
I translated it for you (hope I did it correcty):
function Ones32(x: dword): dword;
begin
x := x - (x shr 1) and $55555555;
x := (x shr 2) and $33333333 + x and $33333333;
x := (x shr 4 + x) and $0F0F0F0F;
x := x + (x shr 8);
x := x + (x shr 16);
Result := x and $3F;
end;
It's an interesting algorithm. I once digged into it and fully understood
it, but now that I look at it again, I think I will have to re-do that :)
See this website:
aggregate.org/MAGIC/
Of course you can ASM optimize the code, or even better: use MMX, and inline
it.
Nils
 

Re:Bit counting problem

The magic behind Nils solution and a 64bit MMX version are explicited in
the Athlon Optimization guide (Integer Optimization/Population Count,
free download from AMD's website).
I copy-pasted the raw MMX version below.
Eric
__declspec ({*word*192}) unsigned int __stdcall popcount64_1
(unsigned __int64 v)
{
static const __int64 C55 = 0x5555555555555555;
static const __int64 C33 = 0x3333333333333333;
static const __int64 C0F = 0x0F0F0F0F0F0F0F0F;
__asm {
MOVD MM0, [ESP+4] ;v_low
PUNPCKLDQ MM0, [ESP+8] ;v
MOVQ MM1, MM0 ;v
PSRLD MM0, 1 ;v>>1
PAND MM0, [C55] ;(v>>1) & 0x55555555
PSUBD MM1, MM0 ;w = v - ((v>>1) & 0x55555555)
MOVQ MM0, MM1 ;w
PSRLD MM1, 2 ;w>>2
PAND MM0, [C33] ;w & 0x33333333
PAND MM1, [C33] ;(w>>2) & 0x33333333
PADDD MM0, MM1 ;x = (w & 0x33333333) + ((w>>2) & 0x33333333)
MOVQ MM1, MM0 ;x
PSRLD MM0, 4 ;x>>4
PADDD MM0, MM1 ;x + (x>>4)
PAND MM0, [C0F] ;y = (x + (x>>4) & 0x0F0F0F0F)
PXOR MM1, MM1 ;0
PSADBW (MM0, MM1) ;sum across all 8 bytes
MOVD EAX, MM0 ;result in EAX per calling convention
EMMS ;clear MMX state
RET 8 ;pop 8-byte argument off stack and return
}
}