hashinit versus phashinit

Tue May 6 13:05:04 UTC 2008

On Tuesday 06 May 2008 09:48:30 Roman Divacky wrote:
> On Tue, May 06, 2008 at 02:25:56AM -0400, David Schultz wrote:
> > On Mon, May 05, 2008, Roman Divacky wrote:
> > > hi
> > >
> > > when we want to use a hash table in kernel we call "hashinit" which
> > > initializes a hash table with power-of-2 size. There's also
> > > "phashinit" that creates hash table of size that is a prime number.
> > > This was added in 1995 by davidg@ but it is not used anywhere in
> > > the kernel.
> > >
> > > phk@ commited rev. 1.30 of vfs_cache.c replacing phashinit with
> > > hashinit stating that it's better because it replaces a division
> > > with logical and. is this reason still valid today? (the commit was
> > > done almost 11 years ago)
> > >
> > > is there still any reason why not use the phashinit instead of
> > > hashinit? I believe using prime-sized hash table might have
> > > positive performance impact...
> >
> > There's a tradeoff.
> >
> > The argument for using powers of 2 is that division takes many
> > times longer than a logical AND.
> >
> > The argument for using primes is that if your hash function isn't
> > as good as you thought it was, or if the data has some regularity
> > you weren't expecting, you can get screwed a lot more easily with
> > a power of 2 hash table.  With a prime-sized hash table, you only
> > get screwed if lots of your data is congruent modulo the prime,
> > which is very rare.
> >
> > Most general-purpose hash implementations I've used (e.g., GNU
> > libstdc++, Sun JDK, Microsoft .NET) use prime table sizes,
> > probably to make it less likely that programmers will shoot
> > themselves in the foot with pathological data or bad hash functions.
>
> yes... a division takes roughly 40 cycles on modern i386 hw, while
> and is just 1, but does it matter compared to the access times of
> memory these days?
>
> the ratio between cpu-speed/mem-speed has changed a lot. I am not
> arguing if it's still true that avoiding the division helps the
> performance these days...

requests/s * div_overhead - [avoided_]collisions/s * collision_overhead 
~= -([avoided_]collisions/requests * collision_overhead)

assuming the collision_overhead (requiring memory operations) greatly 
dominates the div_overhead.

So if there is a high collision rate and you can reasonably assert that 
you will lower that significantly by using a prime sized hash table, the 
div_overhead doesn't matter.

At least that's what I've come up with off the top of my head now.

-- 
/"\  Best regards,                      | mlaier at freebsd.org
\ /  Max Laier                          | ICQ #67774661
 X   http://pf4freebsd.love2party.net/  | mlaier at EFnet
/ \  ASCII Ribbon Campaign              | Against HTML Mail and News