From: Marc LeBrun

Subject: Re: large size hash-tables and CPU time

Date: 2002-12-5 17:19

I don't know the details of ACL's hash tables specifically, but I've 
implemented a few of my own, so perhaps I can suggest something.

First, to avoid the rehash costs you should make the table big to start 
with, so it will basically never have to rehash.  I would suggest making it 
twice the largest expected number of elements, 120K.  This is not really a 
very big number on today's systems.  A reasonable hash table implementation 
(which I assume ACL's is) probably only uses a small number of words per 
entry: one for the pointer to the object, one for the key and at most a few 
more for caching or bookkeeping.

More generally I would suggest using a threshold of 0.5 (or less).  This 
means the table rehashes when half full.  This assures that the number of 
expected collisions will be very low.  When you crowd the table with 
thresholds like 0.9, then, as it gets full, a greater percentage of the 
probes will miss and have to reprobe multiple times.  This can have a big 
unpredictable impact on performance.  For similar reasons I would never use 
a growth factor of less than 2.  Otherwise, in a big run, you will just 
wind up rehashing multiple times to get the table big enough and sparse enough.

Beyond that, system effects (such as inter-process locking to make the hash 
table thread safe) can have a big performance impact.  Look for, and try to 
reduce, extraneous sources of degradation from factors like this.

Lastly, instead of using system-supplied general purpose hash tables, 
consider implementing your own (using arrays etc), specifically tuned to 
your application (eg with a fast special purpose hash function).  Recently 
I did this in a large commercial system, and found that my hash tables 
metered over 100 times faster than the Microsoft system map classes we had 
been using!

Hope this helps!