Re: Problems with EQUALP hash tables and Arrays

Subject: Re: Problems with EQUALP hash tables and Arrays
From: Erik Naggum <erik@naggum.no>
Date: 1999/05/27
Newsgroups: comp.lang.lisp
Message-ID: <3136759246543862@naggum.no>

* Hrvoje Niksic <hniksic@srce.hr>
| Why do they do that?  It's practically *guaranteed* to degrade hash
| searches to O(n) with equal-length keys.

  I don't know _why_ they do it, but it's silly to take a special case and
  argue that it's the world's most important issue.  if hash tables don't
  give you the performance you need, you engage your programmer brain and
  write something that does.  if you are one of those people who think that
  everything should be free and easy, programming is the wrong trade.

  let me give you an example.  I have a pretty decent MD5 algorithm that I
  use in various protocols to ensure that lines don't kill data.  MD5 is a
  fairly complex algorithm, and until I can interface the assembly version
  I wrote directly with Allegro CL, I use one written in Common Lisp, and
  heavily bummed (if I ever start to lecture on optimization, I can boast a
  real life example of 13 000 times factor difference between the naive
  code and the optimized-to-death version).  however, I sometimes need to
  compute the checksum of strings that happen to be STRING= several times.
  finding the appropriate string in a cache is not as trivial as it sounds:
  it needs to be significantly faster than computing the value, since the
  cache will be tried before computing the value.  also, a cache would tend
  to grow without bounds in the absence of weak pointers and the like.  in
  the end, I decided to let the caller signal "cachability" of a string,
  reorder strings in the cache when a hit is made to make it more likely to
  win the next time, flush the cache when it gets full and upon global GC.
  all of this means it doesn't pay to cache strings 119 characters or less
  in length if I am to retain up to 256 strings in the cache.  a lot of CPU
  time went into finding the point where it made sense to cache.  why all
  this work?  because the naïve caching strategy caused a very serious
  performance degradation in actual use, with a usage pattern that had gone
  completely unnoticed during testing.

  if there's a morale to this story it is that optimization holds surprises
  by its very nature: if we knew how to write it as fast as possible,
  that's precisely what we'd do from the start.  we start to optimize
  because the system is unexpectedly slow, right?  so _of_course_ we'll
  find some special special case where performance sucks real hard.

| This calculation is of course slower than just returning the length of
| the vector, but that's what you ask for when you use vectors as hash keys.

  it isn't clear to me what you "ask for" when you use vectors as hash
  keys, but it is nonetheless obvious that user expectations have not been
  met.  sometimes, user expectations are just plain wrong.  in the field of
  optimal performance, user expectations are normally wrong: users are like
  politicians that way, they want something very badly, but have absolutely
  no idea how to get it.  bad performance is a computer's way of showing
  its approval rating of the programmers.

#:Erik
-- 
@1999-07-22T00:37:33Z -- pi billion seconds since the turn of the century