From ... From: Erik Naggum Subject: Re: Problems with EQUALP hash tables and Arrays Date: 1999/05/27 Message-ID: <3136759246543862@naggum.no>#1/1 X-Deja-AN: 482552040 References: <374AF040.F0978582@mindspring.com> <87r9o3d3qa.fsf@pc-hrvoje.srce.hr> mail-copies-to: never Organization: Naggum Software; +47 8800 8879; http://www.naggum.no Newsgroups: comp.lang.lisp * Hrvoje Niksic | Why do they do that? It's practically *guaranteed* to degrade hash | searches to O(n) with equal-length keys. I don't know _why_ they do it, but it's silly to take a special case and argue that it's the world's most important issue. if hash tables don't give you the performance you need, you engage your programmer brain and write something that does. if you are one of those people who think that everything should be free and easy, programming is the wrong trade. let me give you an example. I have a pretty decent MD5 algorithm that I use in various protocols to ensure that lines don't kill data. MD5 is a fairly complex algorithm, and until I can interface the assembly version I wrote directly with Allegro CL, I use one written in Common Lisp, and heavily bummed (if I ever start to lecture on optimization, I can boast a real life example of 13 000 times factor difference between the naive code and the optimized-to-death version). however, I sometimes need to compute the checksum of strings that happen to be STRING= several times. finding the appropriate string in a cache is not as trivial as it sounds: it needs to be significantly faster than computing the value, since the cache will be tried before computing the value. also, a cache would tend to grow without bounds in the absence of weak pointers and the like. in the end, I decided to let the caller signal "cachability" of a string, reorder strings in the cache when a hit is made to make it more likely to win the next time, flush the cache when it gets full and upon global GC. all of this means it doesn't pay to cache strings 119 characters or less in length if I am to retain up to 256 strings in the cache. a lot of CPU time went into finding the point where it made sense to cache. why all this work? because the naïve caching strategy caused a very serious performance degradation in actual use, with a usage pattern that had gone completely unnoticed during testing. if there's a morale to this story it is that optimization holds surprises by its very nature: if we knew how to write it as fast as possible, that's precisely what we'd do from the start. we start to optimize because the system is unexpectedly slow, right? so _of_course_ we'll find some special special case where performance sucks real hard. | This calculation is of course slower than just returning the length of | the vector, but that's what you ask for when you use vectors as hash keys. it isn't clear to me what you "ask for" when you use vectors as hash keys, but it is nonetheless obvious that user expectations have not been met. sometimes, user expectations are just plain wrong. in the field of optimal performance, user expectations are normally wrong: users are like politicians that way, they want something very badly, but have absolutely no idea how to get it. bad performance is a computer's way of showing its approval rating of the programmers. #:Erik -- @1999-07-22T00:37:33Z -- pi billion seconds since the turn of the century