From ... From: Erik Naggum Subject: Re: Q: on hashes and counting Date: 2000/10/19 Message-ID: <3180944138438226@naggum.net> X-Deja-AN: 683360556 References: <8sl58e$ivq$1@nnrp1.deja.com> mail-copies-to: never Content-Type: text/plain; charset=us-ascii X-Complaints-To: newsmaster@eunet.no X-Trace: oslo-nntp.eunet.no 971956484 26907 195.0.192.66 (19 Oct 2000 11:54:44 GMT) Organization: Naggum Software; vox: +47 800 35477; gsm: +47 93 256 360; fax: +47 93 270 868; http://naggum.no; http://naggum.net User-Agent: Gnus/5.0803 (Gnus v5.8.3) Emacs/20.7 Mime-Version: 1.0 NNTP-Posting-Date: 19 Oct 2000 11:54:44 GMT Newsgroups: comp.lang.lisp * The Glauber | Suppose i have a text file and i want to count strings in it. | Specifically, the first 5 characters of each line in this file are a | vendor code. Then i want to ouptut all vendor codes and how many | times they show up in the input. Here's my cut at that task: (let ((hash (make-hash-table :test #'equal))) (with-open-file (stream ...file-args...) (handler-case (loop (let ((vendor (read-sequence stream (make-string 5)))) ;; skip to end of line and consume (peek-char #\newline stream) (read-char stream) ;; real body of loop (incf (gethash vendor hash 0)))) ;; use conditions to simplify error handling (end-of-file ()))) (dolist (key (sort (hash-table-keys hash) #'string<)) (format t "~A,~A~%" key (gethash key hash)))) | This is all straightforward, except for these lines: | (if (gethash curv vhash) | (incf (gethash curv vhash)) | (setf (gethash curv vhash) 1)) | | What they do is, if there is already an entry in the hash, increment | the count. If there isn't, create one (with count set to 1). You have already been informed about the default value to gethash. | Is this the best way to use a hash? In other words, do i really have to | do the (gethash key hash) twice? It feels wasteful. incf does one gethash to get the value and gethash's setf method to set the value. If this is noticeably slow, you _may_ do better with something like this: (let ((entry (gethash vendor hash nil))) (if entry (incf (car entry)) (setf (gethash vendor hash) (list 1)))) | In Perl, you do $hash{key}++ and the entry will get "magically" | created if it isn't there. (Of course, the Perl compiler may be | doing the double lookup behind my back). It is not unlikely that Perl, being a joke of a language around over-powerful regexps and hash tables, has optimized this to death. | The Perl version of this program runs a lot faster than the Lisp | version (6 times faster) (I'm using Clisp and AIX), but i'm assuming | that the main problem with the Lisp version is the fact that each | line of text is consed and thrown away to the garbage collector. | I'm pretty sure Perl would reuse the memory in this case. Did you compile the Lisp code? Perl _always_ runs slower than Lisp for me. Of course, I don't use a (good) toy Lisp like CLISP, but Allegro CL, and I generally compile with high optimization settings. | Anyway, comparing Perl and Lisp is not the point here, i just want | to understand if i'm using the hashes right. Why don't you ask whether hashes are right for the problem? If you have decided to use Perl, regexps and hashes are the right solution because you don't have anything else, but regexps are the wrong solution to almost every problem, and hashes may be similar special optimizations of particular cases. I have a situation not dissimilar to yours, where I have ticker codes for various financial instruments, and I found that it paid off handsomely to use symbols. Some "modern" Lisp books tell us not to use symbols, for the same reasons you should not use regexps and hash tables in Perl, apparently: they are so easy to use that misuse will be hard to combat, but nobody who uses Perl warns about Perl's insane focus on bad tools -- those who take the warnings seriously don't use Perl. So although symbold are expensive structures, they do have enormous benefits language-wise in Lisp, and they should be used if the alternative is more cumbersome. With symbols, the above suggested speed-up becomes (let ((symbol (intern string package))) (if (boundp symbol) (incf (symbol-value symbol)) (setf (symbol-value symbol) 1))) | P.S.: the Perl version is a "one liner": | $c{substr($_,0,5)}++; END { foreach $x (sort keys %c) {print "$x,$c{$x}\n"}} | (very obvious, no? :-)) Perl looks the way it does because 110-baud Teletypes were once upon the time the best people had on Unix, and with lousy typers who had to correct their mistakes all the time, it made sense to abbreviate everything down to one or two characters. Lisp has a different heritage, to put it mildly: Better typers, better use of brainpower than to remember thousands of subtly similar abbreviations, better terminals. So we aren't as interested in one-liners with compact syntax, but for people who still live their lives as if all they can hope for is a 300-baud Teletype, the value of one-liners cannot be underestimated. On the other hand, loop constructs that walk through a file line by line have been posted, as this seems a reasonable thing to do to some people. People have regexp libraries for string hacking in Lisp, too, even though the much better solution is to build a parser that actually returns the stuctured information in the string. It turns out that it doesn't take longer, in programmer-time, to write and test a real parser than it does to write and test regexp-based string hacking at the same quality level, it may in fact take a lot shorter time, because the quality level in a parser is so readily determinable, while people stop worrying about the regexp hacks until it breaks at sufficiently infrequent random times because they data never was designed to be hacked by regexps. Of all the languages I have worked with over the past 20 years, there's only one that stands out as the Wrong Solution to everything it does, especially what it does well, and that's Perl. Quotably: If Perl is the solution, you're solving the wrong problem. #:Erik -- I agree with everything you say, but I would attack to death your right to say it. -- Tom Stoppard