Subject: Re: Q: on hashes and counting
From: Erik Naggum <erik@naggum.net>
Date: 2000/10/19
Newsgroups: comp.lang.lisp
Message-ID: <3180944138438226@naggum.net>

* The Glauber <theglauber@my-deja.com>
| Suppose i have a text file and i want to count strings in it.
| Specifically, the first 5 characters of each line in this file are a
| vendor code.  Then i want to ouptut all vendor codes and how many
| times they show up in the input.

  Here's my cut at that task:

(let ((hash (make-hash-table :test #'equal)))
  (with-open-file (stream ...file-args...)
    (handler-case
	(loop
	  (let ((vendor (read-sequence stream (make-string 5))))
	    ;; skip to end of line and consume
	    (peek-char #\newline stream)
	    (read-char stream)
	    ;; real body of loop
	    (incf (gethash vendor hash 0))))
      ;; use conditions to simplify error handling
      (end-of-file ())))
  (dolist (key (sort (hash-table-keys hash) #'string<))
    (format t "~A,~A~%" key (gethash key hash))))

| This is all straightforward, except for these lines:
|       (if (gethash curv vhash)
|         (incf (gethash curv vhash))
|         (setf (gethash curv vhash) 1))
| 
| What they do is, if there is already an entry in the hash, increment
| the count. If there isn't, create one (with count set to 1).

  You have already been informed about the default value to gethash.

| Is this the best way to use a hash? In other words, do i really have to
| do the (gethash key hash) twice? It feels wasteful.

  incf does one gethash to get the value and gethash's setf method to
  set the value.  If this is noticeably slow, you _may_ do better with
  something like this:

(let ((entry (gethash vendor hash nil)))
  (if entry
      (incf (car entry))
      (setf (gethash vendor hash) (list 1))))

| In Perl, you do $hash{key}++ and the entry will get "magically"
| created if it isn't there. (Of course, the Perl compiler may be
| doing the double lookup behind my back).

  It is not unlikely that Perl, being a joke of a language around
  over-powerful regexps and hash tables, has optimized this to death.

| The Perl version of this program runs a lot faster than the Lisp
| version (6 times faster) (I'm using Clisp and AIX), but i'm assuming
| that the main problem with the Lisp version is the fact that each
| line of text is consed and thrown away to the garbage collector.
| I'm pretty sure Perl would reuse the memory in this case.

  Did you compile the Lisp code?  Perl _always_ runs slower than Lisp
  for me.  Of course, I don't use a (good) toy Lisp like CLISP, but
  Allegro CL, and I generally compile with high optimization settings.

| Anyway, comparing Perl and Lisp is not the point here, i just want
| to understand if i'm using the hashes right.

  Why don't you ask whether hashes are right for the problem?  If you
  have decided to use Perl, regexps and hashes are the right solution
  because you don't have anything else, but regexps are the wrong
  solution to almost every problem, and hashes may be similar special
  optimizations of particular cases.

  I have a situation not dissimilar to yours, where I have ticker
  codes for various financial instruments, and I found that it paid
  off handsomely to use symbols.  Some "modern" Lisp books tell us not
  to use symbols, for the same reasons you should not use regexps and
  hash tables in Perl, apparently: they are so easy to use that misuse
  will be hard to combat, but nobody who uses Perl warns about Perl's
  insane focus on bad tools -- those who take the warnings seriously
  don't use Perl.  So although symbold are expensive structures, they
  do have enormous benefits language-wise in Lisp, and they should be
  used if the alternative is more cumbersome.  With symbols, the above
  suggested speed-up becomes

(let ((symbol (intern string package)))
  (if (boundp symbol)
      (incf (symbol-value symbol))
      (setf (symbol-value symbol) 1)))
  
| P.S.: the Perl version is a "one liner":
| $c{substr($_,0,5)}++; END { foreach $x (sort keys %c) {print "$x,$c{$x}\n"}}
| (very obvious, no? :-))

  Perl looks the way it does because 110-baud Teletypes were once upon
  the time the best people had on Unix, and with lousy typers who had
  to correct their mistakes all the time, it made sense to abbreviate
  everything down to one or two characters.  Lisp has a different
  heritage, to put it mildly: Better typers, better use of brainpower
  than to remember thousands of subtly similar abbreviations, better
  terminals.  So we aren't as interested in one-liners with compact
  syntax, but for people who still live their lives as if all they can
  hope for is a 300-baud Teletype, the value of one-liners cannot be
  underestimated.

  On the other hand, loop constructs that walk through a file line by
  line have been posted, as this seems a reasonable thing to do to
  some people.  People have regexp libraries for string hacking in
  Lisp, too, even though the much better solution is to build a parser
  that actually returns the stuctured information in the string.  It
  turns out that it doesn't take longer, in programmer-time, to write
  and test a real parser than it does to write and test regexp-based
  string hacking at the same quality level, it may in fact take a lot
  shorter time, because the quality level in a parser is so readily
  determinable, while people stop worrying about the regexp hacks
  until it breaks at sufficiently infrequent random times because they
  data never was designed to be hacked by regexps.

  Of all the languages I have worked with over the past 20 years,
  there's only one that stands out as the Wrong Solution to everything
  it does, especially what it does well, and that's Perl.  Quotably:

  If Perl is the solution, you're solving the wrong problem.

#:Erik
-- 
  I agree with everything you say, but I would
  attack to death your right to say it.
				-- Tom Stoppard