Subject: Re: Theory #51 (superior(?) programming languages)
From: Erik Naggum <erik@naggum.no>
Date: 1997/01/26
Newsgroups: comp.arch,comp.lang.lisp,comp.lang.scheme
Message-ID: <3063227553270898@naggum.no>


* D. J. Bernstein
| Expand the tail recursion in C, like this---
:
| ---and you get a 5x speedup. That's more than twice as fast as your
| proud example of fast LISP code.

just to be entirely clear on this, the code in the "proud example" was
posted by a fool who posted both the C and the Lisp code to debunk Lisp's
speed.  he was indeed very proud of his code, and he made no attempt to
speed up the C code once he was shown that CMUCL outdid it with a factor of
2.  you're not answering the question as asked if you change the premises
they way you have done.  but to show his state of mind, let me include a
sentence from his message (translated):

    Execution time under CLISP on my old Pentium/Linux: 140 seconds.
    After compiling the stuff, however, I got it down to 11 seconds.

he was comparing _interpreted_ code to his compiled C code!  you can judge
his brilliance from this alone.  his refusal to acknowledge that CLISP ran
byte-code was the reason I wanted to beat up his example.

however, your challenge was much tougher.

| Of course, it's bad to use the stack for temporary space. Allocate what
| you need at the beginning, handling out-of-memory explicitly rather than
| crashing, and use some obvious dynamic programming---
:
| ---for another 500x speedup. How would you translate that to LISP? How
| fast is the result?

since both the Lisp and the C versions are very fast, I ran a million
passes over each version to get accurate timing.  (SunOS 4 has some odd
notions of internal clock speed, measuring in 100Hz and reporting i 60Hz,
thereby losing a lot of precision.)  your C code used 78 s, but allocated
36M of memory and needed 7 s of system time to do so.  a call to free(u)
and a slight rewrite of the value-returning form increased the time to 82
s, but reduced the system time to less than .3 seconds.  using `alloca' to
allocate on the stack, instead, it's down to 66 s.  the Lisp version uses
104 s, plus 1.1 s of garbage collection and .3 s of system time.  it had
consed a total of 48M.  (experiments showed that zeroing a pre-allocated
vector was _much_ slower than allocating fresh memory.)

here's the code I used for the comparisons:

/* binomial.c -*- mode: c -*- 35453.668341 */

int binomial(int n, int k)
{
  int i, j, *u;
  if (k > n)
    return 0;
  i = n - k;
  if (k > i)
    k = i;
  if (!k)
    return 1;
  u = (int *) alloca((k + 1) * sizeof (int));
  if (!u)
    return -1;
  u[0] = 1;
  for (i = 1;  i <= n; ++i)
    for (j = k; j > 0; --j)
      u[j] += u[j - 1];
  return u[k];
}

/* binomial.c ends */

#| binomial.cl -*- mode: common-lisp -*- 35453.668456 |#

(in-package :user)

(defun binomial (n k)
  (declare (fixnum n k))
  (locally (declare (optimize (speed 3) (safety 0) (debug 0)))
    (unless (and (plusp k) (plusp n))
      (error "No possible result"))
    (when (> k n)
      (cerror "Evaluate (binomial ~*~D ~0@*~D) instead"
	      "Reversed arguments in (binomial ~D ~D)?"
	      n k)
      (rotatef n k))
    (setq k (min k (- n k)))    
    (when (zerop k)
      (return-from binomial 1))
    (let* ((j 0) (u (make-array (1+ k) :initial-element 0)))
      (declare (fixnum j) (type (vector fixnum) u))
      (setf (svref u 0) 1)
      (tagbody
       outer   (setq j k)
       inner   (incf (svref u j) (svref u (1- j)))
	(unless (zerop (decf j)) (go inner))
	(unless (zerop (decf n)) (go outer)))
      (svref u k))))

#| binomial.cl ends |#

there may be ways to improve this further; I don't have time to sit down
and _really_ tune this code (the above was based on a couple hunches and a
little experimentation, and took me about 40 minutes to tweak this into
place, plus about half an hour CPU time for tests running so long at a time
I could get some real work done in between.)  clearly, this is an immense
improvement over the previous code, a factor of 43,100!  (I have been told
that this is because the SPARC CPU (or my version of it) handles recursion
badly due to limited register windows.  I'm seeking more information on
this.)

#\Erik
-- 
1,3,7-trimethylxanthine -- a basic ingredient in quality software.