Subject: Re: Please help!!
From: Erik Naggum <erik@naggum.no>
Date: 1999/02/09
Newsgroups: comp.lang.lisp
Message-ID: <3127541349478598@naggum.no>

* Erik Naggum <erik@naggum.no>
| it is generally unwise to assume that you can use an existing stack frame
| when entering a function anew, even the _same_ function.  to do that
| reliably, you use a looping construct that has control over the variable
| assignment and works _inside_ the stack frame and the known function body.

* pk@kuovi.cs.tut.fi (Kellom{ki Pertti)
| Could you elaborate on this?  I can believe that there may be some
| complications in more complicated situations, but I fail to see the
| problems in compiling e.g.
| 
| 	(define (last lst)
|     	  (if (null? (cdr lst))
| 	      (car lst)
| 	      (last (cdr lst))))
| 
| Or are you referring to the possibility of a redefinition?

  no, redefinition is an additional reason to avoid optimizing away the
  stack frame release.

  let's consider a function call in all its glory.  the arguments are
  evaluated before anything interesting happens, but we'll skip that part
  and just assume that all the values to pass are available.  there are
  several ways to pass arguments to functions, but let's consider the
  simplest case: a stack-allocated vector.  in languages with variable
  numbers of arguments, you need to pass the length of this vector.  that's
  typically passed in a register, so not to complicate things, let's assume
  it is not part of the vector.  (the stack-allocated vector is typically
  set up with PUSH instructions, but it helps to view it as a vector.)

  the actual function call is made, and this is typically very fast, like
  keeping the next instruction address somewhere and transfering control to
  the start of the new function, but in Lisp there typically needs to be
  some lookup to find the actual address.  this is usefully handled by a
  trampoline -- a function that does something important but does not
  disturb the stack frame other than in the expected ways.  (it's called a
  trampoline because you jump on/to it and it helps you jump elsewhere.)

  we have arrived at the start of the our code, where there's an prologue
  waiting for us.  the prologue may check the argument count, do call
  counting statistics, check for interrupts, check for space on the stack,
  set up a stack frame, process the argument list (&optional with default
  values, &key arguments), initialize &aux variables, etc, and the order of
  these things is determined by a lot of factors.  the interesting things
  to us are: error handling, the location of the argument vector, the space
  requirements, the initialization of lambda-bound variables.

  the body of the function is not relevant to us, but after the body (in
  time) is an epilogue that needs to clean up from the prologue, unwind the
  stack, and return to wherever we were called from.  a function is allowed
  to maintain an inconsistent stack state as long as it unwinds correctly.
  also note that CPU's suffering from register starvation (Intel), use the
  stack for temporary memory in some cases, and that local variables are
  allocated as part of this procedure.

  now, to make a tail-call that does not need to unwind the stack, and
  which jumps to the location after the stack frame has been set up, we
  need to know the prologue very intimately.  of course, a compiler writer
  does that, but we also need to make sure that everything that happens
  before the stack frame is set up is handled by the tail-call jump.
  suppose that this is a non-trivial amount of work in the general, such as
  computing the size of a stack-allocated &rest list and such.  is it worth
  the trouble to recognize simple functions rather than go through the
  epilogue and the prologue again?

  and since you cannot actually optimize away the self-tail call completely
  in the general case, it may be a _lot_ of work to recognize the simple
  case in Common Lisp.  now, Scheme has decided to make this decision pay
  off, but I do wonder whether they actually manage to make tail calls as
  efficient as loops.  it is not possible in the general case.

  further complications, by the way, occur when you make a tail call from
  inside a let binding, and that's what we normally do.  if the values are
  stack-allocated, the semantics of tail-call changes from a jump before
  the epilogue to a jump after the epilogue.  a special binding prohibits
  the epilogue from running before the callee has returned.  etc, etc.
  these further complications are, however, fairly easy to spot and argue
  with.  the harder problems occur when even their absence makes it unsafe
  to assume you can just jump directly to the point after the prologue.

  now for an illustrative example.  when compiled with (optimize (speed 3)
  (safety 0) (debug 0)), the simple function

(defun foo (x y z)
  (list x y z))

  should cause a very simple jump to the LIST function, or whatever takes
  care of it with three arguments.  here's the SPARC disassembly with
  Allegro CL 4.1.3:

	ld	[%g4 + 95], %g2	; qlist3
	jmp	%g2 + 0
	xor	%g0, #x3, %g3

  (the SPARC is explicitly pipelined, so the XOR is actually executed
  before the first instruction at the jump target.  %G0 is a constant 0, so
  this sets %G3 to 3.  that's the argument count register.  where the other
  argument values are is immaterial since list is given the exact same
  arguments as foo got.)

  the Intel disassembly with ACL 5.0 is somewhat different.

	pushl	ebp
	movl	ebp,esp
	pushl	esi
	subl	esp,$44
	addl	esp,$12
	pushl	[ebp+16]        ; (argument 2)
	pushl	edx
	pushl	eax
	xorl	ecx,ecx
	movb	cl,$3
	call	*[edi+95]       ; qlist3
	leave
	movl	esi,[ebp-4]
	ret

  note that the tail-call did in fact not get merged, but consider the
  slightly simpler function

(defun bar (x y)
  (list x y))

  the SPARC assembly is identical to the previous function, except that
  QLIST2 is called instead of QLIST3.  the interesting thing happens to the
  Intel code:

	xorl	ecx,ecx
	movb	cl,$2
	jmp	*[edi+47]       ; qlist2

  in other words, the number of arguments can itself be an impediment to
  certain optimizations, and the number may vary from architecture to
  architecture.

  your particular function compiles to a self-tail-call merge, but still
  goes through the symbol-function slot of the symbol naming the function.

#:Erik
-- 
  Y2K conversion simplified: Januark, Februark, March, April, Mak, June,
  Julk, August, September, October, November, December.