Subject: Re: How much tuning does regular lisp compilers do?
From: rpw3@rpw3.org (Rob Warnock)
Date: Sat, 30 Aug 2008 23:04:59 -0500
Newsgroups: comp.lang.lisp
Message-ID: <P9KdnRf8PsH2hSfVnZ2dnUVZ_vOdnZ2d@speakeasy.net>
Bakul Shah  <bakul+usenet@bitblocks.com> wrote:
+---------------
| Rob Warnock wrote:
| > CMUCL-19c always allocates objects on 8-byte boundaries, as you can see.
...
| > So for the "fast" cases the inner loop code never crosses a 16-byte
| > boundary, and for the "slow" cases it does. But why is a 16-byte
| > boundary important, and not the 64-byte cache line size? A possible
| > answer for that I dug out of an old copy of Microprocessor Report
| > on the AMD K7, which notes that the K7 pipeline fetches 16 bytes of
| > instructions from the primary I-cache per cycle. *AHA!* All becomes clear!
...
| > ...for any single function
| > *its* alignment (and thus speed) may well vary from one GC to the next.
| 
| Looks like you are saying
| a) loops should be aligned on a 16 byte boundary
| b) A copying GC should maintain function alignment modulo 16
+---------------

Exactly.

+---------------
| That shouldn't be too hard to fix, right?!
+---------------

Well... In theory, yes, but "theory vs practice", you know.  ;-}

You don't want to have to force conses to 16-byte boundaries,
since on a 32-bit machine that would *double* the space taken
by them [though on 64-bit it's a no-brainer], so you'd have
to do some sort of BiBoP-like (or at least segregated) allocation
in the GC. And then there's adding algorithms to the compiler
to know *which* loops are worth to forcing alignment on (inner
two levels only, probably), and then adding data structures to
the compiler to keep track of that where it's needed, and to
glue that information together with the part of the compiler
that stitches generated code together [since the current CMUCL
compiler is mostly template based, viz. "VOPs"]. And then modify
the FASL loader & format to preserve that alignment on loaded
compiled code. Etc.

Certainly "not too hard" for a funded project by a small team
of experienced compiler writers, but I suspect that resource
isn't currently available for CMUCL.

Perhaps someone from one of the commercial implementations
could comment on how they address(ed) this issue... [Duane?]


-Rob

-----
Rob Warnock			<rpw3@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607