Subject: Re: How much tuning does regular lisp compilers do?
From: rpw3@rpw3.org (Rob Warnock)
Date: Sat, 30 Aug 2008 23:26:02 -0500
Newsgroups: comp.lang.lisp
Message-ID: <WuCdnf7zmaLHgCfVnZ2dnUVZ_qTinZ2d@speakeasy.net>
verec  <verec@mac.com> wrote:
+---------------
| rpw3@rpw3.org (Rob Warnock) wrote:
| [a very detailed explanation of his Lisp on Atlon cache aligment
| experiments]
| 
| That's certainly more than what I expected, but then begs the
| question of how realistic such improved "cached aligned" loops
| might ne in the real world:
| 
| Given the byzantine number of bytes per opcode requirement of
| the x86 ISA, it is very likely that most loops will need more
| than 16 bytes of code between the branch-back and the top of
| the loop, thus incurring a systematic "16 bytes fetch cycle"
| miss each and every time through the loop.
+---------------

In my conversations with people who *are* experienced compiler
writers [I'm not, just an amateur there], I've been told that
for algorithmic kernels keeping loops 16-byte-aligned is an
easy way to get a small but measurable improvement. I would
expect that on modern x86 (and x86_64) machines that the penalty
for misalignment would be only one or two cycles. Whether that
is important enough to you (generic "you") to affect your buying
decision about which compiler to use is something you'd have
to answer for yourself. But compiler writers *do* worry about
such things.

+---------------
| As for the VM mapping bit, I was just saying that virtual
| address 48B68C28 (labbelled FAST in your example) might be
| mapped to physical address, say xxxx20, that which the CPU
| actually fetches from, and thus your 16 bytes alignment that
| holds in VM land might just break after translation.
+---------------

Oh, got it. No, actually, that *CAN'T* happen!!!  ;-}  ;-}
The page size for x86 machine is typically 4 KiB, so the
low three hex digits of any virtual address *will* remain
unchanged across the vir2phys translation. That is, 48B68C28
will always go to xxxxxC28. And since cache lines are aligned
within pages...

But to answer the unstated related question: yes, there have
been cases historically where the compiler had to pay attention
to page crossings as well as cache line crossings. [E.g., the
infamous early MIPS R4000 bug that showed up if a jump was the
last instruction on a page, "fixed" with a change to the linker!!
The SGI compiler included a "-no_jump_at_eop" option in "ld"
for a while.]


-Rob

-----
Rob Warnock			<rpw3@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607