Re: How much tuning does regular lisp compilers do? — Rob Warnock Lisp usenet archive

Andreas Davour  <anteRUN@updateLIKE.uu.HELLse> wrote:
+---------------
| I've just spent three days in a application tuning seminar, learning
| everything about cache lines, cache hits, inlining functions, unrolling
| loops, openMP and more of that kind of techniques. ...
+---------------

The only thing I can contribute to that is:

1. Compiled CLs are not always careful about placing functions
   on optimal cache line boundaries.

2. Item #1 can make a *significant* difference in runtime.

For example, in CMUCL, consider these successive trials:

    cmu> (time (dotimes (i 2000000000)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   4.45f0 seconds of real time
    ;   4.374417f0 seconds of user run time
    ;   4.04f-4 seconds of system run time
    ;   8,243,655,669 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.
    ; 
    NIL
    cmu> (time (dotimes (i 2000000000)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   4.42f0 seconds of real time
    ;   4.373149f0 seconds of user run time
    ;   5.6f-5 seconds of system run time
    ;   8,200,990,023 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.
    ; 
    NIL
    cmu> (time (dotimes (i 2000000000)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   2.95f0 seconds of real time
    ;   2.919542f0 seconds of user run time
    ;   0.002418f0 seconds of system run time
    ;   5,459,768,540 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.
    ; 
    NIL
    cmu> (time (dotimes (i 2000000000)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   4.5f0 seconds of real time
    ;   4.365597f0 seconds of user run time
    ;   0.007846f0 seconds of system run time
    ;   8,362,612,228 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.
    ; 
    NIL
    cmu>

Note that the same loop took ~4, ~4, ~2, & ~4 CPU cycles/iteration.
Why? Because the third compilation was more optimally placed for
the CPU's branch prediction (Athlon-32, as it happens).

I know of no way to tell CMUCL how to align generated code.
YMMV with other implementations.


-Rob

-----
Rob Warnock			<rpw3@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607