Subject: Re: How to avoid GC in tight numeric test loop? (CMUCL)
From: rpw3@rpw3.org (Rob Warnock)
Date: Sun, 22 Jun 2008 21:31:39 -0500
Newsgroups: comp.lang.lisp
Message-ID: <rvqdnXIX8IaWlsLVnZ2dnUVZ_v7inZ2d@speakeasy.net>
{I think we've probably beaten this one to death, but I'll
reply once final time, trying not to open any new issues!  ;-}  ]

Madhu <enometh@meer.net> wrote:
+---------------
| In <fI6dnZtOUvVvY8HVnZ2dnUVZ_t_inZ2d@speakeasy.net>,
| Rob Warnock <rpw3@rpw3.org> wrote:
| | Madhu <enometh@meer.net> wrote:
| | +---------------
| | | I think The no-consing behaviour you observe [after patching
| | | compiler/float-tran.lisp] is not because of 32bit inlining but
| | | because your values lie in the fixnum range.
| | +---------------
| 
| I thought the presence of the bignum multiply indicated there
| was the potential to cons
+---------------

But as I noted before, BIGNUM::%MULTIPLY is a VOP [a compiler-inlined
code segment, roughly] specifically designed *not* to cons when doing
a 32b x 32b -> 64b multiply [usually entirely in the registers]. You
just have to tease the compiler into generating the case that uses it.
Unfortunately, the currently-released x86 code blocks that in *exactly*
the case when you want it -- (<= (1+ RANDOM-FIXNUM-MAX) ARG (EXPT 2 32)) --
which is what my (DEFTRANSFORM RANDOM) patch fixes.

+---------------
| | But interestingly [the "no, not entirely" case], if you pre-decrement
| | the argument, type the intermediate variable with (UNSIGNED-BYTE 32),
| | and then pass it incremented to RANDOM, the consing goes away again,
| | regardless of the value of *FOO* [provided, of course, that you
| | don't attempt values outside the range (INTEGER 1 4294967296)]:
| 
| Are you sure its not just the compiler playing tricks?
+---------------

Well, yes, of course it's "playing tricks", but it's the well-known
trick of type-propagation.  ;-}  But it does break down if the body of
the loop gets too complex, so one has to use alternatives in that event
[see below].

+---------------
| Could you replace the body of your loop which says:
| 	(random BAR)
| with:
| 	(setq $foo (random BAR))
| after doing a (defvar $foo) ...
| Does it change the TIME results?
+---------------

If you do a DEFVAR per se to hold the result, sure, of *course* it
will then definitely cons, since DEFVAR'd variables are global and
therefore must be fully boxed, and therefore any 29- to 32-bit result
will cons up a bignum. But if you define a *local* variable of type
(UNSIGNED-BYTE 32) [big enough to hold a (RANDOM 4294967296)] and
assign to *it*, then there's still no consing. Starting the example
from scratch again:

    cmu> (load "float-tran-patch")

    ; Loading #P"/u/rpw3/float-tran-patch.x86f".
    T
    cmu> (defvar *foo* (floor (- (expt 2 32) most-positive-fixnum) 2))

    *FOO*
    cmu> *foo*

    1879048192
    cmu> (> *foo* kernel:random-fixnum-max)

    T 
    cmu> (fixnump *foo*)

    NIL
    cmu> (time (let ((bar (1- *foo*)) ; so as not to violate the U32
		     (quux 0))
                 (declare (type (unsigned-byte 32) bar quux)) 
                 (dotimes (i 1000000)
                   (setf quux (random (the (unsigned-byte 32) (1+ bar)))))))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   0.03f0 seconds of real time
    ;   0.029242f0 seconds of user run time
    ;   0.0f0 seconds of system run time
    ;   54,873,082 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.
    ; 
    NIL
    cmu> 

And it's not just that the compiler detects that the value of QUUX is
effectively unused. [I could hear that coming a kilometer away!  ;-}  ]
The same stays true even if the value of RANDOM gets stored outside
the loop, as long as you store it in a type that can preserve the
unboxed (UNSIGNED-BYTE 32) nature, such as in a specialized array --
which, if you'll remember, was the question that Robert Maas was
originally asking that *started* this thread!  ;-}  ;-}

Unfortunately, it seems that the type-propagation gets a little
too hairy for the compiler for the trick I used above -- the
(LET ((BAR (1- *FOO*))) ... (RANDOM (1+ BAR))) -- to work here,
so you have to revert to the ugler alternative version I showed
previously.

    cmu> (defvar *arr* (make-array 1000000 :element-type '(unsigned-byte 32)))

    *ARR*
    cmu> (type-of *arr*)

    (SIMPLE-ARRAY (UNSIGNED-BYTE 32) (1000000))
    cmu> (setf *foo* (expt 2 32)) ; Test the first branch

    4294967296
    cmu> (time (if (= *foo* 4294967296)
		 (dotimes (i 1000000)
		   (declare (type (simple-array (unsigned-byte 32)) *arr*))
		   (setf (aref *arr* i)
			 (random 4294967296))) ; Special-cased in DEFTRANSFORM
		 (let ((bar *foo*))            ; Make u32 in-register copy
		   (declare (type (unsigned-byte 32) bar) 
			    (type (simple-array (unsigned-byte 32)) *arr*))
		   (assert (plusp bar))
		   (dotimes (i 1000000)
		     (setf (aref *arr* i)
			   (random (the (integer 1 4294967295) bar)))))))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   0.04f0 seconds of real time
    ;   0.037973f0 seconds of user run time
    ;   0.0f0 seconds of system run time
    ;   78,157,422 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.
    ; 
    NIL
    cmu> (decf *foo*)   ; And now the second branch: 0 < *foo* < 2^32

    4294967295
    cmu> (time (if (= *foo* 4294967296)
		 (dotimes (i 1000000)
		   (declare (type (simple-array (unsigned-byte 32)) *arr*))
		   (setf (aref *arr* i)
			 (random 4294967296))) ; Special-cased in DEFTRANSFORM
		 (let ((bar *foo*))            ; Make u32 in-register copy
		   (declare (type (unsigned-byte 32) bar) 
			    (type (simple-array (unsigned-byte 32)) *arr*))
		   (assert (plusp bar))
		   (dotimes (i 1000000)
		     (setf (aref *arr* i)
			   (random (the (integer 1 4294967295) bar)))))))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   0.05f0 seconds of real time
    ;   0.045365f0 seconds of user run time
    ;   0.0f0 seconds of system run time
    ;   85,205,259 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.
    ; 
    NIL
    cmu> (subseq *arr* 0 20)   ; Just to prove it's storing...
    #(451431769 309298074 3071506735 1152762304 3559544580 951329956
      4115968948 2011196577 1492186446 1960542499 3327192403 3712452625
      3289349728 680218552 2710617085 988100064 4123503696 835921392
      791426944 1630415614)
    cmu> 

"Look, Ma! No consing!"

[Note: The difference in the number of CPU cycles in each case above
is meaningless. RANDOM varies quite widely in its run time from call
to call, completely swamping any small variation in the code between
the two branches of the IF.]

+---------------
| "I have seen the CONSING" :)
+---------------

I'm sure you have. It's not terribly easy to get rid of.
But you *can* get rid of it, at least in the case of filling
an array with (pseudo-)random values [RM's original question].

+---------------
| [SNIP] just addressing bias now:
| 
| | Does this bias still exist if you use the correct type for
| | RANDOM-CHUNK, which on X86 is (UNSIGNED-BYTE 32)??
| 
| I have not studied the MT code (either in CMUCL or C) closely. but I do
| not see a (REM) being called in the case you are intersted in (for
| arg<2^32) so you should be safe from this bias I was trying to illustrate.
+---------------

Oh, o.k., thanks.

+---------------
| [I believe that the bias in CMUCL's random (in the other case) is well
| bounded and small enough to be acceptable generally, but it is non-zero.
| I have not been able to convince myself that the sampling technique used
| in ranlib.c is much better, though it comes at a bignum cost (hehheh)]
+---------------

Yes, well...  ;-}  ;-}


-Rob

p.s. If you feel a need to reply once more, go ahead. But I think I'm done.

-----
Rob Warnock			<rpw3@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607