Subject: Re: data hygiene [Re: Why is Scheme not a Lisp?]
From: Erik Naggum <>
Date: Fri, 22 Mar 2002 07:12:25 GMT
Newsgroups: comp.lang.lisp
Message-ID: <>

* Matthias Blume
| Suppose you can "spawn" as many sub-threads as you like at any point,
| but following a pointer has a fixed cost.

  Are you trying to make me pretend that spawning a sub-thread is free and
  following pointer has costs?  This is the part I simply reject.

| In the plist case, you need at least 2n time steps to traverse n
| key-value pairs because the fetch operations (mostly CDRs) are all
| data-dependend on their respective predecessors.  In the alist case, one
| thread could "run down" the spine of the alist in roughly n steps,
| spawning a new thread at every CAR.  The subthreads then independently
| check to see if their respective key is the one that we are looking for.
| This latter work can effectively overlap with the guy running down the
| spine, so the total time can be less than in the plist case.

  I see that several people have been working this out on a mythical
  computer, not extant hardware.  I am frankly completely unimpressed with
  the arguments that have been made so far.  One does not optimize
  theoretical code for theoretical computers.

| On real hardware, _if_ the key comparison is cheap (e.g., EQ), then each
| of those subthreads will live only very shortly, so the amount of
| effective parallelism that needs to be supported by the hardware is
| bounded (and rather small).

  Well, at least they have to live long enough to grab the cons cell
  pointed to by the car and the contents o the car of that cell.  Now, it
  is all well and good to have a CPU that can pipeline anything you want,
  but we are actually trying to talk to the same _memory_, right?  And that
  is not the memory in either L1 or L2 cache, right?  How many requests can
  the _memory_ pipeline from data that are going to be very close?  I mean,
  my CPU has 32-byte cache lines, meaning it will usually not do any less
  than 32-byte fetches, but the crucial thing is to prefetch _both_ car and
  cdr of all cells visited in order to obtain maximal memory performance.
  To make this work, one would have to do massive analysis of the addresses
  that are actually fetched.  This is actually possible as long as one has
  to fetch from real memory instead of L1 or L2 caches, or even L3 caches.

  In order to make this a testable condition, there is no point in scanning
  some short list.  The list has to be at least 1 million entries in length
  for the test conditions to ensure that all data cache lines are flushed.
  To make that work, there would be some savings in cdr'ing ahead to fill
  half the L1 cache so the cons cells, which I assumme will occupy 2
  adjacent words, will be in the L1 cache when the car visits it.  However,
  in total, there will be exactly as many memory references with a plist
  and an alist, because they use exactly as many cons cells.

  I maintain that on, e.g., the IA32, which I have spent some time learning
  how to optimize the living daylight out of, you will never see any
  difference in performance between an alist and a plist.  There is no
  magic in this, just the willingness not to "suppose" one can do something
  that no processor actually does.

| Obviously, as Joe's tests demonstrate, getting any real advantage out of
| this on a real system is hard.  It would require tight coding, hardware
| that can actually do something like the above, at least in some sense,
| and would probably still not matter for relatively short lists.

  None of the compiled code I have seen has used prefetch instructions, and
  the pipeline on, e.g., IA32, is not long enough to see these actualy
  memory fetch coming.  Therefore, current tests exhibit nothing of the
  kind we would like to prove in this exercise.

  In order to measure the impact of any savings, the best approach is
  probably to have a forerunner that visits both cars and cdrs in an alist
  and the cdrs in a plist.  It should probably be running at _least_ 8 cons
  cells ahead of their use, but something like 100 will probably make more
  sense, in order to give the memory time to fill up the cache lines.  This
  also indicates that there is very significant savings to be had from
  starting the storage of any object on cache line boundaries.

  But suppose your list is short enough that all cons cells can fit in L1
  cache.  The best optimization approach is simply to call length on it to
  move all the cons cells into L1 cache, and then to make sure that they
  stay there between every time you need to scan the list.  If you manage
  to expire the cache lines between calls to either assoc or getf, there is
  no point in this exercise at all, since the next time it is needed, you
  will have to go out and prefetch it, anyway, and since both assoc and
  getf must return the first occurrence, the likelihood of prefetching
  memory that you will never use is pretty high.

| By the way, this is all pretty silly anyway.  If I had to implement a
| dictionary data structure where the number of keys is large enough so
| that the above would matter, then I would *definitely* use something else
| that has better theoretical (and practical) complexity.

  Yup.  I have been trying to argue that assoc is more expensive because it
  is more way more general and may do a _lot_ more work than getf, which
  could be as simple as my plist-get.  To get a user call to assoc be as
  tight as my alist-get requires a lot of effort in both compiler and user

  I think, however, that I have demonstrated exhaustively that there is no
  performance difference between alist and plist and that performance
  should never enter the picture when deciding which to use.  For instance,
  performance is likely to be good for get-properties and destructuring
  using keyword arguments, neither of which have a parallell for alists.

  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.