Subject: Re: data hygiene [Re: Why is Scheme not a Lisp?]
From: Erik Naggum <erik@naggum.net>
Date: Fri, 22 Mar 2002 10:50:22 GMT
Newsgroups: comp.lang.lisp
Message-ID: <3225783034450486@naggum.net>

* Duane Rettig
| What is critical is that the machine has the ability to do three memory
| references at once.

  That is indeed a critical piece of information.  I have reasoned based on
  explicit prefetch instructions because the pipeline is not long enough to
  _sustain_ three outstanding requests, especially considering the
  destructive effect of taken branches on the pipeline, and considering the
  amount of time it takes to fetch from external memory compared to the
  on-chip L1 cache, which will hold the entire cons cell in the same cache
  line, unles allocation is really screwed up.  Unrolling loops has the
  significant disadvantage that execution needs more memory references.

  I have severe problems seeing how you can possibly achieve code with
  three outstanding requests at all times.  Not to mention the severe
  shortage of registers on IA32.  Are there any CPUs that can _actually_
  achieve this?  If none, it is an interesting exercise in How Things
  Should Be, but not in optimizing code for real processors.  Sadly, I am
  far more interested in actual CPUs than imaginary ones, but since
  acquiring a working knowledge of a processor family is pretty much
  depth-first, I have only working knowledge of one current architecture,
  and I am probably not up to speed on what IA32 can do, either.  From what
  I can find in the documentation, however, I am unable to write the code
  for this architecture that will _actually_ achieve more than two
  concurrent accesses to implement assoc (or my simplified alist-get) even
  though the CPU is supposed to be able to handle four.

  There are two obvious reasons for this: (1) memory accesses that end up
  with a L1 cache hit are not worth optimizing further, and (2) if you have
  grabbed one of the car and the cdr of a cons cell, the other access is
  free.  If you have to hit the L2 cache or worse, have to request it from
  real memory, you still get a 64-bit unit back from the memory subsystem.
  In other words, if you keep moving two cdrs ahead in a plist, the car is
  there for free in a plist, whereas the alist must access both car and cdr
  of the given cell.  Yes, there is savings to be had in pipelining this,
  but it has no practical edge over that in plists, because we are not
  optimizing just the processor pipeline, we have to request the memory and
  have something that takes some time to get, before this matters.

  I sort of regret having to be incredibly concrete and low-level, but that
  is where these things actually happen.  It is very convenient to optimize
  some abstract access model, but it has to be implemented and actually be
  possible to generate instructions for it such that it actually matters.
  I have yet to see that be true.

  I guess what I keep saying is that optimizing for a real processor is not
  some theoretical task that can ignore the actual processor.

  You alluded to measurements that indicated a material difference, Duane,
  and I would just love to see that and the code that produced it on
  various processors.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.