From ...
Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!newsfeeds.belnet.be!news.belnet.be!news2.kpn.net!news.kpn.net!nslave.kpnqwest.net!nloc.kpnqwest.net!nmaster.kpnqwest.net!nreader2.kpnqwest.net.POSTED!not-for-mail
Newsgroups: comp.lang.lisp
Subject: Re: data hygiene [Re: Why is Scheme not a Lisp?]
References: <gat-1203021200230001@eglaptop.jpl.nasa.gov> <87u1rkl068.fsf@charter.net> <87wuwg1b05.fsf@photino.sid.rice.edu> <87ofhrc3ed.fsf@charter.net> <874rjj1ve1.fsf@photino.sid.rice.edu> <b84e9a9f.0203141902.24ce7c85@posting.google.com> <87it7yz2sf.fsf@photino.sid.rice.edu> <sfwr8mlewux.fsf_-_@shell01.TheWorld.com> <87d6y5heq2.fsf@becket.becket.net> <87elilwsnx.fsf@photino.sid.rice.edu> <87u1rfn07o.fsf@becket.becket.net> <87k7sbtzp5.fsf@photino.sid.rice.edu> <871yej1v0h.fsf@becket.becket.net> <87y9grsf <3225739210766179@naggum.net> <3C9A6917.FFC1C4ED@motorola.com> <3225746969252089@naggum.net> <3C9A89E4.F6CAF696@interaccess.com> <3225770541321057@naggum.net> <3C9B17D4.E3DE87AE@interaccess.com> <3225858821399981@naggum.net> <m3g02r5uem.fsf@hanabi.research.bell-labs.com>
Mail-Copies-To: never
From: Erik Naggum <erik@naggum.net>
Message-ID: <3225880811573244@naggum.net>
Organization: Naggum Software, Oslo, Norway
Lines: 79
User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Sat, 23 Mar 2002 13:59:59 GMT
X-Complaints-To: newsmaster@KPNQwest.no
X-Trace: nreader2.kpnqwest.net 1016891999 193.71.199.50 (Sat, 23 Mar 2002 14:59:59 MET)
NNTP-Posting-Date: Sat, 23 Mar 2002 14:59:59 MET
Xref: archiver1.google.com comp.lang.lisp:29955

* Matthias Blume
| No, that was not his purpose.  Paul showed an algorithmic optimization
| for a machine that can have 3 loads independently in flight (so that
| they do not have to wait for each other).  Nothing about infinite
| registers and other such nonsense.  They are not needed.

  I have spent some time with the documentation for my processor, and it
  has four-way concurrent access to the level 2 cache, request-response
  access to syste memory, and instantaneous level 1 cache access.  As near
  as I can tell, my simple alist-get function exercises the kinds of
  optimizations that Paul has described and the branch prediction is done
  the right way with Allegro CL's code, so there should be no pipeline
  flushes at all.  As near as I can tell, the code I have written in
  alist-get-expand, as compiled by Allegro CL cannot be improved upon.
  Now, I have tried to optimize Paul's code similarly, but it is in dire
  need of registers in order to keep from accessing the level 1 cache to
  swap registers with memory.  Paul's code is 27% slower than my simple
  alist-get-expand.

| Now, there *are* machines where multiple loads (but not infinitely many)
| can be independently in flight.  There are literally thousands of people
| in the world doing research of how to efficiently program for such
| machines.

  Very nice.  If Paul is one of them, his code should have run faster than
  mine.  It does not.

| In the beginning you were, IIRC, demonstrating that you need the same
| number of CARs and CDRs in either case -- which (while certainly correct)
| showed that you didn't even understand at that time what Paul was talking
| about. (Well, at least you were hiding it well... :-)

  I have tried to argue that the same number of memory accesses are needed.
  Paul did a much better job of explaining his case after it became clear
  that he was not actually interested in optimizing his code, but instead
  optimizing the algorithm.  The result is slower execution.  I remain
  resoundingly unimpressed.

| Paul was saying that there are machines where this can be an advantage.
| You say "on my machine it doesn't".

  No, I have asked for information on which machines it does apply since I
  cannot find any such machine.  Please pay attention if you want to accuse
  people of ill will and stupidity, will you?  Otherwise, you are the
  offender.

| Fine, neither of you has contradicted the other.  A counterexample is
| does not disprove an existentially quantified statement.  Had Paul said
| "this will always make a difference", he would have been wrong.  But he
| didn't.

  What he has indeed said is that his code should run faster, yet has not
  shown that it actually does on any extant hardware.  However, it appears
  that the benefits that he talks about are handled at the hardware level
  in my processor and that his software pipelining works _against_ the
  hardware because of the register shortage.  In other words, it works
  better to do this in software on CPUs that lack hardware support for
  pipelining, and that it has no effect at all on machines with enough
  registers.  If there are no processors that can do 3 concurrent memory
  accesses that do _not_ do sufficient pipeline entirely on their own, this
  is useful, but if modern processors all do this on their own, there is
  simply no point in doing software pipelining, and the effect should be
  noticeable regardless.

  In particular, there should be _no_ difference in execution time between
  scanning a list of a million elements (to get rid of the cache) and
  scanning the cars of a list of a million conses, i.e., heavily optimized
  member and assoc should run just as quickly on the same list.  I would
  like to see timings on at least one processor where this is the case.

| In any case, let's talk about something more useful.  How about: Which
| implementation of bubblesort runs fastest?

  I am actually happy that you are so bemused that this beats daytime TV.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.