From ... Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!newsfeeds.belnet.be!news.belnet.be!news2.kpn.net!news.kpn.net!nslave.kpnqwest.net!nloc.kpnqwest.net!nmaster.kpnqwest.net!nreader2.kpnqwest.net.POSTED!not-for-mail Newsgroups: comp.lang.lisp Subject: Re: data hygiene [Re: Why is Scheme not a Lisp?] References: <87u1rkl068.fsf@charter.net> <87wuwg1b05.fsf@photino.sid.rice.edu> <87ofhrc3ed.fsf@charter.net> <874rjj1ve1.fsf@photino.sid.rice.edu> <87it7yz2sf.fsf@photino.sid.rice.edu> <87d6y5heq2.fsf@becket.becket.net> <87elilwsnx.fsf@photino.sid.rice.edu> <87u1rfn07o.fsf@becket.becket.net> <87k7sbtzp5.fsf@photino.sid.rice.edu> <871yej1v0h.fsf@becket.becket.net> <87y9grsf <3225739210766179@naggum.net> <3C9A6917.FFC1C4ED@motorola.com> <3225746969252089@naggum.net> <3C9A89E4.F6CAF696@interaccess.com> <3225770541321057@naggum.net> <3C9B17D4.E3DE87AE@interaccess.com> <3225858821399981@naggum.net> Mail-Copies-To: never From: Erik Naggum Message-ID: <3225880811573244@naggum.net> Organization: Naggum Software, Oslo, Norway Lines: 79 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Sat, 23 Mar 2002 13:59:59 GMT X-Complaints-To: newsmaster@KPNQwest.no X-Trace: nreader2.kpnqwest.net 1016891999 193.71.199.50 (Sat, 23 Mar 2002 14:59:59 MET) NNTP-Posting-Date: Sat, 23 Mar 2002 14:59:59 MET Xref: archiver1.google.com comp.lang.lisp:29955 * Matthias Blume | No, that was not his purpose. Paul showed an algorithmic optimization | for a machine that can have 3 loads independently in flight (so that | they do not have to wait for each other). Nothing about infinite | registers and other such nonsense. They are not needed. I have spent some time with the documentation for my processor, and it has four-way concurrent access to the level 2 cache, request-response access to syste memory, and instantaneous level 1 cache access. As near as I can tell, my simple alist-get function exercises the kinds of optimizations that Paul has described and the branch prediction is done the right way with Allegro CL's code, so there should be no pipeline flushes at all. As near as I can tell, the code I have written in alist-get-expand, as compiled by Allegro CL cannot be improved upon. Now, I have tried to optimize Paul's code similarly, but it is in dire need of registers in order to keep from accessing the level 1 cache to swap registers with memory. Paul's code is 27% slower than my simple alist-get-expand. | Now, there *are* machines where multiple loads (but not infinitely many) | can be independently in flight. There are literally thousands of people | in the world doing research of how to efficiently program for such | machines. Very nice. If Paul is one of them, his code should have run faster than mine. It does not. | In the beginning you were, IIRC, demonstrating that you need the same | number of CARs and CDRs in either case -- which (while certainly correct) | showed that you didn't even understand at that time what Paul was talking | about. (Well, at least you were hiding it well... :-) I have tried to argue that the same number of memory accesses are needed. Paul did a much better job of explaining his case after it became clear that he was not actually interested in optimizing his code, but instead optimizing the algorithm. The result is slower execution. I remain resoundingly unimpressed. | Paul was saying that there are machines where this can be an advantage. | You say "on my machine it doesn't". No, I have asked for information on which machines it does apply since I cannot find any such machine. Please pay attention if you want to accuse people of ill will and stupidity, will you? Otherwise, you are the offender. | Fine, neither of you has contradicted the other. A counterexample is | does not disprove an existentially quantified statement. Had Paul said | "this will always make a difference", he would have been wrong. But he | didn't. What he has indeed said is that his code should run faster, yet has not shown that it actually does on any extant hardware. However, it appears that the benefits that he talks about are handled at the hardware level in my processor and that his software pipelining works _against_ the hardware because of the register shortage. In other words, it works better to do this in software on CPUs that lack hardware support for pipelining, and that it has no effect at all on machines with enough registers. If there are no processors that can do 3 concurrent memory accesses that do _not_ do sufficient pipeline entirely on their own, this is useful, but if modern processors all do this on their own, there is simply no point in doing software pipelining, and the effect should be noticeable regardless. In particular, there should be _no_ difference in execution time between scanning a list of a million elements (to get rid of the cache) and scanning the cars of a list of a million conses, i.e., heavily optimized member and assoc should run just as quickly on the same list. I would like to see timings on at least one processor where this is the case. | In any case, let's talk about something more useful. How about: Which | implementation of bubblesort runs fastest? I am actually happy that you are so bemused that this beats daytime TV. /// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.