William D Clinger <email@example.com> wrote:
| Adam Warner quoting Andy Glew:
| > Other companies such as Hewlett-Packard (HP), Intel, and Silicon
| > Graphics (SGI) have chosen to support sequential consistency in
| > their current architectures.
| They did that through a combination of hardware and software
| techniques. In particular, support for sequential consistency
| can involve compilers that emit memory barriers as needed, and are
| careful not to perform certain optimizations that would have been
| transparent on older architectures.
I can't speak for H-P & Intel, but the SGI "Origin" (MIPS/Irix) and
"Altix" (IPF/Linux) ccNUMA systems [with many hundreds of CPUs!!]
*do* provide sequential consistency in the *hardware* -- no software
support needed, and *without* any global snooping -- using
DASH-style directory-based cache coherency [where the global state
of a given cache line is held-in/owned-by the *memory* subsystem(s),
not the CPU(s)].
 After boot time, that is. At boot time the O/S is of course
required to set up the cache system and the global memory
switching fabric to "do the right thing" later, at runtime.
 There is of course "local" snooping in the front-side busses
of each CPU by the cache-directory system, to turn local cache
misses (reads) and writes into global cache-line reads and
cache-line write-invalidates, respectively. [To the CPUs the
cache-directory system appears to be "just another CPU" on
the same front-side bus.]
| You should read Boehm's paper. He used to work at SGI, now works
| at HP, and is quite familiar with their architectures and compilers.
Yes, but AFAIK the only compiler tweaks [other than for pure
performance] were to fix bugs in the CPUs [e.g., the famous
"branch from the last word of a page" brokenness in R4000].
This *has* to be the case, actually, since with "Origin" & "Altix"
all of the DMA devices also participate in the global cache coherency,
and most of them are hard-coded silicon, with no "compiler tweaks"
Though note that for the highest-performance DMA devices -- ones
with multiple connections [either physical paths or just multiple
"virtual channels"] into the memory switching fabric, special
ordering considerationsr *were* necessary in a very few cases
where one needed to coerce sequential consistency into a total
order. [And, unfortunately, for a few devices interrupts and
DMA traffic went into the memory switching fabric over different
"virtual channels", and there again...]
Ob. credentials: I worked for ~13 years at SGI designing/debugging
very high performance networking and cluster-interconnect adapters,
and had to deal with this stuff on pretty much a daily basis.
Rob Warnock <firstname.lastname@example.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607