From ...
From: Erik Naggum <erik@naggum.no>
Subject: Re: 8-bit input (or, "Perl attacks on non-English language communities!")
Date: 1999/02/11
Message-ID: <3127703943282468@naggum.no>
X-Deja-AN: 443117437
References: <xlxiudcizfn.fsf@gold.cis.ohio-state.edu> <79p88l$8h2$1@news.u-bordeaux.fr> <lwpv7jr9bd.fsf@copernico.parades.rm.cnr.it> <873e4fbawi.fsf@2xtreme.net> <36C20F62.3FDB5404@elwood.com>
mail-copies-to: never
Organization: Naggum Software; +47 8800 8879; http://www.naggum.no
Newsgroups: comp.lang.lisp

* "Howard R. Stearns" <howard@elwood.com>
| One that is one of my pet peaves.  A while back in "IEEE Computer"
| magazine, some yahoo decided that we don't need to use 16 bits to handle
| international characters.  Instead, we usually only need 8 bits at a
| time, and that we would get better performance by using 8-bit characters
| for everything along with a locally understood "current char set".  They
| eventually printed a "letter to the editor" I sent, and the whole thing
| bugs me enough that I'm going to repeat it here.

  the first ISO 10646 draft actually had this feature for character sets
  that only need 8 bits, complete with "High Octet Prefix", which was
  intended as a stateful encoding that would _never_ be useful in memory.
  this was a vastly superior coding scheme to UTF-8, which unfortunately
  penalizes everybody outside of the United States.  I actually think UTF-8
  is one of the least intelligent solutions to this problem around: it
  thwarts the whole effort of the Unicode Consortium and has already proven
  to be a reason why Unicode is not catching on.

  instead of this stupid encoding, only a few system libraries need to be
  updated to understand the UCS signature, #xFEFF, at the start of strings
  or streams.  it can even be byte-swapped without loss of information.  I
  don't think two bytes is a great loss, but the stateless morons in New
  Jersey couldn't be bothered to figure something like this out.  argh!

  when the UCS signature becomes widespread, any string or stream can be
  viewed initially as a byte sequence, and upon first access can easily be
  inspected for its true nature and the object can then change class into
  whatever the appropriate class should be.  it might even be byteswapped
  if appropriate.  this is not at all rocket science.  I think the UCS
  signature is among the smarter things in Unicode.  that #xFFFE is an
  invalid code and #xFEFF is a zero-width space are signs of a brilliant
  mind at work.  I don't know who invented this, but I _do_ know that UTF-8
  is a New Jersey-ism.

| One issue you bring up that is not covered in the letter is whether speed
| is effected in Lisp by simultaneously supporting BOTH ASCII and Unicode.

  there is actually a lot of evidence that UTF-8 slows things down because
  it has to be translated, but UTF-16 can be processed faster than ISO
  8859-1 on most modern computers because the memory access is simpler with
  16-bit units than with 8-bit units.  odd addresses are not free.

| It is not quite correct to refer to Unicode as a 16-bit standard.
| Unicode actually uses a 32-bit space. It is one of the more popular
| subsets of Unicode, UCS-2, that happens to fit in 16 bits.

  well, Unicode 1.0 was 16 bits, but Unicode is now 16 bits + 20 bits worth
  of extended space encoded as 32 bits using 1024 high and 1024 low codes
  from the set of 16-bit codes.  ISO 10646 is a 31-bit character set
  standard without any of this stupid hi-lo cruft.

  your point about the distinction between internal and external formats is
  generally lost on people who have never seen the concepts provided by the
  READ and WRITE functions in Common Lisp.  Lispers are used to dealing
  with different internal and external representations, and therefore have
  a strong propensity to understand much more complex issues than people
  who are likely to argue in favor of writing raw bytes from memory out to
  files as a form of "interchange", and who deal with all text as _strings_
  and repeatedly maul them with regexps.

  my experience is that there's no point in trying to argue with people who
  don't understand the concepts of internal and external representation --
  if you want to reach them at all, that's where you have to start, but be
  prepared for a paradigm shift happening in your audience's brain.  (it
  has been instructive to see how people suddenly grasp that a date is
  always read and written in ISO 8601 format although the machine actually
  deals with it as a large integer, the number of seconds since an epoch.
  Unix folks who are used to seeing the number _or_ a hacked-up crufty
  version of `ctime' output are truly amazed by this.)  if you can explain
  how and why conflating internal and external representation is bad karma,
  you can usually watch them people get a serious case of revelation and
  their coding style changes there and then.  but just criticizing their
  choice of an internal-friendly external coding doesn't ring a bell.

#:Erik
-- 
  Y2K conversion simplified: Januark, Februark, March, April, Mak, June,
  Julk, August, September, October, November, December.