Subject: Re: Wide character implementation
From: Erik Naggum <erik@naggum.net>
Date: Sun, 24 Mar 2002 06:51:53 GMT
Newsgroups: comp.lang.lisp,comp.lang.scheme
Message-ID: <3225941523389213@naggum.net>

* tb+usenet@becket.net (Thomas Bushnell, BSG)
| Should the Scheme/CL type "character" hold Unicode characters, or
| Unicode glyphs?  (It seems clear to me that it should hold characters,
| but I might be thinking about it poorly.)

  There are no Unicode glyphs.  This properly refers to the equivalence of
  a sequence of characters starting with a base character and optinoally
  followed combining characters, and "precomposed" characters.  This is the
  canonical-equivalence of character sequences.  A processor of Unicode
  text is allowed to replace any character sequence with any of its
  canonically-equivalent character sequences.  It is in this regard that an
  application may want to request a particular composite character either
  as one character or a character sequence, and may decide to examine each
  coded character element individually or as an interpreted character.
  These constitute three different levels of interpretation that it must be
  possible to specify.  Since an application is explicitly permitted to
  choose any of the canonical-equivalent character sequences for a
  character, the only reasonable approach is to normalize characters into a
  known internal form.

  There is one crucial restriction on the ability to use equivalent
  character sequences.  ISO 10646 defines implementation levels 1, 2 and 3
  that, respectively, prohibit all combining characters, allow most
  combining characters, and allow all combining characters.  This is a very
  important part of the whole Unicode effort, but Unicode has elected to
  refer to ISO 10646 for this, instead of adopting it.  From my personal
  communication with high-ranking officials in the Unicode consortium, this
  is a political decision, not a technical one, because it was feared that
  implementors that would be happy with trivial character-to-glyph--mapping
  software (such as a conflation of character and glyph concepts and fonts
  that support this conflation), especially in the Latin script cultures,
  would simply drop support for the more complex usage of the Latin script
  and would fail to implement e.g., Greek properly.  Far from being an
  enabling technology, it was feared that implementing the full set of
  equivalences would be omitted and thus not enable the international
  support that was so sought after.  ISO 10646, on the other hand, has
  realized that implementors will need time to get all this right, and may
  choose to defer implementation of Unicode entirely if they are not able
  to do it stepwise.  ISO 10646 Level 1 is intended to be workable for a
  large number of uses, while Level 3 is felt not to have an advantage qua
  requirement until languages that require far more than composition and
  decomposition to be fully supported.  I concur strongly with this.

  The character-to-glyph mapping is fraught with problems.  One possible
  way to do this is actually to use the large private use areas to build
  glyphs and then internally use only non-combining characters.  The level
  of dynamism in the character coding and character-to-glyph mapping here
  is so much difficult to get right that the canonical-equivalent sequences
  of characters (which is a fairly simple table-lookup process) pales in
  comparison.  That is, _if_ you allow combining characters, actually being
  able to display them and reason about them (such as computing widths or
  dealing with character properties of the implicit base character or
  converting their case) is far more difficult than decomposing and
  composing characters.

  As for the scary effect of "variable length" -- if you do not like it,
  canonicalize the input stream.  This really is an isolatable non-problem.
  
///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.