From ...
Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!newsfeeds.belnet.be!news.belnet.be!news2.kpn.net!news.kpn.net!nslave.kpnqwest.net!nloc.kpnqwest.net!nmaster.kpnqwest.net!nreader3.kpnqwest.net.POSTED!not-for-mail
Newsgroups: comp.lang.lisp,comp.lang.scheme
Subject: Re: Back to character set implementation thinking
References: <87wuw92lhc.fsf@becket.becket.net> <1016554947.964486@haldjas.folklore.ee> <3225568971513146@naggum.net> <1016831590.163240@haldjas.folklore.ee> <3225841444459787@naggum.net> <1016909497.106880@haldjas.folklore.ee> <3225923202075012@naggum.net> <87it7m4mnm.fsf@becket.becket.net> <3225942059872001@naggum.net> <871ye9i91x.fsf_-_@cs.uga.edu> <3226021614417921@naggum.net> <m3adswokg3.fsf@hanabi.research.bell-labs.com> <3226054464281011@naggum.net> <folmcgejul.fsf@trex10.cs.bell-labs.com> <3226061533844203@naggum.net> <87ofhczdat.fsf_-_@becket.becket.net>
Mail-Copies-To: never
From: Erik Naggum <erik@naggum.net>
Message-ID: <3226095271716329@naggum.net>
Organization: Naggum Software, Oslo, Norway
Lines: 82
User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Tue, 26 Mar 2002 01:34:19 GMT
X-Complaints-To: newsmaster@KPNQwest.no
X-Trace: nreader3.kpnqwest.net 1017106459 193.71.199.50 (Tue, 26 Mar 2002 02:34:19 MET)
NNTP-Posting-Date: Tue, 26 Mar 2002 02:34:19 MET
Xref: archiver1.google.com comp.lang.lisp:30115 comp.lang.scheme:9702

* Thomas Bushnell, BSG
| The GNU/Linux world is rapidly converging on using UTF-8 to hold 31-bit
| Unicode values.  Part of the reason it does this is so that existing byte
| streams of Latin-1 characters can (pretty much) be used without
| modification, and it allows "soft conversion" of existing code, which is
| quite easy and thus helps everybody switch.

  UTF-8 is in fact extreemly hostile to applications that would otherwise
  have dealt with ISO 8859-1.  The addition of a prefix byte has some very
  serious implications.  UTF-8 is an inefficient and stupid format that
  should never have been proposed.  However, it has computational elegance
  in that it is a stateless encoding.  I maintain that encoding is stateful
  regardless of whether it is made explicit or not.  I therefore strongly
  suggest that serious users of Unicode employ the compression scheme that
  has been described in Unicode Technical Report #6.  I recommend reading
  this technical report.

  Incidentally, if I could design things all over again, I would most
  probably have used a pure 16-bit character set from the get-go.  None of
  this annoying 7- or 8-bit stuff.  Well, actually, I would have opted for
  more than 16-bit units -- it is way too small.  I think I would have
  wanted the smallest storage unit of a computer to be 20 bits wide.  That
  would have allowed addressing of 4G of today's bytes with only 20 bits.
  But I digress...

| So even if strings are "compressed" this way, they are not UTF-8.  That's
| Right Out.  They are just direct UCS values.  Procedures like string-set!
| therefore might have to inflate (and thus copy) the entire string if a
| value outside the range is stored.  But that's ok with me; I don't think
| it's a serious lose.

  There is some value to the C/Unix concept of a string as a small stream.
  Most parsing of strings needs to parse so from start to end, so there is
  no point in optimizing them for direct access.  However, a string would
  then be different from a vector of characters.  It would, conceptually,
  be more like a list of characters, but with a more compact encoding, of
  course.  Emacs MULE, with all its horrible faults, has taken a stream
  approach to character sequences and then added direct access into it,
  which has become amazingly expensive.

  I believe that trying to make "string" both a stream and a vector at the
  same time is futile and only leads to very serious problems.  The default
  representation of a string should be stream, not a vector, and accessors
  should use the stream, such as with make-string-{input,output}-stream,
  with new operators like dostring, instead of trying to use the string as
  a vector when it clearly is not.  The character concept needs to be able
  to accomodate this, too.  Such pervasive changes are of course not free.

| Ok, then the second question is about combining characters.  Level 1
| support is really not appropriate here.  It would be nice to support
| Level 3.  But perhaps Level 2 with Hangul Jamo characters [are those
| required for Level 2?] would be good enough.

  Level 2 requires every other combining character except Hangul Jamo.

| It seems to me that it's most appropriate to use Normalization Form D.

  I agree for the streams approach.  I think it is important to make sure
  that there is a single code for all character sequences in the stream
  when it is converted to a vector.  The private use space should be used
  for these things, and a mapping to and from character sequences should be
  maintained such that if a private use character is queried for its
  properties, those of the character sequence would be returned.

| Or is that crazy?  It has the advantage of holding all the Level 3 values
| in a consistent way.  (Since precombined characters do not exist for all
| possibilities, Normalization Form C results in some characters
| precombined and some not, right?)

  Correct.

| And finally, should the Lisp/Scheme "character" data type refer to a
| single UCS code point, or should it refer to a base character together
| with all the combining characters that are attached to it?

  Primarily the code point, but both, effectively, by using the private use
  space as outlined above.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.