Subject: Re: Wide character implementation
From: Erik Naggum <>
Date: Tue, 19 Mar 2002 10:53:48 GMT
Newsgroups: comp.lang.lisp,comp.lang.scheme
Message-ID: <>

* Thomas Bushnell, BSG
| If one uses tagged pointers, then its easy to implement fixnums as
| ASCII characters efficiently.

  Huh?  No sense this makes.

| But suppose one wants to have the character datatype be 32-bit Unicode
| characters?  Or worse yet, 35-bit Unicode characters?

  Unicode is a 31-bit character set.  The base multilingual plane is 16
  bits wide, and then there are the possibility of 20 bits encoded in two
  16-bit values with values from 0 to 1023, effectively (+ (expt 2 20) (-
  (expt 2 16) 1024 1024)) => 1112064 possible codes in this coding scheme,
  but one does not have to understand the lo- and hi-word codes that make
  up the 20-bit character space.  In effect, you need 16 bits.  Therefore,
  you could represent characters with the following bit pattern, with b for
  bits and c for code.  Fonts are a mistake, so is removed.


  This is useful when the fixnum type tag is either 000 for even fixnums
  and 100 for odd fixnums, effectively 00 for fixnums.  This makes
  char-code and code-char a single shift operation.  Of course, char-bits
  and char-font are not supported in this scheme, but if you _really_ have
  to, the upper 4 bits may be used for char-bits.

| At the same time, most characters in the system will of course not be
| wide.  What are the sane implementation strategies for this?

  I would (again) recommend actually reading the specification.  The
  character type can handle everything, but base-char could handle the
  8-bit things that reasonable people use.  The normal string type has
  character elements while base-string has base-char elements.  It would
  seem fairly reasonable to implement a *read-default-string-type* that
  would take string or base-string as value if you choose to implement both
  string types.

  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.