From ...
Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!logbridge.uoregon.edu!newshunter!cosy.sbg.ac.at!newsfeed.Austria.EU.net!newsfeed.kpnqwest.at!nslave.kpnqwest.net!nloc.kpnqwest.net!nmaster.kpnqwest.net!nreader2.kpnqwest.net.POSTED!not-for-mail
Newsgroups: comp.lang.lisp,comp.lang.scheme
Subject: Re: Wide character implementation
References: <87wuw92lhc.fsf@becket.becket.net>
Mail-Copies-To: never
From: Erik Naggum <erik@naggum.net>
Message-ID: <3225524036151618@naggum.net>
Organization: Naggum Software, Oslo, Norway
Lines: 41
User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Tue, 19 Mar 2002 10:53:48 GMT
X-Complaints-To: newsmaster@KPNQwest.no
X-Trace: nreader2.kpnqwest.net 1016535228 193.71.199.50 (Tue, 19 Mar 2002 11:53:48 MET)
NNTP-Posting-Date: Tue, 19 Mar 2002 11:53:48 MET
Xref: archiver1.google.com comp.lang.lisp:29543 comp.lang.scheme:9483

* Thomas Bushnell, BSG
| If one uses tagged pointers, then its easy to implement fixnums as
| ASCII characters efficiently.

  Huh?  No sense this makes.

| But suppose one wants to have the character datatype be 32-bit Unicode
| characters?  Or worse yet, 35-bit Unicode characters?

  Unicode is a 31-bit character set.  The base multilingual plane is 16
  bits wide, and then there are the possibility of 20 bits encoded in two
  16-bit values with values from 0 to 1023, effectively (+ (expt 2 20) (-
  (expt 2 16) 1024 1024)) => 1112064 possible codes in this coding scheme,
  but one does not have to understand the lo- and hi-word codes that make
  up the 20-bit character space.  In effect, you need 16 bits.  Therefore,
  you could represent characters with the following bit pattern, with b for
  bits and c for code.  Fonts are a mistake, so is removed.

000000ccccccccccccccccccccc00110

  This is useful when the fixnum type tag is either 000 for even fixnums
  and 100 for odd fixnums, effectively 00 for fixnums.  This makes
  char-code and code-char a single shift operation.  Of course, char-bits
  and char-font are not supported in this scheme, but if you _really_ have
  to, the upper 4 bits may be used for char-bits.

| At the same time, most characters in the system will of course not be
| wide.  What are the sane implementation strategies for this?

  I would (again) recommend actually reading the specification.  The
  character type can handle everything, but base-char could handle the
  8-bit things that reasonable people use.  The normal string type has
  character elements while base-string has base-char elements.  It would
  seem fairly reasonable to implement a *read-default-string-type* that
  would take string or base-string as value if you choose to implement both
  string types.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.