From ... Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!logbridge.uoregon.edu!newshunter!cosy.sbg.ac.at!newsfeed.Austria.EU.net!newsfeed.kpnqwest.at!nslave.kpnqwest.net!nloc.kpnqwest.net!nmaster.kpnqwest.net!nreader2.kpnqwest.net.POSTED!not-for-mail Newsgroups: comp.lang.lisp,comp.lang.scheme Subject: Re: Wide character implementation References: <87wuw92lhc.fsf@becket.becket.net> Mail-Copies-To: never From: Erik Naggum Message-ID: <3225524036151618@naggum.net> Organization: Naggum Software, Oslo, Norway Lines: 41 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Tue, 19 Mar 2002 10:53:48 GMT X-Complaints-To: newsmaster@KPNQwest.no X-Trace: nreader2.kpnqwest.net 1016535228 193.71.199.50 (Tue, 19 Mar 2002 11:53:48 MET) NNTP-Posting-Date: Tue, 19 Mar 2002 11:53:48 MET Xref: archiver1.google.com comp.lang.lisp:29543 comp.lang.scheme:9483 * Thomas Bushnell, BSG | If one uses tagged pointers, then its easy to implement fixnums as | ASCII characters efficiently. Huh? No sense this makes. | But suppose one wants to have the character datatype be 32-bit Unicode | characters? Or worse yet, 35-bit Unicode characters? Unicode is a 31-bit character set. The base multilingual plane is 16 bits wide, and then there are the possibility of 20 bits encoded in two 16-bit values with values from 0 to 1023, effectively (+ (expt 2 20) (- (expt 2 16) 1024 1024)) => 1112064 possible codes in this coding scheme, but one does not have to understand the lo- and hi-word codes that make up the 20-bit character space. In effect, you need 16 bits. Therefore, you could represent characters with the following bit pattern, with b for bits and c for code. Fonts are a mistake, so is removed. 000000ccccccccccccccccccccc00110 This is useful when the fixnum type tag is either 000 for even fixnums and 100 for odd fixnums, effectively 00 for fixnums. This makes char-code and code-char a single shift operation. Of course, char-bits and char-font are not supported in this scheme, but if you _really_ have to, the upper 4 bits may be used for char-bits. | At the same time, most characters in the system will of course not be | wide. What are the sane implementation strategies for this? I would (again) recommend actually reading the specification. The character type can handle everything, but base-char could handle the 8-bit things that reasonable people use. The normal string type has character elements while base-string has base-char elements. It would seem fairly reasonable to implement a *read-default-string-type* that would take string or base-string as value if you choose to implement both string types. /// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.