Subject: Re: Encoding bytes into UTF-8 string
From: rpw3@rpw3.org (Rob Warnock)
Date: Fri, 01 Dec 2006 00:42:49 -0600
Newsgroups: comp.lang.lisp
Message-ID: <m8OdnbZqnPd0U_LYnZ2dnUVZ_oOdnZ2d@speakeasy.net>
Harald Hanche-Olsen  <hanche@math.ntnu.no> wrote:
+---------------
| + dixkey@gmail.com:
| | Leaving aside an amusing theory of how the world started (I believe
| | the classic goes something like "In the beginning there was the
| | Word, and Word had two Bytes and there was nothing else." :) ...
| 
| Bytes?  I am not at all sure when bytes entered the picture.
| Certainly, when I was first introduced to computing, the Word was 24
| bits long.  We never heard about bytes, but maybe they were too
| esoteric for mere undergraduates.
+---------------

Maybe. The DEC PDP-10 <http://en.wikipedia.org/wiki/PDP-10> had
hardware support for *variable-width* bytes!! [In fact, the ANSI
Common Lisp LDB & DPB functions are named after the corresponding
PDP-10 instructions.] The base machine word was 36 bits, and a byte
could be anything between 0 & 36 bits wide. Byte operations had to
indirect through a byte-pointer word which contained the byte's
position-within-word (P) [bits from the *right* of the word],
width-in-bits (S), and word address (Y), with the usual PDP-10
indirection (I) & indexing (X) allowed on the word address.
[A zero-width byte always gave you a zero result for LDB, and was
a no-op for DPB. I've used that trick to good effect on occasion!]
A byte pointer looked like this:

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|     P     |     S     | |I|   X   |                Y                  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 0         5 6        11  13 14   17 18                               35

The ILDB and IDPB instructions incremented (destructively modified)
the byte-pointer word first before doing the LDB or DPB function, so
sucessive ILDB/IDPB instructions would step through sucessive bytes
in memory of whatever size was specified in the byte pointer.

    (defun increment-byte-pointer (bp)
      (when (minusp (decf (byte-pointer-p bp) (byte-pointer-s bp)))
	(setf (byte-pointer-p bp) (- 36 (byte-pointer-s bp)))
	(incf (byte-pointer-y bp)))
      bp)

Bytes could not be split across words. Normal character strings were
7-bit ASCII (upper/lower/etc.), which meant you got 5 characters per
36-bit word, with one bit wasted.[1] It was common to initialize byte
pointers for text strings to P=36 S=7 Y=<string_addr>, that is, pointing
to the non-existent byte to the left of the first word of the string,
so that the first ILDB would step into the first actual character of
the string and successive ILDBs would continue onwards from there.

[Also see <http://pdp10.nocrew.org/docs/instruction-set/Byte.html>.]

System calls & filenames, however, used the SIXBIT character set,
with six 6-bit characters per word. SIXBIT only allowed upper-case
letters, which is why PDP-10 filenames (and later, MS-DOS, which
copied it) were only uppercase.

    > (defun ascii-char-sixbit (x)
	(let ((c (char-code x)))
	  (if (and (>= c 32) (<= c 95))
	    (logxor 32 (logand 63 c))
	    (error "Character '~c' (~d) cannot be converted to SIXBIT." x c))))

    ASCII-CHAR-SIXBIT
    > (map 'list 'ascii-char-sixbit "FOOBAR.BAZ")

    (38 47 47 34 33 50 14 34 33 58)
    > (map 'list 'ascii-char-sixbit "hello")
    Error: 
    Error in function ASCII-CHAR-SIXBIT:
       Character 'h' (104) cannot be converted to SIXBIT.
    ...

And ISTR that somebody built a C compiler for the PDP-10 that
used 9-bit bytes for characters, so you packed four to a word,
but I'm not entirely sure about that.

Anyway, 6 & 7 were the most common byte sizes used in the PDP-10,
though other sizes were routinely used for various specialized
functions [e.g., lexical parsing tables could make very good
use of the indexing in a byte pointers to store "byte-strips"
of various sizes indexed by character (Google for me & FOCAL)].


-Rob

[1] Well, some text editors used a hack of tagging magic "line
    number" words (containing 5 decimal digits in ASCII) with a
    "1" in bit 35, to make it easier(?) to find the next line.
    But except for that, bit 35 was usually zero (wasted) in
    ASCII strings.

-----
Rob Warnock			<rpw3@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607