Subject: Re: Questions about Symbolics lisp machines
From: Erik Naggum <erik@naggum.net>
Date: Mon, 25 Mar 2002 09:26:15 GMT
Newsgroups: comp.lang.lisp
Message-ID: <3226037188416051@naggum.net>

* Thomas Bushnell, BSG
| Consider the hair and pain involved in making Unicode work on GNU/Linux
| systems with UTF-8.  This is the easiest way to go, and even so it's lots
| of work converting a jillion applications to work right.  And this is
| because "character stream" is *not* a well defined concept; Unix
| historically only has ASCII character streams, and from this comes a
| giant problem.

  No, this is an important mistake.  Unix has a well-defined concept of an
  "octet stream".  This is _never_ what you really want.  On top of this
  "octet stream" Unix has, via C's lack of a real character type, given its
  users the notion that a _character_ is just the same as a small integer
  that happens to fit in an octet.  All of this is unfortunately wrong.

  A character and its encoding are different concepts.  An encoding and its
  (external) representation are different concepts.  An external
  representation and the numeric values of whatever unit it is made up of
  are different concepts.  By conflating all four concepts into one, Unix
  has held text processing and computing in general back several decades.
  This is fairly ironic, since Unix started out as a text processing
  vehicle.

  One result of this character = small number = octet confusion is that
  "variable length" encodings are seriously frightening to Unixoid coders
  (partly because of the observation that you make that all the code that
  deals with octet-stream -> anything-else interpretation and has caused
  such a problem with stable and well-defined standards such as ISO 2022
  that the IETF was utterly unable to use any existing standards for the
  representation of multi--character-set "documents" and "streams", and so
  had to invent both MIME (extremely crude structured objects in mail) and
  a charset property at an extremely high level, such that mixing charsets
  became extremely verbose and difficult.  This is also why some people
  think Unicode sucks because it may force programmers to deal with
  characters differently than "just assume 16 bits" instead of the old
  "just assume 8 bits".

| Care to guess how many times different argument parser routines there are
| in an average GNU/Linux system?  I pick that one because argument parsing
| is actually, a *total waste*, forced by the use of a "shell"--another
| wasted concept unnecessary in a Real System (like the various lispms had,
| and like Sky [should it ever happen] will have).

  I think you overreact now.  The biggest problem here is that _everything_
  in Unix is an octet stream, even strings, and program arguments are just
  strings.  (The fact that you need to parse the "string" from beginning to
  end to find the in-band terminator (which cannot even be escpaed) makes
  it a stream, and the "pointer" you have into a stream to read the current
  position is just like the position in a stream.)

  Unix is in fact so streams-based that it is nearly _impossible_ to work
  with structured objects.  Everywhere an object wants to go, it has to be
  marshalled into and out of an octet-stream--based external format, both
  in arguments and in pipelines.  It is as you had to call a function foo
  like (eval (format nil "(foo~@{ ~A~})" <arguments>)).  Hey, I just
  reinvented Tcl.

  Of course, every object must have an external representation of _some_
  sort to communicate it with external programs, but marshalling to and
  from octet stream should preserve the object-ness.  Lisp and things like
  ASN.1 enable the preservation of objectness in marshalling, and many
  other attempts have been made.  But treating everything like a string
  without futher adornment or syntax (which Unix shells and, ironically,
  SGML and XML do) is just plain wrong.

  On the other hand, there _are_ times when you want to just copy a file or
  ship across a network bit by bit, in which case the octet stream might
  seem the only alternative.  This is not really a situation that the user
  or even (application) programmer needs to be exposed to.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.