Subject: Re: Name for the set of characters legal in identifiers
From: Erik Naggum <erik@naggum.no>
Date: 14 Jan 2004 05:39:34 +0000
Newsgroups: comp.lang.lisp
Message-ID: <3283047574505462KL2065E@naggum.no>

* Russell Wallace
| A trivial little question, but one that's been bugging me: Is there
| a name for that set of characters legal in Lisp identifiers?  For
| most languages this would be "alphanumeric" (perhaps with a footnote
| that _ is regarded as a letter in this context), but Lisp includes
| characters like + and - that most languages regard as punctuation.

  The type STANDARD-CHAR covers the set of characters from which all
  symbols in the standard packages are made.  This simple fact may
  give rise to the invalid assumption that there must be a particular
  character set from which all symbols must be made.

  However, the functions INTERN and MAKE-SYMBOL take a STRING as the
  name of the symbol to be created, and there is no restriction on
  this /string/ to be of type BASE-STRING.  Likewise, the value of
  SYMBOL-NAME is only specified to be of type STRING, with no mention
  of the common observation that it may be a SIMPLE-STRING regardless
  of whether the corresponding argument to INTERN or MAKE-SYMBOL was.

  Since the symbols are normally created by the Common Lisp reader,
  your question is therefore really which characters the reader is
  able to build into a string that it will pass to INTERN.  There is
  no upper bound on this character set in the standard, but an actual
  implementation will necessarily place restrictions on this set.  In
  the worst case, the Common Lisp reader does not understand which
  character is has just read the encoding of, and may produce symbols
  with garbage bytes that nevertheless reproduce the character in your
  editor or other character display equipment.

  Pessimistically, therefore, your question is whether you will find
  any mention in the standard of any invalid characters in symbols,
  but you find quite the opposite: After a single-escape character,
  normally \, any following character will be a constituent character
  in the symbol name being read, and between the multiple-escape
  characters, normally |, all characters will be constituent.  The
  best you can hope for is thus that whatever reads the byte stream
  that is your source file will reject unacceptable encodings.  As
  long as you use an encoded character set that includes the standard
  characters, there is no restriction on what you can do, and if you
  use an encoding that does not confuse standard characters and one of
  your other characters even in the least capable decoders, you will
  find that there is not even any useful restriction on the /length/
  of Common Lisp symbol names.

  Optimistically, however, the answer to your question is that the set
  of characters that are legal in identifiers is the standard-class
  CHARACTER, but you may not be able to produce all of them in any
  given source file.

  I am particularly fond of using the non-breaking space in symbol
  names, just as I use it in filenames under operating systems that
  believe that ordinary spaces are separators regardless of how much
  effort one puts into convincing its various programs otherwise.  I
  know people who think there ought to be laws against this practice,
  but sadly, the Common Lisp standard does not come to their aid.

-- 
Erik Naggum | Oslo, Norway                      Yes, I survived 2003.

Act from reason, and failure makes you rethink and study harder.
Act from faith, and failure makes you blame someone and push harder.