Subject: Revisiting split-sequence and patterns
From: Erik Naggum <erik@naggum.net>
Date: Wed, 28 Nov 2001 04:03:04 GMT
Newsgroups: comp.lang.lisp
Message-ID: <3215908980293864@naggum.net>

  This problem occurred to me while trying to explain why comma-delimited
  data formats should not use something as simple as split-sequence (or a
  similar inline loop), so bear with me while I rehash an old issue and
  walk through some background.

  If there be patterns of syntax design, one of them would be escaping or
  quoting a character so it reverts from (potentially) special to normal
  interpretation using one special character solely for escaping purposes,
  backslash being the canonical choice for this.  (Another pattern of
  syntax design would be to use the backslash to _make_ characters special
  and thus mean something else entirely.  It is thus important to recognize
  which pattern the backslash is used to implement in a particular syntax.)
  The escaping character is discarded from the extracted token.  Call this
  the single-escape mechanism.

  Since Common Lisp adheres to this pattern of syntax design, I would have
  expected that a function that split a string into a list of tokens that
  were delimited by a special character would allow a token to contain that
  delimiter if it were escaped.

  If there be more patterns of syntax design, another one would involve
  escaping a whole sequence of characters so they revert from (potentially)
  special to normal interpretation, using a single special character at
  both ends of the sequence, and escaping that character and the escaping
  character inside the sequence with the same character that does this for
  a single character.  All the escaping characters are discarded from the
  extracted token.  Call this the multiple-escape mechanism.

  Since Common Lisp adheres to this pattern of syntax design, too, with
  both its string syntax and the multiple-escape syntax for symbols, I
  would also have expected that a function that split a string into a list
  of tokens that had been delimited on either end by a special character
  would allow a token to be so delimited and thus to avoid being split in
  the middle of such a token if it contained the delimiter on which to
  split the string.

  There are a whole lot of other patterns of syntax design: whether it be
  context-free or context-sensitive; how many characters (or tokens) of
  read-ahead it require; whether the start and end markers of individual
  elements be explicit or deduced.  Common Lisp has a context-sensitive
  grammar (strictly speaking, since every change to the readtable and even
  the package changes the syntax, one cannot read a Common Lisp source file
  correctly without evaluating certain top-level forms and performing any
  evaluation required by the #. reader macro), described by the readtable,
  which implements support for recursive-descent parsers that may change
  the readtable while processing "their" portion of the input stream.  (It
  is easy to describe the (static) syntax of the standard readtable that
  implements the standard language, but that is not sufficient for a real
  Common Lisp system.)  Common Lisp's syntax is in this particular way very
  peculiar and breaks the patterns that have later been established for
  context-free grammars and LALR parsers.  (Most interesting programming
  languages fail to adhere to these formal little things, anyway, but they
  are sometimes helpful in classifying things.)  Common Lisp's standard
  function to read delimited lists, expects to terminate on a character and
  cannot terminate on the end of input.

  Yet another possible pattern is that of considering whitespace completely
  ignorable, but also necessary to keep tokens apart, because almost no
  other character break apart tokens.  This means that whitespace differs
  from other delimiters in that repeated instances should be collapsed, but
  this is not really a feature of the delimiter, but of whitespace.

  Since Common Lisp requires explicit start and end markers for almost all
  tokens (except symbols), and the end marker is already an argument to
  read-delimited-list, I think we have presedence for this argument to a
  function that parses a string, too.

  Briefly, a "split-sequence" that adhers to these patterns would accept:

1 a single-escape character, which defaults to #\\, but the argument may be
  nil to prevent this functionality

2 a designator for a bag, i.e., character or sequence of characters, that
  are multiple-escape characters, which defaults to (#\"), but which can
  usefully be (#\" #\|) or (#\" #\').  If the bag is empty, no characters
  will be treated as multiple-escape characters.  If two multiple-escape
  characters are adjacent and thus complete a "hard empty" field, it is
  returned as an empty string.

3 a designator for a bag of characters that are to be considered whitespace
  and are thus to be ignored apart from their effect on terminating a token
  unless escaped.  E.g., (#\space #\tab).

4 an internal delimiter character that separates tokens, where tokens are
  considered "soft empty" if delimiters are adjacent (ignoring whitespace,
  if any) and if so, are returned as nil.

5 a designator for a bag of terminating delimiters that cause the parsing
  to stop and return the tokens collected.  If nil as a whole or the symbol
  :eof in a sequence, only the end of stream or string will terminate the
  parsed list of tokens, provided that it is not escaped.

  I think the name parse-delimited-list would be descriptive and fitting.

  A note on whitespace.  If some character that is usually a whitespace
  character is to be a delimiter in its own right, such as #\tab, the bag
  of whitespace characters should be empty and it be the internal delimiter
  -- or it will prevent adjacent delimiters from creating an empty field.

  A note on bags.  The function string-trim and friends are the only
  functions I can find that accepts a bag of anything in Common Lisp.  A
  bag can be any sequence type, but if non-characters are required, such as
  :eof, it must be a list or a vector.

  A note on designators for bags of characters.  Like make-array, I think
  it is useful to supply either a character or :eof without having to put
  them in a sequence just for the sake of this argument, so they designate
  a singleton sequence containing itself.  Note that both nil and a string
  of one character is already a bag.

  A note on "hard" and "soft" empty fields.  There may already be good
  reason to supply a "default list" for fields that are empty, modeled on
  make-pathname's defaults argument, but if so, we need a way to specify a
  field that is supplied as empty from one that is defaulted.  In the
  presence of multiple-escape characters, it is possible to create the
  empty string explicitly ("hard"), as opposed to implicitly by omitting
  characters between delimiters ("soft").  Only a "soft empty" field would
  be eligible for defaulting from the defaults list.  Note that supplying a
  short defaults list (which includes not supplying one) will default
  elements to nil, which is what a "soft empty" field would be returned as
  in the absence of a default list, so it need not be a special case.

  I hope it is evident how I have picked elements from several functions
  already in Common Lisp.  I hope the result has a "Common Lisp feel",
  which is also what patterns are about to me.  I have been a little queasy
  about the split-sequence "feel", which to me has a more C- or Perl-like
  feel to it.  (I know, I posted an early version and I am partly to blame.
  But then we age.  Or something.)

///
-- 
  The past is not more important than the future, despite what your culture
  has taught you.  Your future observations, conclusions, and beliefs are
  more important to you than those in your past ever will be.  The world is
  changing so fast the balance between the past and the future has shifted.