Subject: Revisiting split-sequence and patterns From: Erik Naggum <firstname.lastname@example.org> Date: Wed, 28 Nov 2001 04:03:04 GMT Newsgroups: comp.lang.lisp Message-ID: <email@example.com> This problem occurred to me while trying to explain why comma-delimited data formats should not use something as simple as split-sequence (or a similar inline loop), so bear with me while I rehash an old issue and walk through some background. If there be patterns of syntax design, one of them would be escaping or quoting a character so it reverts from (potentially) special to normal interpretation using one special character solely for escaping purposes, backslash being the canonical choice for this. (Another pattern of syntax design would be to use the backslash to _make_ characters special and thus mean something else entirely. It is thus important to recognize which pattern the backslash is used to implement in a particular syntax.) The escaping character is discarded from the extracted token. Call this the single-escape mechanism. Since Common Lisp adheres to this pattern of syntax design, I would have expected that a function that split a string into a list of tokens that were delimited by a special character would allow a token to contain that delimiter if it were escaped. If there be more patterns of syntax design, another one would involve escaping a whole sequence of characters so they revert from (potentially) special to normal interpretation, using a single special character at both ends of the sequence, and escaping that character and the escaping character inside the sequence with the same character that does this for a single character. All the escaping characters are discarded from the extracted token. Call this the multiple-escape mechanism. Since Common Lisp adheres to this pattern of syntax design, too, with both its string syntax and the multiple-escape syntax for symbols, I would also have expected that a function that split a string into a list of tokens that had been delimited on either end by a special character would allow a token to be so delimited and thus to avoid being split in the middle of such a token if it contained the delimiter on which to split the string. There are a whole lot of other patterns of syntax design: whether it be context-free or context-sensitive; how many characters (or tokens) of read-ahead it require; whether the start and end markers of individual elements be explicit or deduced. Common Lisp has a context-sensitive grammar (strictly speaking, since every change to the readtable and even the package changes the syntax, one cannot read a Common Lisp source file correctly without evaluating certain top-level forms and performing any evaluation required by the #. reader macro), described by the readtable, which implements support for recursive-descent parsers that may change the readtable while processing "their" portion of the input stream. (It is easy to describe the (static) syntax of the standard readtable that implements the standard language, but that is not sufficient for a real Common Lisp system.) Common Lisp's syntax is in this particular way very peculiar and breaks the patterns that have later been established for context-free grammars and LALR parsers. (Most interesting programming languages fail to adhere to these formal little things, anyway, but they are sometimes helpful in classifying things.) Common Lisp's standard function to read delimited lists, expects to terminate on a character and cannot terminate on the end of input. Yet another possible pattern is that of considering whitespace completely ignorable, but also necessary to keep tokens apart, because almost no other character break apart tokens. This means that whitespace differs from other delimiters in that repeated instances should be collapsed, but this is not really a feature of the delimiter, but of whitespace. Since Common Lisp requires explicit start and end markers for almost all tokens (except symbols), and the end marker is already an argument to read-delimited-list, I think we have presedence for this argument to a function that parses a string, too. Briefly, a "split-sequence" that adhers to these patterns would accept: 1 a single-escape character, which defaults to #\\, but the argument may be nil to prevent this functionality 2 a designator for a bag, i.e., character or sequence of characters, that are multiple-escape characters, which defaults to (#\"), but which can usefully be (#\" #\|) or (#\" #\'). If the bag is empty, no characters will be treated as multiple-escape characters. If two multiple-escape characters are adjacent and thus complete a "hard empty" field, it is returned as an empty string. 3 a designator for a bag of characters that are to be considered whitespace and are thus to be ignored apart from their effect on terminating a token unless escaped. E.g., (#\space #\tab). 4 an internal delimiter character that separates tokens, where tokens are considered "soft empty" if delimiters are adjacent (ignoring whitespace, if any) and if so, are returned as nil. 5 a designator for a bag of terminating delimiters that cause the parsing to stop and return the tokens collected. If nil as a whole or the symbol :eof in a sequence, only the end of stream or string will terminate the parsed list of tokens, provided that it is not escaped. I think the name parse-delimited-list would be descriptive and fitting. A note on whitespace. If some character that is usually a whitespace character is to be a delimiter in its own right, such as #\tab, the bag of whitespace characters should be empty and it be the internal delimiter -- or it will prevent adjacent delimiters from creating an empty field. A note on bags. The function string-trim and friends are the only functions I can find that accepts a bag of anything in Common Lisp. A bag can be any sequence type, but if non-characters are required, such as :eof, it must be a list or a vector. A note on designators for bags of characters. Like make-array, I think it is useful to supply either a character or :eof without having to put them in a sequence just for the sake of this argument, so they designate a singleton sequence containing itself. Note that both nil and a string of one character is already a bag. A note on "hard" and "soft" empty fields. There may already be good reason to supply a "default list" for fields that are empty, modeled on make-pathname's defaults argument, but if so, we need a way to specify a field that is supplied as empty from one that is defaulted. In the presence of multiple-escape characters, it is possible to create the empty string explicitly ("hard"), as opposed to implicitly by omitting characters between delimiters ("soft"). Only a "soft empty" field would be eligible for defaulting from the defaults list. Note that supplying a short defaults list (which includes not supplying one) will default elements to nil, which is what a "soft empty" field would be returned as in the absence of a default list, so it need not be a special case. I hope it is evident how I have picked elements from several functions already in Common Lisp. I hope the result has a "Common Lisp feel", which is also what patterns are about to me. I have been a little queasy about the split-sequence "feel", which to me has a more C- or Perl-like feel to it. (I know, I posted an early version and I am partly to blame. But then we age. Or something.) /// -- The past is not more important than the future, despite what your culture has taught you. Your future observations, conclusions, and beliefs are more important to you than those in your past ever will be. The world is changing so fast the balance between the past and the future has shifted.