Subject: Re: Splitting a string on a character...
From: Erik Naggum <erik@naggum.net>
Date: Tue, 07 May 2002 02:40:41 GMT
Newsgroups: comp.lang.lisp
Message-ID: <3229728040064576@naggum.net>

* Cory Spencer
| Just a quickie question - is there already a Common Lisp function that
| will split a string on a given character?

  Most often, when people ask quickie questions, they have been working
  themselves through what one would think of as a labyrinth where they make
  brief excursions in the wrong direction and self-correct when they hit
  the wall, so to speak.  When they hit the wall and do not self-correct,
  they post a quickie question, but there is an arbitary amount of back-
  tracking involved in providing the right answer.  Just moving the person
  into a new labyrinth without the particular wall they have run into is
  seldom the best answer, as the wrong choice they have made will lead them
  right into another wall shortly thereafter.  Therefore, a "quickie" is a
  strong signal to experienced problem-solvers that something is wrong: The
  requestor is stuck, but does not think he should have been.  However, if
  his thinking were correct, he would not be stuck.  Yet he is, and that is
  a hint that the amount of backtracking required will be significant and
  that is just the opposite of a "quickie".
  
| ie) will perform a similar function as this:

  Generally speaking, a reader or parser of some sort.

  It is quite important to realize that you will never, ever have a case
  where you can entirely get rid of the "splitting" character.  If you
  think you can legitimately expect this, you are just too inexperienced at
  what you are doing and will run into a problem sooner or later.  Let me
  give you a few examples.  Under Unix, you cannot have a colon in your
  login name, in your home directory name, in your real name, or in your
  shell, because the colon separates these fields in a system password
  file.  (Not to mention null bytes and newlines.)  This is just too dumb
  to be believable on the face of it, but it is actually the case.  Unix
  freaks do not think this is a problem because they internalize the rules
  and do not _want_ a colon in those places.  However, software that
  updates the password file has to do sanity checks in order not to expose
  the system to serious security risks because there is no way to escape a
  payload colon from the delimiting colon.  In the standard Unix shells,
  whitespace separates arguments, but you have several escaping forms to
  allow whitespace to exist in arguments.  All in all, the mechanisms that
  are used in the shell are quite arcane and difficult to predict from a
  program, but a user can usually deal with it, in the standard Unix idea
  of "usually".  Then there is HTML and URL's and all that crap.  To make
  sure that a character is always a payload character, it must be written
  as &#nnn, where nnn is the ISO 10646 code for character, or you have to
  engagge in table lookups, context-sensitive parsing rules, and all sorts
  of random weirdness.  Likewise, in URL's, it is incredibly hard to get
  all you want through to the other side.  Recently, I subscribed to the
  Unabridged Merriam-Webster dictionary, and they need the e-mail address
  as the username.  It turned out to be very hard to write a URL that had a
  payload @ in the username and a syntax @ before the hostname.  I actually
  find such things absolutely incredible -- to be so thoughtless must have
  been _really_ hard.

  This is why you should not use position to find a character to split on,
  you should use a state machine that traverses the string and finds only
  those (matching) characters that are syntactically relevant, not those
  (matching) characters that are (or should be) payload characters.  A
  regular expression is _not_ sufficient for this task.
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.

  70 percent of American adults do not understand the scientific process.