From: Steve Haflich

Subject: Re: XML parser and line feeds between tags

Date: 2003-12-16 12:04

   From: Laurent Eschenauer <pepite.be at laurent>
   
   I have an issue with the xml parser in ACL 6.2 (pxml) when using line
   feeds. Looking at the XML specs, I understand that the XML parser should 
   ignore line feeds and extra whitespace. However when I parse the following 
   file with ACL 6.2 :
   
   <team>
   <person id="b001" name="laurent eschenauer"/>
   <person id="b002" name="cedric gauthy"/>
   </team>
   
   Using the command :(parse-xml stream :content-only t)
   
   I receive:
   
   ((team " 
   " ((person id "b001" name "laurent eschenauer")) "
   " ((person id "b002" name "cedric gauthy")) "
   "))
   
   As you can see, all line feeds are handled by the parser as token
   while they should not be visible (according to the XML specs at
   http://www.xml.com/axml/testaxml.htm).
   
I think you are misreading the XML standard.  (The annotated one you
cite is the 1998 version -- the 2000 revision is available at w3c.org,
but I don't think it is any different in this regard.)

Section 2.10 of the standard states:

  An XML processor must always pass all characters in a document that
  are not markup through to the application.

That's unequivocal.  It is also supported by the annotated test in the
document you cite -- click on the secont circle-T annotation in
Section 2.10..

  A validating XML processor
  must also inform the application which of these characters constitute
  white space appearing in element content.

The ACL pxml is _not_ a validating parser, but even if it were, a
validating parser is still _required_ to pass back to the client
application all whitespace that is not markup.  (Very informally, that
includes all whitespace not inside angle brackets or is inside.)  A
validating parser should differentiate whitespace that appears in
places where regular character data cannot appear (e.g. between an
<ol> and an <li> tag) because that whitespace cannot be part of the
significant document text.  However, it must still be presented to the
application.  The third circle-T annotation in 2.10 verifies this.

At first it is hard to see the sense of this requirement.  But
consider an application such as an XML editor or XSL transformer.
There is no reason it shouldn't make use of the same parser as any
other application, but (since XML documents are supposed to be human
readable as a fallback) such an editor might want much as possible to
preserve whitespace so that serves only to make the document format
more readable.  This is also the reason parsers are _permitted_ (but
not required) to inform applications about comments, as does the SAX
interface.

But it is still up to the application whether to ignore whitespace
that appears in element content (i.e. where character data _may_ _not_
otherwise appear).  The processor (parser) must still present it to
the application.  Of course, whitespace that appears where character
data _may_ appear must also be preserved.  Our phtml used to screw
this up

     <i>foo</i> <i>bar</i>

and run the two words together.  (HTML has _very_ different rules for
whitespace preservation, but the issue is similar.)

  A special attribute named xml:space may be attached to an element to
  signal an intention that in that element, white space should be
  preserved by applications.

But even this attribute does not allow the parser to suppress
whitespace.  It is merely a suggestion to the client application where
whitespace should be considered significant, for example, to
differentiate sections of prose and poetry.