Subject: Re: XML and lisp
From: rpw3@rigden.engr.sgi.com (Rob Warnock)
Date: 27 Aug 2001 10:39:11 GMT
Newsgroups: comp.lang.lisp
Message-ID: <9md80f$ei0bi$1@fido.engr.sgi.com>
Erik Naggum  <erik@naggum.net> wrote:
+---------------
| rpw3@rigden.engr.sgi.com (Rob Warnock)
| > While not repeatable, attributes *are* omissible if the DTD for those
| > attribute contains either default values or the "#IMPLIED" status keyword,
| > are they not?
| 
|   That depends on whether you represent the parsed or pre-parsed structure.
|   In a Common Lisp setting, we are dealing with parsed structure.  If the
|   attribute value is "implied" in the source, it still needs to be there
|   in the parsed structure.
+---------------

*Doh!* I think I finally get what you were trying to say, thanks!

+---------------
| > So if the DTD said:
| > 	<!ELEMENT foo (bar | PCDATA)*>
| > 	<!ATTLIST foo bar NUMBER #IMPLIED>
| > that is, the "foo" element has an optional "bar" attribute *and* also
| > allows an arbitrary number of "bar" sub-elements, then (foo (bar 1) (bar
| > 2)) *would* be ambiguous.
| 
|   If you choose to represent a pre-parsed SGML instance in Common Lisp...
+---------------

Or a half-parsed (i.e., half-assed)?  ;-}

+---------------
|   I would argue strongly against that before I would even attempt to
|   answer anything else.
| 
|   I _really_ mean it when I say that the attribute list has a fixed length.
+---------------

Got it. Now let's see if I can explain it to others who may not have:

My understanding of what Erik is suggesting [very strongly!] is that one
should *NOT* try to invent any kind of direct "Lispified" or S-expr
restatement of XML/HTML/SGML *syntax* per se, but instead to *parse*
the XML document and choose convenient (potentially element-specific)
CL representations for the parsed elements. This parsing process will
involve filling in default values for omitted attributes, including those
whose default is "#IMPLIED". Once you have done this parsing, there is
nothing "optional" at all about any of the attributes -- you now have
*all* of their values. [Whether you choose to explicitly store defaulted
ones or not is a separate decision -- in any event you know their values.]

Now, having parsed the element and filled in the defaults, how you
choose to represent it in CL data is pretty much up to you. One way
might be as an instance of a CLOS class, with the attributes as slots
[plus a slot for the sub-elements, if it's not an empty element]. This
would allow you to use a generic function (print-element elem style)
that specialized on both the element type and the desired output style
to output completely different texts from the same parsed document.

Another way is a simple list of the element name[*] followed by the
values of the attributes (with or without attendant "keywords" to
make them readable to humans debugging the program) followed by the
rest of the contained elements (if any). Without any attribute markers
at all, this might have a form similar in appearance (only!) to a
function call with positional parameters, that is:

	<foo bar="1"><bar>2</bar></foo>

after parsing might internally represented as:

	(foo 1 (bar 2))

Or if you choose to add some element-like structure to the attributes,
you can do that, too. [You might choose to do that if (*ugh!* *shudder!*)
some attributes contain further internal structure, and you'd like to
represent the *parsed* version of that structure in a pleasing way.]
That gets us to:

	(foo (bar 1) (bar 2))

But again, since all of the application routines that have to deal with
a "foo" element *know* that "foo" has a "bar" attribute, all of the code
[that cares about attributes] knows that the CADADR is the attribute value
and the CDDR is the content.

Now suppose that the application-implied value for the attribute "bar"
is zero, and we are given this to parse:

	<foo><bar>2</bar><bar>17</bar></foo>

What I (finally) heard Erik say is that the only reasonable internal
representation for that (depending on whether you chose the "positional"
or "element-like" representation for foo's attributes) would be one of
these forms:

	(foo 0 (bar 2) (bar 17))
or:
	(foo (bar 0) (bar 2) (bar 17))

That is, the structure of the CL representation *must* be invariant
w.r.t. inclusion or omission of attributes in the source text. So in
the second form, the CADADR is still the attribute value and the CDDR
is still the content, even though the attribute was omitted in the
source text.

+---------------
|   I also indicated that for pragmatic reasons, I sometimes use a marker to
|   separate the attributes from the contents in the cdr of the element, such
|   as when the task at hand would be wastefully slow if I were to deal with
|   a fully parsed structure.  Dirty hacks should be within reach because the
|   world is sometimes not clean.
+---------------

I now understand & agree.

+---------------
|   I am probably not going to get used to the habit of some people who
|   see a problem in one part of a proposal and ignore the fact that there
|   is a solution in another part of the same proposal (like the next
|   paragraph), and I am certainly not patient enough with all the rampant
|   idiocy in the SGML/XML world to explain this over and over, but please
|   go back and read the whole message.
+---------------

I did, and that's when the light finally dawned, but I have to say
that until one *does* finally understand it's not at all obvious.
No, I don't know how you could have said it any more clearly. I can
only say (from personal experience now!) that if one *ever* falls into
the trap of trying to "Lispify" the *syntax* of XML instead of represent
the *parsed* structure, it can be very hard to let go of that fixation.

Hmmm... Perhaps it's some sort of "figure/ground" thing, as in that
classic picture <URL:http://www.lcsc.edu/ss150/u5s1p6.htm> used in
gestalt psychology. If you see the young woman first, it's sometimes
hard to then see the old hag (or vice versa). And one's history or
prejudices may strongly affect which one you see first, e.g., young
men tend to see the young woman first.

[Of course, once you've seen *both*, then it's much, much easier
to flip your perception back and forth at will between them.]


-Rob

[*] That is, as I mentioned in my parallel reply to Kent, a CL symbol
    chosen to *represent* the XML element name, not necessarily or even
    desirably any automatic conversion of the XML element name to a CL
    symbol.

-----
Rob Warnock, 30-3-510		<rpw3@sgi.com>
SGI Network Engineering		<http://reality.sgi.com/rpw3/>
1600 Amphitheatre Pkwy.		Phone: 650-933-1673
Mountain View, CA  94043	PP-ASEL-IA

[Note: aaanalyst@sgi.com and zedwatch@sgi.com aren't for humans ]