From ... From: Erik Naggum Subject: Re: data structure for markup text Date: 1999/06/18 Message-ID: <3138714312189865@naggum.no> X-Deja-AN: 491129297 References: <375fa0ac.8976326@news.nova.es> <7jt1q3$jv4$1@nclient3-gui.server.virgin.net> <29ha3.14962$_m4.301127@news2.giganews.com> mail-copies-to: never Organization: Naggum Software; +47 8800 8879; http://www.naggum.no Newsgroups: comp.lang.scheme,comp.text.sgml,comp.text.xml,comp.lang.lisp * cbbrowne@news.hex.net (Christopher Browne) | I have not yet come up with the "perfect" solution for handling | attributes. (e.g. - how to represent ID="FINANCES" in the SGML ... ) in proper SGML terms the Generic Identifier is an attribute of the element the same way any other attributes are, but it has a special role. that is, zot should be regarded as a structure with the attributes GI=FOO and BAR=1, and the contents "zot". note that this property of the generic identifier is used productively in the HyTime and SMDL standards, in that it is moved to a different attribute in order to use "architectual forms". this is very clever, but made into a horribly obscure technical point because people think of the Generic Identifier as somehow the _primary_ name of an element. in reality, it's _only_ a means of making the structure-verification support in SGML work, and some other attribute may well be much more important. so if we allow the same kind of pun on name spaces as Common Lisp already uses in (foo bar) where FOO is in the function namespace and bar is in the value namespace, my suggestion is ((foo :bar 1) "zot"), which I think is a clear winner over the runner-up (foo (:bar 1) "zot") for the simple reason that you can view (foo :bar 1) as a function call that returns a function that can deal with the contents. experienced Lisp or Scheme programmers may think of this as "currying". the best way to deal with SGML is to specify a execution model along the lines of the functional form in Lisp. the above may look like a Scheme form where (foo :bar 1) might be called in the standard Scheme execution model, but this is not so. we have to view that form as returning a function that should take the subforms literally, like a macro (which Scheme doesn't have in the way we need it). instead of a lambda list to take ordinary arguments, envision a function definition that takes an SGML content model and which may or may not check against this model when called depending on the processing mode (like validate, safe, unsafe). that is, we call processing functions from the outside and in, not from the inside and out, which is Scheme's and Lisp's execution model. | The issue here is that there are components that are sequential, and | others that are not. For instance, a title is associated with the | section, whilst the list of paragraphs are sequential: very good point. this is one of the many serious problems with SGML. in essence, the contents of an element and the values of attributes are semantic "equals", but like one attribute is elevated to Generic Identifier to help define SGML's concept of structure, contents is more powerful than attributes, in that it has substructure; the difference is that SGML does not have a "list" concept for attributes (expect a string with one-level structure). Lisp programmers may think of the contents the same way they think of "implicit PROGN", and think of HANDLER-CASE and HANDLER-BIND as having body attributes with and without implicit PROGN. in many cases, SGML forces you to use sub-elements because attributes cannot hold the information, although this is an artificial separation and a sub-element-cum-attribute could be as influential on the way an element is processed as other attributes, and attributes that don't actually have such a role might as well be sub-elements if it weren't for the cluttered namespace and the inability of elements in SGML to have different contents depending on their contexts. | I'm quite certain that Erik Naggum has thought this through further than | I have; I'd welcome his thoughts... well, I hope it helped. | "DTDs are not common knowledge because programming students are not | taught markup. A markup language is not a programming language." | -- Peter Flynn I'll venture a different explanation: DTDs aren't common knowledge because they are extremely badly designed: they don't have any semantics apart from internal consistency of the "document", and there's no execution model for them that people can understand, but several that try to conflate different points of view onto them, like trying to insist on only one way to look at a Picasso, or like insisting that a Lisp form is _either_ code _or_ data in spite of the obvious need for a _viewpoint_ to make meaning emerge. programmers from the Algol family naturally have a hard time regarding their data as programs and probably find it mildly insane to talk about an execution model for the data in a document, despite the obvious fact that other markup languages _do_ come with an execution model -- they _are_ procedural. this leads to another problem in SGML that might be addressed if we were to think in Common Lisp terms. the main problem with SGML is the lack of Lisp-style macros, and this is most visible in HTML. Cascading Style Sheets were invented to address this point. a better solution would simply have been to define macros that massaged the structure contained within them and returned something more specific. this, however, would probably mean that users would like to deal with macros and application-level functions the same way, but as is now apparent, SGML has set up a lot of conceptual barriers between things that are no more than different aspects of the same core principle: 1 attributes and generic identifiers. one might want to use some attribute to define processing and another to define the SGML-defined structure. or one might want to modify the procesisng according to any number of attributes. 2 attributes and contents. the artificial prohibition against attributes with useful structure means some attributes need to be content. the flat namespace means some contents would be turned into attributes to avoid collisions. 3 element name and processing. the application is forced to deal with every issue in processing elements, _except_ validating the structure. why stop there? what is the big deal in prohibiting user-defined macros that could produce the same elements the application used to see? it's not like they needed to be invented -- all other markup languages at the time SGML was defined had them, but SGML viewed them as a weakness. too much of SGML's history, including HTML and XML, are attempts at solving problems that were introduced, rather than which fell out of simple and correct design. if these artificial barriers had not been set up, DTDs would not _need_ to be "common knowledge", because they would have fit in with other language definitions, and instead of insisting that other people do something wrong when they don't understand you, it's much better to realize that there's something you can do differently that might help them understand. insisting that programmers are the enemies of text processing, which goes back to an early rationale for SGML, that the users needed to take back control over the data formats from the programmers, has caused people who could program to ignore SGML, and people who wanted SGML to work to do a whole lot of amazingly stupid things that have alienated programmers even more. but people who ignore programmers while they want their services live in a world of unhealthy delusions, one of which is that programmers cannot understand user needs, which is an attitude that made sense when programmers were separate from users. every author of a LaTeX document knows that the line between user and programmer is very thin, indeed. the same applies to anyone who has written any Visual Basic to make Word behave as they want, or indeed anyone who has defined his own functions in Excel. a computer user _is_ a programmer in the late 1990's, but we still see remnants of the 1960's attitude that users had to take whatever the programmers in white coats gave them. the sad thing is that both users and programmers lose big time from this silly schism: users can't _program_ HTML to do useful stuff, so have to invent or accept tools that destroy what SGML was intended to further: independence of data from applications, and writing HTML is performed by at least 50 different "languages" with unbelievably bad design, orders of magnitude worse than what the purists say would happen to SGML if it had user-defined macros, but then they got XML and were apparently not too alarmed by that. how do you get people to understand markup languages? just make them into programming languages, again, and people will understand them, too. but insist that markup is not like programming, and you break the concept that input to some processor that changes its behavior is not affecting it in useful ways. insisting that the processor be maintained by someone else is very counter-productive. given people too much control is also counter-productive, as the failed WYSIWYG experiment has proven. I think something has to transcend SGML and XML and all the HTML cruft to get at the issues _people_ want, which is no more than an understandable conceptual route from input to output. SGML sets up road blocks on that route and insist that you do not want to go all the way. XML removed one roadblock among several and people could go further, but they still don't "get it": they are prohibited by politics, not technology, to go all the way. most of that politics lies in the insistence on reinventing all the programming language concepts needed to process SGML, like groves, or in being anal-retentive about the textual representation. DTDs are not common knowledge among programming students because markup languages are artificially different from programming languages. me, I believe in the division of labor and see no purpose in spending any time on formatting my own documents. people who know how to do that well should be allowed to their job as best as they can, but anything that actually helps _both_ the author and the formatter (for lack of a better term) should be welcome. practice shows that very little formalism is required to make this work right, and once you have reinstalled division of labor concepts, the languages and notations used in the resulting output is quite immaterial, the same way HTML has become immaterial to users of FrontPage and other cruft that generates HTML that is tied very closely to the expected application environment. I came from a Lisp background to SGML and saw a huge potential. I also saw that without a Lisp background, most of SGML was a complete mystery to people, and since Lisp was viewed as an abstract, academic programming language, and SGML people were allowed to dislike programmers, neither would they _acquire_ the results of a Lisp background. what I can bring to Lisp or to programming in general because of my SGML work is a better design of data formats and communications protocols, but I regard SGML as a major detour that had value the same way any costly mistake has value if you are determined to learn from it. what's more, I don't think SGML people will ever realize that they deal with articifical complications of rather simple ideas and have no reason to be smug about it towards people who prefer simple ideas simple and move on to really complex ideas. I suggest that people who want to understand how to process SGML should spend the time on something else entirely, like trying to figure out which things SGML does well and to build their own environment from that; my suggestion in that regard is to think about processing documents from the outside and in through a functional execution model, contextualizing the processing through dynamic binding of functions, etc. experienced SGML users will think that this is a conflation of the data format with the processing language, and that'ss precisely the idea: SGML has made a mistake in dividing them _too_ much. they don't have to be the same _or_ entirely separate. but once you grasp this idea, why do you need SGML in your documents? just save the Lisp forms -- after all, you already have a Lisp pretty printer at your disposal and the syntax is sane and simple, compared to the contorted mess that is SGML, plus you get a lot more flexibility when you do this. Lisp offers the same structuring means as SGML does, only without artificial restrictions, and the only difference is that you will have to quote your strings and tighten up the overly relaxed entity model in SGML, but it is still a necessary component. Lisp programmers should think of the entity model as the #. reader macro or backquoted forms. in the end, you will find that don't need anything from SGML if you have a programming language that can treat code as data and vice versa. it seems I could go on for a long time, so I'll stop now. #:Erik -- @1999-07-22T00:37:33Z -- pi billion seconds since the turn of the century