Subject: Re: data structure for markup text
From: Erik Naggum <erik@naggum.no>
Date: 1999/06/18
Newsgroups: comp.lang.scheme,comp.text.sgml,comp.text.xml,comp.lang.lisp
Message-ID: <3138714312189865@naggum.no>

* cbbrowne@news.hex.net (Christopher Browne)
| I have not yet come up with the "perfect" solution for handling
| attributes.  (e.g. - how to represent ID="FINANCES" in the SGML <SECT1
| ID="FINANCES"> ... </SECT1>)

  in proper SGML terms the Generic Identifier is an attribute of the
  element the same way any other attributes are, but it has a special role.
  that is, <foo bar=1>zot</foo> should be regarded as a structure with the
  attributes GI=FOO and BAR=1, and the contents "zot".  note that this
  property of the generic identifier is used productively in the HyTime and
  SMDL standards, in that it is moved to a different attribute in order to
  use "architectual forms".  this is very clever, but made into a horribly
  obscure technical point because people think of the Generic Identifier as
  somehow the _primary_ name of an element.  in reality, it's _only_ a
  means of making the structure-verification support in SGML work, and some
  other attribute may well be much more important.

  so if we allow the same kind of pun on name spaces as Common Lisp already
  uses in (foo bar) where FOO is in the function namespace and bar is in
  the value namespace, my suggestion is ((foo :bar 1) "zot"), which I think
  is a clear winner over the runner-up (foo (:bar 1) "zot") for the simple
  reason that you can view (foo :bar 1) as a function call that returns a
  function that can deal with the contents.  experienced Lisp or Scheme
  programmers may think of this as "currying".

  the best way to deal with SGML is to specify a execution model along the
  lines of the functional form in Lisp.  the above may look like a Scheme
  form where (foo :bar 1) might be called in the standard Scheme execution
  model, but this is not so.  we have to view that form as returning a
  function that should take the subforms literally, like a macro (which
  Scheme doesn't have in the way we need it).  instead of a lambda list to
  take ordinary arguments, envision a function definition that takes an
  SGML content model and which may or may not check against this model when
  called depending on the processing mode (like validate, safe, unsafe).
  that is, we call processing functions from the outside and in, not from
  the inside and out, which is Scheme's and Lisp's execution model.

| The issue here is that there are components that are sequential, and
| others that are not.  For instance, a title is associated with the
| section, whilst the list of paragraphs are sequential:

  very good point.  this is one of the many serious problems with SGML.  in
  essence, the contents of an element and the values of attributes are
  semantic "equals", but like one attribute is elevated to Generic
  Identifier to help define SGML's concept of structure, contents is more
  powerful than attributes, in that it has substructure; the difference is
  that SGML does not have a "list" concept for attributes (expect a string
  with one-level structure).  Lisp programmers may think of the contents
  the same way they think of "implicit PROGN", and think of HANDLER-CASE
  and HANDLER-BIND as having body attributes with and without implicit
  PROGN.  in many cases, SGML forces you to use sub-elements because
  attributes cannot hold the information, although this is an artificial
  separation and a sub-element-cum-attribute could be as influential on the
  way an element is processed as other attributes, and attributes that
  don't actually have such a role might as well be sub-elements if it
  weren't for the cluttered namespace and the inability of elements in SGML
  to have different contents depending on their contexts.

| I'm quite certain that Erik Naggum has thought this through further than
| I have; I'd welcome his thoughts...

  well, I hope it helped.

| "DTDs are not common knowledge because programming students are not
| taught markup.  A markup language is not a programming language."
| -- Peter Flynn <silmaril@m-net.arbornet.org>

  I'll venture a different explanation: DTDs aren't common knowledge
  because they are extremely badly designed: they don't have any semantics
  apart from internal consistency of the "document", and there's no
  execution model for them that people can understand, but several that try
  to conflate different points of view onto them, like trying to insist on
  only one way to look at a Picasso, or like insisting that a Lisp form is
  _either_ code _or_ data in spite of the obvious need for a _viewpoint_ to
  make meaning emerge.  programmers from the Algol family naturally have a
  hard time regarding their data as programs and probably find it mildly
  insane to talk about an execution model for the data in a document,
  despite the obvious fact that other markup languages _do_ come with an
  execution model -- they _are_ procedural.  this leads to another problem
  in SGML that might be addressed if we were to think in Common Lisp terms.

  the main problem with SGML is the lack of Lisp-style macros, and this is
  most visible in HTML.  Cascading Style Sheets were invented to address
  this point.  a better solution would simply have been to define macros
  that massaged the structure contained within them and returned something
  more specific.  this, however, would probably mean that users would like
  to deal with macros and application-level functions the same way, but as
  is now apparent, SGML has set up a lot of conceptual barriers between
  things that are no more than different aspects of the same core principle:

1 attributes and generic identifiers.  one might want to use some attribute
  to define processing and another to define the SGML-defined structure.
  or one might want to modify the procesisng according to any number of
  attributes.

2 attributes and contents.  the artificial prohibition against attributes
  with useful structure means some attributes need to be content.  the
  flat namespace means some contents would be turned into attributes to
  avoid collisions.

3 element name and processing.  the application is forced to deal with
  every issue in processing elements, _except_ validating the structure.
  why stop there?  what is the big deal in prohibiting user-defined macros
  that could produce the same elements the application used to see?  it's
  not like they needed to be invented -- all other markup languages at the
  time SGML was defined had them, but SGML viewed them as a weakness.

  too much of SGML's history, including HTML and XML, are attempts at
  solving problems that were introduced, rather than which fell out of
  simple and correct design.  if these artificial barriers had not been set
  up, DTDs would not _need_ to be "common knowledge", because they would
  have fit in with other language definitions, and instead of insisting
  that other people do something wrong when they don't understand you, it's
  much better to realize that there's something you can do differently that
  might help them understand.

  insisting that programmers are the enemies of text processing, which goes
  back to an early rationale for SGML, that the users needed to take back
  control over the data formats from the programmers, has caused people who
  could program to ignore SGML, and people who wanted SGML to work to do a
  whole lot of amazingly stupid things that have alienated programmers even
  more.  but people who ignore programmers while they want their services
  live in a world of unhealthy delusions, one of which is that programmers
  cannot understand user needs, which is an attitude that made sense when
  programmers were separate from users.  every author of a LaTeX document
  knows that the line between user and programmer is very thin, indeed.
  the same applies to anyone who has written any Visual Basic to make Word
  behave as they want, or indeed anyone who has defined his own functions
  in Excel.  a computer user _is_ a programmer in the late 1990's, but we
  still see remnants of the 1960's attitude that users had to take whatever
  the programmers in white coats gave them.  the sad thing is that both
  users and programmers lose big time from this silly schism: users can't
  _program_ HTML to do useful stuff, so have to invent or accept tools that
  destroy what SGML was intended to further: independence of data from
  applications, and writing HTML is performed by at least 50 different
  "languages" with unbelievably bad design, orders of magnitude worse than
  what the purists say would happen to SGML if it had user-defined macros,
  but then they got XML and were apparently not too alarmed by that.

  how do you get people to understand markup languages?  just make them
  into programming languages, again, and people will understand them, too.
  but insist that markup is not like programming, and you break the concept
  that input to some processor that changes its behavior is not affecting
  it in useful ways.  insisting that the processor be maintained by someone
  else is very counter-productive.  given people too much control is also
  counter-productive, as the failed WYSIWYG experiment has proven.

  I think something has to transcend SGML and XML and all the HTML cruft to
  get at the issues _people_ want, which is no more than an understandable
  conceptual route from input to output.  SGML sets up road blocks on that
  route and insist that you do not want to go all the way.  XML removed one
  roadblock among several and people could go further, but they still don't
  "get it": they are prohibited by politics, not technology, to go all the
  way.  most of that politics lies in the insistence on reinventing all the
  programming language concepts needed to process SGML, like groves, or in
  being anal-retentive about the textual representation.

  DTDs are not common knowledge among programming students because markup
  languages are artificially different from programming languages.

  me, I believe in the division of labor and see no purpose in spending any
  time on formatting my own documents.  people who know how to do that well
  should be allowed to their job as best as they can, but anything that
  actually helps _both_ the author and the formatter (for lack of a better
  term) should be welcome.  practice shows that very little formalism is
  required to make this work right, and once you have reinstalled division
  of labor concepts, the languages and notations used in the resulting
  output is quite immaterial, the same way HTML has become immaterial to
  users of FrontPage and other cruft that generates HTML that is tied very
  closely to the expected application environment.

  I came from a Lisp background to SGML and saw a huge potential.  I also
  saw that without a Lisp background, most of SGML was a complete mystery
  to people, and since Lisp was viewed as an abstract, academic programming
  language, and SGML people were allowed to dislike programmers, neither
  would they _acquire_ the results of a Lisp background.  what I can bring
  to Lisp or to programming in general because of my SGML work is a better
  design of data formats and communications protocols, but I regard SGML as
  a major detour that had value the same way any costly mistake has value
  if you are determined to learn from it.  what's more, I don't think SGML
  people will ever realize that they deal with articifical complications of
  rather simple ideas and have no reason to be smug about it towards people
  who prefer simple ideas simple and move on to really complex ideas.

  I suggest that people who want to understand how to process SGML should
  spend the time on something else entirely, like trying to figure out
  which things SGML does well and to build their own environment from that;
  my suggestion in that regard is to think about processing documents from
  the outside and in through a functional execution model, contextualizing
  the processing through dynamic binding of functions, etc.  experienced
  SGML users will think that this is a conflation of the data format with
  the processing language, and that'ss precisely the idea: SGML has made a
  mistake in dividing them _too_ much.  they don't have to be the same _or_
  entirely separate.  but once you grasp this idea, why do you need SGML in
  your documents?  just save the Lisp forms -- after all, you already have
  a Lisp pretty printer at your disposal and the syntax is sane and simple,
  compared to the contorted mess that is SGML, plus you get a lot more
  flexibility when you do this.  Lisp offers the same structuring means as
  SGML does, only without artificial restrictions, and the only difference
  is that you will have to quote your strings and tighten up the overly
  relaxed entity model in SGML, but it is still a necessary component.
  Lisp programmers should think of the entity model as the #. reader macro
  or backquoted forms.

  in the end, you will find that don't need anything from SGML if you have
  a programming language that can treat code as data and vice versa.

  it seems I could go on for a long time, so I'll stop now.

#:Erik
-- 
@1999-07-22T00:37:33Z -- pi billion seconds since the turn of the century