From ... Path: archiver1.google.com!news2.google.com!newsfeed2.dallas1.level3.net!news.level3.com!news-out.visi.com!petbe.visi.com!eusc.inter.net!feed.news.tiscali.de!uio.no!nntp.uio.no!not-for-mail From: Erik Naggum Newsgroups: comp.lang.lisp Subject: Re: XML->sexpr ideas Date: 19 Jan 2004 12:24:42 +0000 Organization: Naggum Software, Oslo, Norway Lines: 77 Message-ID: <3283503882585065KL2065E_-_@naggum.no> References: Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Trace: readme.uio.no 1074515083 12866 129.240.65.210 (19 Jan 2004 12:24:43 GMT) X-Complaints-To: abuse@uio.no NNTP-Posting-Date: Mon, 19 Jan 2004 12:24:43 +0000 (UTC) Mail-Copies-To: never User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3 Xref: archiver1.google.com comp.lang.lisp:10282 * Kenny Tilton | Of course it is easy enough for me to come up with a sexpr format off | the top of my head, but I seem to recall someone (Erik? Tim? Other?) | saying they had done some work on a formal approach to an alternative | to XML/HTML/whatever. | | True that? If so, I am all ears. Really? You are? Maybe I didn't survive 2003 and this is some Hell where people have to do eternal penance, and now I get to do SGML all over again. Much processing of SGML-like data appears to be stream-like and will therefore appear to be equivalent to an in-order traversal of a tree, which can therefore be represented with cons cells while the traverser maintains its own backward links elsewhere, but this is misleading. The amount of work and memory required to maintain the proper backward links and to make the right decisions is found in real applications to balloon and to cause random hacks; the query languages reflect this complexity. Ease of access to the parent element is crucial to the decision-making process, so if one wants to use a simple list to keep track of this, the most natural thing is to create a list of the element type, the parent, and the contents, such that each element has the form (type parent . contents), but this has the annoying property that moving from a particular element to the next can only be done by remembering the position of the current element in a list, just as one cannot move to the next element in a list unless you keep the cons cell around. However, the whole point of this exercise is to be able to keep only one pointer around. So the contents of an element must have the form (type parent contents . tail) if it has element contents or simply a list of objects, or just the object if simple enough. Example: 123 would thus be represented by (foo nil "123"), 123456 by (foo nil "123" bar nil "456"), and 123456 by #1=(zot nil (foo #1# "123" bar #1# "456")). Navigation inside this kind of structure is easy: When the contents in CADDR is exhausted, the CDDDR is the next element, or if NIL, we have exhausted the contents of the parent and move up to the CADR and look for its next element, etc. All the important edges of the containers that make up the *ML document are easily detectible and the operations that are usually found at the edges are normally tied to the element type (or as modified by its parents), are easily computable. However, using a list for this is cumbersome, so I cooked up the «quad». The «quad» is devoid of any intrinsic meaning because it is intended to be a general data structure, so I looked for the best meaningless names for the slots/accessors, and decided on QAR, QBR, QCR, and QDR. The quad points to the element type (like the operator in a sexpr) in the QAR, the parent (or back) quad in the QBR, the contents of the element in the QCR, and the usual pointer to the next quad in the QDR. Since the intent with this model is to «load» SGML/XML/SALT documents into memory, one important issue is how to represent long stretches of character content or binary content. The quad can easily be used to represent a (sequence of) entity fragments, with the source in QAR, the start position in QBR, and the end position in QCR, thereby using a minimum of memory for the contents. Since very large documents are intended to be loaded into memory, this property is central to the ability to search only selected elements for their contents -- most searching processors today parse the entire entity structure and do very little to maintain the parsed element structure. Speaking of memory, one simple and efficient way to implement the quad on systems that lack the ability to add native types without overhead, is to use a two-dimensional array with a second dimension of 4 and let quad pointers be integers, which is friendly to garbage collection and is unambiguous when the quad is used in the way explained above. Maybe I'll talk about SALT some other day. -- Erik Naggum | Oslo, Norway Act from reason, and failure makes you rethink and study harder. Act from faith, and failure makes you blame someone and push harder.