Notes on XML DTD Design

This document describes some of the design decisions underlying the XML processing tools for this web site, and their influence on DTD design.

Design Goals

There are two design goals for DTDs:

With SGML DTDs, both goals can be achieved simultaneously in many applications, because optional elements can be expressed in the DTD. The author can omit them, but the parser will synthesize them for the application.

However, there are no implied tags in XML, and the two goals are now conflicting. For example, in text processing applications, there are two major content models: vertical material and horzontal material. Horizontal material consists of words, hyperlinks, images floating in the text, and so on. Paragraphs of horizontal material, enumerations, displayed equations, etc. form the vertical material. Note that, it is possible to make horizontal material into vertical material (by creating a pragraph), but not vice versa, at least in our simple text processing system.

Document Examples

This asymmetry requires that we permit the more general vertical material in many places. For example, the contents of a section is obviously vertical material, but also an individual list item (otherwise it could not contain more than one paragraph, or another nested list). If we mirror this as exactly as possible in the document structure, an author would have to write:

<SECTION TITLE="Some Section">
 <VBOX>
  <HBOX>This is some horizontal material.</HBOX>
  <LIST>
   <ITEM><VBOX><HBOX>This is the first list item.</HBOX></VBOX></ITEM>
   <ITEM>
    <VBOX>
     <HBOX>A nested list follows in this second list item.</HBOX>
     <LIST>
      <ITEM><VBOX><HBOX>first item in nested list</HBOX></VBOX></ITEM>
      <ITEM><VBOX><HBOX>second item in nested list</HBOX></VBOX></ITEM>
     </LIST>
    </VBOX>
   </ITEM>
  </LIST>
 </VBOX>
</SECTION>

Compare this to the equivalent HTML snippet:

<H1>Some Section</H1>
This is some horizontal material.
<UL>
 <LI>This is the first list item.
 <LI>A nested list follows in this second list item.
  <UL>
   <LI>first item in nested list
   <LI>second item in nested list
  </UL>
</UL>

The difference is remarkable, and few people would consider HTML as authoring language if it had the verbosity of the first example.

Design Centered on Structure

However, the first version in the previous section has an advantage which is hard to dismiss in many applications (especially if all documents are automatically generated): the XML document explicitly encodes the complete document structure. This is reflected by the DTD, which could look like this:

<!ELEMENT SECTION (VBOX)>
<!ATTLIST SECTION TITLE CDATA #REQUIRED>
<!ELEMENT VBOX (HBOX | LIST)+>
<!ELEMENT HBOX (#PCDATA)>
<!ELEMENT LIST (ITEM+)>
<!ELEMENT ITEM (VBOX)>

An important property of this DTD is its lack of redundancy. In all places in which vertical material can appear, it is represented by a VBOX element. (The same applies to horizontal material and HBOX elements.) You can easily add more kinds of vertical material, for example, images:

<!ELEMENT SECTION (VBOX)>
<!ATTLIST SECTION TITLE CDATA #REQUIRED>
<!ELEMENT VBOX (HBOX | LIST | IMAGE)+>
<!ELEMENT HBOX (#PCDATA)>
<!ELEMENT LIST (ITEM+)>
<!ELEMENT ITEM (VBOX)>
<!ELEMENT IMAGE EMPTY>
<!ATTLIST IMAGE FILE CDATA #REQUIRED>

Clearly, we have introduced a new element in the most unintrusive way that is possible. But still, this simplification burdens the document author with a lot of typing work.

Author-Oriented Design

Let's look at an XML equivalent of the short HTML form above.

<BODY>
 <H1>Some Section</H1>
 This is some horizontal material.
 <UL>
  <LI>This is the first list item.</LI>
  <LI>A nested list follows in this second list item.
   <UL>
    <LI>first item in nested list</LI>
    <LI>second item in nested list</LI>
   </UL></LI>
 </UL>
</BODY>

We had to introduce a new BODY element to hold things together. The DTD could look like this (this is not the actual XHTML DTD, though):

<!ELEMENT BODY (#PCDATA | H1 | UL)*>
<!ELEMENT H1 (#PCDATA)>
<!ELEMENT UL (LI+)>
<!ELEMENT LI (#PCDATA | H1 | UL)*>

You can clearly see the redundancy. From a DTD maintainer perspective, it can be eliminated using entities:

<!ENTITY % vmaterial "(#PCDATA | H1 | UL)*">
<!ELEMENT BODY %vmaterial;>
<!ELEMENT H1 (#PCDATA)>
<!ELEMENT UL (LI+)>
<!ELEMENT LI %vmaterial;>

However, the redundancy is still there if we look at the document structure. Entities are just a form of preprocessing, they do not introduce additional structure. This becomes clear if we trie to add an image element:

<!ENTITY % vmaterial "(#PCDATA | H1 | UL | IMG)*">
<!ELEMENT BODY %vmaterial;>
<!ELEMENT H1 (#PCDATA)>
<!ELEMENT UL (LI+)>
<!ELEMENT LI %vmaterial;>
<!ELEMENT IMG EMPTY>
<!ATTLIST IMG SRC CDATA #REQUIRED>

In this case, the code which processes BODY and LI elements would have to be updated.

The Best of Both Worlds

For our own application, we decided to aim for the best of both worlds. We employ an author-oriented DTD design, but follow a few conventions in our processing tools to avoid problems. Basically, we ensure that the problematic %vmaterial; entity appears in the transforming code as if they were implicit elements (and the %hmaterial; entity as well). So we do not only have subroutines analyzing BODY and LI, elements but also an analyzer for (#PCDATA | H1 | UL | IMG)* (or %vmaterial;).

If more standard processing tools were used, we could generate intermediate XML documents from the original source documents which describe the document structure as explicitly as possible (finally arriving at a compiler/assembler/linker processing pipeline). However, with our current toolset and the hand-written XML transformer (see XML Processing with DOM and Perl), this is not necessary.

Leaving the XML Domain (Almost) Completely

Sometimes, it is preferable to sacrifice even more of the XML structure. For example, consider dates. Would you rather write

2003-07-31

or

<DATE YEAR="2003" MONTH="7" DAY="31"/>

or even the following?

<DATE>
  <YEAR>2003</YEAR>
  <MONTH>7</MONTH>
  <DAY>31</DAY>
</DATE>

Several technologies in the XML environment (for example, XML Schema <http://www.w3.org/XML/Schema>) start to promote the simple 2003-07-31 variant. Although this means that you again have to write lexical analyzers and parsers (at least temporarily), it is hardly avoidable if you want readable source documents.

Related Documents

Revisions


Florian Weimer
Home Blog (DE) Blog (EN) Impressum RSS Feeds