An Approach to DTDs and Namespaces
Introduction
JUMBO now implements a simple but powerful approach to DTDs and namespaces,
intended to follow both the spirit and letter of XML. This has been implemented
in an imminent new snapshot of JUMBO (i.e. not vapourware) and feedback
is welcomed, including areas where my knowledge of SGML is shallow.
In textual applications (rendering of 'human-readable' documents on
paper or screen as in current (1997) browsers) stylesheets are the preferred
approach to 'display'. In this spirit JUMBO is tracking the XSL spec and
provides partial support at present. However, many applications are 'non-textual'
either because of the nature of their material (molecules, semantic maths,
structured graphics, etc.) or because of their structure (general graphs,
tables, etc.). In these cases a per-element approach is often valuable
and JUMBO currently provides support by linking element display to Java
classes. This leads to a simple model for namespaces which may also be
useful for some textual applications as well.
The merits and limitations of the XML DTD
Traditional management of XML documents is through the DTD which can provide
the following:
-
(a) A validatable structure for the children of an element (the 'content')
-
(b) A list of validatable typed attributes for an element, with optional
default values.
-
(c) Parameter entities for the maintenance of the DTD
-
(d) Character entities for inclusion of text strings in the document
-
(e) External text entities for inclusion in the document.
-
(f) NOTATIONs for non-XML objects
-
(g) An INCLUDE mechanism
Of these only (a) and (b) impinge on Namespaces and DTDs. JUMBO never
sees (c) (d) and (g), may never see (e) and its author has publicly demonstrated
that he does not understand (f) fully.
Content models
(a) allows a powerful definition of the potential structures for an element's
content. It is useful for validating static XML documents, for creating
new documents, and for editing or merging existing ones. JUMBO thinks it
is a Good Thing and will use it whenever possible (it awaits a publicly
available Java algorithm for content validation).
It has limitations in the following areas:
-
Included character data (#PCDATA) has only one allowed type (STRING).
-
Some desirable content models are not easy/possible to construct
The first is serious since almost all non-textual applications of XML (databases,
technical subjects, commerce and many more) use datatyping such as INTEGER,
DATE, FLOAT, etc. An example of the second is where occurrence counts of
children are substantial (e.g. 'FAMILYs with more than five and less than
eight CHILDren' have ugly content models. [By contrast XLL with NodeSets
could provide an elegant runtime validation: (ALL,FAMILY)CHILD(5,CHILD).NOT.(ALL,FAMILY)CHILD(9,CHILD)
but this is not followed further here.]
Therefore datatyping must be addressed at an early stage in authoring,
editing and processing XML documents, and a DTD-compatible solution is
discussed below.
Attribute validation
Attributes can be validated with respect to
-
their occurrence (#REQUIRED)
-
their type (ID, CDATA, ENTITY, NMTOKEN, NOTATION)
-
their value (if part of an enumeration)
The Typing suffers from the same problems as content (above). I suspect
very few newcomers to XML will use anything other than CDATA for attributes
as they won't understand the point of the other XML types (except possibly
ID). [Personally I can see no reason for having IDREF since XLL is more
powerful. IDREF is a pain to implement and unless anyone convinces me otherwise
I shall not put it in JUMBO. I might transform IDREF="foo" to HREF="#ID(foo)"
which would do the same thing.]
Enumerations suffer from:
-
The restriction on multiple (YES|NO) and similar models (the rationale
being completely opaque to newcomers to XML). I predict this will cause
great confusion and effectively mean that enumerations are little used.
-
The hardcoded nature of the enumerations, though good for validation, is
poorly suited to authoring support where other options may need to be added
in flexible DTDs..
It is perhaps worth noting that AFAICS no current Java XML parser provides
full support for extracting DTD-based information, and I suspect that this
is likely to be common among lightweight parsers. Without good support
the DTD may be in danger of atrophy, perhaps limited to a few ATTLISTs
in the internal subset. This paper (hopefully) adds new life.
The per-Element approach
In the existing DTD there are two types of information - document-wide,
and per-element. The latter covers (a) and (b) above and is the subject
of this document. I use 'per-element' to mean that an element can be completely
described and processed without knowledge of its position in a document.
IMO contextual information is best provided by XLL, with (hopefully) widely
agreed semantics (not discussed here).
The structure of an element as presently defined therefore breaks down
into:
<!ELEMENT element (contentspec, attlist*)>
<!ELEMENT contentspec (#PCDATA)>
<!ELEMENT attlist (#PCDATA)>
Since #PCDATA is a poor descriptor of structure, the first is better
expressed [3.2.1] as:
<!ELEMENT contentspec (#PCDATA|children|Mixed)> <!-- #PCDATA
is 'EMPTY' or 'ANY'-->
<!ELEMENT children (choice | seq)>
<!ATTLIST children
repeatable (YES|NO) #REQUIRED
optional (YUP|NOPE) #REQUIRED> <!-- I can't
use YES|NO again :-( -->
<!ELEMENT choice (cp)+>
<!ATTLIST choice
repeatable (YES|NO) #REQUIRED
optional (YUP|NOPE) #REQUIRED>
<!ELEMENT seq (cp)+>
<!ATTLIST seq
repeatable (YES|NO) #REQUIRED
optional (YUP|NOPE) #REQUIRED>
<!ELEMENT cp (Name|choice|seq)>
<!ATTLIST cp
repeatable (YES|NO) #REQUIRED
optional (YUP|NOPE) #REQUIRED>
<!ELEMENT Name (#PCDATA)>
<!ATTLIST Name
type (STRING|INTEGER|FLOAT|DATE|URL|HTML|OTHER)
"STRING">
<!ELEMENT Mixed (#PCDATA,Name*)>
JUMBO essentially implements this and can display DTD contentspecs as
trees (It was written before the 971208 spec, so details differ. BTW the
spec is much clearer in this area, thanks.)
Similarly the ATTLIST structure [3.3] (again simpler) breaks down into:
<!ELEMENT attlist (AttDef)*> <!-- I would have expected (AttDef)+
-->
<!ELEMENT AttDef (Name, AttType, Default)> <!-- Name as above
-->
<!ELEMENT AttType (#PCDATA | EnumeratedType)> <!-- PCDATA
is 'CDATA', 'ID' etc. as from [55] and [56] -->
<!ELEMENT EnumeratedType (NotationType | Enumeration)>
<!ELEMENT NotationType (#PCDATA, Name*)> <!-- PCDATA is 'NOTATION'
-->
<!ELEMENT Enumeration (#PCDATA)> <!-- PCDATA is '(A|B|C|D)'
-->
This means that the ELEMENT and ATTLIST components of the DTD can be
isomorphically expressed by an XML document.
This is not a world-shattering discovery, and it has been made by many
people. The key point, however, is that it is far more powerful than
the conventional BNF-like DTD. There are several advantages, and the only
disadvantage is that it is not formally supported and encouraged by the
XML spec. This requires the DTD to use a different language from XML (but
even this hurdle can be overcome - see later). The advantages of the XML-DTD
are:
-
Additional ELEMENTS can be added to the XML-DTD, particularly documentation
and Typing (serious problems with current DTDs). For example, JUMBO adds
optional HELP and TYPE elements in the content of ELEMENT
-
Current XML technology can manage XML-DTDs without additional coding. Through
this approach JUMBO is now a DTD-editor and DTD-browser as well as a document
browser.
-
Though nothing has been decided, it would not be a major surprise if future
Namespace mechanisms used XML as their syntax. JUMBO anticipates this below.
-
XML provides powerful mechanisms for distributed documents (entities, XLL,
etc.) so that complex or flexible distributed DTDs can easily be created.
Moreover, DTDs can be assembled on a just-in-time basis for multi-namespace
documents (see below)
-
Newcomers to XML will understand and appreciate this in a way that they
will not with current DTDs.
-
Additional validating semantics can be added through several XML-based
mechanisms (XLL, XSL or hardcoded applications).
-
The application can be directly referenced in the XML-DTD. Thus JUMBO includes
optional <JAVA> elements for ELEMENTs so they can be linked to class
libraries for display, processing and powerful validation (e.g. DATE.class)
Namespaces
In their simplest form, namespaces are simply a restatement of DTDs - each
DTD describes a namespace. JUMBO honours this. In full SGML the SUBDOC
facility allowed multiple DTDs (= namespaces) but XML does not have this.
It is confidently assumed that XML will use namespaces instead and at least
one proposal (XML-data) has already been published. That proposal had many
useful features, but the underlying relational and inheritance structure
would be too complex for JUMBO to implement at present. This current document
proposes a much simpler approach which hopefully is very close to the core
of any namespace proposal. It does not quote or rely on any non-public
material or discussion.
All that is so far publicly given and used is:
-
The colon ':' is reserved for namespace experiments with Names [2.3 Note]
-
Namespaces are identified by syntax something like: <?xml:namespace
href="someUrl" as="foo" ?> (c.f. RDF proposal).
-
Names *might* be of the form: foo:bar. (RDF proposal, and XML PT&T)
-
'foo' provides a handle to identify that foo:bar belongs to the namespace
'foo'
-
'someUrl' has no agreed meaning. It *might* be the address of something
containing useful information. RDF-971002 calls this a *schema* and I shall
use this term. RDF ideas for the schema are touched upon in [2.1] of http://www.w3.org/TR/WD-rdf-syntax-970012.
The syntax and logic are yet to be decided.
-
There is no agreed mechanism whereby multiple DTDs or namespaces can be
used within a single XML document.
The JUMBO Approach
(Almost everything described here is implemented in a working prototype,
so should be seen as generally feasible. Some details (e.g. the syntax
of 'xml:namespace', capitalisation of HREF) are tentative (but work). Some
terminology may also be obsoleted by
All documents are in XML This makes parsing, processing, display, editing,
and everything a lot easier than having multiple syntaxes.
An XML document instance may have zero, one, or many PIs of the form:
<?xml:namespace href="some/where.xml" as="foo" ?>
<?xml:namespace href="else/where.xml" as="bar" ?>
JUMBO does NOT regard the address of the URL as important since it could
be relative or absolute. Only the contents matter.
The Namespace schema
The namespace schema identifies the elementTypes in that schema. The proposed
format is something like:
<NAMESPACE FPI="-//CML//DTD Version 1.2 EN//" PREFIX="CML">
<ELEMENT TYPE="MOL">
<SCHEMA>jumbo/cml/MOLNode</SCHEMA>
</ELEMENT>
<ELEMENT TYPE="ATOMS">
<SCHEMA>jumbo/cml/ATOMSNode.xml</SCHEMA>
</ELEMENT>
</NAMESPACE>
Minor details could be that TYPE was a child ELEMENT, etc.
(Because of the power of Xpointers to abstract subcomponents, much other
information can be added to the schema without causing problems. Essentially
it is transparent to the current process. Examples would be metadata, display
characteristics of namespaces, etc. For example, JUMBO can make buttons
different colours :-)
The schema points to per-element schemas for each element in the namespace
(only two are shown above). The addresses are URLs and can be relative
or absolute. MOL belongs to the CML namespace and would normally appear
in a document as <CML:MOL>. However, the implementation also allows
for a single namespace with no prefix ('MOL' with as=""). If two competing
namespaces occur (e.g. each has namespace of "CML"), the namespace file
can be easily constructed with a different PREFIX.
Per-element Schemas
Each element is described by WF XML of the form below. This could be a
file-per-element (which is what JUMBO does) or could use Xpointers into
a single file. Note that the ELEMENTs do NOT have a hardcoded PREFIX. The
ELEMENTS could have the XML structure outlined above or they could contain
PCDATA representations of conventional DTDs. JUMBO supports the latter
(and will support the former) so that the result is like:
<ELEMENT TYPE="MOL">
<NAMESPACE FPI="-//CML//DTD Version 1.2 EN//"/>
<CONTENTSPEC>(ATOMS,BONDS?)</CONTENTSPEC>
<ATTLIST>BUILTIN CDATA #IMPLIED</ATTLIST>
<ATTLIST>ID ID #REQUIRED</ATTLIST>
</ELEMENT>
A great advantage of this is that many other non-DTD element-related
materials can be inserted as in:
<ELEMENT TYPE="MOL">
<!-- ...content and attributes ... -->
<HELPURL>MOLNodeHelp.xml</HELPURL>
<JAVA>jumbo.cml.MOLNode.class</JAVA>
<ICONURL MIME="image/gif">../icons/mol.gif</ICONURL>
<STYLESHEET>http://www.chem.soc/stylesheet.xsl</STYLESHEET>
</ELEMENT>
One obvious attraction is that the elements can easily be used in more
than one application. JUMBO is starting to do this for simple elements
like jumbo.tecml.INTEGERNode. JUMBO implements a set of fundamental
ELEMENTs for basic data types:
jumbo.tecml.INTEGERNode;
jumbo.tecml.FLOATNode;
jumbo.tecml.DATENode;
jumbo.tecml.STRINGNode;
jumbo.tecml.URLNode;
jumbo.tecml.HTMLNode;
These are only validated as PCDATA at parser level, but JUMBO checks
formats and semantic validity (using java.* classes where possible). These
classes also implement functions like display(), edit(), isValid() for
use in renderers and authoring tools.
Mixing namespaces
In real applications it is highly likely that ELEMENTs can have content
from another DTD. For example, a MOL could contain an RDF:* identifying
the metadata for that molecule. The current validation procedure with a
fixed contentspec will therefore fail. The following are possible mechanisms:
-
do not validate at parser level (unfortunately for SGML adherents I expect
this to be almost universal among newcomers)
-
validate at application, or pre-application level with slightly relaxed
conditions.
-
compute the contentspec just-in-time from knowledge or what DTDs were involved
As an example of the second we might include a reserved elementType (e.g.
XDEV:OTHER) which signified that an element from another namespace was
allowable here. (I'd actually prefer the spec to address this.). The validation
procedure could be modified to allow this to match any element NOT in the
namespace related to the content. The third is the most powerful and could
include constructions of the sort:
<ELEMENT>
<NAME>&namespace:&name</NAME>
<CONTENT>(&foo;|&bar)</CONTENT>
</ELEMENT>
which effectively act as PEs and seem to have at least as much power,
if not more. By reasonable use of entities, any desired DTD can be created
just before parsing.
Summary
This mechanism has been tested on documents with 3 namespaces, all linked
to per-element schema files. It has greatly aided the creation of authoring
tools, which can now use the full contentspec and ATTLISTs. It will be
present in the next snapshot of JUMBO and comments before that time will
be valued and implemented if possible.
Peter Murray-Rust
peter@ursus.demon.co.uk