Copyright 1999 Sun Microsystems, Inc. All rights reserved.
This document is a Sun Microsystems draft document intended for public review and comment. It does not represent a product of, nor has it been approved by the W3C HTML Working Group or any other body and may be updated, replaced, or rendered obsolete by other documents at any time. It is inappropriate to use this document as reference material or to cite it as other than a "work in progress".
[Temporary text in this specification is offset in light blue like this passage. This represents an early working draft, and is still incomplete in many areas of the prose text. See the section on Availability for specific information on the DTD and associated files. Also see Appendix E. for revision information.]
Errors or omissions in this document should be reported to the author.
This document describes a modularized XML document type based on HTML 4.0, explains the modularization model used, guidelines for extending HTML using this model, and an example extension using the MathML DTD. It includes XML DTDs and all necessary support files.
Because concepts of DTD modularization are often misunderstood, many terms are defined in detail, an explanatory section on modularization as well as a discussion of the HTML 4.0 document model are provided.
This document describes a modularization model for HTML 4.0 modified for compatibility with XML 1.0, including:
In this document XHTML (ie., "extensible HTML") is used to refer to this XML reformulation of HTML 4.0 rather than the W3C HTML Working Group code name Voyager, in order to reduce confusion with the products of the Working Group.
This document is organized as follows:
This document is the result of a requirements analysis based on document interoperability and application compatibility needs, experience with SGML and XML conversion issues, and much thought and discussion about the role of a modularized HTML DTD as a reference XML document type, a well-understood foundation for extension into other knowledge domains.
Following is the set of design goals used in preparation of this document:
This document is also informed by or related to other specifications and documents. Among these are some of note:
This document describes a compliant application of the W3C Recommendation Extensible Markup Language (XML) 1.0 [XML].
This document describes document types based on the W3C HTML 4.0 Recommendation [HTML40]. This includes XML versions of the HTML 4.0 Strict, Transitional, and Frameset DTDs. Where significant changes are necessary in the transformation to XML, these are noted in this document.
This document describes an document type extension incorporating the W3C MathML 4.0 Recommendation [MATHML]. Where changes are necessary, these are noted in this document. In particular, two element type names are in conflict between the two markup languages which must be resolved. Decisions regarding this matter will be made in cooperation with the W3C HTML Working Group and the authors of the MathML specification.
While informed by the ongoing work of the W3C HTML Working Group (of which the author is an active member), this document does not represent a W3C product, nor does it follow all decisions made by the W3C in its XML version of HTML 4.0, code-named "Voyager", as described in the Working Draft Reformulating HTML in XML [HTMLXML].
In particular, the element content models of this specification match those of the HTML 4.0 SGML DTD at a child level, which is not the case with the W3C product. The modularization model also differs greatly, as is described in detail below. Where a choice is made between matching the document model of HTML 4.0 vs. Voyager, this specification favors HTML 4.0. When this specification varies from both HTML 4.0 and Voyager, it will be noted.
Both this specification and Voyager adopt a minimalist approach in implementing XML markup, such as supplementing HTML with XML linking or other syntax. Both specifications add xml:lang and xml:space where appropriate; this specification fleshes out all XLink attributes on the HTML anchor element, whereas Voyager only adds the xml:link attribute itself.
This specification will track HTML 4.0 errata as they are addressed by the W3C HTML Working Group.
This specification implements all appropriate XLink attributes on the HTML anchor ("A") element in order to promote experimentation. Since the XLink specification [XLINK] is still a W3C Working Draft, users should follow the same precautions in working with any work-in-progress. Because XLink is expected to be the de facto standard for linking in XML, this specification will track changes that affect its experimental usage.
This specification and its author owe a debt of gratitude to many industry experts. They have published a number of excellent online and printed sources of information on SGML, HTML and XML, but few specifically on document type design. A primary resource used in development of this document is the text by Eve Maler and Jeanne El Andaloussi, Developing SGML DTDs: From Text to Model to Markup [DEVDTD]. Eve Maler and Terry Allen are the principal editors and maintainers of the DocBook 3.0 DTD, which provides an industry benchmark in the development and use of a modularized DTD. Helpful in his regard is the Customizer's Guide to the DocBook DTD V2.4.1 (see [DOCBOOK]), by Eve Maler and Terry Allen.
While some terms are defined in place, the following definitions are used throughout this document. Readers may not fully understand these definitions without also reading through the specification. Familiarity with the W3C XML 1.0 Recommendation [XML] is highly recommended.
This specification and the formal XML declarations for XHTML 1.0 described herein of this specification are protected by copyrights held by the IETF, the W3C, and Sun Microsystems.
Permission to use, copy, modify and distribute the XHTML 1.0 DTD and its accompanying documentation for any purpose and without fee is hereby granted in perpetuity, provided that all copyright notices found in the original documents appear in all copies. The copyright holders make no representation about the suitability of the DTD for any purpose. It is provided "as is" without expressed or implied warranty.
Please also note the Status of this Document section above.
The XHTML 1.0 Strict DTD is complete and valid, and mirrors the element content models of HTML 4.0 Strict at a child level. The XHTML 1.0 Transitional DTD is complete and valid, and apart from desired differences, matches HTML 4.0 Transitional suitably for an XML transformation. The XHTML 1.0 Frameset DTD is basically complete, although prose text describing it does not exist. The MathML DTD awaits resolution of naming conflicts on one attribute and two elements.
The gamut of element types and attributes found in a document type are often described in object-oriented terms, such as "classes" and "subclasses", "global" and "local" scoping, "inheritance", etc. Despite the common use of such buzzwords, SGML and XML describe markup languages, and as such, these terms are somewhat misplaced.
What is recognized is that document models usually contain informational constructs that may be grouped into common categories. These are categories of what are referred to by Eve Maler and Jeanne El Andaloussi in Developing SGML DTDs: From Text to Model to Markup [DEVDTD] as semantic components. These distinct "units of specification" represent containers for distinct types or classes of information (ie., data or human knowledge). So while the markup itself is not object-oriented, the classifications of content it describes may be. Use of the terms "class" and "subclass" in this document therefore refer more to "classification" and "subclassification" respectively. By modeling the document type definition on these common categories of semantic components, the commonalities and divisions of the document type allow a modularization model to be created.
Maler and El Andaloussi further describe a division of semantic components into three categories. The following descriptions borrow heavily from Section 18.104.22.168 Recognizing Content, Structure, and Presentation [DEVDTD]:
When a document model is fairly simple, design and delivery constraints usually don't warrant modularizing the markup model. But when the document model is complex, when network constraints warrant, when a customization is desired without modification of the reference DTD, or perhaps even when components of the DTD are delivered from different locations, breaking the DTD into fragments or modules is a good solution. Perhaps the best reason is to help implement the inherent structures of the document model in the markup model (since SGML nor XML have any such features), which helps in design, maintenance and documentation.
Also, software applications are often designed based on common modules. For example, programming code already exists for rendering CALS or HTML tables, so use of an existing module may lead to a general improvement in interoperability, documentation, and understanding within a user community.
At first glance the added overhead and syntax complexity associated with modularization may seem daunting, but many years of industry experience would suggest that its benefits usually outweigh the costs.
While it is convenient to categorize information, in practice such categorization must be considered carefully, as the same information can often be marked up in different ways, depending on the intention and processing expectations associated with the content. For example, since HTML is strongly presentational, whether an element type is content-based or presentational is often less important than if it is a block or inline element. In some markup languages, presentation issues are minimal, left entirely to stylesheets, or even completely absent.
So if one were to divide a document model into parts, the dividing lines would occur at the semantic component level, and be implemented in the markup model by creating DTD fragments, or modules.
Modules are often used to encompass the markup declarations of a specific semantic component or "feature", from higher-level components like tables, forms, to lower-level ones like specific elements or element groups. Modules can contain modules, creating a hierarchical structure mirroring the document model. Modules are abstract structures, so they can be implemented in various ways, such as a simple designation using comment delimiters, using marked sections, or using entities (see Use of Marked Sections, Files vs. Modules below). If there is an expectation that the DTD may be commonly modified or used as a source of DTD fragments (such as the TEI DTD [TEI]), many such methods have been employed by the markup community over the years, and studying existing DTDs often yields many ways of solving complex problems.
While the idea of "plug and play" with DTD modules is very attractive, in practice this isn't quite so simple. Because complex document models often resort to classification of semantic components to facilitate understanding, markup reuse, extensibility and maintenance (through use of parameter entities), seldom are DTD modules completely self-contained, so there is usually a fair amount of "rewiring" involved in adding or removing a DTD module. A compromise must be made between ease of maintenance or extensibility and complexity of the DTD, and this is where good design of the modularization model (and good documentation) can make all the difference.
Some of the expressive power of SGML DTDs useful for this classification (eg., name groups) is unavailable in the simplified syntax of XML, but many other markup features (and well-written documentation) can go a long way in creating a straightforward and effective modularization.
DTDs are written for humans as much as for machines, and in fact act as an interface between structures of human information and a machine representation. Some representations are more explicit than others, some rely merely on human understanding and do not impact processing directly. If a DTD makes sense to its intended audience and represents an appropriate modelling of its document type (being as loose or restrictive as necessary for its intended application), then it is a success.
To this end, parameter entities are often used to represent various structures. These are described below in more detail.
[creation of PEs for use as common element classes; eg., %flow;, %block; and %inline; in HTML; use of various naming conventions for parameter entities.]
When a semantic component represents a specific data or content type, its representation within the DTD may be handled by creating a parameter entity as a content-type label. The HTML 4.0 DTD uses such parameter entities extensively. This can be seen clearly in the element declaration for the anchor element below:
<!ATTLIST A %attrs; -- %coreattrs, %i18n, %events -- charset %Charset; #IMPLIED -- char encoding of linked resource -- type %ContentType; #IMPLIED -- advisory content type -- name CDATA #IMPLIED -- named link end -- href %URI; #IMPLIED -- URI for linked resource -- hreflang %LanguageCode; #IMPLIED -- language code -- rel %LinkTypes; #IMPLIED -- forward link types -- rev %LinkTypes; #IMPLIED -- reverse link types -- accesskey %Character; #IMPLIED -- accessibility key character -- shape %Shape; rect -- for use with client-side image maps -- coords %Coords; #IMPLIED -- for use with client-side image maps -- tabindex NUMBER #IMPLIED -- position in tabbing order -- onfocus %Script; #IMPLIED -- the element got the focus -- onblur %Script; #IMPLIED -- the element lost the focus -- >
All of the above attribute types (%Charset;,%ContentType;, etc.) resolve to "CDATA", "NMTOKEN)", or "NMTOKENS)", all essentially unparsed string containers.
As a specific example, while within the DTD itself a URI container is just another string container, but obviously has a specific meaning to humans and applications. In HTML 4.0, URIs are represented within the DTD by the parameter entity %URI;, and declared within the DTD as:
<!ENTITY % URI "CDATA" -- a Uniform Resource Identifier, see [URI] -->
Because the syntax of comments has been simplified in XML to those found only in comment declarations, the above comment must be rewritten. Since the comment is no longer contained within the declaration, it is common practice for comments to precede their declarations, such as:
<!-- a Uniform Resource Identifier, see [URI] --> <!ENTITY % URI "CDATA" >
While use of such content-type parameter entities doesn't impact the document model, they can be valuable in making a DTD easier for both authors and application developers to understand.
[creation of PEs for use as common element classes; eg., %flow;, %block; and %inline; in HTML; use of various naming conventions for parameter entities.]
<!-- %Inline.mix; includes all inline elements --> <!ENTITY % Inline.mix "%Inlpres.class; | %Inlphras.class; | %Inlspecial.mix; | %Formctrl.class;" > <!-- %Block.mix; includes all block elements --> <!ENTITY % Block.mix "%Blkpres.class; | %Blkphras.class; | %Blkspecial.mix;" > <!-- %Flow.mix; includes all text content, block and inline --> <!ENTITY % Flow.mix "%Heading.class; | %List.class; | %Block.mix; | %Inline.mix;" >
[creation of PEs for use as common attribute classes] While there is no feature in XML for "global" attributes (ie., an attribute that applies to all element types), parameter entities may be used to create classes of attribute type specifications that may be reused within the DTD.
Here's an example from the DTD:
<!ENTITY % Core.attrib "id ID #IMPLIED class CDATA #IMPLIED style %StyleSheet; #IMPLIED title %Text; #IMPLIED" > <!ENTITY % I18n.attrib "lang %LanguageCode; #IMPLIED xml:lang %LanguageCode; #IMPLIED dir (ltr|rtl) #IMPLIED" >
[the renaming of attribute classes (such as %Coreattrs; to %Core.attrib;) hasn't been implemented in this version, but is planned. Changes to %Core.attrib;, %I18n.attrib;, %Common.attrib;, %Alink.attrib;, %Events.attrib;, maybe others]
[Note the precedence order of declarations vs. redeclaration of variables in a programming language and show why this makes good sense. Discuss both external and internal DS.]
[modularization changes via: module replacement, predeclaration, parameter entity replacement, module amendation (eg., later, using ATTLISTs), marked sections, etc.]
[use of marked sections to create module boundaries and 'switches']
<!-- Tables Module .................................... --> <!ENTITY % XHTML1-table.module "INCLUDE" > <![%XHTML1-table.module;[ <!ENTITY % XHTML1-table PUBLIC "-//Sun Microsystems//ELEMENTS XHTML 1.0 Tables//EN" "XHTML1-table.mod" > %XHTML1-table; ]]>
Note that a DTD module does not necessarily imply a separate file entity. For example, the DocBook 3.0 DTD is delivered as a single driver file, comprising about three hundred lines of code (about half of which are comments). The driver declares and instantiates four file "modules", which themselves are made up of over three hundred internal modules (ie., the entity boundaries are unimportant: the same DocBook DTD could be delivered as one, four, or three hundred files. See: DTD normalization.). After the parameter entities comprising the file modules have been instantiated, the DTD is over 7,600 lines long.
When network performance is an issue, decisions over how to deliver a DTD may come into play. When network bandwidth is limited or packet delivery overhead is high, delivery of a single file is faster than numerous small network accesses, but under some conditions (such as when delivering over an unreliable connection where redelivery is common) smaller files may be preferred. In either case, if the DTD is large, delivery may be a consideration in the entity design. But because DTDs are text files (and therefore even large DTDs are smaller than most GIF images on the Web), delivery performance is usually less of an issue; convenience and utility are greater factors.
[use of public ids and catalog files vs. system ids; mention URNs?]
[parameter entities allowed only where declarations may occur in internal subset and impact]
[briefly describe relationship between SGML and XML, describe this section as a general issues list for those familiar with SGML or involved in document conversion...]
The introductory description of an HTML 4.0 document found in Section 7.1 of the W3C HTML 4.0 Recommendation [HTML40] is confusing and somewhat misleading. This may be an attempt to simplify the SGML terminology elaborated upon later in the specification, or perhaps account for markup minimization which in HTML allows much of the higher-level document model to be implied when absent from a document instance. Nevertheless, this deserves remedy, particularly when HTML is transformed into XML where such types of minimization are not allowed.
The three "parts" of an HTML document as described in the HTML 4.0 Recommendation:
The first item above is of course the DOCTYPE declaration, which represents part of the SGML prolog, corresponding to Production 22 of the XML 1.0 specification [XML]. The DOCTYPE declaration is not so much a "version label" as a declaration of the document element type name ("HTML"), followed by an external reference (in this case, a Formal Public Identifier) to an HTML DTD. For more information on external identifiers, see Section 4.2.2, External Entities, [XML].
NOTE: an XML prolog also includes the XML declaration (a special processing instruction) and optional miscellaneous content (processing instructions, comments and whitespace), but for purposes of this discussion this will be ignored. Also, for better compatibility with Web usage, XML further requires the external reference to include a Uniform Resource Identifier [URI].
In current Web practice, the significance of the DOCTYPE declaration is almost nil. Mainstream HTML browsers ignore its presence and are unable to process any portion of the document prolog. XML 1.0 requires conformant applications to at least be able to parse the declaration and any internal subset. Validating parsers are expected to be able to instantiate and parse external references in the external and internal subsets. This will be elaborated further below.
Missing from the above list is the existence of the root or document element, which represents the outermost container for all document content. Another way of stating this is that the document element contains all content between the <HTML> start tag and </HTML> end tags, which serve as delimiters. The HTML document element has two required children, the HEAD and BODY elements respectively.
NOTE: Whereas HTML 4.0's markup minimization rules allow document authors to omit the tags for the HTML, HEAD and BODY elements (curiously, the only required element in HTML 4.0 is TITLE), they are nevertheless always implied (ie., actually present in the document model).
The HTML 4.0 specification describes the HEAD element as "declarative", containing information about the document. This document metadata is typically not rendered as document content, but strictly speaking, it is of course part of the HTML document.
Within the document HEAD, HTML prescribes no particular structure, merely an unordered container for the element types TITLE, BASE, SCRIPT, STYLE, META, LINK, and OBJECT. Of these element types, the document's TITLE element must occur once, its optional BASE element may occur only once. The rest may occur zero or more times within the HEAD element, in any order.
Please refer to Section 7.4 of the HTML 4.0 specification [HTML40] for detailed descriptions of these element types.
Following the HEAD element is the BODY element, which contains all document content typically rendered by an HTML user agent. It is within the BODY element that most of the structure of an HTML document is found.
The naming scheme for many of HTML's elements is seemingly borrowed from the ISO 8879:1986 General Document DTD, Annex E of [SGML], such as BODY, H1 to H6, P, ADDRESS, and TITLE, and all of its list element types: DL, DT, DD, UL, OL, and LI.
The BODY element contains basically no higher-level structures such as chapters or sections. An HTML document consists of a shallow stream of elements, some requiring a slightly deeper structure. Unlike many other industry DTDs that prescribe deep structural nesting (DocBook inline elements commonly begin at a seventh or eighth nested level and may reach a depth of a dozen or more), HTML rarely requires more than two or three levels.
The DIV and SPAN elements allow for recursive containership, which could be used to create a deeper structure within an HTML document. However, because they are a generic elements and optional, their use in this regard is rather limited.
Unlike HTML 3.2 [HTML32], which contains relatively unstructured content anywhere within BODY, Section 7.5.3 of the HTML 4.0 specification [HTML40] makes a clear distinction between two classes of element types: block and inline:
While the parameter entities that enabled this delineation existed in HTML 3.2, HTML 4.0 is more disciplined in constraining block and inline elements, although there are plenty of holes in both document models.
The flat structure of BODY contains six numbered headings, from H1 to H6. These, however, do not serve as nested structures but rather as a hierarchy of section titles for a nonexistent section structure. There is no enforcement of order or occurrence. The ISO/IEC 15445:1998 HyperText Markup Language (HTML) DTD [ISO-HTML] attempts to remedy this by creating a nested structure of implied B1 to B6 element types. Because XML does not allow for such markup minimization, this solution is unfortunately not available in an XML-based HTML document type.
[DIV, P, BR]
[SPAN, B, EM, STRONG, etc.]
[most inlines in Strict]
[most of those relegated to Transitional only]
[tables, forms, etc.]
In looking back over the description of how semantic components may be categorized (see Section 3), we must acknowledge that HTML, unlike many (or even most) existing markup languages, is strongly presentational. This has influenced the precedence given to the groupings of element types, favoring an early branching of "block" vs. "inline" over "phrasal" vs. "presentational", particularly since upon analysis HTML in practice makes a stronger differentiation between block and inline rather than whether an element type represents a semantic distinction or is merely a presentational effect: people use HTML tags to achieve desired effects.
The semantic components of HTML are classified by the delineating categories below. (Note that because H1 through H6 act as heading titles, not nested containers, they are classified as block phrasal, not structural):
Giving priority to "block" vs. "inline" (as described above) we find the following result:
The DTD uses many parameter entities to create various classes of names, attribute declarations, etc. to further the modularity, reuse and understanding of its declarations.
The "Common Names", "Common Attributes" and "Document Model Hierarchies" modules are declared near the beginning of the DTD, enabling use of parameter entities within each of the DTD modules as encountered. These are called "preliminary declarations" below, and includes common names, attributes, and also the long list of ISO character entities.
[description of Common Names]
Based on the parameter entity attribute class naming scheme, the set of parameter entities for XHTML attribute classes are as follows:
[Description of classes]
[Description of mixes]
[Description of contents]
(XHTML 1.0 Transitional elements in italic. Transitional modules use the same base name as the Strict version, but add "-t" to the module name (eg., "XHTML1-attribs.mod" changes to "XHTML1-attribs-t.mod")
[description of preliminary declarations, including common names, attributes, leave content model to next section]
[description of content model module and role in declaring classes of elements]
[description of module declarations...]
In its short history, Web browsers have promoted a model of document delivery that makes little or no effort at checking the validity of documents. While it is beyond the scope of this specification to address these issues, it is implicitly understood that the value of a DTD is in its ability to check the structure of a document instance against a specific document model. Validation may occur during authoring, delivery or reception.
[introductory paragraph to following sections...]
Before beginning to discuss the specifics of HTML, an explanation of how objects may be referenced and instantiated is in order.
Strictly speaking, the means of associating a document with a DTD has been part of HTML since the beginning, but is rarely used.
[describe external identifiers...]
[Reference [CATALOG] as specification for catalog files.]
[describe packaging issues...]
[describe what a driver is and does...]
<!-- DTD for HTML 4.0 Strict --> ... <!-- end of strict DTD -->
Rather than deliver a modular DTD as separate files, in certain environments where multiple file accesses may be a burden on network resources, the entire DTD may be consolidated into one file through a normalization process, which expands all external parameter entity references. James Clark's freeware SGML toolkit [SP] includes the application spam that may serve as a normalization tool. Certain changes must be made in order to work correctly with XML files. These are described below.
% spam -p -p -c XHTML1.cat test-s.xml
The -c parameter is followed by a reference to the SGML catalog file for XHTML 1.0, which includes an SGMLDECL statement providing the parser with the correct SGML declaration for XML. The -p -p parameters direct the parser to expand all parameter entities.
Included with the distribution are several Unix ksh scripts. Each automatically creates a normalized version of a DTD upon invocation. These are named "_flat-s" (Strict), "_flat-t" (Transitional), and "_flat-f" (Frameset). Availability of SP on the host machine is required.
Note that this is unrelated to the proposed products of the W3C XML Fragment Working Group, or upon SGML Open Technical Resolution 9601:1996 Fragment Interchange, both of which are concerned with interchange of fragments of document content, not the reuse of DTD fragments in composing variant or compound document types.
[discussion of subsets, extensions, etc.]
[describe the changes to XHTML1-model.mod, the catalog file, and the driver]
[more complex example of changes required to add MathML]
The optional XHTML1-arch.mod module includes declarations that enable XHTML to be used as a base architecture according to the Architectural Forms Definition Requirements (Annex A.3, ISO/IEC 10744, 2nd edition).
For more information on use of architectural forms, consult Part Four of David Megginson's Structuring XML Documents [STRUCTXML], or browse the HyTime web site at:
<!-- Architecture Base Declaration --> <?IS10744 ArcBase html ?> <!-- Architecture Notation Declaration --> <!NOTATION html PUBLIC "-//Sun Microsystems//NOTATION AFDR ARCBASE XHTML 1.0//EN" >
To reduce the size of this document, the actual files composing the normative content of this specification have not been included inline, and are shown below as hypertext links. The entire package of files are available in both tarred, gzipped or zipped archives at:
NOTE: The DTD and associated files use file extensions such as .mod, .dtd, etc. and may not display correctly in all browsers. If you're having difficulty viewing the files, or are planning to use the DTD, it is recommended that you download the archive rather than the individual files. Please note the current status of the DTD.
[Reference [CATALOG] as specification for catalog files.]
[describe and reference section above on DTD normalization...]
Below are links to normalized, "single-file" versions of the XHTML 1.0 Strict, Transitional, and Frameset DTDs. They are identical in function to the modular DTDs but have been normalized using James Clark's spam application, part of the SP toolkit.
The following element type content model changes have been made in transforming HTML 4.0 to XML:
The following attribute changes have been made in transforming HTML 4.0 to XML:
|Element Type||HTML 3.2||HTML 4.0||XHTML 1.0||Voyager|
Additionally, the %i18n; attribute class has been augmented by the xml:lang attribute, affecting all element types that include this parameter entity in their attribute definition list declarations.
Other changes made in transforming HTML 4.0 to XML:
[remove unused references upon completion of document...]
The following have contributed to this document:
Revisions to this draft:
|1999-01-29||Restructured some block and inline element types: created new modules for 'b.1 block structural' and 'c.1 inline structural', renumbering in the draft and DTD comments accordingly (see DTD modules for specifics).|
|1999-02-01||Implemented changes required for HTML 4.0 errata. Fixed some bugs in content models and finished most testing of Strict and Transitional DTDs. Added minor notes about normalization.|