XML Modularization of HTML 4.0

Sun Microsystems Note - 2 February 1999

Copyright 1999 Sun Microsystems, Inc. All rights reserved.

This version
Latest version
Previous version
Murray Altheim, Sun Microsystems


Status of this Document

This document is a Sun Microsystems draft document intended for public review and comment. It does not represent a product of, nor has it been approved by the W3C HTML Working Group or any other body and may be updated, replaced, or rendered obsolete by other documents at any time. It is inappropriate to use this document as reference material or to cite it as other than a "work in progress".

[Temporary text in this specification is offset in light blue like this passage. This represents an early working draft, and is still incomplete in many areas of the prose text. See the section on Availability for specific information on the DTD and associated files. Also see Appendix E. for revision information.]

Errors or omissions in this document should be reported to the author.


This document describes a modularized XML document type based on HTML 4.0, explains the modularization model used, guidelines for extending HTML using this model, and an example extension using the MathML DTD. It includes XML DTDs and all necessary support files.

Because concepts of DTD modularization are often misunderstood, many terms are defined in detail, an explanatory section on modularization as well as a discussion of the HTML 4.0 document model are provided.


1. Introduction

This document describes a modularization model for HTML 4.0 modified for compatibility with XML 1.0, including:

In this document XHTML (ie., "extensible HTML") is used to refer to this XML reformulation of HTML 4.0 rather than the W3C HTML Working Group code name Voyager, in order to reduce confusion with the products of the Working Group.

1.1 Contents

This document is organized as follows:

1. Introduction
describes design goals, relations to other specifications, and defines key terms
2. Modularization Concepts
introduces basic modularization concepts, without regard to the specifics of the HTML modularization
3. Differences Between SGML and XML
describes the differences between SGML and XML that affect DTD and document model design
4. HTML 4.0 Document Model
describes the HTML document model and design rationale used in developing the modularization model
5. Modularization of HTML 4.0
describes the specific design structures used in the modularization model and introduces the XML modules
6. Packaging and Delivery
describes the DTD driver files for XML modularized versions of HTML 4.0 Strict, Transitional and Frameset DTDs and guidelines on packaging and delivery
7. Extending XHTML
describes guidelines for use of DTD fragments, for extending a modular DTD, including an example extension using MathML
8. XHTML as an Architecture
describes use of XHTML as an architecture, describing an optional architectural forms module and how it might be used
Appendix A. XML Files
Appendix B. Document Model Changes
Appendix C. References
Appendix D. Acknowledgements

1.2 Design Goals

This document is the result of a requirements analysis based on document interoperability and application compatibility needs, experience with SGML and XML conversion issues, and much thought and discussion about the role of a modularized HTML DTD as a reference XML document type, a well-understood foundation for extension into other knowledge domains.

Following is the set of design goals used in preparation of this document:

  1. This specification shall be strictly conformant with the XML 1.0 specification.
  2. This specification shall not be in conflict with the XML linking or stylesheet specifications.
  3. This specification shall endeavour to conform to the HTML 4.0 document model as closely as possible. Where conformance varies, this should be noted in this specification. (A section that hilights these differences is included.)
  4. This specification is to represent an expression of the highest priority of the need for an XML modularization of HTML, that is, to act as a reference document type for the use of HTML markup in XML environments, and is designed for use as fragments, subsets, extensions and other variants.
  5. Whereas the HTML 4.0 document model is to be matched wherever possible, when a variance must be made, the goal is expressly not to create a conversion DTD for use as a target document type for conversions from HTML, but a reference DTD that preserves an assumed intention of the HTML document model. When a variance is made, a secondary goal will be the simplicity of the DTD design, with reuse in mind. For a discussion of DTD types, see Section 3.1.3 of [DEVDTD].
  6. Clarity and reusability are of utmost importance.
  7. Both public and system identifiers are to be supplied for all external entities.

1.3 How this document relates to other specifications

This document is also informed by or related to other specifications and documents. Among these are some of note:

1.3.1 W3C Recommendation Extensible Markup Language (XML) 1.0

This document describes a compliant application of the W3C Recommendation Extensible Markup Language (XML) 1.0 [XML].

1.3.2 W3C Recommendation HTML 4.0 Specification

This document describes document types based on the W3C HTML 4.0 Recommendation [HTML40]. This includes XML versions of the HTML 4.0 Strict, Transitional, and Frameset DTDs. Where significant changes are necessary in the transformation to XML, these are noted in this document.

1.3.3 W3C Recommendation Mathematical Markup Language (MathML) 1.0 Specification

This document describes an document type extension incorporating the W3C MathML 4.0 Recommendation [MATHML]. Where changes are necessary, these are noted in this document. In particular, two element type names are in conflict between the two markup languages which must be resolved. Decisions regarding this matter will be made in cooperation with the W3C HTML Working Group and the authors of the MathML specification.

1.3.4 W3C Working Draft Reformulating HTML in XML

While informed by the ongoing work of the W3C HTML Working Group (of which the author is an active member), this document does not represent a W3C product, nor does it follow all decisions made by the W3C in its XML version of HTML 4.0, code-named "Voyager", as described in the Working Draft Reformulating HTML in XML [HTMLXML].

In particular, the element content models of this specification match those of the HTML 4.0 SGML DTD at a child level, which is not the case with the W3C product. The modularization model also differs greatly, as is described in detail below. Where a choice is made between matching the document model of HTML 4.0 vs. Voyager, this specification favors HTML 4.0. When this specification varies from both HTML 4.0 and Voyager, it will be noted.

Both this specification and Voyager adopt a minimalist approach in implementing XML markup, such as supplementing HTML with XML linking or other syntax. Both specifications add xml:lang and xml:space where appropriate; this specification fleshes out all XLink attributes on the HTML anchor element, whereas Voyager only adds the xml:link attribute itself.

This specification will track HTML 4.0 errata as they are addressed by the W3C HTML Working Group.

1.3.5 W3C Working Draft XML Linking Language (XLink)

This specification implements all appropriate XLink attributes on the HTML anchor ("A") element in order to promote experimentation. Since the XLink specification [XLINK] is still a W3C Working Draft, users should follow the same precautions in working with any work-in-progress. Because XLink is expected to be the de facto standard for linking in XML, this specification will track changes that affect its experimental usage.

1.3.6 Developing SGML DTDs: From Text to Model to Markup

This specification and its author owe a debt of gratitude to many industry experts. They have published a number of excellent online and printed sources of information on SGML, HTML and XML, but few specifically on document type design. A primary resource used in development of this document is the text by Eve Maler and Jeanne El Andaloussi, Developing SGML DTDs: From Text to Model to Markup [DEVDTD]. Eve Maler and Terry Allen are the principal editors and maintainers of the DocBook 3.0 DTD, which provides an industry benchmark in the development and use of a modularized DTD. Helpful in his regard is the Customizer's Guide to the DocBook DTD V2.4.1 (see [DOCBOOK]), by Eve Maler and Terry Allen.

1.4 Definitions

While some terms are defined in place, the following definitions are used throughout this document. Readers may not fully understand these definitions without also reading through the specification. Familiarity with the W3C XML 1.0 Recommendation [XML] is highly recommended.

document type
a class of documents sharing a common abstract structure. The ISO 8879 [SGML] definition is as follows: "a class of documents having similar characteristics; for example, journal, article, technical manual, or memo. (4.102)"
document model
the effective structure and constraints of a given document type. The document model constitutes the abstract representation of the physical or semantic structures of a class of documents.
markup model
the markup vocabulary (ie., the gamut of element and attribute names, notations, etc.) and grammar (ie., the prescribed use of that vocabulary) as defined by a document type definition (ie., a schema) The markup model is the concrete representation in markup syntax of the document model, and may be defined with varying levels of strict conformity. The same document model may be expressed by a variety of markup models.
document type definition (DTD)
a formal, machine-readable expression of the XML structure and syntax rules to which a document instance of a specific document type must conform; the schema type used in XML 1.0 to validate conformance of a document instance to its declared document type. The same markup model may be expressed by a variety of DTDs.
reference DTD
a DTD whose markup model represents the foundation of a complete document type. A reference DTD provides the basis for the design of a "family" of related DTDs, such as subsets, extensions and variants.
subset DTD
a DTD whose document model is the proper subset of a reference document type, whose conforming document instances are still valid according to the reference DTD. A subset may place tighter restrictions on the markup than the reference, remove elements or attributes, or both.
extension DTD
a DTD whose document model extends a reference document type (usually by the addition of element types or attributes), but generally makes no profound changes to the reference document model other than required to add the extension's semantic components. An extension can also be considered a proper superset if the reference document type is a proper subset of the extension.
variant DTD
a DTD whose document model alters (through subsetting, extension, and/or substitution) the basic data model of a reference document type. It is often difficult to transform without loss between instances conforming to a variant DTD and the reference DTD.
fragment DTD
a portion of a DTD used as a component either for the creation of a compound or variant document type, or for validation of a document fragment. SGML nor XML current have standardized methods for such partial validation.
content model
the declared markup structure allowed within instances of an element type. XML 1.0 differentiates two types: elements containing only element content (no character data) and mixed content (elements that may contain character data optionally interspersed with child elements). The latter are characterized by a content specification beginning with the "#PCDATA" string (denoting character data).
semantic component
a unit of document type specification corresponding to a distinct type of content, corresponding to a markup construct reflecting this distinct type.
element type
the definition of an element, that is, a container for a distinct semantic class of document content.
an instance of an element type.
generic identifier
the name identifying the element type of an element. Also, element type name.
descriptive markup delimiting the start and end (including its generic identifier and any attributes) of an element.
markup declaration
a syntactical construct within a DTD declaring an entity or defining a markup structure. Within XML DTDs, there are four specific types:
entity declaration
defines the binding between a mnemonic symbol and its replacement content.
element declaration
constrains which element types may occur as descendants within an element. See also content model.
attribute definition list declaration
defines the set of attributes for a given element type, and may also establish type constraints and default values.
notation declaration
defines the binding between a notation name and an external identifier referencing the format of an unparsed entity
an entity is a logical or physical storage unit containing document content. Entities may be composed of parseable XML markup or character data, or unparsed (ie., non-XML, possibly non-textual) content. Entity content may be either defined entirely within the document entity ("internal entities") or external to the document entity ("external entities"). In parsed entities, the replacement text may include references to other entities.
entity reference
a mnemonic or numeric string used as a reference to the content of a declared entity (eg., "&amp;" for "&", "&#60;" for "<", "&copyright;" for "Copyright 1999 Sun Microsystems, Inc.")
to replace an entity reference with an instance of its declared content.
parameter entity
an entity whose scope of use is within the document prolog (ie., the external subset/DTD or internal subset). Parameter entities are disallowed within the document instance.
an abstract unit within a document model expressed as a DTD fragment, used to consolidate markup declarations to increase the flexibility, modifiability, reuse and understanding of specific logical or semantic structures.
an implementation of a modularization model; the process of composing or de-composing a DTD by dividing its markup declarations into units or groups to support specific goals. Modules may or may not exist as separate file entities (ie., the physical and logical structures of a DTD may mirror each other, but there is no such requirement).
modularization model
the abstract design of the document type definition (DTD) in support of the modularization goals, such as reuse, extensibility, expressiveness, ease of documentation, code size, consistency and intuitiveness of use. It is important to note that a modularization model is only orthogonally related to the document model it describes, so that two very different modularization models may describe the same document type.
a generally short file used to declare and instantiate the modules of a DTD. A good rule of thumb is that a DTD driver contains no markup declarations that comprise any part of the document model itself.

1.5 Availability

This specification and the formal XML declarations for XHTML 1.0 described herein of this specification are protected by copyrights held by the IETF, the W3C, and Sun Microsystems.

Permission to use, copy, modify and distribute the XHTML 1.0 DTD and its accompanying documentation for any purpose and without fee is hereby granted in perpetuity, provided that all copyright notices found in the original documents appear in all copies. The copyright holders make no representation about the suitability of the DTD for any purpose. It is provided "as is" without expressed or implied warranty.

Please also note the Status of this Document section above.

1.5.1 Current status:

The XHTML 1.0 Strict DTD is complete and valid, and mirrors the element content models of HTML 4.0 Strict at a child level. The XHTML 1.0 Transitional DTD is complete and valid, and apart from desired differences, matches HTML 4.0 Transitional suitably for an XML transformation. The XHTML 1.0 Frameset DTD is basically complete, although prose text describing it does not exist. The MathML DTD awaits resolution of naming conflicts on one attribute and two elements.



2. Modularization Concepts

2.1 What is DTD Modularization?

The gamut of element types and attributes found in a document type are often described in object-oriented terms, such as "classes" and "subclasses", "global" and "local" scoping, "inheritance", etc. Despite the common use of such buzzwords, SGML and XML describe markup languages, and as such, these terms are somewhat misplaced.

What is recognized is that document models usually contain informational constructs that may be grouped into common categories. These are categories of what are referred to by Eve Maler and Jeanne El Andaloussi in Developing SGML DTDs: From Text to Model to Markup [DEVDTD] as semantic components. These distinct "units of specification" represent containers for distinct types or classes of information (ie., data or human knowledge). So while the markup itself is not object-oriented, the classifications of content it describes may be. Use of the terms "class" and "subclass" in this document therefore refer more to "classification" and "subclassification" respectively. By modeling the document type definition on these common categories of semantic components, the commonalities and divisions of the document type allow a modularization model to be created.

Maler and El Andaloussi further describe a division of semantic components into three categories. The following descriptions borrow heavily from Section Recognizing Content, Structure, and Presentation [DEVDTD]:

When a document model is fairly simple, design and delivery constraints usually don't warrant modularizing the markup model. But when the document model is complex, when network constraints warrant, when a customization is desired without modification of the reference DTD, or perhaps even when components of the DTD are delivered from different locations, breaking the DTD into fragments or modules is a good solution. Perhaps the best reason is to help implement the inherent structures of the document model in the markup model (since SGML nor XML have any such features), which helps in design, maintenance and documentation.

Also, software applications are often designed based on common modules. For example, programming code already exists for rendering CALS or HTML tables, so use of an existing module may lead to a general improvement in interoperability, documentation, and understanding within a user community.

At first glance the added overhead and syntax complexity associated with modularization may seem daunting, but many years of industry experience would suggest that its benefits usually outweigh the costs.

While it is convenient to categorize information, in practice such categorization must be considered carefully, as the same information can often be marked up in different ways, depending on the intention and processing expectations associated with the content. For example, since HTML is strongly presentational, whether an element type is content-based or presentational is often less important than if it is a block or inline element. In some markup languages, presentation issues are minimal, left entirely to stylesheets, or even completely absent.

2.2 Facilitating Customization

So if one were to divide a document model into parts, the dividing lines would occur at the semantic component level, and be implemented in the markup model by creating DTD fragments, or modules.

Modules are often used to encompass the markup declarations of a specific semantic component or "feature", from higher-level components like tables, forms, to lower-level ones like specific elements or element groups. Modules can contain modules, creating a hierarchical structure mirroring the document model. Modules are abstract structures, so they can be implemented in various ways, such as a simple designation using comment delimiters, using marked sections, or using entities (see Use of Marked Sections, Files vs. Modules below). If there is an expectation that the DTD may be commonly modified or used as a source of DTD fragments (such as the TEI DTD [TEI]), many such methods have been employed by the markup community over the years, and studying existing DTDs often yields many ways of solving complex problems.

While the idea of "plug and play" with DTD modules is very attractive, in practice this isn't quite so simple. Because complex document models often resort to classification of semantic components to facilitate understanding, markup reuse, extensibility and maintenance (through use of parameter entities), seldom are DTD modules completely self-contained, so there is usually a fair amount of "rewiring" involved in adding or removing a DTD module. A compromise must be made between ease of maintenance or extensibility and complexity of the DTD, and this is where good design of the modularization model (and good documentation) can make all the difference.

Some of the expressive power of SGML DTDs useful for this classification (eg., name groups) is unavailable in the simplified syntax of XML, but many other markup features (and well-written documentation) can go a long way in creating a straightforward and effective modularization.

DTDs are written for humans as much as for machines, and in fact act as an interface between structures of human information and a machine representation. Some representations are more explicit than others, some rely merely on human understanding and do not impact processing directly. If a DTD makes sense to its intended audience and represents an appropriate modelling of its document type (being as loose or restrictive as necessary for its intended application), then it is a success.

To this end, parameter entities are often used to represent various structures. These are described below in more detail.

[creation of PEs for use as common element classes; eg., %flow;, %block; and %inline; in HTML; use of various naming conventions for parameter entities.]

description of .module parameter entities...
description of .content parameter entities...
description of .class parameter entities...
description of .mix parameter entities...
description of .attrib parameter entities...

2.2.1 Content-Type or Common Strings

When a semantic component represents a specific data or content type, its representation within the DTD may be handled by creating a parameter entity as a content-type label. The HTML 4.0 DTD uses such parameter entities extensively. This can be seen clearly in the element declaration for the anchor element below:

      %attrs;                              -- %coreattrs, %i18n, %events --
      charset     %Charset;      #IMPLIED  -- char encoding of linked resource --
      type        %ContentType;  #IMPLIED  -- advisory content type --
      name        CDATA          #IMPLIED  -- named link end --
      href        %URI;          #IMPLIED  -- URI for linked resource --
      hreflang    %LanguageCode; #IMPLIED  -- language code --
      rel         %LinkTypes;    #IMPLIED  -- forward link types --
      rev         %LinkTypes;    #IMPLIED  -- reverse link types --
      accesskey   %Character;    #IMPLIED  -- accessibility key character --
      shape       %Shape;        rect      -- for use with client-side image maps --
      coords      %Coords;       #IMPLIED  -- for use with client-side image maps --
      tabindex    NUMBER         #IMPLIED  -- position in tabbing order --
      onfocus     %Script;       #IMPLIED  -- the element got the focus --
      onblur      %Script;       #IMPLIED  -- the element lost the focus --

All of the above attribute types (%Charset;,%ContentType;, etc.) resolve to "CDATA", "NMTOKEN)", or "NMTOKENS)", all essentially unparsed string containers.

As a specific example, while within the DTD itself a URI container is just another string container, but obviously has a specific meaning to humans and applications. In HTML 4.0, URIs are represented within the DTD by the parameter entity %URI;, and declared within the DTD as:

         -- a Uniform Resource Identifier,
            see [URI]

Because the syntax of comments has been simplified in XML to those found only in comment declarations, the above comment must be rewritten. Since the comment is no longer contained within the declaration, it is common practice for comments to precede their declarations, such as:

     <!-- a Uniform Resource Identifier, see [URI] -->
     <!ENTITY % URI "CDATA" >

While use of such content-type parameter entities doesn't impact the document model, they can be valuable in making a DTD easier for both authors and application developers to understand.

2.2.2 Element Classes (.content, .class, .mix)

[creation of PEs for use as common element classes; eg., %flow;, %block; and %inline; in HTML; use of various naming conventions for parameter entities.]

    <!-- %Inline.mix; includes all inline elements -->
    <!ENTITY % Inline.mix  
          | %Inlphras.class; 
          | %Inlspecial.mix; 
          | %Formctrl.class;"
    <!-- %Block.mix; includes all block elements -->
    <!ENTITY % Block.mix
          | %Blkphras.class; 
          | %Blkspecial.mix;"
    <!-- %Flow.mix; includes all text content, block and inline -->
    <!ENTITY % Flow.mix
          | %List.class; 
          | %Block.mix; 
          | %Inline.mix;" 

2.2.3 Attribute Classes (.attrib)

[creation of PEs for use as common attribute classes] While there is no feature in XML for "global" attributes (ie., an attribute that applies to all element types), parameter entities may be used to create classes of attribute type specifications that may be reused within the DTD.

Here's an example from the DTD:

    <!ENTITY % Core.attrib
       "id          ID             #IMPLIED
        class       CDATA          #IMPLIED
        style       %StyleSheet;   #IMPLIED
        title       %Text;         #IMPLIED"
    <!ENTITY % I18n.attrib
       "lang        %LanguageCode; #IMPLIED
        xml:lang    %LanguageCode; #IMPLIED
        dir         (ltr|rtl)      #IMPLIED"

[the renaming of attribute classes (such as %Coreattrs; to %Core.attrib;) hasn't been implemented in this version, but is planned. Changes to %Core.attrib;, %I18n.attrib;, %Common.attrib;, %Alink.attrib;, %Events.attrib;, maybe others]

2.3 Precedence Order of XML Declarations

[Note the precedence order of declarations vs. redeclaration of variables in a programming language and show why this makes good sense. Discuss both external and internal DS.]

[modularization changes via: module replacement, predeclaration, parameter entity replacement, module amendation (eg., later, using ATTLISTs), marked sections, etc.]

2.4 Use of Marked Sections

[use of marked sections to create module boundaries and 'switches']

    <!-- Tables Module .................................... -->
    <!ENTITY % XHTML1-table.module "INCLUDE" >
    <!ENTITY % XHTML1-table
         PUBLIC "-//Sun Microsystems//ELEMENTS XHTML 1.0 Tables//EN"
                "XHTML1-table.mod" >

2.5 Files vs. Modules

Note that a DTD module does not necessarily imply a separate file entity. For example, the DocBook 3.0 DTD is delivered as a single driver file, comprising about three hundred lines of code (about half of which are comments). The driver declares and instantiates four file "modules", which themselves are made up of over three hundred internal modules (ie., the entity boundaries are unimportant: the same DocBook DTD could be delivered as one, four, or three hundred files. See: DTD normalization.). After the parameter entities comprising the file modules have been instantiated, the DTD is over 7,600 lines long.

When network performance is an issue, decisions over how to deliver a DTD may come into play. When network bandwidth is limited or packet delivery overhead is high, delivery of a single file is faster than numerous small network accesses, but under some conditions (such as when delivering over an unreliable connection where redelivery is common) smaller files may be preferred. In either case, if the DTD is large, delivery may be a consideration in the entity design. But because DTDs are text files (and therefore even large DTDs are smaller than most GIF images on the Web), delivery performance is usually less of an issue; convenience and utility are greater factors.

2.6 Indirection and Public Identifiers

[use of public ids and catalog files vs. system ids; mention URNs?]

2.7 XML External and Internal Subset Differences

[parameter entities allowed only where declarations may occur in internal subset and impact]



3. Differences Between SGML and XML

[briefly describe relationship between SGML and XML, describe this section as a general issues list for those familiar with SGML or involved in document conversion...]

3.1 Restrictions imposed by XML

3.2 Relaxation of previous SGML rules

3.3 Other changes imposed by XML



4. HTML 4.0 Document Model

The introductory description of an HTML 4.0 document found in Section 7.1 of the W3C HTML 4.0 Recommendation [HTML40] is confusing and somewhat misleading. This may be an attempt to simplify the SGML terminology elaborated upon later in the specification, or perhaps account for markup minimization which in HTML allows much of the higher-level document model to be implied when absent from a document instance. Nevertheless, this deserves remedy, particularly when HTML is transformed into XML where such types of minimization are not allowed.

4.1 Document Structure Summary

The three "parts" of an HTML document as described in the HTML 4.0 Recommendation:

  1. a line containing HTML version information,
  2. a declarative header section (delimited by the HEAD element)
  3. a body, which contains the document's actual content. The body may be implemented by the BODY element or by the FRAMESET element.

4.1.1 Document Prolog

The first item above is of course the DOCTYPE declaration, which represents part of the SGML prolog, corresponding to Production 22 of the XML 1.0 specification [XML]. The DOCTYPE declaration is not so much a "version label" as a declaration of the document element type name ("HTML"), followed by an external reference (in this case, a Formal Public Identifier) to an HTML DTD. For more information on external identifiers, see Section 4.2.2, External Entities, [XML].

NOTE: an XML prolog also includes the XML declaration (a special processing instruction) and optional miscellaneous content (processing instructions, comments and whitespace), but for purposes of this discussion this will be ignored. Also, for better compatibility with Web usage, XML further requires the external reference to include a Uniform Resource Identifier [URI].

In current Web practice, the significance of the DOCTYPE declaration is almost nil. Mainstream HTML browsers ignore its presence and are unable to process any portion of the document prolog. XML 1.0 requires conformant applications to at least be able to parse the declaration and any internal subset. Validating parsers are expected to be able to instantiate and parse external references in the external and internal subsets. This will be elaborated further below.

4.1.2 Document Element

Missing from the above list is the existence of the root or document element, which represents the outermost container for all document content. Another way of stating this is that the document element contains all content between the <HTML> start tag and </HTML> end tags, which serve as delimiters. The HTML document element has two required children, the HEAD and BODY elements respectively.

NOTE: Whereas HTML 4.0's markup minimization rules allow document authors to omit the tags for the HTML, HEAD and BODY elements (curiously, the only required element in HTML 4.0 is TITLE), they are nevertheless always implied (ie., actually present in the document model).

4.1.3 Document HEAD Element

The HTML 4.0 specification describes the HEAD element as "declarative", containing information about the document. This document metadata is typically not rendered as document content, but strictly speaking, it is of course part of the HTML document.

Within the document HEAD, HTML prescribes no particular structure, merely an unordered container for the element types TITLE, BASE, SCRIPT, STYLE, META, LINK, and OBJECT. Of these element types, the document's TITLE element must occur once, its optional BASE element may occur only once. The rest may occur zero or more times within the HEAD element, in any order.

Please refer to Section 7.4 of the HTML 4.0 specification [HTML40] for detailed descriptions of these element types.

4.1.4 Document BODY Element

Following the HEAD element is the BODY element, which contains all document content typically rendered by an HTML user agent. It is within the BODY element that most of the structure of an HTML document is found.

The naming scheme for many of HTML's elements is seemingly borrowed from the ISO 8879:1986 General Document DTD, Annex E of [SGML], such as BODY, H1 to H6, P, ADDRESS, and TITLE, and all of its list element types: DL, DT, DD, UL, OL, and LI.

The BODY element contains basically no higher-level structures such as chapters or sections. An HTML document consists of a shallow stream of elements, some requiring a slightly deeper structure. Unlike many other industry DTDs that prescribe deep structural nesting (DocBook inline elements commonly begin at a seventh or eighth nested level and may reach a depth of a dozen or more), HTML rarely requires more than two or three levels.

The DIV and SPAN elements allow for recursive containership, which could be used to create a deeper structure within an HTML document. However, because they are a generic elements and optional, their use in this regard is rather limited.

4.1.5 Element Classes

Unlike HTML 3.2 [HTML32], which contains relatively unstructured content anywhere within BODY, Section 7.5.3 of the HTML 4.0 specification [HTML40] makes a clear distinction between two classes of element types: block and inline:

inline elements
all character-level elements
block elements
all block-like elements (eg., paragraphs and lists)

While the parameter entities that enabled this delineation existed in HTML 3.2, HTML 4.0 is more disciplined in constraining block and inline elements, although there are plenty of holes in both document models.

4.2 Rendered Content Structure

4.2.1 H1-H6 Heading Elements

The flat structure of BODY contains six numbered headings, from H1 to H6. These, however, do not serve as nested structures but rather as a hierarchy of section titles for a nonexistent section structure. There is no enforcement of order or occurrence. The ISO/IEC 15445:1998 HyperText Markup Language (HTML) DTD [ISO-HTML] attempts to remedy this by creating a nested structure of implied B1 to B6 element types. Because XML does not allow for such markup minimization, this solution is unfortunately not available in an XML-based HTML document type.

4.2.2 Structural Elements


4.2.3 Block Elements

[DIV, P, BR]

4.2.4 Inline Elements

[SPAN, B, EM, STRONG, etc.]

4.2.5 Phrasal Elements

[most inlines in Strict]

4.2.6 Presentational Elements

[most of those relegated to Transitional only]

4.2.7 Special Case or 'Feature' Elements

[tables, forms, etc.]



5. Modularization of HTML 4.0

In looking back over the description of how semantic components may be categorized (see Section 3), we must acknowledge that HTML, unlike many (or even most) existing markup languages, is strongly presentational. This has influenced the precedence given to the groupings of element types, favoring an early branching of "block" vs. "inline" over "phrasal" vs. "presentational", particularly since upon analysis HTML in practice makes a stronger differentiation between block and inline rather than whether an element type represents a semantic distinction or is merely a presentational effect: people use HTML tags to achieve desired effects.

The semantic components of HTML are classified by the delineating categories below. (Note that because H1 through H6 act as heading titles, not nested containers, they are classified as block phrasal, not structural):

structural elements
includes all element types that create the overall structure of an HTML document.
block elements
elements that (according to the HTML 2.0, 3.2 and 4.0 specifications) should cause a line break.
inline elements
elements that (according to the HTML 2.0, 3.2 and 4.0 specifications) are displayed inline to an existing block.
phrasal (aka "content-based") elements
elements whose presence denotes a content-based distinction.
presentational elements
elements whose presence indicates a desire on the part of the author for a specific presentational effect.
special case (or "feature") elements
a general category for elements that provide HTML with special features, such as linking, forms, tables, etc.

Giving priority to "block" vs. "inline" (as described above) we find the following result:

5.1 Parameter Entity Containers

The DTD uses many parameter entities to create various classes of names, attribute declarations, etc. to further the modularity, reuse and understanding of its declarations.

The "Common Names", "Common Attributes" and "Document Model Hierarchies" modules are declared near the beginning of the DTD, enabling use of parameter entities within each of the DTD modules as encountered. These are called "preliminary declarations" below, and includes common names, attributes, and also the long list of ISO character entities.

5.1.1 Common Names

[description of Common Names]

5.1.2 Attribute Classes

Based on the parameter entity attribute class naming scheme, the set of parameter entities for XHTML attribute classes are as follows:

attributes on almost all elements: id, class, style and title. (Curiously, style is considered 'core')
internationalization attributes for language and text direction
attributes for support of intrinsic events
a collection of 'Core', 'I18n' and 'Events' attributes used on many elements
additional attributes for XLink simple anchors
alignment attributes
alignment attributes for images
used in tables for horizontal cell alignment
used in tables for vertical cell alignment

5.1.3 Element Classes

[Description of classes]

heading elements H1-H6
list elements
inline phrasal elements
inline presentational elements
block phrasal elements
block presentational elements
form control elements

5.1.4 Element Mixes

[Description of mixes]

a mix of all inline elements
a mix of all inline elements excluding anchors
a mix of inline elements ( a|img|object|script|map ) that are special-case language 'features'. [Extension elements that are considered 'inline' will probably go here...]
a mix of all block elements
a mix of all block elements excluding form and form control elements
a mix of block-level elements ( noscript | form | table | fieldset ) that are special-case language 'features'. [Extension elements that are considered 'block' will probably go here...]
a mix of all heading, list, inline and block elements

5.1.5 Element Contents

[Description of contents]

the contents of the html document element (an HTML document)
the contents of the head element
the contents of the noframes element

(XHTML 1.0 Transitional elements in italic. Transitional modules use the same base name as the Strict version, but add "-t" to the module name (eg., "XHTML1-attribs.mod" changes to "XHTML1-attribs-t.mod")

5.1.6 Preliminary Declarations

[description of preliminary declarations, including common names, attributes, leave content model to next section]


5.1.7 Content Model Module Declarations

[description of content model module and role in declaring classes of elements]


5.1.8 Module Declarations

[description of module declarations...]



6. Packaging and Delivery

In its short history, Web browsers have promoted a model of document delivery that makes little or no effort at checking the validity of documents. While it is beyond the scope of this specification to address these issues, it is implicitly understood that the value of a DTD is in its ability to check the structure of a document instance against a specific document model. Validation may occur during authoring, delivery or reception.

[introductory paragraph to following sections...]

6.1 Referencing and Instantiating Entities

Before beginning to discuss the specifics of HTML, an explanation of how objects may be referenced and instantiated is in order.

Strictly speaking, the means of associating a document with a DTD has been part of HTML since the beginning, but is rarely used.

6.1.1 External Identifiers

[describe external identifiers...]

6.1.2 SGML Catalog Files

[Reference [CATALOG] as specification for catalog files.]

6.2 DTD Packaging

[describe packaging issues...]

6.3 DTD Drivers

[describe what a driver is and does...]

<!-- DTD for HTML 4.0 Strict -->
<!-- end of strict DTD -->

6.3 DTD Normalization

Rather than deliver a modular DTD as separate files, in certain environments where multiple file accesses may be a burden on network resources, the entire DTD may be consolidated into one file through a normalization process, which expands all external parameter entity references. James Clark's freeware SGML toolkit [SP] includes the application spam that may serve as a normalization tool. Certain changes must be made in order to work correctly with XML files. These are described below.

    % spam -p -p -c test-s.xml

The -c parameter is followed by a reference to the SGML catalog file for XHTML 1.0, which includes an SGMLDECL statement providing the parser with the correct SGML declaration for XML. The -p -p parameters direct the parser to expand all parameter entities.

Included with the distribution are several Unix ksh scripts. Each automatically creates a normalized version of a DTD upon invocation. These are named "_flat-s" (Strict), "_flat-t" (Transitional), and "_flat-f" (Frameset). Availability of SP on the host machine is required.



7. Extending XHTML

7.1 Fragment Usage Guidelines


Note that this is unrelated to the proposed products of the W3C XML Fragment Working Group, or upon SGML Open Technical Resolution 9601:1996 Fragment Interchange, both of which are concerned with interchange of fragments of document content, not the reuse of DTD fragments in composing variant or compound document types.

7.2 Extension Guidelines

[discussion of subsets, extensions, etc.]


7.3 An Example Extension: Adding a Single Element

[describe the changes to XHTML1-model.mod, the catalog file, and the driver]

7.3.1 Adding a New Module


7.3.2 Modifying an Existing Module


7.3.3 Modifying the Document Model Module


7.3.4 Modifying the DTD Driver


7.3.4 Changing Identifiers


7.3.4 Modifying the Catalog File (Optional)


7.4 An Example Extension: HTML + MathML

[more complex example of changes required to add MathML]



8. XHTML as an Architecture

8.1 A Brief Introduction to Architectures


The optional XHTML1-arch.mod module includes declarations that enable XHTML to be used as a base architecture according to the Architectural Forms Definition Requirements (Annex A.3, ISO/IEC 10744, 2nd edition).

For more information on use of architectural forms, consult Part Four of David Megginson's Structuring XML Documents [STRUCTXML], or browse the HyTime web site at:

8.2 The XHTML Base Architecture Module

XHTML1-arch.mod ...

    <!-- Architecture Base Declaration -->
    <?IS10744 ArcBase html ?>
    <!-- Architecture Notation Declaration -->
    <!NOTATION html 
        PUBLIC "-//Sun Microsystems//NOTATION AFDR ARCBASE XHTML 1.0//EN" >


8.3 Using The XHTML Architecture




Appendix A. XML Files (Normative)

To reduce the size of this document, the actual files composing the normative content of this specification have not been included inline, and are shown below as hypertext links. The entire package of files are available in both tarred, gzipped or zipped archives at:

Tarred, gzipped archive
Zipped archive

NOTE: The DTD and associated files use file extensions such as .mod, .dtd, etc. and may not display correctly in all browsers. If you're having difficulty viewing the files, or are planning to use the DTD, it is recommended that you download the archive rather than the individual files. Please note the current status of the DTD.

A.1 Catalog File

[Reference [CATALOG] as specification for catalog files.]

SGML Catalog File

A.2 DTD Drivers


XHTML 4.0 Strict DTD Driver
XHTML 4.0 Transitional DTD Driver
XHTML 4.0 Frameset DTD Driver
XHTML 4.0 Strict + MathML Extension DTD Driver

A.3 DTD Modules



A.4 Normalized DTDs

[describe and reference section above on DTD normalization...]

Below are links to normalized, "single-file" versions of the XHTML 1.0 Strict, Transitional, and Frameset DTDs. They are identical in function to the modular DTDs but have been normalized using James Clark's spam application, part of the SP toolkit.



Appendix B. Document Model Changes (Non-Normative)

B.1 Element Type Content Models

The following element type content model changes have been made in transforming HTML 4.0 to XML:


B.2 Element Type Attributes

The following attribute changes have been made in transforming HTML 4.0 to XML:

Element Type HTML 3.2 HTML 4.0 XHTML 1.0 Voyager
HTML     xmlns xmlns
HEAD   profile   profile
PRE     xml:space xml:space

Additionally, the %i18n; attribute class has been augmented by the xml:lang attribute, affecting all element types that include this parameter entity in their attribute definition list declarations.

B.3 Other Changes

Other changes made in transforming HTML 4.0 to XML:




Appendix C. References

C.1 Normative References

Extensible Markup Language (XML) 1.0: W3C Recommendation, Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, 10 February 1998.
HTML 4.0 Specification: W3C Recommendation, Dave Raggett, Arnaud Le Hors, Ian Jacobs, 24 April 1998.
Information Processing -- Text and Office Systems -- Standard Generalized Markup Language (SGML), ISO 8879:1986.
Please consult for information about the standard, or about SGML.
Mathematical Markup Language (MathML) 1.0 Specification: W3C Recommendation, Stephen Buswell, Stan Devitt et al, 7 April 1998.

C.2 Other References

[remove unused references upon completion of document...]

Entity Management: OASIS Technical Resolution 9401:1997 (Amendment 2 to TR 9401) Paul Grosso, Chair, Entity Management Subcommittee, SGML Open, 10 September 1997.
Developing SGML DTDs: From Text to Model to Markup, Eve Maler and Jeanne El Andaloussi.
Prentice Hall PTR, 1996, ISBN 0-13-309881-8.
Structuring XML Documents, David Megginson. Part of the Charles Goldfarb Series on Information Management.
Prentice Hall PTR, 1998, ISBN 0-13-642299-3.
Comparison of SGML and XML: W3C Note, James Clark, 15 December 1997.
Reformulating HTML in XML: W3C Working Draft, D. Raggett, F. Boumphrey, M. Altheim, T. Wugofski, 24 November 1998.
XML Linking Language (XLink): W3C Working Draft, Eve Maler and Steve DeRose, 3 March 1998.
A new XLink requirements document is expected soon, followed by a working draft update.
DocBook DTD, Eve Maler and Terry Allen.
Originally created under the auspices of the Davenport Group, DocBook is now maintained by OASIS. The Customizer's Guide for the DocBook DTD V2.4.1 is available from this site.
The Dublin Core: A Simple Content Description Model for Electronic Resources, The Dublin Core Metadata Initiative.
HTML 3.2 Reference Specification: W3C Recommendation, Dave Raggett, 14 January 1997.
ISO/IEC 15445:1998 HyperText Markup Language (HTML), David M. Abrahamson and Roger Price.
Resource Description Framework (RDF): Model and Syntax Specification, Ora Lassila and Ralph R. Swick, 19 August 1998.
Cascading Style Sheets, level 2 (CSS2) Specification, Bert Bos, Hakon Wium Lie, Chris Lilley, Ian Jacobs, 12 May 1998.
Composite Capability/Preference Profiles (CC/PP): A user side framework for content negotiation, Franklin Reynolds, Johan Hjelm, Spencer Dawkins, Sandeep Singhal.
Synchronized Multimedia Integration Language (SMIL) 1.0 Specification, Philipp Hoschka, 15 June 1998.
The Text Encoding Initiative (TEI), (TBD)
Uniform Resource Identifiers (URI): Generic Syntax, T. Berners-Lee, R. Fielding, L. Masinter, August 1998.
See: This RFC updates RFC 1738 [URL] and [RFC1808].
IETF RFC 1738, Uniform Resource Locators (URL), T. Berners-Lee, L. Masinter, M. McCahill.
Relative Uniform Resource Locators, R. Fielding.


Appendix D. Acknowledgements (Non-Normative)

The following have contributed to this document:



Appendix E. Revisions (Temporary)

Revisions to this draft:

1999-01-29 Restructured some block and inline element types: created new modules for 'b.1 block structural' and 'c.1 inline structural', renumbering in the draft and DTD comments accordingly (see DTD modules for specifics).
1999-02-01 Implemented changes required for HTML 4.0 errata. Fixed some bugs in content models and finished most testing of Strict and Transitional DTDs. Added minor notes about normalization.