Augmented Plain Text (APT) Version 1.0

Neocortext.Net Note 17 May 2002

Neocortext.Net

Locator:
http://www.altheim.com/specs/NOTE-apt1.html
Author:
Murray Altheim  <m.altheim@open.ac.uk>
Revision:
$Id: NOTE-apt1.html,v 1.1 2002/05/17 15:31:58 altheim Exp $

Abstract

The Augmented Plain Text (APT) specification is a design for a simple set of keyword tokens, that when added to a plain text document enables an APT processor to autogenerate valid XHTML documents. This can be used for authoring or Web conversion of existing plain text sources.

Status of this Document

This document is intended for review and comment by interested parties. It is a “work in progress,” currently has no formal status, and its publication should not be construed as endorsement by any corporate or academic body. This document may be updated, replaced, rendered obsolete by other documents, or removed from circulation at any time. It is inappropriate to use this document as reference material, or cite it as anything other than a “work in progress.” Distribution of this document is unlimited.

Contents


top-of-sectionIntroduction to Augmented Plain Text (APT)

The 'APT' notation is a simple, augmented notation designed to simplify creation of XHTML documents from existing text sources such as email messages. Most current editing software has some facility to generate plain text output, with some claiming to generate HTML. Unfortunately, the "HTML" generated from most every single known product is far from adhering to any known HTML specification, and in the case of (for example, since it is the worst) MS Word, its output is so obtuse and convoluted that a slew of translators have been written to translate its "HTML" output into something akin to HTML, though not without substantial losses of content in many cases. If you have some familiarity with HTML, I suggest you look at MS Word's "HTML". Really unbelievable, especially considering that HTML is generally a pretty simple syntax.

APT was designed to fill a different niche, namely for those wishing to author in plain text, those who have existing text sources, or who'd like an XHTML-valid document with autogenerated, table of contents with hierarchically-numbered sections from what is ostensibly a plain text document (with a few simple codes added). Yes, APT is simple. It doesn't have every known Web feature, doesn't create JavaScript buttons or fry your bacon for you. It does simplify web authoring for those who think web authoring should be simple and straightforward. You can spend some time playing with the CSS stylesheet if you really want your output to look different than the default, or take the output document as input for further processing (perhaps adding your own JavaScript buttons and bacon).

An APT document looks something like this:

    #APT V1.0
    #AUTHOR Tim Bunwich
    #TITLE Not a Normal Day at the Park

    Today I went to
    http://centralpark.org/home.html #LINK Central Park 
    to feed some squirrels, a thing I do most everyday. Well, for some reason 
    the squirrels seemed agitated, a bit put off at simply accepting the nuts 
    I was handing out.
    
    I was at the point of positioning a piece of pecan in front of this big 
    black squirrel's nose when suddenly he ran up my arm and stood on the top 
    of my head, then began squealing wildly. I froze in place, not knowing 
    quite what to do. These little buggers have very sharp claws and teeth, 
    and me with no hair... well, I was afraid for my scalp.

    All around me were squirrels, all just a few feet from my Berkenstock'd
    toes. I knew my life was about to change for the worse.

Pretty simple?

top-of-sectionSyntax

APT uses keywords that start with a hash symbol (e.g., #AUTHOR) that occur in column 1 (i.e., the beginning of a line) to denote an APT statement. APT parsers should ignore keywords occurring elsewhere, or unknown keywords (perhaps emitting a warning when this occurs), so that other instances of hash characters followed by unknown tokens have no effect (other than to be replicated in the output file).

The APT syntax is designed according to implementation levels, to allow for varying levels of support. Level 1 is quite simple, Level 2 provides general link support, with Level 3 providing inclusions and other features.

Level 3 APT processors may optionally preserve HTML markup occurring inline but Level 1 and 2 processors should autogenerate the document title, headings, divisions and paragraphs, as well as a table of contents using heading titles. Future optional features will include autogeneration of hierarchical section numbers. Note that if necessary, a hash character can be escaped using its XML character entity equivalent ("&#35;"). A processor should be labeled as according to its implementation level.

All APT statements start with an APT keyword beginning in column one and continue to the end of the line. Lines may be continued with a backslash onto the next line.

#LINK is a bit of a special case. APT processors will note the beginning of http: and ftp: URLs (scanning until the first whitespace or end-of-line) and autogenerate XHTML links where they exist, using the URL itself as the link text, unless the token following the URL is #LINK, in which case the link text being the content following #LINK to the end of that line.

APT Level 1

The following keywords should be supported in all APT Level 1 processors:

#APT WS version
required APT header
#AUTHOR WS content
author name
#EMAIL WS content
author email
#TITLE WS content
document title
#SUB WS content
document subtitle
#SELF WS linkURL
URL reference to this document
#TITLE WS content
document title
#COM WS content
comment (ie., as XML comment)
#SETHEAD WS [1-6]
sets top level heading (default is 1)
#HEAD WS content
document heading
#DFN WS term "|" definition
definition list item
#LI WS content
list item (unordered)
#HR WS [percentage]
horizontal rule [optional width]
#PRE WS content
preformatted content

("WS" stands for whitespace: space or tab characters)

APT Level 2

The following keywords should be supported in all APT Level 2 processors:

#LINK WS linkURL WS link text
a web link
#IMG WS imageURL WS alt text
an image
#FIG WS imageURL WS alt text
an image treated as a figure
#LOGO WS imageURL WS alt text
a logo image for top-of-page

APT Level 3

The following keywords should be supported in all APT Level 3 processors:

#INCLUDE WS linkURL
a transclusion (see note)

Additionally, Level 3 processors should allow for a subset of existing inline HTML markup to be normalized and preserved in output.

Examples:

    #APT V1.0
    #AUTHOR Igor Rostropovich
    #EMAIL igor@rostropovich.org
    #SETHEAD 2
    #HEAD It Takes a Virtual Village \
    To Make a Virtual Village Idiot

Whitespace between blocks of plaintext will automatically create paragraph breaks. The document will use #TITLE as both the document title and its first displayed <h1> heading, the remainder of headings being <h2> headings. The TOC is autocreated from headings.

Transclusions: Note that filenames or URLs must be in double quotes. For example, to include an external file as an answer to a question:

     #DFN My first question is how many monkeys? |
     #INCLUDE "answer1.html"

NOTE: currently unsupported are #INCLUDE and backslashes to continue lines.

A sample of an APT source and the generated output from Ceryle:

top-of-sectionProcessing

An APT source document must be a text document. This source document undergoes a number of transformations, enumerated below:

  1. Transclusions
    When Level 3 transclusions are supported and the feature is active, all transclusions are processed at this time. Transcluded documents are expected to be APT documents, so special note should be made about possible recursive loops should transcluded documents subsequently reference for transclusion documents that transclude documents that transclude documents that transclude documents that... (e-gad)
  2. Character Escaping
    The document is first processed by "escaping" various characters so that they are "safe" in XHTML. This includes markup characters occurring in the text such as ampersands, as well as known non-ASCII characters supported in the XHTML character entity set. Unicode characters above the ASCII set ( > 128 ) not covered by these provisions are converted to numerical entity references (e.g., &#3204;).
  3. Whitespace Handling
    Because an APT processor is searching for blank lines as paragraph delimiters, all lines consisting entirely of whitespace are converted into empty lines, and trailing whitespace at the end of each line is eliminated.
  4. Keyword Handling
    The file is processed line-by-line from the beginning, APT keywords are processed as according to their individual design. This generates an XML document (actually, a DOM Document in the case of Ceryle) that is populated by APT parsing events.
  5. Post Processing
    Once the source document has been parsed and the DOM Document generated, post processing includes optional generation of section numbers, generation of the table of contents, and creation of the default header and footer content, including a link to the default stylesheet.

top-of-sectionFeedback

Feedback on the APT design can be sent to its author at  <m.altheim@open.ac.uk>

top-of-section