org.ceryle.xml
Class Sniffer

java.lang.Object
  extended by org.ceryle.xml.Sniffer

public class Sniffer
extends Object

Determines the MIME type of a document by "sniffing" the beginning of the file. Rather than a large number of set/get methods, this uses a lot of public member variables, which are reset prior to each sniff.

This uses the MIME types found in MIME.

Notes

Note 1: The rules followed here regarding XHTML are not strictly correct, as this class does not require a DOCTYPE to classify something as XHTML, merely that it be well-formed and declare the XHTML XML namespace. If strict conformance is desired, the static variable STRICT_XHTML may be set true.

Note 2: Because the heuristic for determining file type is somewhat prone to error, this will not always return the correct result. In cases where it doesn't, the document type may be some sort of valid or invalid hybrid. Despite the years and significant time spent arguing about these issues, the rules for this sort of thing have yet to be satisfactorily standardized. This has only gotten worse given the proliferation of XML markup languages with no reasonable architecture for interoperability and interchange.

Since:
JDK1.3
Version:
$Id: Sniffer.java,v 3.11 2007-06-20 01:28:39 altheim Exp $
Author:
Murray Altheim
See Also:
MIME

Field Summary
 boolean claimsHTML
          A boolean indicating the sniffed document claims it is HTML, by having a <html> document element.
 boolean claimsXHTML
          A boolean indicating the sniffed document claims to be some form of XHTML, by containing a combination of factors (noting STRICT_XHTML).
 boolean claimsXHTML1F
          A boolean indicating the sniffed document claims it is XHTML Frameset, by including its public identifier in its DOCTYPE declaration.
 boolean claimsXHTML1S
          A boolean indicating the sniffed document claims it is XHTML Strict, by including its public identifier in its DOCTYPE declaration.
 boolean claimsXHTML1T
          A boolean indicating the sniffed document claims it is XHTML Transitional, by including its public identifier in its DOCTYPE declaration.
 boolean claimsXHTMLns
          A boolean indicating the sniffed document claims it is XHTML, by containing a declaration for the XHTML namespace.
 boolean claimsXML
          A boolean indicating the sniffed document claims it is XML, by containing an XML declaration.
 boolean claimsXTM
          A boolean indicating the sniffed document claims it is XTM, by containing a <topicMap> document element.
 boolean claimsXTMns
          A boolean indicating the sniffed document claims it is XTM, by containing a declaration for the XTM namespace.
 boolean m_verbose
          Message verbosity: set false for no messages while sniffing.
static int sniffLength
          The number of characters to sniff, following the XML parsing.
static boolean STRICT_XHTML
          A boolean flag that when true requires XHTML documents to not only be well-formed XML and declare the XHTML namespace, but contain a recognized DOCTYPE declaration.
 boolean valid
          A boolean indicating the sniffed document is valid XML, based upon a parse of its content.
 boolean wellFormed
          A boolean indicating the sniffed document is well-formed XML, based upon a parse of its content.
 
Constructor Summary
Sniffer()
          Default constructor.
 
Method Summary
 String getDescription()
          Returns a text description of the status of the last sniff.
 String getHTMLTitle()
          If the previous sniff indicated HTML, then the Java text Document is still available.
 int getMethod()
          Returns an int indicating of the serialization method of the last sniff, using the org.ceryle.xml.XMLUtils constants.
 MIME getMIMEtype()
          Returns an int indicating of the MIME type of the last sniff, This returns null prior to that point.
 Set getXHTMLMetadata()
          If the previous sniff indicated XHTML, then the DOM Document is still available.
 String getXHTMLTitle()
          If the previous sniff indicated XHTML, then the DOM Document is still available.
static boolean hasWikiTag(String s)
          Returns true if the String s starts with the wiki tag.
static String head(Document doc)
          Returns the first part of the Document doc, as long as length sniffLength, or less if the Document isn't that long.
static String head(String s)
          Returns the first part of the String s, as long as length sniffLength, or less if the String isn't that long.
 boolean isReset()
          Returns true if this Sniffer has been reset or has never been used (i.e., its nose is clean).
 String sniff(Document doc)
          Sniffs the media (MIME) type of the provided java.text.Document doc, setting the type and description, as well as any appropriate booleans.
 String sniff(File file)
          Sniffs the media (MIME) type of the provided File file, setting the type and description, as well as any appropriate booleans.
 String sniff(String content)
          Sniffs the media (MIME) type of the provided String content, setting the type and description, as well as any appropriate booleans.
 boolean sniffLTM(String s)
          Returns true if the percentage of left and right square brackets in the provided text passes a certain threshold combined with the presence of some known key strings.
 boolean sniffWiki(String s)
          Returns true if the provided content matches a regex for either the wiki tag or a square-bracked wiki link of the form "[abc|abc]".
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

STRICT_XHTML

public static boolean STRICT_XHTML
A boolean flag that when true requires XHTML documents to not only be well-formed XML and declare the XHTML namespace, but contain a recognized DOCTYPE declaration. This could have been determined by validity, except that the XHTML 1.0 Recommendation does not require that, having its own nonstandard conformance definition.


sniffLength

public static int sniffLength
The number of characters to sniff, following the XML parsing.


claimsXML

public boolean claimsXML
A boolean indicating the sniffed document claims it is XML, by containing an XML declaration.


claimsXTM

public boolean claimsXTM
A boolean indicating the sniffed document claims it is XTM, by containing a <topicMap> document element.


claimsXTMns

public boolean claimsXTMns
A boolean indicating the sniffed document claims it is XTM, by containing a declaration for the XTM namespace.


claimsHTML

public boolean claimsHTML
A boolean indicating the sniffed document claims it is HTML, by having a <html> document element.


claimsXHTML

public boolean claimsXHTML
A boolean indicating the sniffed document claims to be some form of XHTML, by containing a combination of factors (noting STRICT_XHTML).


claimsXHTMLns

public boolean claimsXHTMLns
A boolean indicating the sniffed document claims it is XHTML, by containing a declaration for the XHTML namespace.


claimsXHTML1S

public boolean claimsXHTML1S
A boolean indicating the sniffed document claims it is XHTML Strict, by including its public identifier in its DOCTYPE declaration.


claimsXHTML1T

public boolean claimsXHTML1T
A boolean indicating the sniffed document claims it is XHTML Transitional, by including its public identifier in its DOCTYPE declaration.


claimsXHTML1F

public boolean claimsXHTML1F
A boolean indicating the sniffed document claims it is XHTML Frameset, by including its public identifier in its DOCTYPE declaration.


wellFormed

public boolean wellFormed
A boolean indicating the sniffed document is well-formed XML, based upon a parse of its content.


valid

public boolean valid
A boolean indicating the sniffed document is valid XML, based upon a parse of its content. All valid XML documents are by definition also well-formed, though the converse is not always true.


m_verbose

public boolean m_verbose
Message verbosity: set false for no messages while sniffing. Default is false.

Constructor Detail

Sniffer

public Sniffer()
Default constructor.

Method Detail

getDescription

public String getDescription()
Returns a text description of the status of the last sniff. This returns null prior to that point.


getMethod

public int getMethod()
Returns an int indicating of the serialization method of the last sniff, using the org.ceryle.xml.XMLUtils constants. This returns -1 prior to that point. This is similar to an indication of document type, though from a serialization context.


getMIMEtype

public MIME getMIMEtype()
Returns an int indicating of the MIME type of the last sniff, This returns null prior to that point.


sniff

public String sniff(File file)
Sniffs the media (MIME) type of the provided File file, setting the type and description, as well as any appropriate booleans. Returns the sniffed type as a String, null if it cannot be determined or there is an error.


sniff

public String sniff(Document doc)
Sniffs the media (MIME) type of the provided java.text.Document doc, setting the type and description, as well as any appropriate booleans. Returns the sniffed type as a String, null if it cannot be determined or there is an error.


sniff

public String sniff(String content)
Sniffs the media (MIME) type of the provided String content, setting the type and description, as well as any appropriate booleans. Returns the sniffed type as a String, null if it cannot be determined or there is an error.


sniffWiki

public boolean sniffWiki(String s)
Returns true if the provided content matches a regex for either the wiki tag or a square-bracked wiki link of the form "[abc|abc]". The wiki tag value is "#!wiki ".


hasWikiTag

public static boolean hasWikiTag(String s)
Returns true if the String s starts with the wiki tag. The wiki tag value is "#!wiki ".


sniffLTM

public boolean sniffLTM(String s)
Returns true if the percentage of left and right square brackets in the provided text passes a certain threshold combined with the presence of some known key strings.

Note: This might also indicate positive when provided with wiki text (which uses square brackets for links), but wiki text stored within Ceryle is expected to have the wiki declaration. Also, This class also sniffs for wiki text prior to sniffing for LTM.


getXHTMLMetadata

public Set getXHTMLMetadata()
If the previous sniff indicated XHTML, then the DOM Document is still available. Calling this method returns a Set of three-element String arrays containing the attribute values of the name, content, and scheme attributes (in that order), extracted from the <meta> DOM Elements. If the Document is not available or contains no metadata, this returns an empty Iterator, not a null. If a particular <meta> element does not contain both a name and content attribute, it is ignored. If the scheme attribute is unspecified, the third array element will be null.

Note that calling sniff again while this method is active may be problematic.

Despite this operating on (at least in theory) XHTML, this method ignores case on the <meta> element as well as the attribute names, erring on the 'safe' side in trying to capture any metadata in the document.


getXHTMLTitle

public String getXHTMLTitle()
If the previous sniff indicated XHTML, then the DOM Document is still available. Calling this method returns the contents of the first encountered <title> element having some character data content. A warning is produced if there are more than one such element. Returns null if unable to provide a result.


getHTMLTitle

public String getHTMLTitle()
If the previous sniff indicated HTML, then the Java text Document is still available. Calling this method returns the contents of the first encountered <title> element having some character data content. Returns null if unable to provide a result.


isReset

public boolean isReset()
Returns true if this Sniffer has been reset or has never been used (i.e., its nose is clean).


head

public static String head(String s)
Returns the first part of the String s, as long as length sniffLength, or less if the String isn't that long. Throws a NullPointerException if the parameter is null.


head

public static String head(Document doc)
                   throws BadLocationException
Returns the first part of the Document doc, as long as length sniffLength, or less if the Document isn't that long. Throws a NullPointerException if the parameter is null.

Throws:
BadLocationException - if unable to extract the Document's content


The Ceryle Project. Copyright ©2001-2007 Murray Altheim, All Rights Reserved. See LICENSE included with distribution.