org.ceryle.util
Class DocumentWordTokenizer

java.lang.Object
  extended by org.ceryle.util.DocumentWordTokenizer

public class DocumentWordTokenizer
extends Object

This class tokenizes a Swing Document model, also providing a word and sentence count. This is designed for western natural language documents and is not otherwise very Locale-savvy. It also does not correctly handle hyphenated words.

This also includes two static utility methods, altWordCount(String) and altTokenCount(String,boolean), to obtain alternative counts, and displayStatisticsFor(Document,String,String,int,int) to pop up a dialog providing a compendium of Document statistics.

Since:
JDK1.3
Version:
$Id: DocumentWordTokenizer.java,v 3.5 2007-06-15 12:09:55 altheim Exp $
Author:
Murray Altheim

Constructor Summary
DocumentWordTokenizer(Document document)
          Constructor provided the Document to be processed.
 
Method Summary
 int[] altTokenCount(String text, boolean stats)
          Uses a StringTokenizer to traverse the provided text, returning a token count.
 int altWordCount(String text)
          Uses a BreakIterator to traverse the provided text, returning a word count.
static void displayStatisticsFor(Document document, String description, String note, int revisionCount, int recordLength)
          A static utility method that displays a dialog providing statistics for the provided javax.swing.text.Document.
 int getSentenceCount()
          Returns the current number of sentences that have been processed.
 int getWordCount()
          Returns the current number of words that have been processed.
 boolean hasMoreWords()
          Returns true if there are more words that can be processed in the String.
 boolean isNewSentence()
          Returns true if the current word is at the beginning of a sentence.
 String nextWord()
          Returns the next word in the text.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DocumentWordTokenizer

public DocumentWordTokenizer(Document document)
Constructor provided the Document to be processed.

Method Detail

getWordCount

public int getWordCount()
Returns the current number of words that have been processed.


getSentenceCount

public int getSentenceCount()
Returns the current number of sentences that have been processed.


hasMoreWords

public boolean hasMoreWords()
Returns true if there are more words that can be processed in the String.


isNewSentence

public boolean isNewSentence()
Returns true if the current word is at the beginning of a sentence.


nextWord

public String nextWord()
Returns the next word in the text.


altTokenCount

public int[] altTokenCount(String text,
                           boolean stats)
Uses a StringTokenizer to traverse the provided text, returning a token count. This is returned in position 0 of an int array. If the boolean stats is true, the second value will be the statistical mean. If false, the second value will be -1.


altWordCount

public int altWordCount(String text)
Uses a BreakIterator to traverse the provided text, returning a word count.


displayStatisticsFor

public static void displayStatisticsFor(Document document,
                                        String description,
                                        String note,
                                        int revisionCount,
                                        int recordLength)
A static utility method that displays a dialog providing statistics for the provided javax.swing.text.Document. The document description and suffixed note are optional.



The Ceryle Project. Copyright ©2001-2007 Murray Altheim, All Rights Reserved. See LICENSE included with distribution.