Electric Forest

Electric Forest

thoughts about books, digital libraries, and stuff related to expressing and keeping track of our thoughts...

Friday, April 22, 2005

Node size doesn't matter

If you look at what Michael has done with GODDAG and consider his background (with SGML, TEI, etc.) it makes perfect sense that his approach would be with things like CONCUR and Extended XPath [EXPath].

Combining ideas from GODDAG, or more properly, ideas from EXPath, provides a way of referring to points within a text, spans, and overlapping spans, by way of graphs with a modicum of document structure polyancestry. By coincidence, the paper I'm currently writing (which may never see the light of day) has an introductory section comparing the paradigms of linking vs. mapping. What Michael seems to be basically doing is mapping, and while I've not looked into what Graham had proposed, I'm guessing it was mapping a document.

Since Topic Maps are designed precisely for this purpose, and have their own inherent graph structure (as opposed to a tree structure), both a Topic Map document and the territory-document it is mapping can have a mirrored graph structure.

Now, NODAL might seem somewhat orthogonal to this, but only a bit. NODAL is similar to (and influenced by) the Reiser File System by Hans Reiser (now commonly used on linux machines), which rather than create a file system upon a tree structure, uses node-level metadata to create a journaled graph system. Very high performance, I've been using it for three or four years now.

If one considers that an implementation of NODAL (as described to me by Lee) effectively breaks down the barriers between documents within a file system into what in the HyperText world are typically called "lexia" plus added metadata, if such a system were implemented across machine boundaries one would effectively have an enormous grove, i.e., with the right tools the Web would become one big grove. I think that's what Lee's thinking, anyway. He might be interested in this conversation. And if one considers what ReiserFS and NODAL are doing, it's the micro level to what a digital library is doing at the macro level. If we ignore document boundaries, there's almost no difference conceptually.

Eliot Kimber (coauthor with Steve Newcomb of the ISO 10744 HyTime standard, which provides us with the concept of groves) has a simplified grove model for XML called XIndirect which, coupled with NODAL, provides almost everything:

Regarding other activities

I like what Patrick says. Back in the hayday of Doug Engelbart hanging out at SRI on Tuesday evenings following the UnRev II lectures, we got started talking about distributed documents, and Lee Iverson created something called DDom -- distributed DOM. Nothing ever came of it, but GODDAG looks like it. Nice to see it out in the public domain.

You may recall that Graham Moore, back in around 2000, at Extreme Markup, started a project Grove4J which was based on a rant I started there: it's time for a damn good grove implementation in Java. About all that's there is an API. Project seems to have died.

Lee Iverson took all that we understood about groves, authentication, permissions, etc, and started NODAL which is continuously under development.
NODAL is designed as a general, document-oriented distributed database with a data model that allows addressing, searching and linking of content of any kind from any document. The data model defines documents as directed graphs of content nodes and provides adaptable addressing, security, privacy and version control at the granularity of these nodes. Moreover, it is built on a distributed client-server (or peer-to-peer) communication model that seamlessly shifts from synchronous, real-time interaction to asynchronous or intermittently-connected interaction. Finally, it is designed to extensibly, support a wide range of input and output formats so that it will interoperate easily with systems using existing standard document formats and exchange protocols, including even applications unaware of its existence. It is hoped that this simple system will become a standard, universal component of the infrastructure of information management and exchange and thus allow for flexible, productive collaboration between willing people for any purpose, anywhere, using any tools.

Thursday, April 21, 2005

Other Initiatives

Actually there are a couple of initiatives that I think would be of interest either for the existing PORT project and/or for an expanded one.

It is not public, yet, but some friends at the University of Kentucky are about to release an Eclipse plugin that allows the linking of an image of a manuscript to a transcription of the same (think image map from the early days of the WWW) and allows scholars to "encode" the text by selecting either in the image or transcription window, with menu lists of what they wish to say about the portion of text in question. No markup ever displayed to the user, simply the manuscript and its transcription.

At present it has to be installed on your local workstation and I am not sure how well it would work with large images/files over the WWW, nor if they have dealt with the attendant security issues. Hmmm, "projects," a term meaning the images, transcriptions, DTDs, etc., could be stored remotely and I assume some login process to control access would not be all that difficult to arrange.

When I first saw the software, it was using a database to store overlapping markup in separate streams. Actually I did a lecture for the CS department on overlap and they have subsequently implemented Sperberg-McQueen/Huitfelt's GODDAG structure, although without validation. See: http://dblab.csr.uky.edu/~eiaco0/docs/expath/ (I was hopeful they would implement JITTs, but I must admit they did a good job with GODDAG.)

Note that this is a close collaboration between CS and Humanities departments, quite unusual. I seem to recall Kevin Kiernan saying that they had gotten NSF money to fund part of the work on the Electronic Beowulf and Electronic Boethius projects. I have corresponded with Kevin and know that he is interested in collaborating with other projects. I tried to get the SBL to follow up on that idea but they were more interested in "... tell[ing] the hour without error and make a modest noise in doing so" for those of you who know your Neitzsche.

Speaking of overlap a bit more, I have recently become interested in restating the grove paradigm from HyTime without the architectural form syntax and think there is potential for a data structure quite similar to the GODDAG from UKy, but with the ability to do validation, partial parsing, etc. Not to mention that it would work with any source of data and not simply markup.

I think most of the major components for a collaboration environment exist, albeit that some glue will be required to meld them together into an interface that will interest the average scholar.

Not to mention building relationships between various projects, which like most scholarly projects/societies, etc., all see themselves as being the center of creation, which leaves them very little to discuss with others. I suppose that is one of my strengths, I consider "being" to be an attribute that is only applicable to the Deity and "becoming" is the lot of all else. If one views projects, etc., as "becoming" then the door is open to wide ranging cooperation/collaboration to reach goals beyond any of them in isolation.

I think topic maps will play a major role in a collaborative environment. For those of you who have not seen it, the latest draft of the Topic Maps Reference Model (TMRM), can be found at: http://www.isotopicmaps.org/TMRM/TMRM-latest.pdf

I started to summarize the paradigm but the TMRM is only three pages and a paragraph long (sans all the boiler plate stuff) and is the result of some 3 years of labor and word smithing.

Looking forward to future developments!

Tuesday, April 12, 2005

If a blog drops in the forest, does anybody hear it?

This blog begins as a continuation of the beginnings of a brainstorming session in an email discussion that needed a home. Where it will lead, who knows? Here's an edited version of the original message that started that discussion:

On Digital Libraries

As Ian Witten and David Bainbridge note in the introduction to their book "How to Build a Digital Library" [1]:
"Whereas physical libraries have been around for 25 centuries, digital libraries span a dozen years. Yet in today's information society, with its Siamese twin, the knowledge economy, digital libraries will surely figure among the most important and influential institutions of this new century. The information revolution not only supplies the technological horsepower that drives digital libraries, but fuels an unprecedented demand for storing, organizing, and accessing information."
I concur. The necessary technical infrastructure is just now becoming available, both in terms of hardware and software, and we're just at the cusp of large-scale implementations. All of us may play an important part in the ongoing development, and the need to bring a high level of pragmatism to this field is very evident. While the "Semantic Web" may end up another technological flash in the pan without even a demonstrable, widespread purpose (like "push" technology but with DARPA funding), the need for digital libraries is plainly evident, and the technologies for their implementation already at hand. We need to get out the Meccano tool kit and begin to play.

On OASIS' OpenDocument "standard"

I'm planning to soon install Greenstone (the digital library project from New Zealand) to see how it stacks up. As I mentioned to Jack Park recently, I'll be looking into how this all might plug into both Ceryle, as well as OASIS' new Open Document Text format [2], which will become a standard document format for word processors as well as other software applications (and therefore a potential target format for digital libraries worldwide).

OpenOffice 2.0 will be able to import and export to ODT — I've installed a beta version and it's very cool — as with previous versions of OpenOffice, the stored documents are all XML under the covers, and done right too. But now, with ODT, the format is now fully specified and we can expect to see importers and exporters in commercial products, maybe even MS Word.

You can take any ODT document and unzip it (it's just a bunch of XML and text files zipped together). This in marked contrast to MS Word's format, which is an opaque and proprietary format that changes in undocumented ways between versions of Word, and even between operating system versions. ODT will become a standard, because when governments wisen up they will begin to demand open formats for their content. ODT is a quiet new thing, and will become part of the Digital Library Master Plan*.

Murray

[1] How to Build a Digital Library, Ian H. Witten & David Bainbridge, Morgan Kaufman Publishers, San Francisco. ISBN: 1-55860-790-0. See also: How to Build a Digital Library (UNESCO)
[2] OASIS Releases OpenDocument 1.0 Committee Draft Specification for Public Review. XML Cover Pages, 4 January 2005.
* Yes, there is a conspiracy amongst librarians (in particular, systems librarians) to take over the world.

Sunday, April 10, 2005

References & reading list

Following is a list of recommended reading and/or references to standards, specs, papers, etc. (As time permits we'll try to transfer any entries added in blog comments up into this message as well*.)

Books & Papers

How to Build a Digital Library
Ian Witten, David Bainbridge. ISBN 1-558-60790-0; Morgan Kaufman, 2003.
We've come a long way as a species (despite other setbacks) when you can buy a book with this title that is as accessible as Witten and Bainbridge's excellent book on digital libraries. Their approach centers around the Greenstone implementation, but the principles discussed could be used in any DL application.

XML Topic Maps: Creating and Using Topic Maps for the Web
Jack Park, ed., Sam Hunting, Technical Ed., with a foreword by Douglas Engelbart. ISBN ISBN: 0-201-74960-2; Addison Wesley 2002.
Publisher's description: "With contributed chapters written by today's leading Web experts, XML Topic Maps: Creating and Using Topic Maps for the Web is designed to be a 'living document' for managing information across the Web's interconnected resources, with a companion Web site and discussion forums.

Essential Classification
Vanda Broughton. ISBN 1-85604-514-5; Facet Publishing 2005.
From the introduction: "Everybody can and does classify, and if we spend so much time and energy classifying the world about us, it is natural to attempt to organize our stores of information about the world. It's necessary, too, to have systems for managing stored information in a way that allow us to find it again — systems that use our human classificatory skills to organize, to match, to predict and to interpret." I met Ms. Broughton, an important UK researcher in Faceted Analytical Theory, at a JISC meeting several years back, and only recently picked up this book — which I wish I'd seen years ago. Given that it was published last year I suppose this is an impossibility, but it's already become an important addition to my canon. The journal Information Research provides a review.

Data and Reality
William Kent; ISBN 1-58500-970-9; 1st Books Library 2000.
As one of the reviewers on Amazon wrote: "This book is a nightmare and it is not for the sqeamish. It tells you the truth about trying to model real systems and the problems you are about to tackle. Surprisingly, most data modellers have been ignoring it's contents for decades. Why? I'm not sure. Buy it and have nightmares! :-)" It should be recommended reading for those who've been drinking the "Semantic Web" kool-aid. As the blurb from Extreme Markup 2003 states, this is "perhaps the best book ever written about the concepts behind data modelling." Bill has posted some excerpts on his web site. He's one of those people I'd really love to spend some time getting to know — one of our wise men.

The Intellectual Foundation of Information Organization
Elaine Svenonius; ISBN 10-262-19433-3; MIT Press, 2000.
A description by Roger Sperberg is included in the entry References for the digital library. There's also a review of the book in New Architect by Eugene E. Kim, another review in Issues in Science and Technology Librarianship by Flora Shrode, and the ACM Citation.

Specifications

XML Topic Maps (XTM) 1.0 Specification
Steve Pepper, Graham Moore, eds.
There's a growing body of online information on Topic Maps, and a number of websites devoted to keeping track of new developments, related specifications, etc. I'll try to collect some of those sites here as time permits, but as a start, try
  • The Topic Map articles page.
  • EasyTopicMaps. This wiki site was overrun by spam and hasn't yet been cleaned up (as of May 2005), but it's still a valuable resource.
  • There's also the ISO Topic Maps activity, found on the ISO SC34/WG3 committee home page — this is where the standards work is being done.


* feel free to add entries...

Editing conventions

Okay, I'm just starting to get this together, so bear with me.

Whitespace

The first thing you'll notice in writing entries is how truly annoying whitespace can be. There's two modes on Blogger as regards whitespace handling, with one generating a lot of extra, the other not enough. With the switch flipped the way it is, you'll probably notice your posts have a lot of extra whitespace, basically any line break (even after end tags) will create vertical whitespace. It's a bit annoying, but the other setting is actually worse. If you're like me, you'll end up fiddling with your post a bit to get it right. You can always ask me for help, even to do a final copy edit. I don't mind. I'm probably pickier than you are, though I think Roger runs a close second.

Titles

Roger suggested we use "down-style" (or something I can't remember) which means that we only capitalize the first word of a title unless there are words that would naturally be capitalized. See below for subtitles.

Bibliographic format

Take a look at the References & Reading List page. I'm not a stickler, really, but some measure of consistency would be nice.

Styles

The CSS stylesheet for Electric Forest has a few classes that you can use, described below.

There's an "inline" class for inline links. Normally, links are bold (e.g., Google), and if you don't want them to stand out so much within a paragraph, add a class attribute, e.g. (e.g., Google):
  <a class="inline" href="http://www.acme.com/">Acme</a>

Subtitles

If you want to add a subtitle to a post, you can use a class attribute of "subtitle", e.g.:
  <p class="subtitle">The subtitle to this entry.</p>
This will work on spans, paragraphs, or heading elements.

Notes

If you want to add notes in small text, you can use a class attribute of "note", e.g.:
  <p class="note">My note text.</p>

So here's a bit of a note.

I've also been using notes as introductory text to a post.

Footnotes

For footnotes, we're using square-bracketed numerals[1], with the link ID unique within the blog by prefixing a number scheme of your choosing with your initials or name (your "namespace"). You're responsible after that for using unique IDs (so there are no collisions within the blog). Roger seems to have been using things like "roger34", "roger35", etc. Both the <a> and the <p> it links to should have a class attribute with a value of "fn" to pick up the style. Here's some sample markup for a footnote (note that we link directly to the paragraph, not an anchor):
  The last martini<a class="fn" href="#mur003">[1]</a>.
...
<hr align="left" width="33%" />
<p class="fn" id="mur003">[1] footnote text.</p>

[1] Though if you want to use surname+year (e.g., "[Steinbeck 1954]"), I don't really have a problem with it.


If there's anything you want to add to either the stylesheet or this post, drop me a line. We don't yet have definition lists or other features, but they can always be added. If this all seems too onerous, ask. I might be willing to do the copy editing for you, and I'm always willing to answer questions.