Electric Forest

Electric Forest

thoughts about books, digital libraries, and stuff related to expressing and keeping track of our thoughts...

Friday, May 27, 2005

Tagsonomies and digital libraries

I know something about e-books and I see the most serious problem with e-books as publishers' completely ignoring the computer chip in the reading system [1]. I know little about the development of digital libraries. But I suspect they have missed the point of the web, whose model radically shifts the boundaries between author (or publisher) and reader, and I expect they are not properly formulated for a web world.

The classical functions of the library (or librarian)[2], as enunciated in The Intellectual Foundation of Information Organization by Elaine Svenonius and rewritten here for an electronic repository, are to enable the user:
  1. To locate a specific document when certain attributes (such as the title, author or date of publication) are known in advance (the so-called finding objective)
  2. To locate a set of documents representing
    1. All the documents from the same author or organization or governmental entity or court
    2. All the documents as part of the same series or aggregation or issuance
    3. All the documents "published" or released within a specific time-frame, or before or after a specific date
    4. All the documents on a specific subject
    5. All the documents referred to or cited within a single or master or central document
    6. All the documents defined by some "other" criteria
    7. All the documents defined by some combination of the above criteria (together these constitute the collocating objective)
  3. To choose among different types of documents, which are more or less suitable to the user's needs (the choice objective)
  4. To acquire access to the document or set of documents, through electronic delivery on-screen, download, printing, faxing, or other mechanism, in real time even if not pre-arranged (the acquisition objective)
  5. To navigate the collection — that is, to find documents related to a given document or to a given subject by generalization, association and aggregation, or to travel along axes of equivalence, association or hierarchy (the navigation objective)[3]
Much of the library's role is in organizing and arranging the collection to meet these objectives; the author, title and subject catalogs are the library users' guide to that organization. I don't know if it's useful to rank these objectives, but surely the collocating objective is the hardest. It's also the one I suspect has enjoyed the least amount of development.

Earlier, I said libraries could combine the index from each of the books, electronic or otherwise, in their collection as a replacement for the subject catalog. This unified index would enable users to locate relevant material even when it was not one of the two or three most significant categories the entire book would fall into. And even if permissions allowed only one reader at a time of any segment, electronification could then allow multiple simultaneous users of a single licensed copy, each reading the segment of interest to them: In other words, you read chapter 4 on Leonardo while I read about Giotto in chapter 1.

But a more significant expansion of the notion of a subject catalog rests in the collective knowledge of web users. If, as Clay Shirky persuasively argues, formal taxonomies are (often) inferior to collaborative tagsonomies, [4] why shouldn't digital libraries involve the library users as cataloguers? Yes, individual inexperienced users don't measure up to a professional — but who thinks Zagat's restaurant ratings are inferior to professional reviewers'? The collective knowledge of average restaurant-goers prevents really wrongheaded assessments and ultimately causes the overall evaluation to center around legitimate issues. (And as a counterweight to that "lack of expertise," the body of all readers knows of more obscure but wonderful places to eat than any individual professional reviewer.) If every reader of The Renaissance, volume five of Will Durant's Story of Civilization, supplied terms that apply to chapter 17, however many variants there may be on "Julius II" and "Michelangelo" and "Raphael" and "Rome" and on other aspects of popes and art, we know that "David" and "Sistine Chapel" will not be left out. And isn't it likely that the "Sleeping Cupid" will be mentioned too? So what is that chapter about? The Sistine Chapel frescoes or Pope Julius? A single category is inherently wrong, and yet we rely on single categories for entire books in the current scheme of things. If we look at how people categorize things, we learn what it is they want to find. That's why you want every reader to contribute.

And when i say "every reader," I mean every reader, no matter what collection — the NY Public Library's or the Montclair library's or my own — the book belongs to. If a hundred readers will give us better results than one, why shouldn't digital libraries pool their data with all other libraries around the world, so that there could be ten thousand readers' tags for this chapter?

What I'd like to hear is that I am ignorant of just such a proposal. Do any digital libraries propose such partnership? I know the Simile project at MIT, which is related to DSpace, provides for semantic capabilities. Does it go farther?

I say that digitial libraries, in order to be relevant to this age, have to be more than just digital manifestations of the content and catalogs from earlier times. If everyone is an author on the web, so is everyone a cataloguer. When the library patron is an equal partner to the librarian in meeting those five objectives, then we will have a digital library.




[1] As noted in the exchange with Rick Brannan, this applies most significantly to e-books read on booksize devices and not desktops or laptops. But then, I wouldn't wish reading a booklength work on a CRT on anyone.

[2] Despite its imposing title, Svenonius' book is intended to describe libraries in non-technical terms for the non-library professional — in other words, me. What a librarian knows about information organization is something people working with the semantic web also need to know, which is why I highly recommend this book.

[3] In these statements, document stands for any separate object delivered to the user, and should be understood to include audio, visual and multimedia entities. A piece of a larger whole, such as one article in a journal, if delivered as a standalone object is considered a document in this context.

Ideally of course, the system will deliver all of and only the documents desired by the user — that is, the documents supplied should comprise as close to 100 percent of all the relevant documents in the collection (the recall rate), and of the list of documents returned to the user, as close to 100 percent as possible should in fact be relevant (the precision rate).

[4] I use the term "tagsonomy" in preference to the better established "folksonomy." Wikipedia applies the latter term to the "practice of collaborative categorization using freely chosen keywords. More colloquially, this refers to a group of people cooperating spontaneously to organize information into categories, noted because it is almost completely unlike traditional formal methods of faceted classification. This phenomenon typically only arises in non-hierarchical communities, such as public websites, as opposed to multi-level teams. Since the organizers of the information are usually its primary users, folksonomy produces results that reflect more accurately the population's conceptual model of the information." If there is a distinction between reader-applied and author-applied (eg, between tags at del.icio.us and technorati), I think that insignificant compared to the collaborative aspect.

Thursday, May 26, 2005

Grid technologies and DL : The DILIGENT Project

We could not pay too much attention to the importance of the vanishing boarder between documents and data, and merging of their respective management technologies. We more and more handle documents as data, we consider data as documents, we use similar metadata for both, and try to build integrated environments handling both in a seamless way. Certainly markup languages have done a lot towards this convergence. Always remember that any XML file is both a data set and a document ...
So, look carefully at projects trying to bring together the best of both data management and library science, such as DILIGENT, which aims at combining state-of-the-art Grid and DL technologies, and test the resulting cocktail in domains as different as environmental e-science and cultural heritage. Looking at the consortium, it's wonderfully set up for cross-pollinization. The general framework is the European project "Enabling Grids for E-sciencE".

Wednesday, May 25, 2005

Challenging Google's digital library project

Responding to Murray's invitation my first post here is about a hot topic, at least in Europe. Should European libraries challenge Google in the universal on-line digital library project? For those not aware of it, see the recent political declaration by European leaders. Posted on diglet – a blog devoted to digital library issues from the UCSD Digital Library Planning Working Group.
From the same source, a recent post about another reaction: University-Press Group Raises Questions About Google's Library-Scanning Project:
Saying that Google's high-profile library project "appears to be built on a fundamental violation of the copyright act," the Association of American University Presses listed concerns and questions about the project in a six-page letter to Google's top lawyer. The complaint is one of a growing list of formal objections to Google's digital-library plans by publishing groups.

Thursday, May 19, 2005

References for the digital library

There's a page on this site with a reading list of really useful resources. I think, though, that some of the books there deserve their own post.

One of these is Elaine Svenonius' book, The Intellectual Foundation of Information Organization (MIT Press, 2000. ISBN: 0-262-19433-3).

I can't summarize its rationale better than how it's described in the first two paragraphs of the Preface:

Instant electronic access to digital information is the single most distinguishing attribute of the information age. The elaborate retrieval mechanisms that support such access are a product of technology. But technology is not enough. The effectiveness of a system for accessing information is a direct function of the intelligence put into organizing it. Just as the practical field of engineering has theoretical physics as its underlying base, the design of systems for organizing information rests on an intellectual foundation. The topic of this book is the systematized body of knowledge that constitutes this foundation.

Much of the literature that pertains to the intellectual foundation of information organization is inaccessible to those who have not devoted considerable time to the study of the disciplines of cataloging, classification, and indexing. It uses a technical language, it mires what is of theoretical interest in a bog of detailed rules, and it is widely scattered in diverse sources such as thesaurus guidelines, codes of cataloging rules, introductions to classification schedules, monographic treatises, periodical articles, and conference proceedings. This book is an attempt to synthesize this literature and to do so in a language and at a level of generality that makes it understandable to those outside the discipline of library and information science.

She succeeds admirably in her goal. The book is first-rate, especially her discussion of subject languages.

Tuesday, May 17, 2005

Moving into the next age for readers

At ricoblog, Rick Brannan responded to the post below, The first thing or two about e-books. He is an information architect at Logos Bible Software.

With his permission, I quote from his blog and our email correspondence (visit his blog to read his comments in full):

Rick Brannan:
I wonder if [you are] aware of Logos Bible Software. ... Logos strives to reproduce the printed page as much as it makes sense in an electronic environment while adding features appropriate for an electronic environment (the Libronix Digital Library System, in this case). These enhancements are primarily in the realm of hypertext referencing (so, click on a Bible ref, or a Josephus ref, or a reference to 'page 347' and go there), topic indexing, and (increasingly) in distinguishing different fields of information for searching purposes.

Some resources take this quite far. The morphologically tagged editions of the Hebrew Bible and Greek New Testament have all sorts of data stuffed in there, associated with specific words. This would never work in print, it only works electronically — much like [your] chess example only works electronically and doesn't work in print.

Other resources have a relatively high degree of interaction. One recent example is Moody's AM Bible Courseware (be sure to check the video at the bottom of the page), which is powered by the Libronix Digital Library System. The books are delivered as books, they are cross-referenced with the larger Logos Bible Software library. And yes, there are tests. ...

He adds, lest he be misunderstood as an electronifying zealot:

There are many things that could be done electronically that don't occur in Logos books. I like to describe these sorts of things as a sort of "multimedia extravaganza." It is all in accordance with Brannan's First Law of Electronic Book Design: Just because you can doesn't mean you should.

Roger Sperberg:
I should plead guilty upfront to the charge of making overgeneralizations (not merely generalizations, mind you, but OVERgeneralizations). I'm also familiar only with certain parts of publishing and tend to ignore my ignorance about other areas and their pertinence.

That said, let me qualify my position. The courseware/books that are demonstrated in the Libronix DLS video are, as you say, examples of what can be done with texts only when they're electronic and they're something being done already.

But I want to distinguish between electronic presentation of texts that can only be read on a computer, and e-books that can be read on portable reading devices. (Not that I said this in my posts.) CD-ROM publishing and web publishing (and the capabilities of Libronix demonstrated seem to fall in that spectrum) are outside this scope. On the other hand, if you were doing these same things in a clear and workable fashion and the book/product could be read on a PocketPC or Palm or Librie or Cytale or eBookwise or Gemstar device, then I would be singing your praises.

I'm very much indebted to Bill Hill at Microsoft for reading the 10,000 documents on fonts, reading comprehension, book design and so on that he did, and for distilling the essence of what makes a book a book and what permits "ludic reading." I lean heavily on his research and conclusions. If you haven't read his 80-page research paper then that may be your next step. I think you have a great research tool. I imagine I could lose myself in reading texts in it. But to me you have more of a software application than a book. And failing so many of Hill's criteria for ludic reading, I do not think if all texts were available in your system that it would move us into the next age for readers.

My point in my post is not that e-book publishers don't know that they should or could link more, bring in other texts and pictures, and so on, but that you and I, as bringers-about-of-the-future, as Prometheans of publishing, have TWO obligations to meet if we are to succeed: we must find the things (hyperlinks in your case, motion graphics for process in my example) that print books can't do AND then execute these capabilities in such a fashion that in every other aspect we humans still regard the object we are reading as a book.

Remember too that every criterion I could list as to what makes a book could almost be met by magazines and newspapers and web pages — and CD-ROM publications too — and that I claim a special role for books. Hill's title claims the magic for reading and not for book-reading, and so maybe I'm on thin ice when I argue from this position. But it's why I focused on books instead of information retrieval as the key issue for libraries going into the future. Many people won't agree with me; and perhaps you won't agree with me, but that would be their and your prerogative. But my story is we've got to keep an e-book really booklike, and I'm sticking to it.

Rick Brannan:
I think ... Roger and I use the term "ebook" just a little differently. I've got the blinders on to my particular context and I usually take it to mean "electronic implementation of a printed resource," typically as a resource in the Libronix Digital Library System, though not necessarily. Roger's definition is admittedly a bit more broad than mine is. Once I understood this, the light went off and many of Roger's comments fell into place.

[Update: The discussion continues at ricoblog with more of Rick's thoughts.]

Monday, May 16, 2005

Can our libraries be digital if the books are not?

Maybe a new genre will be necessary for e-books to become popular successes — you know, something along the lines of a multi-pronged narrative where the order of what you read and what the characters know depends on the reader making choices: "(A) Bill lets the stranger walk by. (B) Bill engages the stranger in conversation." Such books already exist — my third-grader has several Goosebumps titles that R.L. Stine has concocted that use this structure — and it is because these are all text, all narrative, that electronifying them will preserve the book essence.

But, really, if this were done right, it wouldn't be a book at all but something more akin to a CD-ROM adventure or a video game. And I can imagine applications that would take your input — say your financial circumstances and retirement goals — and then offer text-heavy guidance as to your best investing strategies. But the more you leant on the application and interaction, probably the more interesting and useful this could be, and it would probably be better off if you didn't approach it as a book at all.

Now a library isn't restricted to serving books to patrons — obviously periodicals, music and videos can be checked out in most libraries, and the rationale for offering online access seems akin to that of offering reference librarians: You need some information? Still, any Hill acolyte will say that a book — reading — has an almost unreal capacity to engage the reader (magic Hill says) and let's not ignore the formula that has been devised over these centuries as we seek to expand the delivery vehicle.

So if I focus my argument on making books for the digital library instead of discussing the broader topic of ways of delivering information, it's because I agree that books play a special role for us. Let's not re-invent everything just because we can.

But there's so much that could be done that isn't being considered, even within this narrow slice.

Take the matter of the subject index.

Today in a city the size of, well, my hometown, Montclair, NJ, which has about 35,000 residents, there are many sizeable libraries — the town library on South Fullerton Street, the university library/libraries at Montclair State, the high school and three middle school and six grade school libraries. A fair-sized library — maybe a couple thousand books — at Union Congregational Church and surely similar ones at other churches and synagogues. The library at the art museum. And there must be some individuals — I know some candidate professors — with specialized collections as large as Union Cong's.

If you were able to take the index out of any of these print books and merge it into a site-specific subject catalog, what a detailed and powerful search tool that would be. The patrons at any one of these libraries would find that incredibly useful.

Of course, the resulting subject card catalog would be enormous and making physical cards for all these index entries would be taxing, not to mention sorting them, so let's skip ahead in this gedankenexperiment and make our unified-index subject catalog from electronic files. Of course the books themselves don't need to be electronic. The subject index is simply going to point to a location in a resource — and I use that word advisedly — and that might be a print book or even a periodical, if it's been indexed. And of course Topic Maps and RDF have paved the way for this sort of thing to be done rather handily, if only the indexes were available electronically.

Publishers, naturally enough, do have all their indexes in electronic form, all their content for that matter (I exaggerate I suppose, but no publishers can typeset their texts by non-electronic means without increasing their composition costs ten- or twenty-fold). No one has asked for the indexes to be released separately from the print book and in electronic form.

Yet imagine it — the full contents of a library could easily be put into a unified index (well, apart from fiction and other non-indexed titles). And Montclair's index, specific to its titles, would be individual and different from those in Verona or West Orange or Bloomfield just a couple miles away.

Perhaps there aren't enough libraries to persuade publishers to go through the extra step of making indexes available in RDF or XTM format. What about individual book collections? In my own case, and this is on the small side for someone in publishing, our home contains several hundred books, half of them children's books. I'm an e-book enthusiast, with well over 2,000 individual titles (and more than 10,000 with all my duplicate formats) but if I had just the indexes for the Egyptology or XSLT or art or Renaissance or chess books, I would be ecstatic. Even though I have fewer than a dozen in all of these subjects, these little mini-collections would be vastly more useful if I could do one lookup and then go only to the book or books that had material to answer my question.

Almost from day one the key issue in electronic publishing has been publishers' concern about lost sales because of "piracy" or digital copying. Yet if libraries — not just digital libraries — would push on something like releasing indexes to print books in electronic form, where no "piracy" or digital copying could occur, think how much farther down the road to our future we would be.

The first thing or two about e-books

The first thing about e-books, electronic books, is that you can deliver them electronically.

So this leads into all kinds of issues —
  • pricing
  • on-line distribution
  • e-book reading systems
  • design, of the books and of the reading system
  • no-quality-loss electronic copying
  • DRM
  • piracy
and so on.

The publishing industry — and by this I mean all the stakeholders, publishers per se, editors, writers, agents, booksellers — are interested in keeping things as close to the current situation as possible. The worst thing that could happen from their perspective (at least to half these people) is that something like the great video ripoff of the 1970s would occur, when the actors and directors — the creative people who made the movies — did not share in the huge money that studios as distributors raked in when the video market came into existence. The writers and agents of course identify with the actors, but the publishers and booksellers don't want to be shut out any more than the writers, and are deathly afraid that a single misstep could spell disaster.

Consequently what we've seen over the last five or six years has been a tendency to treat e-books as a fourth format to release books in, acknowledging that there are clearly circumstances when an e-book fits the bill and neither a hardcover nor a trade paperback nor a mass-market paperback do. Everyone is comfortable with this perspective, particularly because royalties can be placed in a range that alarms no one, and really there have been technical and legal challenges aplenty to deal with. Of course, the minuscule number of e-books sold and read has also made everyone feel there's no rush to settle things.

All the issues being dealt with by publishers, editors, writers, agents, booksellers — and their lawyers! — stem from being able to deliver a book to a customer electronically. Some of those people are glad the market hasn't taken off, because then they might be more in the circumstance of the record and movie industries in relation to electronic copying and sharing (so-called piracy). [1]

Most everyone you talk to believes that solving these issues — by making cheap e-book readers, getting e-books in the hands of students, digitizing vast libraries of existing books — will somehow keep the genie in the publishing bottle and prevent it going the way of the typesetting business, which in the course of about eight years after the appearance of PostScript composition found 90 percent of its practitioners closing their doors. From that experience in the late 1980's and early 1990s, we in publishing know that technology is no respecter of market position and status and reputation. Everyone is learning the lesson of the last war, as it were, and making sure that they take advantage of technology and not ignore it.

But what all this overlooks is that e-books are being read on a device — an e-book reader, a Palm or a PocketPC, or maybe a laptop — that incorporates a computer. And no one is making books that take advantage of this.

In part this is because book publishers are firm in their decision not to become more like CD-ROM publishers. An e-book is a book; an interactive CD-ROM is not. You can't take a print book and readily (that is, cheaply) transform it into a CD-ROM adventure or interactive reference. [2]

As Bill Hill has pointed out in The Magic of Reading, we change a book's design at our peril. [3] Hill says everything on a book page from font design to margins has settled there from centuries of fine-tuning to humankind's physiological preferences. You can see what he means by comparing the design of Adobe's Acrobat Reader, which has literally dozens of controls all jammed into the default interface, with that of MS Reader, which has one visible control. [4]

So if we believe Bill Hill, then we really don't want to invent a new information vehicle, something that's not a book, but want as much as possible to keep the book experience on our electronic reading systems. There are plenty of ways to do that and still take advantage of that computer under the hood

For instance, any print book explaining a dynamic process can't hold a candle to an electronic book that simply illustrates the process in action. Here is a page taken from Garry Kasparov's On My Great Predecessors [5].

Page 361 shows four static chess diagrams

And here, by contrast, is a chess game whose process is clearly shown (click on "play" or ">":




This is taken from Der Alte Goniff, a chess blog written by Ed Gaillard. Note that the size of the board and of the commentary are relatively small here and that its design could be easily altered to fit the dimensions of the reading system. Where the print book is limited to showing the pieces at a few stages of the game, and must indicate the change in state by placing slightly changed images near to each other, the electronic illustration shows the board after every move.

Of course, every how-to book from re-wiring your house to learning to juggle would benefit similarly.

A textbook could record your answers to the questions at the end of the chapter and email them to the instructor.

A music theory book explaining Wagner could illustrate the themes being discussed by playing them.

A character's name in a science-fiction novel could be pronounced.

OK, maybe some of these are trivial uses. And something suggested to me recently by David Rothman — that a guidebook could connect with GPS data to display information about the sites you are passing when walking down a street in New York, or London, or Paris, or Tokyo — maybe takes you out of the realm of a basic e-book and into a specialized travel device.

But the point is to look at the limitations of a print book and, without changing the essence of the material being presented, then to release the e-book from those limits.

That's what we should be talking about when we speak about the future of e-books, or rather when we speak about e-books in our future. Answer this question: What can an e-book do that you can't do on paper?, and you've at least got your head around the real benefits that we could see.





[1] Recall that in the 19th century, people who opposed slavery were attacked as thieves wanting to steal a slave-owner's "property." The "intellectual property" scheme currently supported by our laws is, in my mind, based on flimsier rationales than that for slavery.

[2] It's not just a matter of adding in a few videos and songs — books and CD-ROMs are radically different formats. Perhaps in the future some specially-designed websites will transform easily into CD-ROM publications, if that medium survives at all. (Or perhaps the fact that the experience can be replicated online means there won't be sufficient market to support CD-ROM sales; it's a dying sector now already.)

[3] Hill is the mastermind behind Microsoft's Reader software and also its sub-pixel font hinting technology, which has migrated from the e-book reader to the browser and word-processing software. The Magic of Reading is available in MS Reader format at Slate or in Word format at the Poynter Institute.

[4] This isn't from a conservative aesthetic — "let's make it look more like a book because change is bad" — but from an exhaustively researched recognition that changing the reading interface will result in poorer results, either in entertainment or information delivery (in other words, change is bad.)

[5] Page 361. Part of the Everyman Chess Series published in the U.K. by Gloucester Publishers and in the U.S. by Globe Pequot Press.

Thursday, May 12, 2005

Random thoughts on DSpace

There is an article about Dutch universities opening their research to the web. A random sample of the listed repositories revealed that the platform used as a repository is DSpace. From the DSpace site:
A groundbreaking digital repository system, DSpace captures, stores, indexes, preserves and redistributes an organization's research material in digital formats.
The not-so-random thought is that there is a clear opportunity to layer a good topic map engine over DSpace, and, from that, provide for the support of research-oriented communities.