Friday, May 27, 2005
Tagsonomies and digital libraries
I know something about e-books and I see the most serious problem with e-books as publishers' completely ignoring the computer chip in the reading system [1]. I know little about the development of digital libraries. But I suspect they have missed the point of the web, whose model radically shifts the boundaries between author (or publisher) and reader, and I expect they are not properly formulated for a web world.
The classical functions of the library (or librarian)[2], as enunciated in The Intellectual Foundation of Information Organization by Elaine Svenonius and rewritten here for an electronic repository, are to enable the user:
- To locate a specific document when certain attributes (such as the title, author or date of publication) are known in advance (the so-called finding objective)
- To locate a set of documents representing
- All the documents from the same author or organization or governmental entity or court
- All the documents as part of the same series or aggregation or issuance
- All the documents "published" or released within a specific time-frame, or before or after a specific date
- All the documents on a specific subject
- All the documents referred to or cited within a single or master or central document
- All the documents defined by some "other" criteria
- All the documents defined by some combination of the above criteria (together these constitute the collocating objective)
- To choose among different types of documents, which are more or less suitable to the user's needs (the choice objective)
- To acquire access to the document or set of documents, through electronic delivery on-screen, download, printing, faxing, or other mechanism, in real time even if not pre-arranged (the acquisition objective)
- To navigate the collection — that is, to find documents related to a given document or to a given subject by generalization, association and aggregation, or to travel along axes of equivalence, association or hierarchy (the navigation objective)[3]
Earlier, I said libraries could combine the index from each of the books, electronic or otherwise, in their collection as a replacement for the subject catalog. This unified index would enable users to locate relevant material even when it was not one of the two or three most significant categories the entire book would fall into. And even if permissions allowed only one reader at a time of any segment, electronification could then allow multiple simultaneous users of a single licensed copy, each reading the segment of interest to them: In other words, you read chapter 4 on Leonardo while I read about Giotto in chapter 1.
But a more significant expansion of the notion of a subject catalog rests in the collective knowledge of web users. If, as Clay Shirky persuasively argues, formal taxonomies are (often) inferior to collaborative tagsonomies, [4] why shouldn't digital libraries involve the library users as cataloguers? Yes, individual inexperienced users don't measure up to a professional — but who thinks Zagat's restaurant ratings are inferior to professional reviewers'? The collective knowledge of average restaurant-goers prevents really wrongheaded assessments and ultimately causes the overall evaluation to center around legitimate issues. (And as a counterweight to that "lack of expertise," the body of all readers knows of more obscure but wonderful places to eat than any individual professional reviewer.) If every reader of The Renaissance, volume five of Will Durant's Story of Civilization, supplied terms that apply to chapter 17, however many variants there may be on "Julius II" and "Michelangelo" and "Raphael" and "Rome" and on other aspects of popes and art, we know that "David" and "Sistine Chapel" will not be left out. And isn't it likely that the "Sleeping Cupid" will be mentioned too? So what is that chapter about? The Sistine Chapel frescoes or Pope Julius? A single category is inherently wrong, and yet we rely on single categories for entire books in the current scheme of things. If we look at how people categorize things, we learn what it is they want to find. That's why you want every reader to contribute.
And when i say "every reader," I mean every reader, no matter what collection — the NY Public Library's or the Montclair library's or my own — the book belongs to. If a hundred readers will give us better results than one, why shouldn't digital libraries pool their data with all other libraries around the world, so that there could be ten thousand readers' tags for this chapter?
What I'd like to hear is that I am ignorant of just such a proposal. Do any digital libraries propose such partnership? I know the Simile project at MIT, which is related to DSpace, provides for semantic capabilities. Does it go farther?
I say that digitial libraries, in order to be relevant to this age, have to be more than just digital manifestations of the content and catalogs from earlier times. If everyone is an author on the web, so is everyone a cataloguer. When the library patron is an equal partner to the librarian in meeting those five objectives, then we will have a digital library.
[1] As noted in the exchange with Rick Brannan, this applies most significantly to e-books read on booksize devices and not desktops or laptops. But then, I wouldn't wish reading a booklength work on a CRT on anyone.
[2] Despite its imposing title, Svenonius' book is intended to describe libraries in non-technical terms for the non-library professional — in other words, me. What a librarian knows about information organization is something people working with the semantic web also need to know, which is why I highly recommend this book.
[3] In these statements, document stands for any separate object delivered to the user, and should be understood to include audio, visual and multimedia entities. A piece of a larger whole, such as one article in a journal, if delivered as a standalone object is considered a document in this context.
Ideally of course, the system will deliver all of and only the documents desired by the user — that is, the documents supplied should comprise as close to 100 percent of all the relevant documents in the collection (the recall rate), and of the list of documents returned to the user, as close to 100 percent as possible should in fact be relevant (the precision rate).
[4] I use the term "tagsonomy" in preference to the better established "folksonomy." Wikipedia applies the latter term to the "practice of collaborative categorization using freely chosen keywords. More colloquially, this refers to a group of people cooperating spontaneously to organize information into categories, noted because it is almost completely unlike traditional formal methods of faceted classification. This phenomenon typically only arises in non-hierarchical communities, such as public websites, as opposed to multi-level teams. Since the organizers of the information are usually its primary users, folksonomy produces results that reflect more accurately the population's conceptual model of the information." If there is a distinction between reader-applied and author-applied (eg, between tags at del.icio.us and technorati), I think that insignificant compared to the collaborative aspect.



7 Comments:
Murray wrote to me:
"I'm trying to figure out how to cull from your post what seem to be the main requirements of what constitutes a digital library, something you've already done, but fleshing out the details is a bit more difficult. For example, of the objectives, what does any one of them really mean in terms of actual implementation?"
To which I replied (and then later wrote to Murray that I would post his and my remarks here):
Svenonius' section on the objectives clearly delineates the what and the why. But I think these objectives, even if translated to electronic documents, as I re-formulated them, are so rooted in the 19th-century that what may be more important is to discover what the new objectives are.
Hence just as, in an ink-and-paper world we have an objective for enabling the user to get their hands on the right object (in other words, which copy of the book do you want? and what are the check-out privileges of users?), so should we now have an objective for allowing patrons (as opposed to librarians) to classify books, and to supply other metadata (such as saying, these two books are related).
I take it that "of the objectives, what does any one of them really mean in terms of actual implementation?" is a rhetorical question, meaning you have to work them out in your system [at the Open University].
The way I made use of these objectives was to state them, along with certain other basic principles, and then used them to justify the specific requirements that we identified for a delivery platform for our documents [in a publishering environment]. So a requirement saying user metadata needed to be able to be searched along with content-set metadata could be connected to the choice objective (as well as pieces of the collocating objective) -- if the user had some of the documents in a search result already in hand, say, our system had to be able to offer them first before offering others that required downloading.
I'm not so sure that that's what you are thinking about when you ask your rhetorical questions, but identifying these objectives certainly brought our requirements into sharper focus and coherence.
About 'tagsonomy' and 'folksonomy'; they are not unknown nor ununsed at libraries. Catalogers, librarians and other 'knowledge workers' have use "free-tagging" for many years. What they haven't done is allow the commoners unprecise opinion in on the data ... until now. At least here at the National Library of Australia (and I've heard of others) we adopt and use (at least, very soon) a bit of 'tagsonomy' and 'folksonomy' in various applications. The sausage in the pudding, I believe, is the very destinction the applications can make between a trained professional and a sausage-eating teenager wanting to screw up the metadata.
As such, nothing really is new, apart from the lack of a mass adoption to this scheme. And, as such, somthing we should be a bit cautious about; I do not wish to alienate the librarian from the library on the grounds that Britney Spears have billions of lovers and equally amounts of metadata and opinion attached. I do not appreciate the popular embrace for the sake of popularity. Oh, and I think Clay Shirky's got it very wrong.
Alex, I followed the link you provided, which were reactions to Shirky's Ontology is Overrated article, and I don't know which part of what he's written you disagree with, but it seems perhaps that you may not be considering his audience nor the target of his criticism. I don't think he's criticizing library classification systems per se — he's looking at how the "Semantic Web fabulists" are trying to create ontologies without consideration for the ambiguity of the real world. He uses library classification as an example, and in those examples he may have even gotten something wrong (which seems to be the beef of those on the site you referenced), but his basic message is pretty solid. I don't know that I agree with his projections on tags and folksonomies, but that'll shake out on its own.
As he says, "Critically, the semantics here are in the users, not in the system," which should be self-evident to anyone who understands that "semantics" (meaning) is a result of the human process of interpretation and does not exist absent a human agent. The Web will never be "semantic" because it cannot contain meaning, just information that allows a human to interpret it and thereby create meaning. It's delusional science fiction.
Adding metadata doesn't change this — even if the metadata was universally-valid and context-free. As Cory Doctorow writes, a "world of exhaustive, reliable metadata would be a utopia. It's also a pipe-dream, founded on self-delusion, nerd hubris and hysterically inflated market opportunities." Shirky is basically on the same track as Doctorow, following the foxes spending DARPA's money in building systems that can perhaps compare a passport code against a database of known terrorists, but are stupid when used to build classification systems, just the wrong tool for the job. But given the track record of most intelligence systems I wouldn't hold my breath waiting for any of it to work. Garbage in, garbage out.
Murray: I agree with their sentiment, but I disagree with how they came to it. :) The attack on library classification systems are based on wrong information even if the end result is "the semantic web is bullshit."
I guess it is problematic for me to swallow Shirky's polemic when he base his conclusion on false notions, even if I happen to agree with and accept that conclusion. I'm human. So sue me. :)
Again, I don't see what Shirky's written as intended as any substantive "attack" on library classification schemes. He's simply pointing out the problems in those systems that are well known to librarians, who seem to admit as much: ontologically, DDC is a mess, LoC is full of holes, etc. These classification systems were developed for different purposes than computer-based ontologies (being focused on physically organizing a collection within a library, not on customer's ease of finding resources, nor on being ontologically correct), and the "Semantic Web" types are falling into the same kinds of traps that librarians did a hundred years ago, such as mistaking an identifier for a concept, not having clear ideas about subject identity, or entirely missing what C.S. Peirce calls Thirdness, i.e, human, abductive (intuitive) reasoning and interpretation. These systems were developed because someone was trying to organize reality, and the problems that happen when one tries to do this are roughly identical. In a much less sophisticated way, Shirky is to a degree mirroring some of Peirce's arguments. I watched for years as several Peirceans tried to make similar arguments on the various upper ontology mailing lists.
I also don't see any glaring "false notions." You've only complained about them in general but never elucidated them specifically, so we could perhaps understand what actual problems you find.
Murray : I'm just pointing out that his use of categorising schemes in an attack on ontologies are misplaced. Also, when he *does* go into those schemes, he doesn't mention that each level has finer granularity and extensions. They are schemes, not ontology, and his link between the two somewhat leads to a conclusion about the latter, and I just don't agree with that link.
Yes, sure, some make ontologies to fix the categorising schemes problems. Others don't. Not even Library of Congress would do such a misguided thing anymore, even if there have been attempts in the past. (FRBR to the rescue of some of that, I think) We learn by doing mistakes, and I feel Mr. Shirky hasn't talked to a librarian lately about these issues; he isn't quite up to date.
But *outside* the library world, he's spot on, so hooray to him and his message, ok? :)
And since we're talking about "what's wrong with ontologies", we should more accept the fact that librarians (and this piece is from 1980!) themselves have been critial of those cataloging schemes for a long time. That is what Shirky should have addressed. Anything else is wool.
Post a Comment
<< Home