Electric Forest

Electric Forest

thoughts about books, digital libraries, and stuff related to expressing and keeping track of our thoughts...

Thursday, June 23, 2005

Metadata should be free

Here's the lede from Michael Rogers' story at MSNBC:
Several years ago journalist John Lenger told a remarkable story in the Columbia Journalism Review about teaching a journalism class at Harvard's extension school. He asked his young students to write a story about a Harvard land deal that occurred in 1732, but after a week of research, most came back with almost nothing substantial to report. The problem: They had done most of their research using the Internet, walking right past Harvard's library and archives, where the actual information could be found. When Lenger questioned their research methods, one student replied that she assumed that anything that was important in the world was already on the Internet.
This connects to the effort in transforming "books into bytes" in Rogers' story, but I think it's about the poor job we've done in putting metadata about books, such as their indexes, online, making the discovery of these resources more likely.

Years ago, people said protection was needed against "theft" of software programs, and it took years for the countervailing attitude to appear in a strong, coherent way. Today, free and open software includes operating systems like Linux and Solaris, and applications for word processing, image manipulation, spreadsheets, presentations, software development and a thousand different uses. IBM, Sun, Nokia and other companies have taken IP rights and software that cost millions of dollars and put them into the public domain. We don't say such software is not creative and not worth "protecting," but its free distribution is more valuable to our society than locking it up. Enough people see this that the cultural effect is so far in advance of the laws it's kind of scary.

I'll skip over the arguments for scrapping patent laws that favor one side of this argument over the other. And there are many sites for discussing the similar issues around copyright. In the book world, people are beginning to see this, such as Cory Doctorow, who has posted his third book on the web in a variety of formats, saying "Share this book! That's what it's for." [1] The battleground goes beyond the legal arena to the cultural aspects, I would say. Just what are our obligations to creators and researchers? The IP pendulum needs to swing back to the middle in our minds as much as in our laws.

Metadata, creation and copyright have a tenuous connection to each other that ought to follow the open software model, to my way of thinking. Sure, there is intellectual effort that goes into the creation of an index, say, just as there is in the writing of a software application. But our society will benefit more — from wider, deeper, more accurate searching — when such information can be readily shared.

Metadata should be free. Metadata about non-electronic resources especially.



[1] Doctorow actually goes much further than simply allowing the electronic version of the book to be read for free, saying: "What's more, if you live in the developing world -- a country not on the World Bank's list of high-income countries -- you can do much more. You can make your own editions, charge money for them, make movies, translations, plays and anything else you care to, and charge whatever you want, without sending me one cent -- you don't even need my permission. See the FAQ for more. The only restriction is that you can't export your versions to the world's high-income countries where all my paying customers are. Deal? Deal."

Thanks to Teleread for its Doctorow and Rogers posts.

Saturday, June 04, 2005

Wikipedia and libraries

What's the most important thing about the web?
Instant access, free access, or permanent access?

At MaisonBisson, Casey Bisson suggests integrating Wikipedia's entries with the display of catalog search results, with the obvious example of biographical data: "We have three books about Nikola Tesla, but why not include the first few paragraphs from the Wikipedia entry on him?"

He connects this effort (at Plymouth State University in New Hampshire, apparently) with the "increasing tendency toward self-service" and he says this about Wikipedia:
For my part, I've come to love Wikipedia, despite having access to EB and other, more traditional sources. Why? Because it takes better advantage of the web than others, and unlike those commercial products, I don't have to sign in to use it.
At PressThink, a few months back I read an essay by Simon Waldman, the director of digital publishing at The Guardian newspaper. He discusses how so many people think of immediacy as being the point of the web and yet that the key element for participating in the conversation of the web is permanence. He says that many newspapers, despite being on the web, are not of the web because after a week or so they remove their articles from free access. And with this you slip off the search engines and out of the consciousness of the web user.

In our instant access world, you think, OK, I put my article on the web. But you have to leave it there, you have to make it accessible for perennial discovery. But if you look up subjects of interest in a hundred fields, you'll note the absence of The New York Times and the Washington Post and other authorities, who remove their stories from the freely accessible web and so remove themselves from the full impact they can have in Google, Yahoo! and other search engines.

It is the same with all information. Keep the indexes, keep the content, keep the images protected under heavy protection and you will find that people ignore this sheltered content in favor of the sources that embrace the web and make everything accessible there. They will become the influential authorities, not because they are more trustworthy, or more authoritative, or better written, but because they are more accessible.

And this goes for libraries too, especially when I owe no more allegiance to my local library than I do to my local newspaper. After all any accessible library may be closer in cyberspace than my local library or newspaper.

(Thanks to TeleRead for the pointer.)

Wednesday, June 01, 2005

Digitising BOB indexes

Electric Forest is fortunate to receive an email from freelance indexer Linda Sutherland, as several of our posts have dipped into the subject of back-of-book indexes. A point of interest: back in 1991, the Davenport Group (developer of DocBook) was trying to solve problems arising from the inconsistent use of terms in the master indexes of independently developed and rapidly changing technical documentation. The proposed solution was accepted by the ISO/JTC1/SC18/WG8 working group and published as the international standard ISO/IEC 13250:2000, or what we know as Topic Maps.

One or two comments in earlier articles seem to suggest that, if only librarians and publishers were willing, it would be easy to digitise the BOB (back-of-book) indexes of innumerable books, then merge them to form one single ‘mega-index’ to all of the books.

It's an attractive idea, and one which may become feasible in time. But making it so isn't simply a matter of persuading publishers. At least two practical problems will need to be overcome as well.

One of them is copyright. Roger writes of “releasing indexes to print books in electronic form, where no 'piracy' or digital copying could occur”. In fact it could occur, at least in some cases. Freelance indexers own the copyright in their work, except where the contract for an index expressly transfers rights to the publisher. If not transferred, re-use of the index without its creator's permission would certainly merit a black-patched eye.

The other problem is compatibility. A BOB index is a individualised, tailor-made product, crafted to suit one text and its target readership, and subject to any constraints on length, use of subheadings etc. that may have been specified by the client. Co-ordination with other indexes is rarely, if ever, a requirement.

Any attempt to merge indexes would have to cope with the consequences of that individualisation. The problems will include varying levels of specificity/exhaustivity/granularity, non-existent vocabulary control between indexes, and highly context-specific ‘see’ and ‘see also’ references which, if merged without editing, would almost certainly result in a jungle of misdirections.

Imagine merging together all the diaries ever written, then sorting the entries in chronological order. The result would be a history of sorts — but would you expect it to be the clearest, most readable, most reliable, or most succinct of its kind? Similarly, a ‘mega-index’ created by merging BOB indexes may not be entirely useless as a retrieval tool, but without a great deal of editing it will not be nearly as useful as you might expect. — Linda Sutherland