More About Google Print

Prediction: we’ll talk about Google Print until they debut the beta, then we’ll talk about it more.

Copyfight posted some followup on Google’s announcement earlier this week. Of note was a quote from Michael Madison:

A first thought: It’s one more example, and a pretty important one, of the fading of the lines separating copyright law from communications law. Is Google Print an information conduit? A massive, rogue P2P technology? Is it a contributory infringer? A publisher? From whom, if anyone, does it need licenses, and who, if anyone, should regulate it, and how, if at all?

TeleRead started talking about how Google Print will be presented:

My understanding, which may be wrong, is that Google will OCR the page scans, but do only cursory machine cleanup of the raw unstructured text that results. This approach is which I call “raw digital text” or RDT), and use the still-error-laden RDT in their search system to pull up the page scans.

You can see this approach now in the way Amazon presents results of its “search inside this book” feature. The text is indexed for searching, but clicking on the results brings up the scanned, bitmapped pages. When available, the feature is incredibly useful, but I feel cheated when I try to copy and paste the text.

TeleRead points out that this is also how the University of Michigan’s Making of America collection works.

MoA scanned the books, placed the scanned page images online, and built a search engine to search the resulting RDT from OCR. Then, one by one they have been converting the RDT from selected books to highly-proofed SDT (structured digital text) using human proofers and TEI (I think) for structuring. So, the scans came first, and then the cleanup was (and is being) done at a later time.

The excitement here, for TeleRead, is that Google might end up contributing to efforts like Project Gutenberg and could benefit greatly from the Distributed Proofreaders volunteers.