
CAPTCHAs work—for digitizing old, damaged texts, manuscripts - nickb
http://arstechnica.com/news.ars/post/20080814-captchas-workfor-digitizing-old-damaged-texts-manuscripts.html
======
stcredzero
I'm imagining this film about the progress of human knowledge. Somehow we need
to express this continuity across the ages -- from the Monks who transcribed
classical texts on parchment, to centuries later when junkies are fooled into
deciphering fragments of those same parchments for access to porn.

------
weegee
Interesting, but I would think that scanning to a PDF file would better
preserve the work for future readers and researchers. That way you're looking
at the original, not a text copy of it. Or have the text copy available. Seems
it would be a lot less work. And for me, digital is too transient, too
fragile. For photography, I still feel that a piece of film can last longer
than a digital file. We have film from more than a hundred years ago but I
highly doubt in one hundred years that any digital files created today will
still exist. I work for Corbis, and we bought the Otto Bettmann archive, which
at the time of purchase, was the most significant archive of images in the
world. We began something called "The Keystone Project" which would have
digitized the bulk of the archive. Soon though the need to move the archive
from the building in Manhattan (which was not air conditioned and had a leaky
roof, among other problems) became a very high priority, and the archive now
rests deep underground at Iron Mountain in Virginia. The originals are stored
at below zero and the original card catalogs are still there for research
purposes (though there are numerous portable versions for researchers around
the world). Hopefully this preservation will last hundreds of years.

~~~
thwarted
Once it is scanned, the work is preserved as an image, and you still can not
search it. OCR often fails because it is so old and worn. Interestingly
enough, in order to actually do this, you need to scan, photograph, or
otherwise digitize it to begin with, so the output of this is both what you
seek (a scanned image) AND the actual content. Everyone wins.

