
A new project untangles the handwritten texts in the Vatican's collections - fraqed
https://www.theatlantic.com/technology/archive/2018/04/vatican-secret-archives-artificial-intelligence/559205/?single_page=true
======
WalterBright
Sometimes I think archivists are so obsessed with getting perfect scans and
every-pixel-is-precious that scanning books becomes too costly and so never
happens.

A simple alternative is to just collect some volunteers with iphones and have
one person turn pages while the other just clicks the shutter. You could
easily do 20 pages a minute, 1200/hr, 10000/day. I bet those acres of books
could be ground through in reasonably good time.

Of course, the images would horrify an archivist. But try it yourself with a
random book. They're quite serviceable. At the very least, one then has a
backup in the case of a catastrophe at the library.

OCRing them is an entirely separate issue.

~~~
etrevino
I think that some archivists are happy even for "dirty" copies because it
keeps people's hands off of the originals. And I mean that literally, because
some of this stuff is so old and delicate that it will easily break apart in
your hands. In the UK National Archives (at least in my room) they won't let
you look at an original document if there's microfilm of it, no matter how
bad. If it was unreadable you had to get an archivist to agree before they'd
sign off on you requesting the original. I had one archivist tell me no and it
led to an especially difficult day.

------
anon1253
Very interesting. Way back my old university was also involved in historical
document processing
[https://www.rug.nl/research/portal/files/40224455/Chapter_7....](https://www.rug.nl/research/portal/files/40224455/Chapter_7.pdf)
they also looked at things like writer identification and trying to
automatically date the documents using a wide array of hand crafted features.
Curious what would happen with some of the newer deep learning models, but the
project has been dead for a while [http://application02.target.rug.nl/cgi-
bin/monkweb?db=All&cm...](http://application02.target.rug.nl/cgi-
bin/monkweb?db=All&cmd=scroogle) … as these things go

------
pimlottc
I found this part interesting:

> In texts transcribed so far, a full one-third of the words contained one or
> more typos, places where the OCR guessed the wrong letter. [...] Still, the
> software got 96 percent of all handwritten letters correct.

96% correct sounds pretty good but that's still multiple errors per sentence!
The threshold for truly "error-free" is quite high...

~~~
acdha
I would also be curious about the kind of errors: is it gibberish which is
obvious to the reader or selecting a valid word which requires more
understanding to recognize as incorrect?

~~~
WalterBright
In my experience OCRing text, even a high error rate does not impair the
readability of the text due to the high redundancy of english. It's just
annoying (and impairs search-ability, of course).

~~~
acdha
It definitely slows reading since you have to think about alternatives. I was
curious since I’ve noticed that many ML systems produce errors which are
different than past systems and many places aren’t expecting that in their
quality review process.

For example, I’ve built web systems which display search results using an
image overlay so you never see the OCR gremlins but if the failure produces
valid words rather than gibberish that means an increase in the false positive
rate and can be confusing for the user.

------
SiempreViernes
Sloppy summary: some researcher has trained some NN or whatever to segment and
then ocr old handwritten text and hopes to use it on the enormous archive the
Vatican has. Apparently because if it's not scanned its almost completely
useless to "modern scholars", which I take to mean those historians that only
read medieval latin if its printed on a screen...

~~~
jlarocco
I'm not sure I understand what you're getting at.

Are you saying it's better to keep these documents locked in the Vatican
instead of digitized and easily available online to everybody?

Surely it's better to have digitized versions rather than having a million
people thumbing through delicate ancient documents?

~~~
SiempreViernes
If historians haven't been putting this archive to good use, it's certainly
not because of a simple lack of scans, and the Vatican is likely to not give
easy access to the scans for mostly the same reasons they keep people out of
their archives.

Certainly buying a plane ticket to Rome isn't the end of the world, so
historians put the archive to as good use as they can. But they, the people
long trained to make use of delicate ancient document, are dismissed as some
sort of backwards tribe that are not fit to be labelled "modern scholars".

~~~
etrevino
They keep people out of their archives because things are fragile and poorly
organized. Don't underestimate how difficult (and expensive!) it is to keep
these documents preserved. I've had pages fall apart in my hands and the
document was only from 1714. I was taking proper precautions, too. And, I
should note, the documents were adequately preserved according to the
standards of the time. The Vatican only wants to let in people who have a good
reason for being in there. Fishing expeditions-- especially by independent
researchers-- are not permitted, because they can damage the documents. That
creates a weird chicken and egg scenario, because it makes it very difficult
to find new topics to explore. Also, it means that you won't be allowed to use
the archives unless you're certain that 1. the documents are there and 2. they
are instrumental to the topic at hand.

Digitizing the documents will change all of that. It's a lot less error-prone
and destructive than creating microfiche and it yields quite a bit of other
benefits, like indexing. No one, least of all the Vatican archivists, have a
complete picture of the contents of the library. Once they have those scans
they'll want to release them because it will help us understand a great deal
about Christianity and post-Roman Europe. The only limitations I can see are
those that other libraries have put in, such as forcing researchers to view
the documents in the browser, as opposed to downloading them. That's what
Oxford did with their collection of ballads (whether they still do I don't
know). That's annoying, but they reasoned that it would keep people from
taxing their servers by downloading the entire collection wholesale.

And I'm not sure what you mean by referring to trained archivists as being
dismissed as some sort of backwards tribe. A great many history programs have
archival (i.e. "library science) programs and training for their PhDs.

