Hacker News new | past | comments | ask | show | jobs | submit login
A new project untangles the handwritten texts in the Vatican's collections (theatlantic.com)
80 points by fraqed on May 2, 2018 | hide | past | favorite | 16 comments



Sometimes I think archivists are so obsessed with getting perfect scans and every-pixel-is-precious that scanning books becomes too costly and so never happens.

A simple alternative is to just collect some volunteers with iphones and have one person turn pages while the other just clicks the shutter. You could easily do 20 pages a minute, 1200/hr, 10000/day. I bet those acres of books could be ground through in reasonably good time.

Of course, the images would horrify an archivist. But try it yourself with a random book. They're quite serviceable. At the very least, one then has a backup in the case of a catastrophe at the library.

OCRing them is an entirely separate issue.


I think that some archivists are happy even for "dirty" copies because it keeps people's hands off of the originals. And I mean that literally, because some of this stuff is so old and delicate that it will easily break apart in your hands. In the UK National Archives (at least in my room) they won't let you look at an original document if there's microfilm of it, no matter how bad. If it was unreadable you had to get an archivist to agree before they'd sign off on you requesting the original. I had one archivist tell me no and it led to an especially difficult day.


Very interesting. Way back my old university was also involved in historical document processing https://www.rug.nl/research/portal/files/40224455/Chapter_7.... they also looked at things like writer identification and trying to automatically date the documents using a wide array of hand crafted features. Curious what would happen with some of the newer deep learning models, but the project has been dead for a while http://application02.target.rug.nl/cgi-bin/monkweb?db=All&cm... … as these things go


I found this part interesting:

> In texts transcribed so far, a full one-third of the words contained one or more typos, places where the OCR guessed the wrong letter. [...] Still, the software got 96 percent of all handwritten letters correct.

96% correct sounds pretty good but that's still multiple errors per sentence! The threshold for truly "error-free" is quite high...


I would also be curious about the kind of errors: is it gibberish which is obvious to the reader or selecting a valid word which requires more understanding to recognize as incorrect?


In my experience OCRing text, even a high error rate does not impair the readability of the text due to the high redundancy of english. It's just annoying (and impairs search-ability, of course).


It definitely slows reading since you have to think about alternatives. I was curious since I’ve noticed that many ML systems produce errors which are different than past systems and many places aren’t expecting that in their quality review process.

For example, I’ve built web systems which display search results using an image overlay so you never see the OCR gremlins but if the failure produces valid words rather than gibberish that means an increase in the false positive rate and can be confusing for the user.


Sloppy summary: some researcher has trained some NN or whatever to segment and then ocr old handwritten text and hopes to use it on the enormous archive the Vatican has. Apparently because if it's not scanned its almost completely useless to "modern scholars", which I take to mean those historians that only read medieval latin if its printed on a screen...


I'm not sure I understand what you're getting at.

Are you saying it's better to keep these documents locked in the Vatican instead of digitized and easily available online to everybody?

Surely it's better to have digitized versions rather than having a million people thumbing through delicate ancient documents?


If historians haven't been putting this archive to good use, it's certainly not because of a simple lack of scans, and the Vatican is likely to not give easy access to the scans for mostly the same reasons they keep people out of their archives.

Certainly buying a plane ticket to Rome isn't the end of the world, so historians put the archive to as good use as they can. But they, the people long trained to make use of delicate ancient document, are dismissed as some sort of backwards tribe that are not fit to be labelled "modern scholars".


They keep people out of their archives because things are fragile and poorly organized. Don't underestimate how difficult (and expensive!) it is to keep these documents preserved. I've had pages fall apart in my hands and the document was only from 1714. I was taking proper precautions, too. And, I should note, the documents were adequately preserved according to the standards of the time. The Vatican only wants to let in people who have a good reason for being in there. Fishing expeditions-- especially by independent researchers-- are not permitted, because they can damage the documents. That creates a weird chicken and egg scenario, because it makes it very difficult to find new topics to explore. Also, it means that you won't be allowed to use the archives unless you're certain that 1. the documents are there and 2. they are instrumental to the topic at hand.

Digitizing the documents will change all of that. It's a lot less error-prone and destructive than creating microfiche and it yields quite a bit of other benefits, like indexing. No one, least of all the Vatican archivists, have a complete picture of the contents of the library. Once they have those scans they'll want to release them because it will help us understand a great deal about Christianity and post-Roman Europe. The only limitations I can see are those that other libraries have put in, such as forcing researchers to view the documents in the browser, as opposed to downloading them. That's what Oxford did with their collection of ballads (whether they still do I don't know). That's annoying, but they reasoned that it would keep people from taxing their servers by downloading the entire collection wholesale.

And I'm not sure what you mean by referring to trained archivists as being dismissed as some sort of backwards tribe. A great many history programs have archival (i.e. "library science) programs and training for their PhDs.


I think I qualify as a legitimate historian in this situation (I actually did buy a plane ticket to Rome last year and spent a week looking through physical texts at the Vatican archives). I'm thrilled by any efforts to digitize and OCR this texts and so is everyone else I know who does archival work. So I don't really know where the tone of dismissal/cynicism I'm picking up in your posts is coming from.

As anyone who does archival work will tell you, one of the biggest problems is the "too much to know" issue: it's basically physically impossible to look at every relevant page of every relevant text when you only have at most a few months to a year to do your field work. So aside from the convenience aspect of being able to double check your work without flying back to Rome (or wherever), OCR also offers a qualitatively different style of research that lets you trawl for words or phrases across a very large corpus. That's super valuable for doing intellectual or cultural history, or biography.


Old texts are pretty delicate. You wouldn't what droves of scholars reading them all the time. Digital images don't wear from reading them.

Searchability is also a really nice feature.


I don't agree with any of that.

One big practical reason the archive requires special access is that 1200 year old paper documents are fragile and require special handling to read. If the Vatican allowed everybody to come in and finger through them, they would be easily damaged.

Furthermore, I suspect required travel to Rome has prevented a lot of people from using the archives.

> But they, the people long trained to make use of delicate ancient document, are dismissed as some sort of backwards tribe that are not fit to be labelled "modern scholars".

I don't see anybody here making that claim, and I don't know what it has to do with the Vatican archives.


It's much easier to read an image than an old, crumbly document that you can only see if you're physically located in Rome.

That doesn't justify OCRing it instead of just photographing it, but it certainly justifies the idea that the documents are much less useful hidden than scanned.


Its useless to modern scholars because of how difficult it is to access the original texts, because they must be carefully preserved. All of that is explained in the article.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: