
1.5 million pages of ancient texts to be made accessible online - iProject
http://arstechnica.com/open-source/news/2012/04/15-million-pages-of-ancient-texts-to-be-made-accessible-online.ars
======
jedberg
Soon, historians will need to become experts at processing big data so that
they can make new discoveries.

Protip for the college kids: If you're a computer science person who really
like history, take courses that will teach you how to process large datasets.
Then you can be on the leading edge of historical research!

~~~
grepherder
The tip is of course valid and pro, and I'd recommend the same, but it's
already being done, under machine translation. Also, in this area big data
loses its meaning, as you don't really need traditional databases, you just
process raw text. There are literally thousands researching how to
intelligently select and process this data.

~~~
gliese1337
It's not just machine translation; it's image processing / cleanup (to handle
huge amounts of data for multispectral imaging and figure out how to combine
it into sets of false-color images that people can read), optical character
recognition (for ancient handwriting in weird writing systems), system-level
programming to run the scanners, etc. There's a big ol' book on this, "Rome
Wasn't Digitized in a Day":
<http://www.clir.org/pubs/reports/pub150/pub150.pdf> BYU (which I attend and
whom I work for) has done a huge amount of work in this field:
<http://maxwellinstitute.byu.edu/about/cpart.php>

A few years ago I was writing web applications to support transcription of
images of medieval documents in Old French- avoiding close-to-insurmountable
OCR problems using grad students, but that still requires segmenting images
properly. The LDS church does similar stuff on a very large scale to digitize
genealogical records. It makes research a whole lot easier, but there's still
plenty of room for improvement; image maps don't always reliably match up with
the fields that you're trying to read/transcribe on images of documents, and
that's kind of a pain.

~~~
TheAmazingIdiot
What we need here are true eyeballs to read the scripts.

I do medieval and renaissance dance reconstruction and dance performance.
Having just been to an event, I took a class on the Dances of the Gresley
Manuscript.

Well, what is this manuscript? It isn't a dance treatise, or anything of the
sort. Gresley was a law student from the 1530-1550's (we know from latter
court cases by a lawyer Gresley). These dance instructions come from the
margins of his law book.

He wrote in musical notation, dance notation and other descriptive words. He
even left words that have no meaning in the dance community. We have to deduce
what he meant by a multitude of methods, none of which we can guarantee.

But back to the topic of OCR... How does these document scanners and OCR's
plan to deduce this kind of source written in the margins?

~~~
gliese1337

        But back to the topic of OCR... How does these document scanners and OCR's plan to deduce this kind of source written in the margins?
    

I have no idea; probably they don't, yet. Everything I've worked on uses
students' eyeballs to do the actual character recognition, so I'm not deeply
familiar with the state of the art. I do know that OCR is mainly used for
documents that have a well-defined structure where you can make an image map
identifying different semantic fields, and the contextual field information
allows for much more intelligent OCR; it's not so good for big blocks of
paragraph text.

When you get to figuring out stuff scrawled in the margins, there are image
pre-processing techniques that can identify regions of handwriting and then
normalize it by rotation and scaling, but I'm pretty sure a complete solution
is still in the realm of stuff considered AI (because, of course, once you
know how to do it reliably, it becomes machine learning or pattern matching or
something like that and no one calls it AI anymore).

------
dbuxton
Does anyone have a lead on a framework or service that can handle the storage,
display and metadata management for assets of this type online? On a low (non-
institutional) budget?

My family has a large number (thousands to tens of thousands) of photographs,
sketchbooks and other historical documents that we're in the course of
digitising - partly so that they are not lost to posterity but also so that we
can share with other branches of the family.

At the moment we have thousands of scans in a massive Dropbox account but it's
becoming unmanageable very quickly, and only allows minimal metadata storage.
(The quality of the scans is something that we have concerns about but for the
moment it's good enough).

Apologies for the slightly unrelated post but I've been mulling this for a
while and may see if I can hack something together if nothing is out there
already.

~~~
zrail
I'm working on a personal photo album project (personal as in "for my use
only") that kind of matches this. It took me a little under five hours to put
together a simple rails site that lives on heroku, handles the uploads and
thumbnailing and sticks the whole thing on S3, as well as handling display.
Right now the only metadata is "name" and "description" but it's just a
postgres database, you can add whatever you want.

Here's the basic app if you want to poke around. There's almost literally
nothing to it:

<https://github.com/peterkeen/phytos>

------
iRobot
My first Pascal program should be in there somewhere.

