
Digitising books as objects: The invisible made visible - diodorus
http://blogs.bl.uk/collectioncare/2018/02/digitising-books-as-objects-the-invisible-made-visible.html
======
peterburkimsher
The article talks about lighting, and the different features visible with
normal light (the text), raking light (folds and holes in the paper), and
transmitted light (watermarks).

I have a tangentially-related question, regarding the books I've digitised
myself with a phone camera and natural daylight. Are there tools that can
detect the spine in the middle, de-skew the image, adjust discolouration,
detect text, and OCR? I think it would be a good task for machine learning.

I'd also like to run those image manipulation routines separately, because the
text is mixed Chinese-English, so the OCR step is non-trivial.

~~~
_ph_
Are there good open source OCR engines around? Would be a starting point for
such a program - a good automated way of scanning books seems to be kind of
missing.

~~~
peterburkimsher
Tesseract doesn't meet the "good" criterion you listed, in my experience with
Chinese and skewed English text.

OCR is the most important part of turning JPG into TXT, but it's the image
processing beforehand that I'm less familiar with (deskew, text detection).
Robotically automated scanning isn't practical in my opinion; it could easily
get thrown off by thicker pages for photos in the middle of a book, for
example. Digitisation will always be a partly manual process, I just hope the
existing images can be processed better.

If you're not in Germany and are interested in digitisation, I think Project
Gutenberg are worth looking into.

