Hacker News new | comments | show | ask | jobs | submit login
Digitising books as objects: The invisible made visible (blogs.bl.uk)
43 points by diodorus 5 months ago | hide | past | web | favorite | 5 comments

The article talks about lighting, and the different features visible with normal light (the text), raking light (folds and holes in the paper), and transmitted light (watermarks).

I have a tangentially-related question, regarding the books I've digitised myself with a phone camera and natural daylight. Are there tools that can detect the spine in the middle, de-skew the image, adjust discolouration, detect text, and OCR? I think it would be a good task for machine learning.

I'd also like to run those image manipulation routines separately, because the text is mixed Chinese-English, so the OCR step is non-trivial.

> Are there tools that can detect the spine in the middle, de-skew the image, adjust discolouration, detect text, and OCR?

https://www.flameeyes.eu/projects/unpaper provides a partial solution. https://mzucker.github.io/2016/09/20/noteshrink.html for some more ideas

These steps are generally split between the image processing ones, the OCR one, and a final step of combining/compressing pages into a single "bound" pdf/djvu file, at least if you are looking to use FOSS software.

For image processing, take a look at Scantailor (https://github.com/scantailor/scantailor/wiki), which will handle all the image processing steps for you and output images that are ideal for OCR.

I have not done OCR on mixed language text but I will say that tesseract has been under active development for years and does continue to improve.

The best FOSS options for binding all the processed images and OCR output into single files are djvubind (https://github.com/strider1551/djvubind) for djvu output, and pdfbeads (https://github.com/ifad/pdfbeads) fr pdf output. I tried to write up an outline of the whole process and how to use each of the tools here: https://github.com/wikey/bookscan

A lot of those tools have received little development in the past couple of years. They tend to do what they do well and reliably so don't let that put you off, though anyone interested in adding to the developer pool would certainly be welcome.

For more general information and especially background discussion, take a look through the DIY Book Scanner forum: https://forum.diybookscanner.org/

Are there good open source OCR engines around? Would be a starting point for such a program - a good automated way of scanning books seems to be kind of missing.

Tesseract doesn't meet the "good" criterion you listed, in my experience with Chinese and skewed English text.

OCR is the most important part of turning JPG into TXT, but it's the image processing beforehand that I'm less familiar with (deskew, text detection). Robotically automated scanning isn't practical in my opinion; it could easily get thrown off by thicker pages for photos in the middle of a book, for example. Digitisation will always be a partly manual process, I just hope the existing images can be processed better.

If you're not in Germany and are interested in digitisation, I think Project Gutenberg are worth looking into.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact