Hacker News new | comments | show | ask | jobs | submit login

Tesseract doesn't meet the "good" criterion you listed, in my experience with Chinese and skewed English text.

OCR is the most important part of turning JPG into TXT, but it's the image processing beforehand that I'm less familiar with (deskew, text detection). Robotically automated scanning isn't practical in my opinion; it could easily get thrown off by thicker pages for photos in the middle of a book, for example. Digitisation will always be a partly manual process, I just hope the existing images can be processed better.

If you're not in Germany and are interested in digitisation, I think Project Gutenberg are worth looking into.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: