
Ask HN: Does software exist to digitize scanned books and articles? - chmaynard
I&#x27;ve noticed that when I view a PDF of an old book or article, often I can&#x27;t select and copy text. I assume this is because (1) text selection is disabled somehow, or (2) the document is essentially just a collection of images. Does software exist that can convert a printed page with a lot of math notation into a truly digital document? I&#x27;m looking for the same level of quality as TeX. Thanks!
======
espeed
Yes, PDFMiner in Python
[https://github.com/euske/pdfminer](https://github.com/euske/pdfminer)

Apache PDFBox in Java [https://pdfbox.apache.org](https://pdfbox.apache.org)

Previous discussion
[https://news.ycombinator.com/item?id=11327493](https://news.ycombinator.com/item?id=11327493)

For a list of others, see [http://okfnlabs.org/blog/2016/04/19/pdf-tools-
extract-text-a...](http://okfnlabs.org/blog/2016/04/19/pdf-tools-extract-text-
and-data-from-pdfs.html)

~~~
chmaynard
According to the PDFMiner site, pdf2txt.py cannot recognize text drawn as
images that would require optical character recognition. I'm interested in
software that combines OCR with some sort of math notation rendering engine.

~~~
espeed
For handwritten character recognition, see:

[https://www.tensorflow.org/tutorials/mnist/beginners/](https://www.tensorflow.org/tutorials/mnist/beginners/)
(also google "tensorflow ocr")

[http://yann.lecun.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/)

CROHME: Competition on Recognition of Online Handwritten Mathematical
Expressions
[http://www.isical.ac.in/~crohme/](http://www.isical.ac.in/~crohme/)

Closed-sourced API: [http://mathpix.com](http://mathpix.com)
[https://photomath.net/en/](https://photomath.net/en/)

Best off-the-shelf OCR (originally developed by HP, now Google):

[https://github.com/tesseract-ocr/tesseract](https://github.com/tesseract-
ocr/tesseract)

[https://github.com/tesseract-
ocr/tesseract/wiki](https://github.com/tesseract-ocr/tesseract/wiki)

Two Clojure talks...

Machine Learning Live - Mike Anderson
[https://www.youtube.com/watch?v=QJ1qgCr09j8](https://www.youtube.com/watch?v=QJ1qgCr09j8)

Adventures in Understanding Documents - Scott Tuddenham
[https://www.youtube.com/watch?v=94NjRg8zoCA](https://www.youtube.com/watch?v=94NjRg8zoCA)

