
Doc2text – Detect text blocks and OCR poorly scanned PDFs in bulk - jlsutherland
https://github.com/jlsutherland/doc2text
======
onetwotree
I worked on a PDF text extraction project once, with scientific articles as
the primary target.

That stuff is _really hard_ , even when the text is ostensibly present in the
PDF (as opposed to the PDF being an image of text). Thing is, it's all just
"draw text" commands in the content streams (basically postscript programs).
The text commands appear in no particular order and you generally have to
compute the layout to see even where the spaces are (which is still a guessing
game, because the PDF generator will vary the space width to achieve a
visually pleasing format).

OCR is an approach that wasn't quite ready for prime time back then, so it's
cool to see people working on it!

------
ldenoue
Also look at Ocropy from Tom Breuel who has a page segmenter that identifies
columns. [https://github.com/tmbdev/ocropy](https://github.com/tmbdev/ocropy)

~~~
jlsutherland
Ocropus is an incredible tool. I highly recommend it!

------
zaphar
This looks awesome. I've got a ghetto full text search indexer I've written
that uses OCR as a fallback if it can't extract text from a pdf but as you say
many times the quality is so bad it's a lost cause. I wonder if I can leverage
this to improve the indexing.

~~~
malux85
Tesseract is sadly, quite out of date. If you would like help implementing
Deep Learning models for OCR let me know.

~~~
marai2
Would you mind commenting why Tesseract is out of date? I see developers are
still active on it:

[https://github.com/tesseract-
ocr/tesseract/commits/master](https://github.com/tesseract-
ocr/tesseract/commits/master)

~~~
malux85
There are people still working on VAX systems - are VAX systems not out of
date by the same logic?

Spend 1 afternoon with tesseract, and 1 afternoon with Googles text
recognition API. The quality of the results is night and day.

I would love there to be an open source one that can complete, which is why I
said "sadly". But if you're interested in quality of results, Deep Learning is
the way to go.

~~~
acdha
Google has been one of the biggest contributors to Tesseract – has that
changed? My understanding – which could be years out of date – was that Google
Books used Tesseract but that most of their effort had gone into either
advanced image preprocessing or large-scale training.

~~~
malux85
Yes you're right - they were one of the biggest contributors until a 12-18
months ago (roughly)

Now it's deep learning, it's at the point now where there's no point in
spending ages manually 'feature engineering', just throw some GPU and soon to
be TPU processing power at it

~~~
acdha
Thanks for the update – has any of that been described in public?

------
redwards510
Can you please explain what makes this utility different than other OCR
solutions? I've seen quite a few coming out recently. What is the secret sauce
that makes this more than just a frontend for tesseract?

~~~
jlsutherland
The quick and dirty: OCR solutions exist, but to work well they generally need
a little hand-holding. You have to give your OCR software a clean image if you
want clean results (this goes for tesseract, ocropus, etc). The problem is
that scans are rarely so clean...they are crooked, there is a hand in it,
there is half of another page in it, etc. etc.---and common OCR software
doesn't correct for this too well out-of-the-box.

doc2text bridges the gap between the initial scan and the scan you should pass
through your OCR to greatly increase OCR ability. It takes that dirty scan,
identifies the text region, fixes skew, performs a few pre-processing
operations that help with common OCR binarization, and BOOM...data that was
inaccessible, now accessible.

Try running tesseract or ocropus on a bad document scan before and after using
doc2text...you'll see what I mean!

P.S. I should add...the end-user is also a little different from strict OCR
packages/wrappers. Users might be admin staff or academics (or kids like my
RA's) who want a simple, straightforward API to extract the text we need from
poorly scanned documents. doc2text is built with this need in mind.

~~~
ashkulz
Do you have a comparison with unpaper, which seems to do almost the same
thing?

------
placeybordeaux
Some examples would be really informative as to how well it works.

------
cpr
Pretty impressive leverage here: only a few dozen lines of code in total,
using other OSS libraries.

~~~
jlsutherland
Thanks!

