
Ask HN: OCR framework for extracting formatted text? - crocodiletears
I&#x27;m a serial information hoarder, and often use screenshots in order to store comments, passages and fragments of conversations I find useful or insightful. This works well if I want to reference something recent, but obviously doesn&#x27;t scale well. I&#x27;d like to integrate these into my personal archive, but don&#x27;t know any frameworks (preferably for Go, Node, or Python) which could automatically extract the text from the images while retaining its formatting. I&#x27;m not against doing some image preprocessing myself, but I don&#x27;t feel comfortable passing the images to a 3rd party service, since a portion of the images contain private or sensitive information that I can&#x27;t readily sort out of my collection.
======
undebuggable
To extract text from photos and non OCR-ed PDFs Tesseract[1] with language
specific model[2] never fails me.

I use my shell utility[3] to automate the workflow with ImageMagick and
Tesseract, with intermediate step using monochrome TIFFs. Extracting each page
into separate text file allows to ag/grep a phrase and then find it easily
back in the original PDF.

Having greppable libraries of books on various domains and not having to crawl
through the web search each time is very useful and time-saving.

[1] [https://tesseract-ocr.github.io/](https://tesseract-ocr.github.io/)

[2] [https://github.com/tesseract-ocr/tessdata](https://github.com/tesseract-
ocr/tessdata)

[3]
[https://github.com/undebuggable/pdf2txt](https://github.com/undebuggable/pdf2txt)

~~~
Jaruzel
I struggled to get tesseract to OCR my image based PDFs directly, so resorted
to using GhostScript to extract the pages to pngs which I then put through
tesseract. As an added bonus though, I gained the ability to have a thumbnail
png for the search front end.

~~~
undebuggable
> so resorted to using GhostScript to extract the pages to pngs which I then
> put through tesseract

I understand you use this to extract text from non OCR-ed PDFs, especially
consisting of low quality scans or photos (e.g. low resolution, JPEG
artifacts).

Ocassionally passing higher resolution to ImageMagick when converting a page
to TIFF helped, but this sounds like a reasonable fallback as well.

------
ryanfox
I built an application for exactly this. It's called A Personal Search Engine,
APSE for short.[0]

It OCRs screenshots and stores the text in a search index, so you can query by
keyword, date, boolean operators, the whole shebang.

It's all local. It is _really_ useful for me - yesterday it saved me after
Firefox wigged out and lost all my tabs. It's in a great place to try out, and
I am actively developing it.

[0] [https://apse.io](https://apse.io)

~~~
pezo1919
That's cool. How can I make sure it does not send my data somewhere? :)

~~~
Jaruzel
Ahh the paranoia of HN strikes again.

Maybe, stop and think before you ask this. Someone offers up an example of
their hard work and you instantly accuse them of being a malware author that
steals your data. Nice.

~~~
stragies
I would rather call it the new-found paranoia most legal departments now have,
when IT dept. mentions rolling out new, unknown, non-auditable software in the
company from a new vendor that wasn't grand-fathered in (= the legacy
windows/cisco deployments). I'm happy about it. I'm just waiting for them to
forbid closed-source firmwares/hardware.

------
asguy
ocrmypdf and friends.

I've built an archival system based around Tesseract and PostgreSQL. It takes
Images/PDFs, either scanned or generated, and rebuilds them as searchable PDFs
before being extracted and inserted into Postgres' full-text search. I keep
all of the original media because disk is cheap.

Originally I used Tesseract directly. But I found that ocrmypdf did a better
job than my home-grown pipeline, so I switched.

~~~
tuddman
I also built a system that extracted structured and unstructured text from
images/pdfs. For the generated pdfs, I found pdftotext could pull with 100%
fidelity, and so that was 'option #1'. for scanned-images-saved-as-pdfs, then
tesseract could sometimes extract with 90+% accuracy. But never 100%.
Combining pdftotext (with the right flags set) with some of the other
associated pdf-tools, we were able to achieve what we were after: Building a
searchable DB and auto-informing corpus of information derived entirely from
various pdf sources. All in-house. No sending off to 3rd parties.

~~~
undebuggable
> For the generated pdfs, I found pdftotext could pull with 100% fidelity, and
> so that was 'option #1'. for scanned-images-saved-as-pdfs, then tesseract
> could sometimes extract with 90+% accuracy.

Arrived to a similar conclusion although never have bothered with DB or any
web interface running locally. Simply grepping the text files works flawlessly
for me.

------
tyingq
Here's a blog post showing self hosted PyTesseract finding text in an image
and preserving the format: [https://stackabuse.com/pytesseract-simple-python-
optical-cha...](https://stackabuse.com/pytesseract-simple-python-optical-
character-recognition/)

There's a reason why the external services are popular though...lots of
training data and tweaks to make them much more accurate. Try the Google demo
here, for example:
[https://cloud.google.com/vision/docs/ocr](https://cloud.google.com/vision/docs/ocr)

------
flicken
Although [https://www.willus.com/k2pdfopt/](https://www.willus.com/k2pdfopt/)
is meant for reformatting PDFs to view on e-readers, it does do a reasonable
job of extracting text via OCR and storing as a PDF layer. The underlying
engine can be either [https://github.com/tesseract-
ocr/tesseract](https://github.com/tesseract-ocr/tesseract) or
[http://jocr.sourceforge.net/](http://jocr.sourceforge.net/)

------
faustomorales
If you are (a) willing to take the word bounding boxes and convert them to
paragraphs yourself, and (b) okay with a deep learning approach, you may want
to give keras-ocr [0] a try.

Full disclosure: I'm the primary package developer. Shameless plug. :)

[0] [https://github.com/faustomorales/keras-
ocr](https://github.com/faustomorales/keras-ocr)

------
nacho_man
This doesn't meet most your requirements, (Go, Node, Python, and it's a manual
process...) but... maybe this would be helpful?

On Mac I use a modified version of this Keyboard Maestro script, to OCR a user
selected area of the screen.

This script will result in the OCR Text on the clipboard. I'm sure Keyboard
Maestro could automagically append it to a text file or something. I'm kinda a
noob with Keyboard Maestro, so I don't know all of it's functionality.

I have a couple variations of this script, one that will use the Mac's speak
this command to read aloud the OCR text, as I am a slow reader, and an
auditory learner.

My father had a bunch of newspaper clippings scanned into the family tree
application and wanted the text. I used this method to get the text instead of
typing it all out.

[https://forum.keyboardmaestro.com/t/ocr-user-selected-
area-m...](https://forum.keyboardmaestro.com/t/ocr-user-selected-area-
macro-v9-0-1/15054)

------
kamalfariz
OCR techniques are general purpose in trying to map any conceivable text-
looking shapes into actual text. Accuracy can vary wildly but the good ones
will match against plausible words to eliminate low quality guesses.

Is there an accuracy optimization to be found if I can pre-train the OCR
engine to look for a limited set of words instead of the entire dictionary-
and printable character space?

The use case I have is OCRing shipping labels for packages that arrive at an
office. The set of plausible matches is incredibly small as it is the set of
employee names that work in said office.

Further optimizations include reducing the problem space by only considering
computer printed glyphs and not bothering with handwritten labels, and the
insight that the distribution of packages follow a power law where a
disproportionately small group of people receive the largest number of
packages.

The end goal is to perform this entirely on device, with low latency and high
accuracy.

~~~
hsson
Consider looking into language models such as KenLM. It is used by ASR models
like wav2letter and DeepSpeech to correct speech-to-text transcripts

------
kranner
Try [https://screenotate.com/](https://screenotate.com/)

(no affiliation, just a user)

------
inetsee
One problem that I have with OCR is dealing with images of pages that are
warped. I have some books that I would like to turn into electronic books, but
not enough to justify setting up a book scanning rig (framework, two cameras,
platen, etc). Setting up a document camera is fairly easy, but using it to
take pictures of a book laying flat on the base produces images where the
pages are warped and most OCR software seems to have problems with warped
pages.

After a fair amount of searching I found ScanTailor:
[https://github.com/4lex4/scantailor-advanced#scan-tailor-
adv...](https://github.com/4lex4/scantailor-advanced#scan-tailor-advanced)
which seems to have the capability of dealing with warped page images. I
haven't actually gone through the complete workflow with it yet, but it seems
to be a very capable OCR package.

~~~
umvi
I used this[0] in conjunction with Tesseract, and it worked pretty well.

[0]
[https://github.com/mzucker/page_dewarp](https://github.com/mzucker/page_dewarp)

~~~
inetsee
Thank you. This does look like it has an easier workflow than ScanTailor. I'll
have to give it a try.

------
coderguy123
I hate to post my own app but it does do part of what you ask and it does it
locally. Nothing is sent to any server.

[https://www.dizzybits.com/Photoplex](https://www.dizzybits.com/Photoplex)

It does on-device text recognition on your photos, stores on local SQLite
database and lets you full text search.

------
Jugurtha
Site: [https://openpaper.work/](https://openpaper.work/)

Repo:
[https://gitlab.gnome.org/World/OpenPaperwork/paperwork](https://gitlab.gnome.org/World/OpenPaperwork/paperwork)

------
Brainsnail
[https://github.com/axa-group/Parsr](https://github.com/axa-group/Parsr)

------
FloatArtifact
I'm interested in drawing bounding boxes around text that can be displayed to
the end user. In this way I don't care about OCR accuracy but the ability
detect text accurately and across different mediums of type. Thoughts for a
framework for this that's low latency under 150 ms or so?

------
jangia
You may set up your OCR service on AWS Lambda.

I wrote a guide how to do it here:

[https://typless.com/2020/05/21/tesseract-on-aws-lambda-
ocr-a...](https://typless.com/2020/05/21/tesseract-on-aws-lambda-ocr-as-a-
service/)

Hope it helps

------
cl0rkster
just search for "tesseract GUI". if you are more technical, you can write code
around tesseract. for what you get for free, it's really impressive what
Google has done with this in just a few years to make it something that the
average person can really consider using for free.

ex.
[https://github.com/tesseract4java/tesseract4java](https://github.com/tesseract4java/tesseract4java)

------
misiti3780
I know you said you didnt want to upload stuff to third parties but Amazon
Textract works great and supports HIPPA data

------
crocodiletears
Plenty of fantastic suggestions in the comments, any one of which looks like
it could do the trick. Not having any experience in the problem domain, I'm
afraid I don't have much to contribute in response, but I look forward to
evaluating each framework/service.

------
lowdose
Why not upload it to Google Photos. It will do the OCR and make the text on
your photos / screenshots searchable with a sweet UI in the browser.

If you still want to grab the text yourself you make a copy to Google Keep and
use the "grab text" function.

Works for me, I take full screenshots of interesting stuff so the url is still
visible when I want to go back to the original.

Obviously I have a paid G Suite account at Google. That comes with a very good
set of privacy protecting rules. Doesn't matter how you roll your stack
eventually you are going to be dependent on a 3th party. Better use one that
offers full encryption and 2FA to lockup your data.

[https://gsuite.google.com/learn-more/security/security-
white...](https://gsuite.google.com/learn-more/security/security-
whitepaper/page-6.html)

~~~
crocodiletears
I've a number of screenshots concerning conversations, documents, and pii I
don't necessarily trust in the hands of third parties, as well as don't feel
I'm at liberty to share with third parties.

Beyond that, as exceptional as Gsuite is, I've been making a conscious effort
to excise Alphabet/Google services from my life - it's just not a company I
trust.

~~~
lowdose
Isn't that data already in the hands of 3th parties when it are screenshots of
conversations and documents, or did you also build that communication stack
from the ground up?

~~~
crocodiletears
I'd frame it like this:

With respect to online conversations - most of them are on the open-web,
anyone can see them. I don't care if their content gets out. Private
conversations should be kept between their participants, their host, and their
host's infrastructure provider.

More saliently however, many of these screenshots contain incidental data
which I wouldn't necessarily want to be centralized off of my own hardware.
This ranges from the identities of multiple alt-accounts, who they follow on
social media, to generic information about my social graph. They also include
receipts of much of my online transaction history.

While I'm under no delusion that much of that data doesn't travel all over the
universe via data brokers and information sharing agreements, I'm just not
comfortable directly handing it all to any one company.

If I was working on a commercial project, I'd leap at the opportunity to
outsource the task of content transcription - it would save me time, money,
and quite probably give me better results.

But since I want to feed it all into my personal archive, which runs on my own
hardware and is as much a learning project as it is a utility, and since I
like to keep my personal life as personal as possible, I make a point of
keeping everything self-hosted wherever possible.

I'll fully admit that it's paranoid, labor-intensive, likely ineffectual, and
by most measures a bit excessive.

But there are few places where one is at liberty to draw a line in the sand
anymore with how their data is distributed. This is simply where I've chosen
to draw one of mine.

~~~
lowdose
Look I fully agree with you if that is what you want, and you are fully aware
of the trade-off you make.

When you pull this off you are a very talented skilled engineer. I hope you
open source your solution so friction is removed for other people with a
similar dilemma in the future.

Our time is the only currency we have and we can pursue activities we love or
fear. The line between paranoia or choosing for personal freedom is thin and
very personal.

I came to the conclusion for myself I have spend to much time on home grown
solution for problems others have solved better and cheaper. Getting from it
works 80% of the time to 99% and I can blindly trust my infra is the
difference between a weekend and year fulltime work.

I choose for G Suite because at least Google offers me a paid option to
exclude my account from their advertisement data monetizing branch.

I do really respect that you make a deliberate effort in this.

