Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: OCR framework for extracting formatted text?
146 points by crocodiletears 12 days ago | hide | past | web | favorite | 42 comments
I'm a serial information hoarder, and often use screenshots in order to store comments, passages and fragments of conversations I find useful or insightful. This works well if I want to reference something recent, but obviously doesn't scale well. I'd like to integrate these into my personal archive, but don't know any frameworks (preferably for Go, Node, or Python) which could automatically extract the text from the images while retaining its formatting. I'm not against doing some image preprocessing myself, but I don't feel comfortable passing the images to a 3rd party service, since a portion of the images contain private or sensitive information that I can't readily sort out of my collection.

I built an application for exactly this. It's called A Personal Search Engine, APSE for short.[0]

It OCRs screenshots and stores the text in a search index, so you can query by keyword, date, boolean operators, the whole shebang.

It's all local. It is really useful for me - yesterday it saved me after Firefox wigged out and lost all my tabs. It's in a great place to try out, and I am actively developing it.

[0] https://apse.io

That's cool. How can I make sure it does not send my data somewhere? :)


You could block it at the firewall - same as you would for any application.

Ahh the paranoia of HN strikes again.

Maybe, stop and think before you ask this. Someone offers up an example of their hard work and you instantly accuse them of being a malware author that steals your data. Nice.

I would rather call it the new-found paranoia most legal departments now have, when IT dept. mentions rolling out new, unknown, non-auditable software in the company from a new vendor that wasn't grand-fathered in (= the legacy windows/cisco deployments). I'm happy about it. I'm just waiting for them to forbid closed-source firmwares/hardware.

To extract text from photos and non OCR-ed PDFs Tesseract[1] with language specific model[2] never fails me.

I use my shell utility[3] to automate the workflow with ImageMagick and Tesseract, with intermediate step using monochrome TIFFs. Extracting each page into separate text file allows to ag/grep a phrase and then find it easily back in the original PDF.

Having greppable libraries of books on various domains and not having to crawl through the web search each time is very useful and time-saving.

[1] https://tesseract-ocr.github.io/

[2] https://github.com/tesseract-ocr/tessdata

[3] https://github.com/undebuggable/pdf2txt

If you're interested in grepping PDFs (among other formats) another option is ripgrep-all.


I struggled to get tesseract to OCR my image based PDFs directly, so resorted to using GhostScript to extract the pages to pngs which I then put through tesseract. As an added bonus though, I gained the ability to have a thumbnail png for the search front end.

> so resorted to using GhostScript to extract the pages to pngs which I then put through tesseract

I understand you use this to extract text from non OCR-ed PDFs, especially consisting of low quality scans or photos (e.g. low resolution, JPEG artifacts).

Ocassionally passing higher resolution to ImageMagick when converting a page to TIFF helped, but this sounds like a reasonable fallback as well.

I tried using Tesseract to OCR just some numbers on an almost plain background, and it failed around 2-3% of the time. Which made the whole thing useless, because I needed 100% correctness.

Your bash script is totally broken. It doesn't properly parse command line arguments. Ignores the page range set on -t -f

This should work now, thanks.

ocrmypdf and friends.

I've built an archival system based around Tesseract and PostgreSQL. It takes Images/PDFs, either scanned or generated, and rebuilds them as searchable PDFs before being extracted and inserted into Postgres' full-text search. I keep all of the original media because disk is cheap.

Originally I used Tesseract directly. But I found that ocrmypdf did a better job than my home-grown pipeline, so I switched.

I also built a system that extracted structured and unstructured text from images/pdfs. For the generated pdfs, I found pdftotext could pull with 100% fidelity, and so that was 'option #1'. for scanned-images-saved-as-pdfs, then tesseract could sometimes extract with 90+% accuracy. But never 100%. Combining pdftotext (with the right flags set) with some of the other associated pdf-tools, we were able to achieve what we were after: Building a searchable DB and auto-informing corpus of information derived entirely from various pdf sources. All in-house. No sending off to 3rd parties.

> For the generated pdfs, I found pdftotext could pull with 100% fidelity, and so that was 'option #1'. for scanned-images-saved-as-pdfs, then tesseract could sometimes extract with 90+% accuracy.

Arrived to a similar conclusion although never have bothered with DB or any web interface running locally. Simply grepping the text files works flawlessly for me.

Cool I've done exactly the same with ocrmypdf! With a Django web app to search through the scanned documents (around 30k documents and 200k pages).

Tesseract is the best FOSS one I found when I looked a bit back. I don't conceive of a superior FOSS one any time soon unless a commercial one open sources, or some one utilizing deep learning comes out.

Tesseract 4 has switched to using a deep learning engine by default.

Very nice. This is quite similar to what I've been building for myself.

Here's a blog post showing self hosted PyTesseract finding text in an image and preserving the format: https://stackabuse.com/pytesseract-simple-python-optical-cha...

There's a reason why the external services are popular though...lots of training data and tweaks to make them much more accurate. Try the Google demo here, for example: https://cloud.google.com/vision/docs/ocr

If you are (a) willing to take the word bounding boxes and convert them to paragraphs yourself, and (b) okay with a deep learning approach, you may want to give keras-ocr [0] a try.

Full disclosure: I'm the primary package developer. Shameless plug. :)

[0] https://github.com/faustomorales/keras-ocr

This doesn't meet most your requirements, (Go, Node, Python, and it's a manual process...) but... maybe this would be helpful?

On Mac I use a modified version of this Keyboard Maestro script, to OCR a user selected area of the screen.

This script will result in the OCR Text on the clipboard. I'm sure Keyboard Maestro could automagically append it to a text file or something. I'm kinda a noob with Keyboard Maestro, so I don't know all of it's functionality.

I have a couple variations of this script, one that will use the Mac's speak this command to read aloud the OCR text, as I am a slow reader, and an auditory learner.

My father had a bunch of newspaper clippings scanned into the family tree application and wanted the text. I used this method to get the text instead of typing it all out.


OCR techniques are general purpose in trying to map any conceivable text-looking shapes into actual text. Accuracy can vary wildly but the good ones will match against plausible words to eliminate low quality guesses.

Is there an accuracy optimization to be found if I can pre-train the OCR engine to look for a limited set of words instead of the entire dictionary- and printable character space?

The use case I have is OCRing shipping labels for packages that arrive at an office. The set of plausible matches is incredibly small as it is the set of employee names that work in said office.

Further optimizations include reducing the problem space by only considering computer printed glyphs and not bothering with handwritten labels, and the insight that the distribution of packages follow a power law where a disproportionately small group of people receive the largest number of packages.

The end goal is to perform this entirely on device, with low latency and high accuracy.

Consider looking into language models such as KenLM. It is used by ASR models like wav2letter and DeepSpeech to correct speech-to-text transcripts

Try https://screenotate.com/

(no affiliation, just a user)

I hate to post my own app but it does do part of what you ask and it does it locally. Nothing is sent to any server.


It does on-device text recognition on your photos, stores on local SQLite database and lets you full text search.

One problem that I have with OCR is dealing with images of pages that are warped. I have some books that I would like to turn into electronic books, but not enough to justify setting up a book scanning rig (framework, two cameras, platen, etc). Setting up a document camera is fairly easy, but using it to take pictures of a book laying flat on the base produces images where the pages are warped and most OCR software seems to have problems with warped pages.

After a fair amount of searching I found ScanTailor: https://github.com/4lex4/scantailor-advanced#scan-tailor-adv... which seems to have the capability of dealing with warped page images. I haven't actually gone through the complete workflow with it yet, but it seems to be a very capable OCR package.

I used this[0] in conjunction with Tesseract, and it worked pretty well.

[0] https://github.com/mzucker/page_dewarp

Thank you. This does look like it has an easier workflow than ScanTailor. I'll have to give it a try.

Although https://www.willus.com/k2pdfopt/ is meant for reformatting PDFs to view on e-readers, it does do a reasonable job of extracting text via OCR and storing as a PDF layer. The underlying engine can be either https://github.com/tesseract-ocr/tesseract or http://jocr.sourceforge.net/

I'm interested in drawing bounding boxes around text that can be displayed to the end user. In this way I don't care about OCR accuracy but the ability detect text accurately and across different mediums of type. Thoughts for a framework for this that's low latency under 150 ms or so?

You may set up your OCR service on AWS Lambda.

I wrote a guide how to do it here:


Hope it helps

just search for "tesseract GUI". if you are more technical, you can write code around tesseract. for what you get for free, it's really impressive what Google has done with this in just a few years to make it something that the average person can really consider using for free.

ex. https://github.com/tesseract4java/tesseract4java

I know you said you didnt want to upload stuff to third parties but Amazon Textract works great and supports HIPPA data

Plenty of fantastic suggestions in the comments, any one of which looks like it could do the trick. Not having any experience in the problem domain, I'm afraid I don't have much to contribute in response, but I look forward to evaluating each framework/service.

Why not upload it to Google Photos. It will do the OCR and make the text on your photos / screenshots searchable with a sweet UI in the browser.

If you still want to grab the text yourself you make a copy to Google Keep and use the "grab text" function.

Works for me, I take full screenshots of interesting stuff so the url is still visible when I want to go back to the original.

Obviously I have a paid G Suite account at Google. That comes with a very good set of privacy protecting rules. Doesn't matter how you roll your stack eventually you are going to be dependent on a 3th party. Better use one that offers full encryption and 2FA to lockup your data.


I've a number of screenshots concerning conversations, documents, and pii I don't necessarily trust in the hands of third parties, as well as don't feel I'm at liberty to share with third parties.

Beyond that, as exceptional as Gsuite is, I've been making a conscious effort to excise Alphabet/Google services from my life - it's just not a company I trust.

Isn't that data already in the hands of 3th parties when it are screenshots of conversations and documents, or did you also build that communication stack from the ground up?

I'd frame it like this:

With respect to online conversations - most of them are on the open-web, anyone can see them. I don't care if their content gets out. Private conversations should be kept between their participants, their host, and their host's infrastructure provider.

More saliently however, many of these screenshots contain incidental data which I wouldn't necessarily want to be centralized off of my own hardware. This ranges from the identities of multiple alt-accounts, who they follow on social media, to generic information about my social graph. They also include receipts of much of my online transaction history.

While I'm under no delusion that much of that data doesn't travel all over the universe via data brokers and information sharing agreements, I'm just not comfortable directly handing it all to any one company.

If I was working on a commercial project, I'd leap at the opportunity to outsource the task of content transcription - it would save me time, money, and quite probably give me better results.

But since I want to feed it all into my personal archive, which runs on my own hardware and is as much a learning project as it is a utility, and since I like to keep my personal life as personal as possible, I make a point of keeping everything self-hosted wherever possible.

I'll fully admit that it's paranoid, labor-intensive, likely ineffectual, and by most measures a bit excessive.

But there are few places where one is at liberty to draw a line in the sand anymore with how their data is distributed. This is simply where I've chosen to draw one of mine.

Look I fully agree with you if that is what you want, and you are fully aware of the trade-off you make.

When you pull this off you are a very talented skilled engineer. I hope you open source your solution so friction is removed for other people with a similar dilemma in the future.

Our time is the only currency we have and we can pursue activities we love or fear. The line between paranoia or choosing for personal freedom is thin and very personal.

I came to the conclusion for myself I have spend to much time on home grown solution for problems others have solved better and cheaper. Getting from it works 80% of the time to 99% and I can blindly trust my infra is the difference between a weekend and year fulltime work.

I choose for G Suite because at least Google offers me a paid option to exclude my account from their advertisement data monetizing branch.

I do really respect that you make a deliberate effort in this.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact