Hacker News new | comments | ask | show | jobs | submit login
Ask HN: Open source OCR library?
264 points by whbv on June 25, 2015 | hide | past | web | favorite | 98 comments
Does anyone have a recommendation for a good character recognition library? I'm trying to pull text out of images. I tried ocrad.js but haven't had much success with it.

As others pointed out, Tesseract with OpenCV (for identifying and cropping the text region) is quite effective. On top of that, Tesseract is fully trainable with custom fonts.

In our use case, we've mostly had to deal with handwritten text and that's where none of them really did well. Your next best bet would be to use HoG(Histogram of oriented gradients) along with SVMs. OpenCV has really good implementations of both.

Even then, we've had to write extra heuristics to disambiguate between 2 and z and s and 5 etc. That was too much work and a lot of if-else. We're currently putting in our efforts on CNNs(Convolutional Neural Networks). As a start, you can look at Torch or Caffe.

In my experience, CNNs offer the best performance. It is also easy to treat as a black box with many tunable parameters.

But the existing frameworks are mostly bad as an engineering product. Caffe, for example, calls `exit` every time an error occurs, including recoverable ones like files which do not exist.

Yes, even we found Caffe to be overwhelming. This library seems promising considering there are no dependencies and is quite portable to Android or iOS. https://github.com/nyanp/tiny-cnn

Does it have to be open-source? If free, but not trainable and restricted to Windows apps/phone is good enough, then I recommend the Microsoft OCR library. It gives you very, VERY good results out of the box. An excellent piece of work from Microsoft Research. To test it, see for example https://ocr.a9t9.com/ which uses Microsoft OCR inside.

And for comparison, an OCR application with Tesseract inside: It has a dramatically lower text recognition rate: http://blog.a9t9.com/p/free-ocr-windows.html

(Disclaimer: both links are my little open-source side projects)

Are you talking about this library from Microsoft: https://www.nuget.org/packages/Microsoft.Windows.Ocr/ ?


That https://ocr.a9t9.com/ link worked pretty impressively. I did a screen shot and uploaded it to to test it out. Nice.

How does the PDF OCR process compare to images? I uploaded a sample PDF with very clear sans-serif text (printed to PDF from a webpage) and there seems to be some odd substitutions. "prohibitecL" instead of "prohibited", "ac" instead of "QC" (as part of an address), random clipping of the first letter in a few lines and random use of a capital i instead of 1.

Overall very good, I'm just wondering if the library is better with image files than PDFs?

Interesting... I see it now. I assume some issue during the PDF to image conversion in the web app. PDF support is just a few days old.

The OCR library itself supports only image formats as input and is "innocent" with regards to this issue ;)

I tried your app with this image: http://cdn.swapweb.com.ar/estilo-web.net/publis/dell_inspiro... and the results were impressive!.

Much much better than what I can get with tesseract. Would love to have it as an API service.

I tried using Tesseract and could not get it to work reliably. I tried a bunch of different pre-processing techniques and it turned into a very frustrating experience.

When I compared Tesseract to Abbyy, the difference was night and day. Abbyy straight out of the box got me 80%-90% accuracy of my text. Tesseract got around 75% at best with several layers deep of image pre-processing.

I know you said open source, and just wanted to say, I went down that path too and discovered in my case, proprietary software really was worth the price.

A counter anecdote, I've used Tesseract for a largish project and it's worked wonderfully. Typically, on a full page of text, there'd be one or two minor errors, easily 99% accuracy. Irritatingly for me, but also impressively, it seemed to detect ligatures correctly (it used the unicode 'fl' instead of 'fl').

That unicode ligature would be an interesting surprise if you wanted plain text.

Exact same experience here. Omnipage also produced excellent results.

Finereader engine Abbyy for Linux looks fine, but I can't find price data, is it <10k/year? One time license?

You can get Abbyy OCR Cloud SDK and its $0.10 per page (goes down with volume).



Abbyy, and most good commercial OCR, charge a runtime royalty based on the number of cores you are deploying to. Depends on volume.

And they limit you to scanned pages per year on the corporate/server offerings. Doubly-so for all their dev/api license options.

And the kicker: You can't buy those licenses from them directly. They put you in contact with some randome monopoly local distributor that usually has mandatory "training" charges.

Messy, and I'm planning on staying away from that with a ten-foot pole.

If there is no opensource/free software with the same quality, what then? What are you using as an OCR server side system on Linux? I'm sure not good enough to write my own OCR better than Abbyy.

As the others in the thread have mentioned. Constrain your problem as a computer-vision one to segment nice pieces of work for Tesseract. Along with some nice training data, and possibly human validation if that's feasible.

All do-able within Linux.

As the parent of this comment thread mentions, Tesseract is not very great for mass usage due to the error rate with Abbyy much better. I would be interested in experience not opinion.

Where do you see this? I'm actually thinking of buying the software right now and I explicitly asked what other charges are there besides the initial fee (~$100) and they said there wasn't any. Am I not asking the right question?

At my previous job we used them (on Windows).

I seriously doubt that it's just $100 -- but, I guess if you are making an application that you are just going to run yourself on a single machine, it would not cost much. The costs tended to increase with the number of computers you were deploying to. Or perhaps we used something more capable (does that come with all languages -- including Chinese/Japanese/Korean and RTL languages).

Aye, I just had a back and forth with a tech rep there. They advised me at the end to contact sales to 100% understand, but the gist I got was that the FineReader Engine is $100 and it can do anything the SDK can do, but you have limited functionality in how it can operate with other programs (on Mac, all you get is Automator) and it can't be scaled to many cores. If you don't need that, then it seems to be a pretty damn good deal. Will talk to the sales rep first though.

What kind of problem were you dealing with? Were you bumping up against any constraints?

We were an image-processing SDK with OCR add-ons which were us reselling 3rd party SDKs with integration to our codecs and the processing code. It needed to be completely general purpose use in any .NET program (web, service, desktop, console, etc) since we didn't really want to get into usage with our customers (we had no royalties or usage restrictions on our part).

I think we opted for the customer actually obtaining their Abbyy license from Abbyy in the end because of the licensing mis-match. We sold just the wrapper.

Our parent company was a big user of Abbyy, and I think had a totally custom deal. They needed it for all of the language support and similarly wanted the full power to run it at high-speed and inside of .NET programs.

Can you point me to some prices, per core / core-year?

This is from years ago, but it was $4,900 for an SDK license for development and testing, and then the runtime licenses that you actually deploy are negotiable depending on pages/month, cores, languages, and other options. I think if you're just buying 1 runtime license they try to make it so it is around $5K so the total deal size is $10K. But they are negotiable. Maybe mention if you're a startup.

They also sold Recognition Server which has fewer options to integrate with programmatically (it was only 'hot folders' at the time), and I think only runs on Windows, but costs less.

And their mobile OCR which is lightweight, designed to run on smartphones, they worked out deals where they get a percentage of revenue.

I believe you need to call them.

I was actually looking in to Tesseract yesterday, so this post coincides nicely.

I have a hobby project where I scrape instagram photos, and I actually only want to end up with photos with actual people in them. There are a lot of images being posted with motivational texts etc that I want to automatically filter out.

So far I've already built a binary that scans images and spits out the dominant color percentage, if 1 color is over a certain percentage (so black background white text for example), I can be pretty sure it's not something I want to keep.

I've also tried OpenCV with facial recognition but I had a lot of false positives with faces being recognized in text and random looking objects, and I've tried out 4 of the haarcascades, all with different, but not 'perfect' end results.

OCR was my next step to check out, maybe I can combine all the steps to get something nice. I was getting weird texts back from images with no text, so the pre-processing hints in this thread are gold and I can't wait to check those out.

This thread is giving me so much ideas and actual names of algorithms to check out, I love it. But I would really appreciate it if anyone else has more thoughts about how to filter out images that do not contain people :-)

My solution with regards to bad facial detection in OpenCV is to do the following:

1. Use an LBP cascade on the picture. This is lower quality, higher false positives. Uses integer math so this is fast. Its named lbpcascade_frontalface.xml

2. Capture the regions of interest that the LBP cascade identifies with a face and throw into a vector<Mat>. This means you can capture (potentially) arbitrary amount of faces. Of course, with OpenCV you are limited to a minimum of 28px by 28px minimum face.

3. Run the haar cascade for eye detection on the ROI's you saved in the vector. Ones that return eyes show a good match. Haar cascades are slower(because they use floats), but the reduction in pixels means its relatively fast. Its named haarcascade_eye_tree_eyeglasses.xml

I can maintain 20fps with this setup at 800x600 on a slow computer.

For the motivational posters specifically, you might want to check out a Perceptual Hash type algorithm. Convert the image to a 64px square low-depth grayscale (4b?) and most of them should look more or less the same to PHash. Maybe you then classify them into clusters based on hash distance or something.

I also like your dominant-color thing. If you have a couple approaches that each make sense, you can use them as an ensemble - the more classifiers that don't like something, the more likely it's junk.

I've done basically the same thing as you for a project i did for a newspaper where i collected 9000 selfies (see http://vk.nl/selfies).

It's a lot of manual work, but using OpenCV saves you a lot of time. I can't share the code unfortunately, but what i did was this:

* Get all Instagram photos with the '#selfie' tag

* Run it all through the haarcascade_frontalface_alt2 OpenCV cascade, i used the 1.3 and 5 values for the detectMultiScale() method.

* Check that there's only one face in the image, and make sure it's larger than 20% of the width of the image.

Even after that i still needed to go manually through the images. I guess around 10% was still false positives.

> But I would really appreciate it if anyone else has more thoughts about how to filter out images that do not contain people :-)

Google allows image searches to be filtered by is/isn't a face. I think you could tap into that knowledge, although it isn't immediately clear what the route would be.

Here's a detailed rundown of the more obscure google search paramters: https://stenevang.wordpress.com/2013/02/22/google-search-url...

The relevant one to your interest is "tbs=itp:face".

dlib's face detection is the best of the open-source libs, in my experience.


It's a brilliant piece of software for a number of things.

Tesseract is ok, but I gather that a lot of the good work in the last few years on it has remained closed source within Google.

If you want to do text extraction, look at things like Stroke Width Transform to extract regions of text before passing them to Tesseract.

If you are using OpenCV and tesseract, you might have a look at Scene Text Detection in OpenCV 3. It's in the text module within OpenCV_contrib [0].

There are samples here [1] and here [2] to get you started. The paper is here: [3]


[0] http://docs.opencv.org/3.0-beta/modules/text/doc/erfilter.ht...

[1] https://github.com/Itseez/opencv_contrib/blob/master/modules...

[2] https://github.com/Itseez/opencv_contrib/blob/master/modules...

[3] http://cmp.felk.cvut.cz/~neumalu1/neumann-cvpr2012.pdf

Does Google provide any "OCR as a service" API that takes advantage of their closed-source advances? I know something OCR-ish happens when you upload an image-PDF to Google Docs, but I don't remember if there's any way to get the resulting text out.

Is it possible to extract text from pre-formatted documents? Let's say I have a document issued by the government, and I am only interested in the fields that have been filled. Could I process such a document fully automated using Tesseract? Maybe some image pre-processing would be needed?

If you'd be interested in using something like this as a paid API/service and have decent volume, feel free to contact me. I'm currently in pre-beta for API rollout of such a thing.

In summary, you need templates that map the field positions -> meaningful keys so that you can get back useful data as json/csv/xml. I have some tools that are still being polished that automate much of the template creation and do a lot of the pre-proc for you.

email is my username (at) gmail

"Maybe some image pre-processing would be needed?"

No, a whole lot of pre-processing would be needed. It all depends on the exact layout - if your tolerances are tight you need much more logic than if you have, let's say, 2cm white space around one sentence you're after.

(Disclaimer: I work for creatale GmbH)

We open-sourced the library that we use for exactly that purpose: https://github.com/creatale/node-fv

Not really. Everything out there is ancient and uses techniques from the 80s/90s.

It's pretty sad considering that OCR is basically a solved problem. We have neural nets that are capable of extracting entities in images and bigdata systems that can play jeopardy. But no one has used that tech for OCR and put it out there.

It's because the kind of massive training sets that are required for accurate speach/text recognition are difficult and time-consuming to amass. It's just too difficult for hobbiests at the moment.

It's kind of the same reason there is no good open source solid modelling software (like Solidworks or NX). The problem has been "solved" for years, but actually doing it is too mammoth a task for open-source software.

Creating training data is a relatively simple task though (even though massive in scale), which can be done by untrained people in a crowd sourced fashion. Creating a modeling tool is both difficult, requiring experts _and_ time consuming.

[OT] What about FreeCAD? It looks like good open source solid modelling software: http://freecadweb.org/

FreeCAD is starting to get good, but the problem is it lacks core functionality that Solid works already has built in. Admittedly, the underlying python calls are exposed, and you can use them... if you know how to make the appropriate calls.

When they get good GUI hooked up to the calls, it will be in parity with the pay solutions.

Yep. Back when I was at NYPL Labs[0], we developed a bit of an obsession with OCR. It turns out that so much of it depends on the context of the documents, original scan quality (unpaper and deskewing be damned), typography, how much handwriting you're willing to suffer (can someone please pick up the work of applying Convnets to large handwriting corpora that aren't just MNIST?).

A lot of this comes down to how much time you're willing to put into it to get it up and running and if you're willing to put in any effort to additionally train/refine a model.

(ps all the elements I'm about to mention here have been mentioned throughout this thread. I'm just providing a bit more context and bringing what I know together).


Gotta have something now: Tesseract

Not saying it's "bad" but it ain't as great as you want. Bindings for every language, lots of configs, documented well enough that it's 80% of the suggestions here. For what it is, it's a damn miracle. Most languages and alphabets and lots of kinds of fonts likely start giving you results somewhere between 80% and the Uncanny Valley immediately. You also get a pretty good Layout Analysis engine [1] for if you're working with complete pages of text. The existing models it ships with are robust, but if you want to get better outputs, retraining it is a real pain (you have to segment on a character-by-character basis). You're better off trying to apply a general preprocessing to clean the image or...


Gotta have a proof of concept this afternoon: Send just text to Tesseract

By now you've realized that asking an OCR engine to OCR a tree (or a picture of a tree alongside some text) somehow always comes back looking like a cat just napped on your keyboard. As alluded to in a few other places here, Just Give It Text! Tesseract (and most traditional OCR) was developed in a text document centric world where you could safely assume it's just a page of words, and sometimes you might have to deal with columns (that's where the Layout Analysis comes in). Probably not your real world.

Depending on your type of documents you might be able to develop a few hueristics for identifying text regions in the picture, then sending only those sections over to Tesseract. This'll dramaticlly help Tesseract out, but it can increase some of the complexity you'll have to juggle. You might be able to come up with some hueristics specific to your documents (which can be very good, especially if it lets you infer more information about those regions that you might want later).

You can also use something like Stroke-Width Transform (all the links I would use were graciously linked in this earlier comment [2]) which was discovered by Microsoft trying to spot text in the wild for their Street View efforts. ccv has a very nice SWT implementation [3], and their http server for the whole library [4] that with a bit of makefile finagleing can have a very nice SWT preprocessor -> Tesseract api over HTTP up in an hour or so.

Also it looks like the OpenOCR[5] project is now using SWT -> Tesseract with a nice HTTP API written in Go, conveniently packaged up for Docker and very well documented[6].


Take me down the rabbit hole, but maybe get results in 2 days:

The roots of tesseract were planted by HP in 1985, so there had to be a better way at some point. OCRopus[7] was supposed to be the great state-of-the-art hope that would save us all. The approach was incredible and the results being published were great. But the documentation came in the form of a mind map[8]. Recently the project was picked up again by the original developer and rechristened OCRopy[9] and has garnered a pretty active and growing community in the past few months.

You'll have to do a bit more work here than just tesseract, but the LSTM neural network[10] approach completely blows away Tesseract's results with just a little training.

To get started, Dan Vanderkam's tutorial is excellent to start working with the out-of-the-box model[11] immediately.

But results get INCREDIBLE when you take some time to train your own model[12]. I provided Dan with the source images for the project[13], from The New York Public Library's historical collections, and the source text was in a font style the default model had never encountered before. But in an hour or two of transcribing the training data (training data in OCRopus is awesome because you just feed in a line at a time; no need to align and segment each letter like in Tesseract), he was getting an under 1% error rate!

Layout analysis isn't OCRopy's strongest suit, so you might get even better results if you pre-segment with something like SWT, but again, not completely necessary.


That's it. Way too much for a Newsy Combinator comment, but a pretty decent tour of the big stuff in open ocr world. There's some magic state of the art going on inside Google, Microsoft, and Abbyy, but hopefully you'll be able to teach your computer to appreciate whatever it is you want it to read through these now.

[0]: http://nypl.org/labs

[1]: https://en.wikipedia.org/wiki/Document_layout_analysis

[2]: https://news.ycombinator.com/item?id=9776034

[3]: http://libccv.org/doc/doc-swt/ and http://libccv.org/lib/ccv-swt/

[4]: http://libccv.org/doc/doc-http/ and http://libccv.org/lib/ccv-serve/

[5]: https://github.com/tleyden/open-ocr

[6]: https://github.com/tleyden/open-ocr/wiki/Stroke-Width-Transf...

[7]: https://en.wikipedia.org/wiki/OCRopus

[8]: https://www.mindmeister.com/192257150/ocropus-overview-publi...

[9]: https://github.com/tmbdev/ocropy | all day all week

[10]: https://en.wikipedia.org/wiki/Long_short_term_memory

[11]: http://www.danvk.org/2015/01/09/extracting-text-from-an-imag... | though you've got to download the model separately because it's too big for github

[12]: http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-mode... [13]: https://www.oldnyc.org/

Great list of links. You have done the research! Personally, I have used ocropy to great effect. Right now, I am moving beyond forms at work and researching cartographic map vectorization using deep learning techniques. There is little science papers on the subject however. Have you done any research on that by any chance?

Indeed! Map vectorization is one the things NYPL Labs is actively working on: https://github.com/nypl/map-vectorizer

It's designed to work primarily with old fire insurance atlases (e.g. Sanborns) and is a bunch of hand-tuned heuristics. But it powers Building Inspector (http://buildinginspector.nypl.org), which is where all the data is validated consensus crowdsourcing, providing ground truth for the data coming out.

Unfortunately (or perhaps fortunately if you're trying to get a Computer Science PhD) I haven't come across anyone applying deep learning to map vectorization either. Frankly, it could open up a whole new field with respect to historical mapping (among other things).

Would love to talk more about it and see how deep learning could be applied here. We always structured the outputs of Building Inspector so they'd be useful training sets for unsupervised or reinforcement learning so it could hopefully apply here as well.

this is awesome! thanks for taking the time... Damn, HN really needs bookmarks

The accuracy of GOCR is very high and it doesn't require any learning.


GOCR is ok, certainly much better than GNU Ocrad, but I've had much better results with Tesseract.

GOCR is simple enough to install from your distro's repository and wrap into a script without fuss though. Tesseract is wildly more complicated to get started with.

If you're doing real time street sign recognition or involved with a book scanning and archival startup, investigate Tesseract for sure. But even then you'll probably want to prototype with gocr first.

(Disclaimer: I work for creatale GmbH)

As you mentioned ocrad.js I assume you search for something in js/nodejs. Many others already recommended tesseract and OpenCV. We have built a library around tesseract for character recognition and OpenCV (and other libraries) for preprocessing, all for node.js/io.js: https://github.com/creatale/node-dv

If you have to recognize forms or other structured images, we also created a higher-level library: https://github.com/creatale/node-fv

I can attest to this. I used Node-DV in a production environment and it's pretty damn amazing with all its built-in functionality.

There are options to adjust the image in various ways and once Tesseract runs, it's easy to get the result in various formats.

I've used tesseract to great affect. I don't know how your images are but if only part of the image has text in it, you should only send that part to the OCR engine. If you send the entire image and only a portion of it has text in it, chances of the OCR extracting text are slim. There are pre-processing techniques [1] you can use to crop out the part of the image that has text

[1]: https://en.wikipedia.org/?title=Hough_transform

Could you explain how to use a Hough transform to find areas that have text? I only recently managed to wrap my head around how it works for line detection and other shapes with the generalized form, but how would one recognize text?

and an implementation with source & binaries [1]. also check out unpaper [2]

[1] http://galfar.vevb.net/wp/2011/deskewing-scanned-documents/

[2] https://github.com/Flameeyes/unpaper

Tesseract is quite good and can get some really good results if you can train it. I just wish there was a more UI friendly way for generating the training file; for the project where I had to use it I ended up paying a freelancer with some strong knowledge of Tesseract training. Really happy with the results for my iOS app (universal app)

Tesseract does no layout analysis.

So if the source image contains text columns or pull quotes or similar, the output text will just be each row of text, from the far left to the far right.

Take a look at https://github.com/tmbdev/ocropy for layout analysis (at one point this project was called OCRopus).

Could you add another layer on top (eg. image processing) to detect boundaries of blocks of content, and send each block to Tesseract in sequence?

Yeah, you could do that. And I wouldn't think it would be that hard either.

I'd make a cascade that detects all letters and numbers from major font sets. That shouldn't be too terribly difficult.

Now, use the cascade to scan the document. Now, convert the document to a list of all detected characters (we don't actually care what the chars are).

Once you have this, do best fit bounding boxes around the data. You'll have to figure out what distance you want to exclude from the bounding boxes.

Now what you should end up with are a few boxes indicating the regions of data on the document. Now, crop each of these regions of interest and feed them into Tesseract.

Been looking at Number Plate Recognition and https://github.com/openalpr/openalpr has been my go to solution at the moment. It uses Tesseract and OpenCV.

Judging by the comments there is much space for improvement on open source OCR libraries, maybe someone with experience in the field should start a kickstarter for it, there must some here in HN itself; maybe something with CLI and GUI and layout recognition.

I used Tesseract and OpenCV to process Words With Friends tiles.

Yeah, I used Tesseract to solve Countdown puzzles on Android, code here: https://github.com/danielsamuels/countdown/blob/master/scree...

Haha, this is awesome. Have you written about your program anywhere? Would love to hear more about it.

It's pretty much as other people talked about, I used OpenCV to try to detect the board, and then isolate each tile, and then used Tesseract to try to figure out the letter. The biggest problem I had was whenever the red score bubble landed on a tile, it would fall on top of the latter and thus lose information so I ended up having to create training data and associating that to get the letter. I was able to get to about 98% accuracy and it was relatively robust until they created a new app that I haven't had time to adjust for.

I think this is the best solution, though a bit daunting without experience in either to approach. But you'll get access to processing all of your information in a low level language very quickly.

I'm just starting out on something similar (OCR'ing letters from a word game), so would also love to here more about the process you're using with Tesseract and OpenCV.

A best-word solver I presume?

I did it to count the tiles, and was going to work on a full-on cheating program but never got around to it. I did find the algorithm online though, it's pretty interesting.

I've tried using Tesseract before, the biggest of the open source libraries. It goes from "okay" to "terrible" depending on the application.

Our particular application was OCRing brick and mortar store receipts directly from emulated printer feeds (imagine printing straight to PDF). We found that Tesseract had too many built-in goodies for image scanning, like warped characters, lighting and shadow defects, and photographic artifacts. When applied directly to presumably 1 to 1 character recognition, it failed miserably.

We found that building our own software to recognize the characters on a 1 to 1 basis produced much better results. See: http://stackoverflow.com/questions/9413216/simple-digit-reco...

It depends on what you're trying to do. For myself, I wanted to OCR scanned documents and I've been moderately successful using ScanTailor to process the images and then Tesseract to OCR the result. Certainly, it's far from perfect and the documentation on Tesseract and its options is spotty, but I've been moderately happy. As a note, I've had better luck with the current trunk in the git repo of Tesseract than the point releases. Some of the older versions of Tesseract modified the images when processing to a PDF and this was unacceptable to me.

I used the go wrappers for tesseract[0] in my full text search indexer for disks[1] and it worked great. No issues it handled pretty much anything I threw at it.

With the caveat that none of the stuff was handwritten.

[0] http://godoc.org/gopkg.in/GeertJohan/go.tesseract.v1 [1] https://bitbucket.org/zaphar/goin/

you may try baidu OCR API https://github.com/JeremyWei/baidu-ocr

Is that trainable for new alphabets? API use policy?

Tried tesseract, it worked very poorly. I don't think there is anything that compares well to Abbyy.

Unfortunately the native Linux version is a bit pricey: http://www.ocr4linux.com/en:pricing

Otherwise I would use the command line version to help me index all my data.

An important aspect is stripping text from image, before OCR'ing: http://sourceforge.net/projects/tirg/


Examples here: http://funkybee.narod.ru/

I'm working on these problems for applications in the financial industry right now. We are building some interesting technologies and have quite a bit of funding from top VCs. We are looking to hire people who know this problem cold. Let me know if you're curious. nate@qualia-app com



Apologies, I am not sure if its open source.

It seems this is just built on top of Tesseract as well.

We've had great success with CCV's SWT to indentify text zones and Tesseract to extract it.

[1] http://libccv.org/doc/doc-swt/

A coworker had great success using Ocropus (which uses neural nets under the covers) extracting test from the backs of many photos for Old New York.

Tesseract with ccv for SWT works but is a bit of a slow dog. Anyone used deep learning (caffe maybe?) for OCR?

Thought I'd reply on my own comment.

Caffe has been used for handwriting, seems like OCR of typefaces would work just the same with a typeface dataset.


Here is the only other OCR example with caffe i could find:


Any references on OCR to extract MICR (font E13B) for desktop as well as Android?

Dunno about desktop, but as far as Android is concerned, tess-two is a fork of Tesseract tools for Android: https://github.com/rmtheis/tess-two

This dude did a pretty nice writeup on getting it going: http://gaut.am/making-an-ocr-android-app-using-tesseract/

for mobile OCR check out microblink SDK

apache tika + ocr is an excellent combination - https://wiki.apache.org/tika/TikaOCR

Few years ago I used Cuneiform, but looks like it's dead now.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact