
Extracting text from an image using Ocropus - danvk
http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html
======
martingordon
Thanks for this. I tried using Tesseract over the weekend to extract text from
a game screenshot and had no luck. The documentation for Tesseract is rather
opaque; maybe I'll have better luck with Ocropus.

~~~
joaomsa
My main gripe with tesseract is how convoluted and lacking in documentation
the training procedure is, which is critical to getting better results. I'll
be sure to check out ocropus.

~~~
danvk
You'll enjoy my follow-up post then, which talks about training:
[http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-
mode...](http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html)

------
iskander
I wonder if it's possible to remove the need for post-processing of the LSTM's
output by integrating transcription into the neural network model directly.

~~~
danvk
The first row of output from the Neural Net is a special "no character" output
which effectively gives you the character segmentation. You can distinguish
"aa" from "a" because the former shows up as "(no)a(no)a(no)" whereas the
latter is "(no)a(no)". You can read more about this in the Ocropus paper:
[http://www.helsinki.fi/~mpsilfve/ocr_course/materials/2008-b...](http://www.helsinki.fi/~mpsilfve/ocr_course/materials/2008-breuel-
ocropus-open-source.pdf)

------
sherjilozair
Slightly off-topic, but is anyone aware of an similarly capable library for
hand-written text recognition, i.e. ICR?

~~~
danvk
If you're able to provide enough training data, there's no reason Ocropus
couldn't do this. If you're open to using a commercial OCR program, FineReader
is excellent.

------
soperj
Are there any OCR/ICR open source projects that are actively being worked on
that anyone knows about?

~~~
danvk
Ocropus is fairly actively developed. Lots of commits in 2015:
[https://github.com/tmbdev/ocropy/commits/master](https://github.com/tmbdev/ocropy/commits/master)

