

OCRopus: high-quality open source OCR sponsored by Google and used by reCAPTCHA - henning
http://code.google.com/p/ocropus/

======
chime
I've used <http://code.google.com/p/tesseract-ocr/> by Google too. It appears
OCRopus uses that as one of the plugins for OCR.

------
dlib
Is there a webservice that allows you to send in files (PDF with images, png,
jpeg etc.) through an API and have the service send you a .txt, .doc or a pdf
of the same file with the text embedded?

~~~
perplexes
Docsplit comes to mind, but it's not in web-API form:
<http://documentcloud.github.com/docsplit/>

------
nearestneighbor
Is this the same one as featured here:

<http://recaptcha.net/digitizing.html>

If so, I'm not impressed.

~~~
henning
What exactly do you expect? Some of those old documents they're trying to
digitize are in such bad shape that you practically need an electron
microscope to decipher them. Document recognition in its full generality is
still an open problem. The examples shown on that page constitute highly
adversarial challenges. For simpler examples of the kind that would prevail
with recently printed material, much better results can be achieved.

~~~
nearestneighbor
> Some of those old documents they're trying to digitize are in such bad shape
> that you practically need an electron microscope to decipher them.

Red herring. I'm talking about the examples pictured.

Do you know _for a fact_ that this is the same software package?

I don't want to waste my time arguing about why it doesn't live up to my
expectations, if it's not.

------
cowsandmilk
what evidence do you have that this is used for reCAPTCHA?

~~~
gometro33
A quick search revealed that this title ("...and used by reCAPTCHA") has been
reused many times and is likely just a rumor at this point.

I would definitely be interested in seeing evidence of it though.

------
aschobel
Is there something similar for handwritten text?

~~~
ZeroGravitas
The links says: " _The OCRopus engine is based on two research projects: a
high-performance handwriting recognizer developed in the mid-90's and deployed
by the US Census bureau, and novel high-performance layout analysis methods._
", so it appears so.

------
korch
Uh, why is Google releasing this? Wouldn't this code give hackers a good head
start to create an OCR system capable of trivially defeating CAPTCHA
everywhere?

Or maybe they've realized any human computer test based on text recognition is
flawed, and so what better way to force the web to upgrade than to make OCR
trivial? I rather like this shotgun approach to AI.

~~~
modoc
I think the benefits of having high quality free OCR tools available to
developers outweighs the CAPTCHA abuse risk. Information organization is a
huge problem/area of opportunity, and being able to extract
text/content/context out of scans/photos and the like is key.

