

OCR by uploading images to Google Docs - gintas
http://docs.google.com/support/bin/answer.py?answer=176692

======
anty
In case anyone wonders: I tried if Google could solve its own captchas. It
can, if each character is separated, but once they overlap, like they usually
do, it doesn't work.

------
nodata
Does anyone know if this uses Google's open source tesseract-ocr software?

~~~
l0nwlf
I was unable to find any reference regarding this. Personally gocr worked
better for me than tessearct/pytesseract. Google docs inbuilt OCR gives pretty
satisfactory results too.

------
CWuestefeld
I find it tremendously frustrating that so many people are creating this
problem for themselves.

Anything that needs to be data should be data, not images. Except for some
very specific cases, you're not doing anybody any favors by outputting PDF.
That format is a data black hole. It allows you to transmit very well-
formatted output, but it absolutely _stops_ you from reliably _using_ anything
in that content.

I beg you all: if it's anything that contains data, or really, if it's
anything for which layout and formatting is not absolutely critical, please
don't use PDF. Send data as data.

~~~
slouch
99% of the time I want to OCR a document or image I am not the creator of said
item.

~~~
CWuestefeld
Obviously. But if we could make _everyone_ understand, then we'd be covered.

Every few months here, we get a customer asking why we can't automatically
handle purchase orders that they send us in PDF format, and every time they
get the same explanation.

~~~
nostrademons
If we could make _everyone_ understand, we wouldn't need computer programmers.
We could just have computers talk to each other, and all their formats would
be magically compatible, and the vast body of data conversion code wouldn't
exist.

The problem is that computers are made for humans, and humans are often
wantonly illogical. You're not going to change this, short of Skynet and the
rise of the machines. So it makes sense to put up with a fair amount of coding
pain to make things easier for your users. It's lucrative, at least.

Think of it as a full-employment theorem for data-miners.

------
ylem
Has anyone checked to see if this works with Japanese, Korean, or Chinese?
What about Arabic or Hindi? This would shed some light on whether it's likely
to be tesseract or ocrpus....

------
joakin
Wow I just tested with an image, and you get a GDoc with the image on top and
the OCRed text in the bottom.

Pretty cool.

I wonder what are they using for Google Goggles and this

------
Estragon
Incidentally, I noticed that if you try to use tesseract on an image taken
from a Google Books page, you get terrible OCR accuracy. Anyone know why that
is?

~~~
zzleeper
I recall that on some google-scanned books, there was some metadata from abbyy
finereader. So that may be why.

Also, tesseract often needs to be configured.

------
Tichy
Is there an API by any chance?

~~~
mseebach
I suppose this will work: <http://code.google.com/apis/documents/>

You can both upload and specify format when downloading. I guess that includes
the OCR applied if necessary.

------
mikecane
G1ver whar OCR locks lice in g00gLe ePubs in g0og1e Buuks, th1s w111 du we11.

------
trezor
Trying to improve some scanned forms I have, I got an average of 5 characters
per page recognized. Also form formatting recognized as "1 1 1 1 1 1 1 1 1 1 1
1 1".

I may not rely entirely on google docs for my OCR needs in future ;)

~~~
andybak
Can you upload your document or part of it somewhere where others can take a
look?

I find it hard to believe Google would release this if it was that useless (no
jokes about Buzz or Wave, please).

------
rorrr
I wonder what's stronger - google OCR or google captcha?

