Right this very moment (well, a few moments ago when I wasn't procrastinating on HN) I was in the midst of extracting data from a client's old website in preparation of creating a new website.
A lot of that data is contained within images.
From a few preliminary tests, I'm hugely impressed. This seems on-par with any other OCR software I've used, and the fact that it happens in realtime in the browser is amazing.
I tried it on a piece of content I'd just had to type out, that was originally in an image. Typing out the content took about 10 minutes. Copying and pasting with Naptha, and then making some minor edits/corrections, did the same thing in about 2 minutes.
My Msc thesis was on reducing OCR error rates by pre-processing of various forms, and while I managed to get some reduction in error rates, one of the things I found was actually that given how low the error rates generally was to begin with, you have a very tiny budget in terms of extra processing time before further error reduction just isn't worth it - if a human needs to check the document for errors anyway, a "quick and dirty" scan+OCR is often far better than even spending the time to get "as good as possible" results. Spending even a few extra seconds per page to place the page perfectly in a scanner, or waiting a few extra seconds for more complicated processing, can be a net loss.
It's a perfect example of "worse is better": OCR, at least for typed text, is good enough today that the best available solutions aren't really worthwhile to spend resources on (for users) unless/until they give results so perfect it doesn't need to be checked by a person afterwards.
With really low res scanners I can image it could make a big difference.
The biggest problem was stuffing too many files into an NTFS directory. Apparently, NTFS didn't like tens of thousands of files in one directory. :)
Doesn't that depend entirely on what you're using the text for and how accurate it needs to be?
From my own experiments, I tend to find that you can read through and correct errors only relatively marginally faster than you can type because you either follow along with the cursor or need to be able to position the cursor very quickly when you find an error, and as the error rate increases, trying to position the cursor to each error very quickly gets too slow.
Dropping accuracy in your effort to correct the text doesn't really seem to speed things up much. You likely speed it up if you're willing to assume that anything that passes the spellchecker is ok (but it won't be, especially as modern OCR's often try to rely on data about sequences of letters, or dictionaries, when they're uncertain about characters)
If you're ok with lower accuracy, e.g. for search, and the alternative is not processing the document at all, then it'd be drastically different.
FF 28 seems to be working fine with the "Weenie Hut Jr." version...is it just the add-on that isn't supported?
awesome tech, btw
Of course then it was "easy": almost all the text would have been rendered with one of a tiny number of fonts available on the system, with little to no distortion.
Even though it solved a problem we don't usually have today (this story notwithstanding), it was still one of the most amazingly useful programs ever.
If the window was rendered with multiple font that wouldn't be reliable, but I guess it'd likely be "good enough" to avoid a wider search most of the time.
 Here's the RastPort struct from AROS (open source re-implementation of AmigaOS): http://repo.or.cz/w/AROS.git/blob/HEAD:/compiler/include/gra...
Maybe soon I won't feel guilty for leaving my alt attributes empty.
here's what the project does now with js + web workers:
processing time is < 1500ms in Chrome and < 2000ms in FF
the code is open source, though using it isnt yet polished. i'm working slowly on a blog post series to detail how to use the lib(s). https://github.com/leeoniya/pXY.js
a walkthrough of the base lib is here: http://o-0.me/pXY/
But in my experience, the recognition quality isn't good enough to replace Tesseract if you have that capability.
Here is a copy/paste example from imgur:
Top: vou SAID w[ W[R[
Bottom: TN[ FACTTNATl'M MAWING TNISM[M[ g
INST[AD of DRIVING D[TERMIN[D TN#rWASA ll[
Maybe it needs to be a certain font for better results. Still pretty cool. Hopefully all the kinks get worked out. I would definitely find this useful.
EDIT: need to make sure the language is set to "internet meme" and it works much better.
Top: YOU SAID WE WERE
Bottom: THE FACT THAT I'M MAKING THIS MEME
INSTEAD OF DRIVING DETERMINED THAT WAS A LIE
Next time I'll RTFM.
YOU SAID WE WERE
TN[ FACTTNAT |'M MAWING TNIS M[M[
INST[AD of DRIVING D[TERMIN[D TN#rWASA ll[
I imagine that the thick outline of the font makes it hard to detect the edge of the letters, especially since it obscures the true "background".
e: using the Internet Meme language worked much better!
THE FACT THAT I'M MAKING THIS MEME
INSTEAD OF DRIVING DETERMINED THAT WAS A LIE
Am I alone?
2) Erase Text option menu location
Using version 0.7.2, the "Erase Text" option is displayed under the "Translate" section (certainly not where I would ever intentionally look for it).
3) Select Text -> Right-click changes selection
After selecting my text, when I right-click the selected text often (almost always) changes. For example, with the kitten text, I selected both paragraphs, but when I right-clicked to go to Translate->Erase the first paragraph ceased to be highlighted. After erasing the second paragraph I tried in vain to select and erase the first paragraph, but everytime I'd right-click the selected paragraph only a single word would still be highlighted. I eventually tried erasing text while only one word was highlighted and the entire first paragraph was erased.
4) I really appreciate the Security & Privacy section of the project page.
5) I would love to see a Firefox version of Project Naptha!
For starter it's based on gnu ocrad  but fails to state a license and to publish any source code.
I wonder if you could get better performance when running locally by sending the result through a spellchecker and doing some Bayesian magic on the word choice...
It can't get the top-right text correctly
Awesome tech though
Once again, such a simple implementation by somebody that grabs some components that have been around for ages and mashes them up in a way that makes people question why it wasn't invented before
I've got this installed and it'll probalby never leave my chrome profiles. Keep up the awesome work!
http://www.xkcd.com/ bottom line here is recognized as:
This made me realize I never saw such a thing as OWR, i.e. a software that would first try to recognized whole words, then go down to character level if no satisfying match found.
Found out this exists already: https://en.wikipedia.org/wiki/Intelligent_word_recognition
In my experience, the ability to handle overlapping letters (which is very common on type-written text and professionally typeset material) is one of the key things that separate the relatively lightweight OCRs (like Ocrad and GOCR) from the big complicated ones (Tesseract, Cuneiform, Abbyy etc). Whitespace character segmentation cannot be taken for granted if you want to do any useful OCR of "historical" material.
I posted a review on my blog here: http://www.sinosplice.com/life/archives/2014/04/24/can-proje...
OP, I'd be happy to work with you on improving the recognition of Chinese text. Just get in touch with me through my blog (linked to above).
1. The implementation of Stroke Width Transform is not super good. So far, http://libccv.org/ has the best implementation of SWT. But again, you can neither make the head nor the tail of that implementation.
2. There are just too many false text regions and the text detection accuracy is no where near what you can call good. A mixed use of multiple OCR engines might give better results.
All that said, you can't take away the cleverness of the application of detecting text. Mind == Blown, on that area.
Ocrad is being used as the default because it runs locally and it's small enough that it's easy to ship with. The remote OCR engine uses Tesseract which gets much closer to acceptable in a lot of circumstances.
But there is a lot of work which can be done to improve it. I have a friend who constantly nags me for not having a solid test corpus to run regression analysis/parameter tuning/science. Certainly it lacks the rigor of an academic and scientific endeavor, but I've always imagined this as a sort of advanced proof of concept. I think the application of transparent and automatic computer vision, deserves to be part of the interaction paradigm for the next generation of operating systems and browsers.
In case anyone from the project is monitoring - text selection did seem to work fine for me in FireFox (ESR 24.3) despite the "Not Supported" text being displayed.
Here were the API references I could find for the remote OCR:
- GET https://sky-lighter.appspot.com/api/read/<chunk.key>
- GET https://sky-lighter.appspot.com/api/lookup?url=<image.src>
- POST https://sky-lighter.appspot.com/api/translate
Apparently the author was one of the winners of HackMIT 2013 according to some of the comments. Couple of fun things in there if you decide to poke around in the code. Jump into naptha-wick.js for the remote logic.
Note from the Dev (http://challengepost.com/users/antimatter15, http://antimatter15.com/wp/, https://twitter.com/antimatter15):
It's April 16, 2014.
It's been six months since I started this project.
Just under two years after I first came up with the idea.
It's weird to think of time as something that happens,
to think of code as something that evolves. And it may
be obvious to recognize that code is not organic, that
it changes only in discrete steps as dictated by some
intelligence's urging, but coupled with a faulty and
mortal memory, its gradual slopes are indistinguishable
Hopefully, this project is going to launch soon. It
looks like there's actually a chance that this will
be able to happen.
The proximity of its launch has kind of been my own little
perpetual delusion. During the hackathon, I announced that
it would be released in two weeks time.
When winter break rolled by, I had determined to finish
and release before the end of the year 2013.
This deadline rolled further way, to the end of January
term, IAP as it is known. But like all the artificial
dates set earlier, it too folded against the tides of
I'll spare you February and March, but they too simply
happened with a modicum of dread. This brings us to the
present day, which hopefully will have the good luck to
be spared from the fate of its predecessors.
After all, it is the gaseous vaporware that burns.
Yeah, the code is super messy, but I'd prefer if you didn't play around too much with the remote OCR service, specifically, the translation parts because Google Translate is pretty expensive per-use.
The website was not very clear if work was done client-side or not (mentioning server calls). It turns out that server calls can be disabled and the extension is working quite fine without. By default, I would disable this option and offer opt-in, it is better for privacy I think.
I get a big problem with various people sending me screenshots with stackdumps in. This is perfect for extracting them into the ticket bodies and it does it perfectly (I've just done 20 with it and manually checked them!)
This is the sort of stuff that really improves people's lives by making all data equal.
I'm using the latest version of Chrome on a modern Mac and have Naptha properly installed and Chrome has been relaunched.
Any hints would be appreciated.
Anyway, good luck!
Here is the picture: http://thesuperslice.com/wp-content/uploads/2012/04/downtown...
And the text outcome - found it most interesting what symbols it thought it recognized:
. o C-‘7' H ' .-.”-." «'~3;
It (sort of) worked:
"I AB5ENTH|NDEDLY5ELECT RANDU1 Bl.OO<5 OFTEXTHSI READ,
PND FEEL SLRONSCDUSLY SATISFIED LHEN THE HIGHUGHTED AREA
|"PKE5 H 5Yl’R1ETRICHL 5|-PPE"
However, it seems to confuse letter O and number 0.
Since serial numbers are not English words, I'm not sure how you would solve this unless you had a lookup for commonly used web fonts.
Going back to the page after closing it once, I noticed written in smaller characters that this somewhat pointless page is for a useless extension as it is exclusively limited to the worst offender privacy wise of a web browser that I would not touch with a stick. google chrome is the new internet explorer to me as its main use is to download firefox.
In conclusion this looked promising but a confusing web page and browser lock-in renders it useless and shows that it is far from doing what it claims.
"... on every image you see while browsing the web" should be "...on every image you see while browsing the web in google chrome".
No github and no open license tells me that as a linux user of opera I'm pretty much assured I will never see a version of this extension.
Webpage is not to the point and design has some room for improvement. See point 1 and 2 of http://www.webpagesthatsuck.com/biggest-mistakes-in-web-desi...
In any case, pretty cool project, I'm a bit amazed how far we've come since I've last played with OCRs (and defeated one bad CAPTCHA implementation, still in use at pastebin.com it seems).
Then it just binarizes the image by whether the internal histogram is larger than the corresponding value of the color on the external histogram.
It's a strategy that works quite well on machine-printed text, but probably less effective than existing strategies when it comes to scans or photographs.
I remember seeing that from the project list and really wishing I could download it right away.
Just another example that the "idea are worthless!" saying is bullshit. This was a great idea, anyone implementing it first decently would get success with it.
Oh god... how does it finish! I need closure!
(PS: this is awesome)
Too many webpages make it too hard to select even actual plain text.
"The fundamental problem of communication is that of reproducing atone point either exactly or approximately a message selected at anotherpoint. Frequently the messages have meamlng; that is they refer to or arecorrelated according to some system with certain physical or conceptualentities. These semantic aspects of communication are irrelevant to theengineering problem. The signiﬁcant aspect is that the actual message isone selected from a set of possible messages. The system must be designedto operate for each possible selection, not just the one which will actuallybe chosen since this is unknown at the time of design."
That seems to make sense to me, at least. Use ocrad mode by default, if it doesn't perform well, switch to tessaract and you'll hopefully get a better result.
Indistinguishable from magic.
It's spelled Naphtha (http://en.wikipedia.org/wiki/Naphtha). And for the HN hordes - read the bottom of the linked project page, it is supposed to be a reference to Naphtha.
Sarcasm is hard to read on the internet. I'm usually pretty good at it, but this one flew right past me.