Was this ever contributed back to opencv?
"BSD (unlike some other licenses) does not require that source code be distributed at all." 
Looks like there is no legal obligation to provide their source code or changes. But I really hope they did push their changes upstream. It is a real shame when companies that benefit so much from the open source community don't give back.
If someone there is reading this, please release this code in any way you like, even if you think it's too rushed or ugly, i'm sure the community can clean it up.
One dumb but surprisingly effective thing to try is apply a bunch of random binarizing filters before putting your text into an OCR engine, then picking the most common output from those filtered images. Some combination of morphological operations (dilates/erodes in different orders and different strengths), different levels of blur and sharpening, adaptive binary threshold at different levels, resizing the image (hilarious but some OCR engines are sensitive to relative scale), adding white borders of different thicknesses, rotating by small angles, denoising, ...
This gave us ~95% word level accuracy on our dataset (with tesseract as the OCR engine), without tuning the filters (we had used the filters individually before with some brittle logic for deciding when to use them: the random filter approach was kind of a joke which ended up working). Tesseract on its own had an abysmal ~40% word level accuracy on that same dataset. Dataset was of words segmented out of scanned bank statements.
Main downside was that this was pretty slow (we were generating ~10k filters per word), but we found a way to make it work (hierarchical stacked AWS Lambdas, lol).
Later I played around with the open source Tesseract OCR, but it was way behind commercial desktop packages in accuracy/usability. It lagged especially in dealing with different document layouts. What's the current leader in desktop OCR? Has deep learning yet revolutionized the "traditional" OCR market for desktop document processing, or is it just making it easier for companies like Dropbox to quickly get something reasonably good working?
1. Full page OCR is trivial these days. Anyone who's just doing full-page OCR has no business using one of these commercial offerings. The commercial stuff is really for extraction of structured info from unstructured and semi-structured documents. ABBYY and Nuance are the only ones with products that can handle it. There are alternatives for simple capture tasks (like DocParser), but not for complex ones.
2. ABBYY and Nuance have a lot of IP in the space on lock.
3. The market for complex data extraction (at least the kind of stuff I do) may not actually be big enough for smaller players to bother pursuing (most new players are going after tasks of intermediate difficulty, like invoice and receipt capture).
4. There's still a need for on-prem style solutions for people doing huge volumes (hundreds of thousands of pages a year), whereas most of the new market entrants are cloud only.
5. I haven't used Nuance's stuff much but ABBYY's products are actually incredibly robust under the hood-- stuff you wouldn't want to build yourself unless you absolutely had to. FlexiCapture will do 95%+ accuracy out of the box (on a character by character basis) with no human verification.
So what are you supposed to use for full-page OCR?
You can compare all three here: https://ocr.space/compare-ocr-software
Replying to points 1 and 3: For smaller players and/or complex tasks you can always implement your own custom parser.
I'm doing work as a contractor in this space.
Respectfully, I disagree about this being a parsing issue. The whole reason so-called "zonal ocr" exists is because of the challenges of reliably inferring the structure of a document at parsing time. Yes, there are some kinds of documents where parsing logic alone will suffice, but for more complex tasks you need what ABBYY and Nuance are selling.
I think it boils down to the needs of each individual application.
Some cases have a lot of templates and need the "automatic fuzzy matching" functionality and the extra bells and whistles.
But smaller players often deal with just a handful of relatively simple templates where FlexiCapture would just be overkill (not to mention a couple of other problems that I'm covering at the end of the post). This is of course not an easy task because you need someone who can design and implement an end to end system that possibly involves image processing, "zonal OCR", include an OCR engine and also perform reliable text extraction from images/PDFs (extracting text from PDFs is tricky). It's way easier for a non developer to think about what rulesets/logic to apply and not having to think about the image processing/OCR bits. I think that is one of the main selling points of FlexiCapture. It abstracts the OCR bits so that the system designer can think about the problem itself, design a spec and think about the logic the logic. Do you need deskewing of documents? Click a button and you get deskewing.
Which brings me to the second point. The products sold by ABBYY/Nuance are meant to be used by integrators (no programming needed other than the occasional VB.net script), not image processing specialists/developers. In my (biased) opinion, it makes more sense for some businesses to go the custom route instead of investing in FlexiCapture.
There is also FlexiCapture Engine that is meant for developers.
This has the same problems as the other offerings by ABBYY (I don't know about Nuance but I suspect it's the same):
- vendor lock-in
- ridiculous extra costs for things like "cloud/VM license", exporting to PDF, etc.
- limits on how many pages you can process per year or in total (complex licensing schemes)
- ABBYY really wants to sell you their own cluster/cloud management services which is all proprietary
- limited flexibility in implementing distributed services, costs that add up fast, you have to be trained in their own stack
1. FlexiCapture makes pre-processing incredibly painless and training-free. Beyond the usual binarization stuff, we extensively use the built in auto-rotation and cleanup (skew, noise, speckle, etc.).
2. Templating is really the big win for FlexiCapture. I have not seen anything else with a template GUI that comes close to being as usable, robust, or simple. That's really important to us because we build a LOT of templates. I have a really hard time imagining having to code them.
3. FlexiCapture's template engine is super strong for the kinds of documents we work with, which is mainly complex repeating groups with nested structures. It's also really good at handling both photos of documents (e.g. mobile) and scans. One thing it also offers that I haven't seen elsewhere in a turnkey product or existing platform is the ability to define zones in purely relative terms without absolute positions. I don't know about Nuance but I've not seen any other template GUI that will allow you to spec something like "look for either this word or a two-line string containing these words in the upper left quadrant of the document."
4. There's a dearth of zonal ocr frameworks. Outside of ABBY and Nuance's SDK's, the only one I'm even aware of is OpenKM, and I don't write Java. The FlexiCapture Engine SDK is a terrible beast. The documentation is horrible, it's Windows only, and it's all COM objects.
In that case ... it would be nice if Tensorflow had OCR as one of its examples/tutorials.
Do you have an example of anyone working on that? I think those packages have a lot of inertia because you can buy a product with an API, documentation, etc. Something which says “Learn TensorFlow & enough AI to be dangerous” is going to be hard to get in the door at many large organizations so this might be an area where a solid open-source project could have a significant impact.
This issue here is that industries become stagnant when it appears difficult/costly to introduce a new product in that market. You usually need significant a technical and/or social change to occur before incumbents can be displaced. In this case, it was a technical change.
The other part of the issue is that investors want "0 to 1" ideas to invest in, and founders often chase those ideas. In reality, we need more "1.0 to 1.5" products, especially in crowded markets.
I think that getting there is a lot more work than OP anticipated.
I too tried the Acrobat OCR first and found it to be vastly inferior. One surprise was that (at the time at least) the OCR-by-Acrobat files were not only much less accurate in the text layer, but the file size also bloated up surprisingly.
1. There is no dataset/competition like ImageNet for OCR.
2. Most people/conferences/universities are going after natural images and "computer vision" problems. OCR is its own animal and while it shares some concepts with computer vision it's not the same thing.
3. A lot of IP, knowledge and talent locked up in a handful of very old companies doing this for a long time. ABBYY is for OCR what Google + Facebook are for deep learning (maybe more).
4. OCR is kind of a niche, a lot of knowledge is not available to many people outside of a few insiders (ABBYY/Nuance, universities, research labs, OCR conferences).
I'm sure Google uses it a lot internally (e.g. Google Street View numbers etc.).
5. The incumbents don't just do OCR. They do preprocessing (computer vision/image processing) + OCR + NLP.
6. Hard to find data. ABBYY Finereader supports 190 languages. Collecting this data is no easy task.
I'm probably missing other reasons as well, but this is just off the top of my head.
That being said, I'm sure that there's going to be a lot of progress in the OCR + deep learning space soon.
This is a strong statement. Do you have any links to code or documentation that implements high-quality OCR with tensorflow?
I realize that this is a result of my lack of knowledge about the field, but the ML hype train makes it easy to overestimate the capabilities of deep learning as a layperson.
Now, creating a Word document from a scan is a different beast because it requires layout analysis. This is where Abbyy with its long experience still has a good lead.
Extracting data from documents requires a solution which uses OCR but is a different product (e.g. ABBYY FlexiCapture).
This is most commonly referred to as zonal OCR and comes with the added functionality of handling multiple templates, defining zones/fields, specifying special rules for fields, verification process for manual inspection (e.g. triggered when the image receives a low confidence recognition score) etc. This is different and more complex than a product that does full page OCR (e.g. ABBYY Finereader).
Handwritten OCR is a whole different story. The products that do zonal OCR will fail to recognize handwritten text, unless it's in boxes (PDF forms). I'm working on a prototype that can handle handwritten text outside of boxes too.
They don't offer OCR publicly, instead, they generate a list of possible candidates to match against search queries (fuzzy matching).
From one of the blog posts: "Employing multiple reco engines is important not only because they specialize on different types of text and languages, but also as this allows to implement ‘voting’ mechanism — analyzing word alternatives created by diverse engines for the same word allows for better suppression of false recognitions and giving more confidence to consensus variants."
It was not perfect, but I was very impressed with the quality, speed, and price.
Preprocessing the images (OCR pipeline) is very important for OCR. For generic scanned PDF documents Finereader does a pretty good job.
There is a lot of stuff going on in a OCR engine. Layout analysis, dewarping, binarization, deskewing, despeckling (and others) and then there's the OCR itself.
With Tesseract you have to do a lot of things yourself, you have to provide it with a clean image. The commercial packages do that for you automatically. ABBYY and other solutions also use NLP to augment/check the OCR results from a semantic analysis perspective.
Also, there is no "one size fits all" OCR. It is highly specific to the nature of the application. Consider the following use cases:
- scanned PDF document
- scanned document with a non-standard font (e.g. Fraktur script in a historic book)
- photo of scanned document acquired with a mobile phone's camera
- passport OCR (MRZ)
- credit card OCR
- text appearing in natural image (e.g. store sign)
There is a growing number of papers using deep learning that get submitted to ICDAR (the premier OCR conference) and the other OCR conferences.
One of the problems is the lack of a universal dataset/competition like ImageNet.
The SmartDoc competition (documents captured from smartphones) was cancelled this year due to an insufficient number of participants.
If anyone is doing work with OCR + deep learning, I'd love to discuss!
How long ago did you use it? We use it to OCR photos of paper checks taken via smartphone and pull contact information out. It works pretty well.
I was trying to OCR scanned scientific publications with multi-column layout and figures. I didn't expect anything to get all the captions right, or handle specialized notation, but it was important to recognize all the ordinary English words on the page and to have the text flow in the right order.
Snap a picture of a fixed layout form from smartphone & extract TYPED data
If you want to extract data out of documents/forms then you need to develop your own solution (I'm doing work in this area) or use expensive packages like ABBYY FlexiCapture.
Images taken from a smartphone (compared to scanned documents) is going to make the problem harder.
Glad to hear this.
There was a recent post on the Google Developers blog about a similar project to recognize bottle-cap codes in a mobile app:
Solving problems like DropTurk with stronger guarantees around turnaround/quality/confidentiality is exactly what we do at Scale. Would love to hear more about DropTurk’s experiences around building this internally and how it compares to our product:
Are there any Evernote -> Dropbox Paper migration tools out there?
Online is when you have info about the strokes (e.g. writing on a tablet or on a HTML5 canvas), offline is when you don't anything other than the image.
Online is mostly solved, offline is not working well yet. Deep learning will certainly help accelerating progress.
I'm working on a prototype that extracts (and transcribes) handwriting info from PDF documents (like forms).
The online system is not so useful, I prefer to type when I have the option anyway. It's faster.
The offline system initially doesn't even need to be that accurate. OCRing printed text must be accurate because of the sheer volume being OCRed. Handwritten stuff tends to be of far smaller volume, and so can be hand-corrected more practically.
I mostly use GoodNotes. I don't convert my handwriting to text, but it is nice that I can search my notes.
And yes, context is needed to recognize cursive. At the very least a word frequency combined with a dictionary should get "most likely" recocgnitions.
For example, my OCR software (for printed text) often recognizes ? as }. Come on.
StackOverflow is filled with such answers. Which is why many people ask questions defensively, also mentioning the alternatives that they don't want to do. If I were to ask this as a StackOverflow question, I'd do it like ...
How can I make Dropbox ignore certain files from syncing? Is there some sort of .dropboxignore mechanism?
PS: I don't want to save files outside Dropbox's directory, I do want to synchronize my projects on Dropbox and lets assume for the sake of argument that I know what I'm doing.
(at which point I'd realize that I'm becoming aggressive and that I probably answered my own question, therefore closing the browser window in frustration)
checkout a WIP branch and push it. You can pull it on any other computer you want, keep working on it till you are comfortable enough to merge back on master/develop.
I commit WIP code to a separate branch, change, and now if I go back to the original computer, I need to make sure to push, go back, pull again... god forbid I had done interactive rebasing or amended a commit, since it'll be all of of sync... it's just more more complicated than simply saving a file and having it reflected. "Temporary commits" should be a rare thing, not mandatory.
I end up using both git and dropbox for some projects and it mostly works, but having projects on dropbox is a mess because of node_modules, having sync running continuously because some tools/projects/IDEs will keep generating temporary files, etc.
Why not use a common git server? Why not use a dedicated backup solution? Many backup apps can store backups to Dropbox if that's what you want.
Myself, when I'm at home, I just check out the project on my desktop, use SSHFS to mount it on my laptop, edit it on my laptop or desktop as needed, and have an SSH window open from my laptop to my desktop for doing things like running npm (because that'd be way too slow if I ran it from the laptop and depended on SSHFS to sync it.)
If X11 or RDP was snappier, I'd just use my laptop as a thin client, but it's not, so I'm stuck with figuring out multi-master realtime source-code tree sync.
A similar problem obtains when using VMs, but at least there, there isn't the requirement that I have the ability to run a GUI text editor in the VM that sees changes I make outside the VM quickly enough to be unnoticeable.
There are options for not syncing some files, for how to resolve conflicts, etc.
But for personal projects I use syncthing, which supports .stignore, gives you versioning options, and most importantly doesn't randomly revert your files back to a previous version because of clock drift.
basically, use FUSE to create an alternative view of your filesytem, but that fuse layer will also exclude certain files matching a regex. Only the FUSE view into your files should put in the Dropbox folder and synced. This particular fuse setup is read-only which was fine for my purposes, but you might need an alternative which allows write access for Dropbox.
> To peek at personal data
How is this a concern when that is the purpose of the company's software?
I think you are confusing dropbox for something else?
In fact I think dropbox would be a horrible backup solution! Don't use it for backup or you will be disappointed.
(Not saying you're wrong)
They provide storage to lots of individuals, but that isn't really where they are trying to grow their business.
If you read further into the article, they did exactly what you suggested to train their own system:
"A corpus with words to use...A collection of fonts for drawing the words...A set of geometric and photometric transformations meant to simulate real world distortions...The generation algorithm simply samples from each of these to create a unique training example."