Hacker News new | comments | show | ask | jobs | submit login
Creating a Modern OCR Pipeline Using Computer Vision and Deep Learning (dropbox.com)
460 points by wwarner 3 months ago | hide | past | web | favorite | 125 comments

> Our Word Detector was also a significant bottleneck. We ended up essentially rewriting OpenCV’s C++ MSER implementation in a more modular way to avoid duplicating slow work when doing two passes (to be able to handle both black on white text as well as white on black text); to expose more to our Python layer (the underlying MSER tree hierarchy) for more efficient processing; and to make the code actually readable.

Was this ever contributed back to opencv?

OpenCV is licensed under the BSD license. [1]

"BSD (unlike some other licenses) does not require that source code be distributed at all." [2]

Looks like there is no legal obligation to provide their source code or changes. But I really hope they did push their changes upstream. It is a real shame when companies that benefit so much from the open source community don't give back.

1. https://github.com/opencv/opencv/blob/master/LICENSE

2. https://en.wikipedia.org/wiki/BSD_licenses

Even under GPL there would be no requirement, since this happens on the server, not the client. The AGPL is the only license I'm aware of that places this requirement when not distributing the software.

Hopefully, those who miss that Dropbox code today may think about using a more powerful license (ie AGPL) on the code they'll write themselves in the future so that they make sure their users will contribute back like they want DropBox to. :-)

Pretty sure that both OpenCV and Dropbox knew what they were doing when they avoided the affero license.

Looking forward to seeing this Dropbox.

If someone there is reading this, please release this code in any way you like, even if you think it's too rushed or ugly, i'm sure the community can clean it up.

A better baseline than an out of the box OCR engine is applying some computer vision techniques before using an out of the box OCR engine. This can significantly out perform a pure OCR engine approach. (speaking from experience working on startups doing this kind of stuff for ~10 months)

One dumb but surprisingly effective thing to try is apply a bunch of random binarizing filters before putting your text into an OCR engine, then picking the most common output from those filtered images. Some combination of morphological operations (dilates/erodes in different orders and different strengths), different levels of blur and sharpening, adaptive binary threshold at different levels, resizing the image (hilarious but some OCR engines are sensitive to relative scale), adding white borders of different thicknesses, rotating by small angles, denoising, ...

This gave us ~95% word level accuracy on our dataset (with tesseract as the OCR engine), without tuning the filters (we had used the filters individually before with some brittle logic for deciding when to use them: the random filter approach was kind of a joke which ended up working). Tesseract on its own had an abysmal ~40% word level accuracy on that same dataset. Dataset was of words segmented out of scanned bank statements.

Main downside was that this was pretty slow (we were generating ~10k filters per word), but we found a way to make it work (hierarchical stacked AWS Lambdas, lol).

Selfless plug, but for some examples of these see a side project of mine: https://www.juusohaavisto.com/northern-nike-nabob.html

Or post processing with domain specific knowledge. eg, https://www.taggun.io/ using google vision for receipts and then applying some ml, nlp, etc.

...and the result was better than what you could buy from Google cloud vision or ocr.space out of the box?

Interestingly, When you think about it, the filters in a Convolutional neural net aren't too different from your setup and they provide similar, if not better, performance.

I would be interested in exchanging notes, send me an email if you're up to it.

This is called an ensemble method.

c'mon -- that's the easy way!

10 years ago I used Abbyy Finereader to add an OCR text layer to about 500,000 pages of scanned PDF documents. At the time it appeared to be the best OCR software available, at least of the options I could try.

Later I played around with the open source Tesseract OCR, but it was way behind commercial desktop packages in accuracy/usability. It lagged especially in dealing with different document layouts. What's the current leader in desktop OCR? Has deep learning yet revolutionized the "traditional" OCR market for desktop document processing, or is it just making it easier for companies like Dropbox to quickly get something reasonably good working?

I had to research the commercial OCR market recently for a client project. It's abysmal. Most of the solutions are horrendously overpriced, windows only enterprise packages, and the ones that I was able to try out bordered on unusable (bad results and terrible UI). Since OCR is basically "hello world" for tensorflow, I don't understand why these incumbents haven't been wiped off the map.

Lots of reasons they haven't been wiped off the map.

1. Full page OCR is trivial these days. Anyone who's just doing full-page OCR has no business using one of these commercial offerings. The commercial stuff is really for extraction of structured info from unstructured and semi-structured documents. ABBYY and Nuance are the only ones with products that can handle it. There are alternatives for simple capture tasks (like DocParser), but not for complex ones.

2. ABBYY and Nuance have a lot of IP in the space on lock.

3. The market for complex data extraction (at least the kind of stuff I do) may not actually be big enough for smaller players to bother pursuing (most new players are going after tasks of intermediate difficulty, like invoice and receipt capture).

4. There's still a need for on-prem style solutions for people doing huge volumes (hundreds of thousands of pages a year), whereas most of the new market entrants are cloud only.

5. I haven't used Nuance's stuff much but ABBYY's products are actually incredibly robust under the hood-- stuff you wouldn't want to build yourself unless you absolutely had to. FlexiCapture will do 95%+ accuracy out of the box (on a character by character basis) with no human verification.

> Anyone who's just doing full-page OCR has no business using one of these commercial offerings.

So what are you supposed to use for full-page OCR?

As far as hosted solutions go, the best are Google Cloud Vision, Azure OCR and OCr.space

You can compare all three here: https://ocr.space/compare-ocr-software

For an opensource solution that uses Tesseract, check out ocrmypdf: https://github.com/jbarlow83/OCRmyPDF

I wouldn't say that full page OCR is trivial. Using an opensource solution (99% based on Tesseract) is going to get you ok-ish results if your input is relatively clean (no complex layout, scanned documents from a flatbed scanner, standard fonts) and you don't care about speed. If you care about recognition accuracy then Tesseract isn't going to cut it (at least not without some serious effort).

Replying to points 1 and 3: For smaller players and/or complex tasks you can always implement your own custom parser.

I'm doing work as a contractor in this space.

I agree with you that Tesseract isn't great out of the box, but if you aren't doing huge volumes, there are plenty of cloud options available.

Respectfully, I disagree about this being a parsing issue. The whole reason so-called "zonal ocr" exists is because of the challenges of reliably inferring the structure of a document at parsing time. Yes, there are some kinds of documents where parsing logic alone will suffice, but for more complex tasks you need what ABBYY and Nuance are selling.

Just to be sure that we're talking about the same thing, by "custom parser" I meant implementing your own barebones "zonal OCR" functionality with just the features that are needed for the specific problem.

I think it boils down to the needs of each individual application.

Some cases have a lot of templates and need the "automatic fuzzy matching" functionality and the extra bells and whistles.

But smaller players often deal with just a handful of relatively simple templates where FlexiCapture would just be overkill (not to mention a couple of other problems that I'm covering at the end of the post). This is of course not an easy task because you need someone who can design and implement an end to end system that possibly involves image processing, "zonal OCR", include an OCR engine and also perform reliable text extraction from images/PDFs (extracting text from PDFs is tricky). It's way easier for a non developer to think about what rulesets/logic to apply and not having to think about the image processing/OCR bits. I think that is one of the main selling points of FlexiCapture. It abstracts the OCR bits so that the system designer can think about the problem itself, design a spec and think about the logic the logic. Do you need deskewing of documents? Click a button and you get deskewing.

Which brings me to the second point. The products sold by ABBYY/Nuance are meant to be used by integrators (no programming needed other than the occasional VB.net script), not image processing specialists/developers. In my (biased) opinion, it makes more sense for some businesses to go the custom route instead of investing in FlexiCapture.

There is also FlexiCapture Engine that is meant for developers. This has the same problems as the other offerings by ABBYY (I don't know about Nuance but I suspect it's the same):

  - expensive
  - vendor lock-in
  - ridiculous extra costs for things like "cloud/VM license", exporting to PDF, etc.
  - limits on how many pages you can process per year or in total (complex licensing schemes)
  - ABBYY really wants to sell you their own cluster/cloud management services which is all proprietary
  - limited flexibility in implementing distributed services, costs that add up fast, you have to be trained in their own stack
Can you provide an example where you think that a custom solution would not work? I'm curious.

First of all, why don't you shoot me an email at info@jurymatic.com and we can talk further. In a nutshell, it would have been way more expensive and difficult for us to roll our own than even the high cost of a FlexiCapture license. But here's a reasonably complete explanation of the build vs buy analysis we did.

1. FlexiCapture makes pre-processing incredibly painless and training-free. Beyond the usual binarization stuff, we extensively use the built in auto-rotation and cleanup (skew, noise, speckle, etc.).

2. Templating is really the big win for FlexiCapture. I have not seen anything else with a template GUI that comes close to being as usable, robust, or simple. That's really important to us because we build a LOT of templates. I have a really hard time imagining having to code them.

3. FlexiCapture's template engine is super strong for the kinds of documents we work with, which is mainly complex repeating groups with nested structures. It's also really good at handling both photos of documents (e.g. mobile) and scans. One thing it also offers that I haven't seen elsewhere in a turnkey product or existing platform is the ability to define zones in purely relative terms without absolute positions. I don't know about Nuance but I've not seen any other template GUI that will allow you to spec something like "look for either this word or a two-line string containing these words in the upper left quadrant of the document."

4. There's a dearth of zonal ocr frameworks. Outside of ABBY and Nuance's SDK's, the only one I'm even aware of is OpenKM, and I don't write Java. The FlexiCapture Engine SDK is a terrible beast. The documentation is horrible, it's Windows only, and it's all COM objects.

OCR is not "hello world" for tensorflow. The entire article is devoted to actually developing the OCR pipeline. They ported the model over to tensorflow as a secondary thing.

Character recognition might be "hello world" but full page OCR definitely is not. When I read the article and think about all the obstacles dropbox overcame, it looks like a pretty huge achievement. I can tell you, it would take me a lot longer than 8 months do pull it off.

I am certain they had a team on this, I just wonder what its size was. Quite an achievement!

OCR for really basic documents might be "hello world", but as usual real world practical applications have all sorts of interesting things that make this a harder problem than it seems.

> Since OCR is basically "hello world" for tensorflow [...]

In that case ... it would be nice if Tensorflow had OCR as one of its examples/tutorials.

> Since OCR is basically "hello world" for tensorflow, I don't understand why these incumbents haven't been wiped off the map.

Do you have an example of anyone working on that? I think those packages have a lot of inertia because you can buy a product with an API, documentation, etc. Something which says “Learn TensorFlow & enough AI to be dangerous” is going to be hard to get in the door at many large organizations so this might be an area where a solid open-source project could have a significant impact.

The parent post is not implying that companies adopt Tensorflow themselves and solve this problem. The post is implying that a competitor come in to the market using the newer, better technology, and replace them with a better product (with an API, documentation, etc).

This issue here is that industries become stagnant when it appears difficult/costly to introduce a new product in that market. You usually need significant a technical and/or social change to occur before incumbents can be displaced. In this case, it was a technical change.

The other part of the issue is that investors want "0 to 1" ideas to invest in, and founders often chase those ideas. In reality, we need more "1.0 to 1.5" products, especially in crowded markets.

I think your last point is correct: there's a huge gap between the kind of “hello world” demo the original poster was talking about and a product-level tool which has reasonable training data and performance across a non-trivial range of documents.

I think that getting there is a lot more work than OP anticipated.

OP here; yes, you've summed it up perfectly.

I recently looked at a number of packages, and the best option in my opinion is ABBYY Finereader Pro for Mac. The mac version can easily be automated with Hazel or Automator, which adds features that cost +$400 more for the PC. The results are worlds better than Acrobat.

When I did my big project I was using the cheap standard edition of Finereader on (virtualized) Windows, and I built an automated pipeline for it with a combination of AutoHotkey, pdftk, Multivalent pdf tools, and Python.

I too tried the Acrobat OCR first and found it to be vastly inferior. One surprise was that (at the time at least) the OCR-by-Acrobat files were not only much less accurate in the text layer, but the file size also bloated up surprisingly.

Some thoughts (adding to staticautomatic's post):

1. There is no dataset/competition like ImageNet for OCR.

2. Most people/conferences/universities are going after natural images and "computer vision" problems. OCR is its own animal and while it shares some concepts with computer vision it's not the same thing.

3. A lot of IP, knowledge and talent locked up in a handful of very old companies doing this for a long time. ABBYY is for OCR what Google + Facebook are for deep learning (maybe more).

4. OCR is kind of a niche, a lot of knowledge is not available to many people outside of a few insiders (ABBYY/Nuance, universities, research labs, OCR conferences). I'm sure Google uses it a lot internally (e.g. Google Street View numbers etc.).

5. The incumbents don't just do OCR. They do preprocessing (computer vision/image processing) + OCR + NLP.

6. Hard to find data. ABBYY Finereader supports 190 languages. Collecting this data is no easy task.

I'm probably missing other reasons as well, but this is just off the top of my head.

That being said, I'm sure that there's going to be a lot of progress in the OCR + deep learning space soon.

>Since OCR is basically "hello world" for tensorflow,

This is a strong statement. Do you have any links to code or documentation that implements high-quality OCR with tensorflow?

Not high quality. But since MNIST is typically used as a baseline reference, I was surprised that more advanced character recognition wasn't a staple in the field. From the tensorflow introduction[0]: "When one learns how to program, there's a tradition that the first thing you do is print 'Hello World.' Just like programming has Hello World, machine learning has MNIST."

I realize that this is a result of my lack of knowledge about the field, but the ML hype train makes it easy to overestimate the capabilities of deep learning as a layperson.

[0] https://www.tensorflow.org/get_started/mnist/beginners

Because for example Abby replaces the image with an PDF with invisible text overlays. I think it is a bit more complicated once you account for the formatting of documents.

Abbyy has had that ability for a longer period of time so it may be more accurate, but this is something tesseract supports[1]. In fact I'd say most OCR systems support it with varying degrees of accuracy.


The ability to create searchable PDFs is very useful and convenient. But creating searchable PDFs does not require a deep understanding of the document format (like column detection etc). You just place the words at the right coordinates of their bounding boxes. You can test this for example here: https://ocr.space - select the option to create a searchable PDF. It works even for the most complex documents.

Now, creating a Word document from a scan is a different beast because it requires layout analysis. This is where Abbyy with its long experience still has a good lead.

My issue was that I needed one that exposes an API for doing OCR on pictures taken from mobile devices. It was really difficult to find non-desktop packages.

Out of curiosity, what did you need beyond something like OpenCV/Metal + Tesseract?

The client needed an OCR solution for supplier invoices with a variety of layouts and a combination of printed and hand-written characters, and didn't have the budget for a bespoke solution. To be fair, it's a very hard problem, I was just surprised that given all the much hyped recent advancements in deep learning for computer vision, most of the solutions in the market seem to be running on decades old technology.

Well it is more complex than it appears.

Extracting data from documents requires a solution which uses OCR but is a different product (e.g. ABBYY FlexiCapture).

This is most commonly referred to as zonal OCR and comes with the added functionality of handling multiple templates, defining zones/fields, specifying special rules for fields, verification process for manual inspection (e.g. triggered when the image receives a low confidence recognition score) etc. This is different and more complex than a product that does full page OCR (e.g. ABBYY Finereader).

Handwritten OCR is a whole different story. The products that do zonal OCR will fail to recognize handwritten text, unless it's in boxes (PDF forms). I'm working on a prototype that can handle handwritten text outside of boxes too.

Do you know what Evernote uses? Although I don't really use it anymore, when I did, I was astonished at how will it found and indexed text.

They use multiple OCR engines. Some developed in-house and some proprietary ones. A blog post mentions I.R.I.S. as one of the proprietary ones.

They don't offer OCR publicly, instead, they generate a list of possible candidates to match against search queries (fuzzy matching).

From one of the blog posts: "Employing multiple reco engines is important not only because they specialize on different types of text and languages, but also as this allows to implement ‘voting’ mechanism — analyzing word alternatives created by diverse engines for the same word allows for better suppression of false recognitions and giving more confidence to consensus variants."

More: http://blog.evernote.com/tech/2013/07/18/how-evernotes-image...



If I recall, they don't do traditional OCR, rather they build a term probability database that lets them do matching on search terms. Both Evernote and ABBYY seem to have ties to a specific Russian research group.

I know you are talking about Desktop OCR, but I had a very good experience with Google Vision API. On my previous gig we were trying to automate receipt scanning, and it gave very good results with no previous image manipulation whatsoever (meaning, no special camera alignment, lighting condition, rotation, skewing, etc).

It was not perfect, but I was very impressed with the quality, speed, and price.

ABBYY has dominated the field for many years (decades really) and still outperforms every solution out there. OmniPage by Nuance is probably the second best.

Preprocessing the images (OCR pipeline) is very important for OCR. For generic scanned PDF documents Finereader does a pretty good job.

There is a lot of stuff going on in a OCR engine. Layout analysis, dewarping, binarization, deskewing, despeckling (and others) and then there's the OCR itself. With Tesseract you have to do a lot of things yourself, you have to provide it with a clean image. The commercial packages do that for you automatically. ABBYY and other solutions also use NLP to augment/check the OCR results from a semantic analysis perspective.

Also, there is no "one size fits all" OCR. It is highly specific to the nature of the application. Consider the following use cases:

  - scanned PDF document
  - scanned document with a non-standard font (e.g. Fraktur script in a historic book)
  - photo of scanned document acquired with a mobile phone's camera
  - passport OCR (MRZ)
  - credit card OCR
  - text appearing in natural image (e.g. store sign)
These are all "OCR projects" but they require very different approaches. You cannot just throw any input image at an OCR engine and expect it to work. It often requires a mix of computer vision/image processing, machine learning and OCR engine.

There is a growing number of papers using deep learning that get submitted to ICDAR (the premier OCR conference) and the other OCR conferences. One of the problems is the lack of a universal dataset/competition like ImageNet. The SmartDoc competition (documents captured from smartphones) was cancelled this year due to an insufficient number of participants.

If anyone is doing work with OCR + deep learning, I'd love to discuss!

The unreleased v4 version of tesseract has a new engine based on state of the art deep learning techniques. In some initial poking at it, it looks like it might blow the existing commercial solutions away in terms of quality.

In mumble experience it is much better then v3 but still not as good as Abby.

Can you elaborate on this a bit? It still has some edge case quality issues, but I haven't seen anything where Abby does better so far. That said, the default app is missing a bunch of preprocessing that you have to add (e.g. page deskew, flipping, etc).

Do you have an idea of when it is going to be released? Or this is only going to be a Google internal thing?

it's in the master branch, it just hasn't been tagged for release yet.

Oh, thanks. I had taken a looked at the number of commits which didn't have many after the last release so I thought it hadn't been merged yet.


It uses LSTM for the line recognizer.

CuneiForm was a close competitor to Abbyy that got open sourced, both were Russian companies. CuneiForm OCR is certainly better to Tesseract OCR. The open source Windows version is feature complete and very similar to Abbyy though it compiles only with VS 6. Some converted it to an Linux application, but most advanced features never got ported nor the great native UI is available - unfortunately. It would be great if a community would re-activate this open source OCR gem, though some Russian speaking devs are needed to translate the comments. (Tesseract OCR is based on 1995 HP OCR application and more behind than CuneiForm OCR) https://en.m.wikipedia.org/wiki/CuneiForm_(software)

> I played around with the open source Tesseract OCR, but it was way behind commercial desktop packages in accuracy/usability. It lagged especially in dealing with different document layouts.

How long ago did you use it? We use it to OCR photos of paper checks taken via smartphone and pull contact information out. It works pretty well.

It was 4+ years ago. I don't remember exactly when, but the project was still hosted on Google Code at the time.

I was trying to OCR scanned scientific publications with multi-column layout and figures. I didn't expect anything to get all the captions right, or handle specialized notation, but it was important to recognize all the ordinary English words on the page and to have the text flow in the right order.

The alpha version (4.0) is significantly better than the stable version (3.05).

I have a similar need. Is there any OSS projects which does this?

Snap a picture of a fixed layout form from smartphone & extract TYPED data

If you're looking for full page OCR, check out ocrmypdf (uses Tesseract).

If you want to extract data out of documents/forms then you need to develop your own solution (I'm doing work in this area) or use expensive packages like ABBYY FlexiCapture.

Images taken from a smartphone (compared to scanned documents) is going to make the problem harder.

> "At some point, we will open source and release this synthetically generated data for others to train and validate their own systems and research on."

Glad to hear this.

This article was posted in April. Did Dropbox follow through and release? 7 months..

Yeah that would be a great read. You want your system to learn what's in real photographs, not how to see through your simulation.

I'd kill for a self-hostable solution like that. I just don't have the resources to spend like Dropbox. Can anyone recommend a poor man's approach to this? I can afford a lot of AWS instances, but not the 8 months of engineering.

Not self-hostable, but Google Vision API OCR was pretty good when I used last time. AWS also has a offering for that, but I haven't personally tested.

I'm offering a self-hosted solution, mail is in profile.

It's pretty remarkable that they were able to build a system more or less from scratch that "gave us the accuracy, precision, and recall numbers that all met or exceeded the OCR state-of-the-art."

There was a recent post on the Google Developers blog about a similar project to recognize bottle-cap codes in a mobile app:


HN discussion:


seems like a lot of value locked away in these training sets, is there a commercial platform to buy and sell trained sets yet ?, answering my own question, yes, https://algorithmia.com/ any others ?

Super cool!

Solving problems like DropTurk with stronger guarantees around turnaround/quality/confidentiality is exactly what we do at Scale. Would love to hear more about DropTurk’s experiences around building this internally and how it compares to our product:


Hadn't really used the document scanner yet...but it sounds like I might have to now. If they keep moving on this it's going to replace my primary use of Evernote...and since Dropbox has a Linux client but Evernote doesn't it'll end up being a pretty easy switch.

Are there any Evernote -> Dropbox Paper migration tools out there?

Was this done by the Endorse team Dropbox acquired a few years back? I know they were big into this same line of work.

Are there any fully-offline alternatives to this? What is the most capable open-source solution?

I'd love to see a system that worked on handwriting. With all the image recognition progress in the last few years, I never see a mention of recognizing handwriting.

There are two options for this, one is online and the other is offline handwriting recognition.

Online is when you have info about the strokes (e.g. writing on a tablet or on a HTML5 canvas), offline is when you don't anything other than the image.

Online is mostly solved, offline is not working well yet. Deep learning will certainly help accelerating progress.

I'm working on a prototype that extracts (and transcribes) handwriting info from PDF documents (like forms).

Nice to see someone working on it!

The online system is not so useful, I prefer to type when I have the option anyway. It's faster.

The offline system initially doesn't even need to be that accurate. OCRing printed text must be accurate because of the sheer volume being OCRed. Handwritten stuff tends to be of far smaller volume, and so can be hand-corrected more practically.

IIRC, Evernote will index—though not transcribe—handwritten text.

I use two apps on my iPad that do a scarily good job of recognizing my handwriting. They are Nebo and GoodNotes. Neither app was expensive.

I mostly use GoodNotes. I don't convert my handwriting to text, but it is nice that I can search my notes.

Character separation is much harder, bordering on impossible, with handwriting. Also look at all all the handwriting where you need to understand the context to eg. differ a 'm' from an 'nn'.

I thought facial recognition would border on the impossible, but that seems to be a solved problem today.

And yes, context is needed to recognize cursive. At the very least a word frequency combined with a dictionary should get "most likely" recocgnitions.

For example, my OCR software (for printed text) often recognizes ? as }. Come on.

There a couple of apps that do this, notably Evernote and Goodnotes. Recently, the built in Notes app on iOS has added this feature as well, but it's still a bit finicky.

Do you mean recognize text as you draw it, or recognize text in a scanned image?

You can draw notes, and then the text of the notes is available for searching. I think it appears in the list of notes as well, but I'm not sure if there's an easy way to extract it.

I just tried that on the Notes app on my iphone; it didn't transcribe it or I didn't do it right.

I hope some better OCR comes along. I use Nuance and either I have confused it with my training file or it makes some very dumb decisions. It seems like is should scan something that it sees as "wi1l" or "s0mething" and realize that it isn't very likely that someone would put a number in the middle of a word or put random superscript in the middle of a word.

Very interesting post. Is there any benchmark data showing how their system compares to the competition?

All this and yet they still can't implement support for a .dropboxignore file

You can just place any such file outside the Dropbox directory. Doesn't seem like a big deal.

When people have a problem, it's very, very frustrating for somebody to show up and disagree — as if that person couldn't figure out that he can save files outside the Dropbox directory — clearly that's not what he freaking wants and what isn't a big deal for your usage patterns might be a big deal to somebody else.

StackOverflow is filled with such answers. Which is why many people ask questions defensively, also mentioning the alternatives that they don't want to do. If I were to ask this as a StackOverflow question, I'd do it like ...

How can I make Dropbox ignore certain files from syncing? Is there some sort of .dropboxignore mechanism?

PS: I don't want to save files outside Dropbox's directory, I do want to synchronize my projects on Dropbox and lets assume for the sake of argument that I know what I'm doing.

(at which point I'd realize that I'm becoming aggressive and that I probably answered my own question, therefore closing the browser window in frustration)

It's very frustrating if you keep your programming projects backed up through Dropbox, and want to back up the code but not gigabytes of node_modules and python virtualenvs. Not that Dropbox has any incentive to make it easier to use less space...

Isn't git or mercurial more appropriate for archiving programming projects?

For archival, yes. But it's not rare that I want to swap computer without committing WIP changes.

Wouldn't a branching do the trick?

checkout a WIP branch and push it. You can pull it on any other computer you want, keep working on it till you are comfortable enough to merge back on master/develop.

While that sort of works and I do that, it's a much more cumbersome solution.

I commit WIP code to a separate branch, change, and now if I go back to the original computer, I need to make sure to push, go back, pull again... god forbid I had done interactive rebasing or amended a commit, since it'll be all of of sync... it's just more more complicated than simply saving a file and having it reflected. "Temporary commits" should be a rare thing, not mandatory.

I end up using both git and dropbox for some projects and it mostly works, but having projects on dropbox is a mess because of node_modules, having sync running continuously because some tools/projects/IDEs will keep generating temporary files, etc.

Oh i do agree, it's cumbersome and dropbox should have .dropboxignore file. I was just suggesting a work around :)

What you're looking for is an automated backup solution. What you're using is a file syncing service. Those are not necessarily the same things.

Why not use a common git server? Why not use a dedicated backup solution? Many backup apps can store backups to Dropbox if that's what you want.

No, I do want a file-syncing service—so that I can have the same dirty git worktree on both my laptop and my desktop.

Myself, when I'm at home, I just check out the project on my desktop, use SSHFS to mount it on my laptop, edit it on my laptop or desktop as needed, and have an SSH window open from my laptop to my desktop for doing things like running npm (because that'd be way too slow if I ran it from the laptop and depended on SSHFS to sync it.)

If X11 or RDP was snappier, I'd just use my laptop as a thin client, but it's not, so I'm stuck with figuring out multi-master realtime source-code tree sync.

A similar problem obtains when using VMs, but at least there, there isn't the requirement that I have the ability to run a GUI text editor in the VM that sees changes I make outside the VM quickly enough to be unnoticeable.

Fyi, i have been using unison command line tool for this scenario. It is fast enough that i can edit files locally, and then have a file system watcher on my remote machine run tests when files are changed and it only adds a second or so to the process.

There are options for not syncing some files, for how to resolve conflicts, etc.

I use SSHFS to work from home, and then X11-forwarding to run gitk/git-gui for the occasional git ops that take too long otherwise.

But for personal projects I use syncthing, which supports .stignore, gives you versioning options, and most importantly doesn't randomly revert your files back to a previous version because of clock drift.

here's shitty solution to your problem: https://github.com/gburca/rofs-filtered

basically, use FUSE to create an alternative view of your filesytem, but that fuse layer will also exclude certain files matching a regex. Only the FUSE view into your files should put in the Dropbox folder and synced. This particular fuse setup is read-only which was fine for my purposes, but you might need an alternative which allows write access for Dropbox.

Other syncing programs (eg, Syncthing has .stignore) support this feature. It really is a wonderful feature and I'd find it hard to live without too.

If I have something in my dropbox sync folder, and then open it with a viewer or editor that creates other files beside the original in the same directory (eg. temp files, working backups, etc.), then that is a possible use for a .ignore.

Dropturk sounds useful, any plans of open-sourcing it ?

Why would a personal backup company invest in OCR?

> To peek at personal data

This statement is obvious. I pay Dropbox specifically so they can hold my data and yes - look at my personal data. And by OCR'ing it make my data even more understandable and searchable.

How is this a concern when that is the purpose of the company's software?

I understand that you are fine with them looking at your data, but how exactly is it obvious that the primary purpose of a company you are paying to backup your data is to mine that data?

The primary purpose of dropbox was never "backup". It was always to intelligently store and share data, and to be able to identify information inside of it - like if it's an image, have a gallery. If it's a word doc, have a light weight interface for reading and editing, etc. It was also one of the primary ways to get data out of cell phones because of the "sharing" and "syncing" options.

I think you are confusing dropbox for something else?

In fact I think dropbox would be a horrible backup solution! Don't use it for backup or you will be disappointed.

Right: Dropbox isn't a safe-deposit box for your data; it's a checking account for your data.

Dropbox has never been sold as "pure backup" like Backblaze, and isn't priced like it (I'm a paying customer of both). I'm not paying Dropbox to backup my data, I'm paying them to sync my data across machines, make it available to me on devices I don't own, make it easy to share with family and friends, etc. I want Dropbox to OCR my documents -- I don't really want Backblaze to (and Backblaze's margins probably don't have any wiggle room for added features).

Normal people like to find the things which they store. If they're storing images, 99% of the time there won't be much metadata or text alternatives available — it's the same reason why everyone is investing in image classification and face recognition so you can search for “uncle bob” and find pictures which have them.

I am curious how they will expose the search indexes they are presumably building to the desktop client users. I almost never use the dropbox website for anything, and primarily use Spotlight to search on my desktop for any file, dropbox or not. I don't really see how their indexing of my files for search can mix into my workflow

A more innocent explanation would be "make your scans searchable"

(Not saying you're wrong)

Unless you're encrypting it first (via Cryptomator etc.), they already have lots of your data in easier-to-peek-at formats. You could always do your own OCR, and encrypt the results before storing it with them - though I accept that most won't, or use encryption at all for the files they keep in Dropbox, gDrive, etc.

Dropbox isn't a personal backup company.

They provide storage to lots of individuals, but that isn't really where they are trying to grow their business.


Just like autonomous driving isn't happening cus its not open source?

comma got open-sourced: https://github.com/commaai/openpilot

Only due to the fact that geohot got destroyed legally peddling his untested bullshit basic ML with no responsibility for the real world side effects the work might have.

Why do they use mechanical turk to train printed text? You could easily build a tool that transforms ASCII into something that looks like photographed printed text, with different levels of quality (using e.g. erosion and dilation, different fonts, etc.)

They were using Mechanical Turk to evaluate existing OCR systems before they started at building their own.

If you read further into the article, they did exactly what you suggested to train their own system:

"A corpus with words to use...A collection of fonts for drawing the words...A set of geometric and photometric transformations meant to simulate real world distortions...The generation algorithm simply samples from each of these to create a unique training example."

Isn't that training for a different task, though?

They discuss exactly this, beginning with the third paragraph under "Word Deep Net". Did you stop reading before then?

Yes. This is called data augmentation by synthesis. They used this.

Keep reading, you might be pleasantly surprised.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact