Hacker News new | past | comments | ask | show | jobs | submit login
How Google Translate squeezes deep learning onto a phone (googleresearch.blogspot.com)
403 points by xwintermutex on July 29, 2015 | hide | past | favorite | 92 comments

This is great. I particularly like that they also automatically generated dirty versions for their training set, because that's exactly what I ended up doing for my dissertation project (a computer vision system [1] that automatically referees Scrabble boards). I also used dictionary analysis and the classifier's own confusion matrix to boost its accuracy.

If you're also interested in real time OCR like this, I did a write up [2] of the approach that worked well for my project. It only needed to recognize Scrabble fonts, but it could be extended to more fonts by using more training examples.

[1] http://brm.io/kwyjibo/

[2] http://brm.io/real-time-ocr/

It seems your dissertation paper is behind something password protected [1]. It would be nice to see that too.

Can't get [1]https://www.dcs.shef.ac.uk/intranet/teaching/campus/projects...

Hmm looks like they have, well here's another link to it: https://dl.dropboxusercontent.com/u/1672291/scrabble-referee...

Glad there's prior art on that. I had a small project where I iterated all the fonts on the system and used them to generate glyph training images. The next step was to dirty them up, but I never continued the project.

More generally, I really like the idea of generating controlled synthetic images and then messing them up for regularization.

Funny, just read an article today proposing the same feature detection algorithm (the one you called 'grid merge'). Have you tried applying these techniques on scanned/photographed documents?

Could you link to it please?

I've not tried it on anything else, but I remember thinking that it has a lot of potential uses. Also I only used it on gray-scale features, but I'm sure it could make use of full RGB too. I'll have to try it some time!

"We also investigated hierarchical features where the image is overlaid with a grid of cell size c × c and pixels withins each cell are added up. This is same as downsampling the image and using the raw pixels in the downsampled image as features." (p. 3)


Sounds similar to one level of a pyramid:


excellent project. as a scrabble player, i'm very interested - it would be a great way to run a blitz tournament, for instance.

I am 15 years into this computers thing and this blog post made me feel like "those guys are doing black magic".

Neural networks and deep learning are truly awesome technologies.

They are, but once you start learning about them, you realize the "black magic" part comes mostly from their mathematical nature and very little from them being "inteligent computers".

A neural net is a graph, in which a subset of nodes are "inputs" (that's where the net gets information), some are outputs, and there are other nodes which are called "hidden neurons".

The nodes are interconnected between each other in a fashion, which is called the "topology" or sometimes "architecture" of the net. For example I-H-O is a tipical feed forward net, in which I (inputs) is the input layer, H is the hidden layer and O the output layer. All the hidden neurons connect with all the input neurons "output", and all the output neurons connect to the hidden neurons "output". The connections are called "weights", and the training adjusts the weights of all the neuron with lots of cases until the desired output is achieved. There are also algorithms and criteria to stop before the net "learns too much" and looses the ability to generalize (this is called overfitting). In particular, a net with one hidden layer and one output layer is a universal function estimator -- that is, an estimator that can model any mathematical function of the form f(x1, x2, x3, ..., xn) = y.

Deep learning means you're using a feedforward net with lots of hidden layers (I think it's usually between 5 to 15 now), which apply convolution operators (hence the "convolutional" in the name), and lots of neurons (in the order of thousands). All this was nearly impossible until the GPGPUs came along, because of the time it took to train a modest network (minutes to hours for a net with a between 50 to 150 neurons in one hidden layer).

This is a very shortened explanation -- if you want to read more I recommend this link[1] which gives some simple Python code to illustrate and implement the innards of a basic neural network and you can learn from the inside. Once you get that you should move to more mature implementations, like Theano or Torch to get the full potential of neutral nets without worrying about implementation.

[1] http://iamtrask.github.io/2015/07/12/basic-python-network/

>>They are, but once you start learning about them, you realize the "black magic" part comes mostly from their mathematical nature and very little from them being "inteligent computers".

Oh humbug! The black magic comes from the vast resources Google drew to obtain perfect training datasets. Each step in the process took years to tune, demonstrating that data is indeed for those who dont have enough priors.

You could say very much the same about the brain...

> [...] the "black magic" part comes mostly from their mathematical nature and very little from them being "inteligent computers". A brain is a graph, in which a subset of neurons are "inputs", some are outputs, and others are "hidden". The nodes are interconnected between each other in a fashion, which is called the "topology" or sometimes "architecture" of the net.

The deep question about deep learning is "Why is it so bloody effective?"

I work in the field, and while some models are based on biological structures/systems, there's a lot of fuzz about them being "based on biological foundations" that is now best avoided. Yes, it is true the model is based on them, but it's a model that only covers very little of the real complexity. So in a sense, it's naive to say "put a billion neurons in there and you'll get a rat brain" (as was publicized one time).

The effectiveness comes from their non-linear nature and their ability to "learn" (store knowledge in the weights, that is derived from the training process). And black magic, of course!

If there is magic to be found, it may be in that question. Why about graphs (namely the subset that are deep neural networks) allow them to not only contain such powerful heuristics, but also allow them to be created from scratch with barely any knowledge of the problem domain.

As a side note, I was playing a board game last night (Terra Mystica I believe) and wondering if you could get 5 different neural networks to play the game and then train them against each other (and once they are good enough, against players). I wonder how quickly one could train a network that is unbeatable by humans? Maybe even scale it up to training it to play multiple board games til it is really good at all of them before setting it lose on a brand new one (with a similar genre). Maybe Google could use this to make a Go bot.

But what happens if this is used for evil instead? Say a neural network that reads a person's body language and determines how easily they can be intimidated by either a criminal or the government. Or one that is used to hunt down political dissidents. Imagine the first warrant to be signed by a judge for no reason other than a neural network saying the target is probably committing a crime...

The best Go bot approach (as of some years ago, but it's not like neural networks are a new idea) uses a very different strategy. Specifically, the strategy of "identify a few possible moves, simulate the game for several steps after each move using a very stupid move-making heuristic instead of using this actual strategy recursively, and then pick the move that yielded the best simulated board state".

Monte Carlo Tree Search ( Random playout ) is currently the best computer strategy for evaluating a Go position.

This is likely due to the way Go works , random playout provides a rough estimate of who controls what territory ( this is how Go is scored ).

Recently two deep-learning papers showed very impressive results.



The neural networks were tasked with predicting what move an expert would make given a position.

The MCTS takes a long time 100,000 playouts are typical - once trained the neural nets are orders of magnitude faster.

The neural nets output a probability for each move ( that an expert would make that move ) - all positions are evauluated in a single forward pass.

Current work centers around combining the two approaches, MCTS evaluates the best suggestions from the neural net.

Expert Human players are still unbeatable by computer Go.

For Chess see David Silver's work on TreeStrap

It learns to master level from self-play.


also his lecture bootstrapping from tree based search


and Silver's overview on board game learning


The "use a stupid heuristic as part of the evaluation function" is is, in fact, also an important part of Chess AI's mode (as Quiescence Search), through for different reasons.

> Maybe Google could use this to make a Go bot.

There was in fact a group within Google that worked on this: http://www.cs.toronto.edu/~cmaddis/pubs/deepgo.pdf

and the follow up from Google's Deepmind group :

Move Evaluation in Go Using Deep Convolutional Neural Networks Chris J. Maddison, Aja Huang, Ilya Sutskever, David Silver


Before clicking I was assuming it would fail. Then read this in the summary: "When the trained convolutional network was used directly to play games of Go, without any search, it beat the traditional search program GnuGo in 97% of games, and matched the performance of a state-of-the-art Monte-Carlo tree search that simulates a million positions per move."

They are effective because:

- They use more parameters (and fewer computations per parameter.)

- They are hierarchical (convolutions are apparently useful at different levels of abstraction of data).

- They are distributed (word2vec, thought-vectors). Not restricted to a small set of artificial classes such as parts-of-speech or parts of visual objects.

- They are recurrent (RNN).


word2vec isn't "deep" in the relevant sense. The both skipgram and CBOW forms have a single hidden layer.

It's not really that deep, imo: a typical deep net these days has O(10^8) parameters (e.g. http://stackoverflow.com/questions/28232235/how-to-calculate...). You can store a hell of a lot of patterns in that many parameters, making them the best pattern matchers the world has ever seen. (Un)fortunately, pattern matching != intelligence. More interesting deep questions for which there is precious little theory revolve around the design of the networks themselves.

Is "pattern matching != intelligence" what occurred when the Google image recognition stuff in the news recently was shown to recognize the pattern of a "dumbbell" as always having a large muscular arm attached to it?

Seemed like a great way to highlight the limitations of patterns.

I hadn't heard about that but it sounds like what I'm talking about. With their ever expanding training corpus Google's net will eventually learn that dumbbells and arms are separate entities, but it will never deduce that on its own. And if it did it would not be able to generalize that to the fact that wedding rings and fingers are different (I hypothesize). Basically there is a whole other component of "intelligence" that feels absent from neural nets, which is why visions of AI lording over humanity don't exactly keep me up at night. (Autonomous weapons otoh...)

> Deep learning means ... which apply convolution operators

Convolutional networks are only one kind of deep learning. In particular, they generally apply only to image processing.

They are doing matrix multiplications. To pass input a single time through even some very large neural network - it is a relatively fast operation (if compared to training such a network, that is). Training requires data centers and arrays of GPUs. Passing the input through the network - usually you can get away with a single core and vectorized operations. Unless you are doing high resolution computer vision in real time... You can still get away with the single core even then, but that requires some very smart sublinear processing.

Completely right. Applying a neural network is much faster than training one. The main trick here is fitting the trained model into cache (or smaller) so that the matrix multiplies are fast.

> this blog post made me feel like "those guys are doing black magic".

Two remarks. First, these guys probably don't know very well why what they are doing works so well ;) It requires a lot of trial and error, and a lot of patience and a lot of compute power (the latter being the reason why we are seeing breakthroughs only now).

Second, training a neural net requires different computing power from deploying the net. The neural network that is installed on your phone has been trained using a lot of time and/or a very large cluster. Your phone is merely "running" the network, and this requires much less compute power.

Of course they are

they are awesome, but not that difficult to implement

The most awesome and surprising thing about this is that the whole thing runs locally on your smartphone! You don't need network connection. All dictionaries, grammar processing, image processing, DNN - the whole stack runs on phone. I used this on my trip to Moscow and it was truely god send because it didn't need expensive international data plans (assuming you have connectivity!). English usage is fairly rare in Russia and it was just fun to learn Russian this way by pointing at interesting things.

I used this in Brazil this last March to read menus. It works extremely well. The mistranslations make it even more fun. Much faster than learning Portuguese!

I took a few screen shots. Aligning the phone, focus, light, shadows on the small menu font was difficult. You must keep steady. Sadly, I ended up hitting the volume control on this best example. Tasty cockroaches! Ha! http://imgur.com/j9iRaY0

I had some Brazilian roomates who didn't speak english (and I don't speak portugues). We used a combination of my poor spanish and google translate off my phone to comunicate.

It worked ok (much better than nothing.) However there were a number of times when there were very large issues in the translations that created some pretty big misunderstandings. Luckily we had a friend who had fluent English and Portuguese who would translate when things got to confused.

To reduce errors, you do need to be really careful to use short, complete sentences with simple and correct grammar. It's also better to use and that contain words that aren't ambiguous. (Those two sentences would probably not translate well.)

e.g. Please write simple words, short phrases and simple phrases. Please write words with just one meaning. Those phrases and words are easier to translate.

> Please write words with just one meaning.

Those words are very rare and tend to only be useful in very technical contexts.

Fair enough. The idea that is intended to express is 'unambiguous'. I tend to try to avoid more obscure words when writing text for automatic translation, often at the expense of explicit accuracy.


It seems it can't really handle context, so 'cockroaches' may have been a mistranslation of 'cheap' in some contexts, as the 'it had stopped chestnut' may have simply been 'brazil nuts'

Most probably the OCR read "batata" (potato) as "barata" (cockroach).

Yeah, this is very likely as well, especially if the t was printed incorrectly

Word Lens is impressive. It came from a small startup. Google didn't develop it; it was a product before Google bought it. I saw an early version being shown around TechShop years ago, before Google Glass, even. It was quite fast even then, translating signs and keeping the translation positioned over the sign as the phone was moved in real time. But the initial version was English/Spanish only.

I see no mention of it, but I'd be surprised if they didn't use some form of knowledge distilling [1] (which Hinton came up with, so really no excuse), to condense a large neural network into a much smaller one.

[1] http://arxiv.org/abs/1503.02531

WordLens/Google Translate is the most futuristic thing that my phone is able to do. It's specially useful in countries that don't use the latin alphabet.

"Squeezes" is very relative. These phones are equal to or larger than most desktops 10-15 years ago, back when I was doing AI research with evolutionary computing and genetic algorithms. We did some pretty mean stuff on those machines, and now we have them in our pockets.

The main issue here is probably not squeezing memory but squeezing performance. Even using regular SIMD is not good enough if your network is medium sized. They apply linear quantization, lookups and special SIMD operations to make it speedy.

See here for what they did for offline speech recognition: http://static.googleusercontent.com/media/research.google.co...

They did this even more impressively when squeezing their speech recognition engine to mobile devices.


A possibly relevant research paper that they didn't mention: "Distilling the Knowledge in a Neural Network" http://arxiv.org/abs/1503.02531

International travel now has a new source of entertainment: On-the-spot generation of humorous mistranslations.

The oddest result I ever got from WordLens was when using it to translate a page of poetry on a plaque. The output was wonderful :)

WordLens was awesome for translating fragments of foreign languages - stuff like signs, menus and so on. But its offline translation seemed to be little more than a word->word translation, so there is a huge scope for improvement there. Very difficult when working offline!

Reddit is going to have a field day

Chinese restaurants did it first.

Just capture the screenshot and you have a meme generator as well!

What are the advantages of using a neural network over generating classification trees or using other machine learning methods? I'm not too familiar with how neural nets work, but it seems like they require more creator input than other methods, which could be good or bad I suppose.

Neural networks, and the plain old trusted logistic regression :) handles raw, continuous data better than the other learning algorithms. For example, if your inputs are images or audio recordings, it's really hard to do classification with decision trees or random forests as you'd need to construct the features manually. What would be a feature: color densities, color histograms, edges, corners, Haar-like, etc.? The promise of multilayer neural network is that given a lot of data, the right network structures, an appropriate learning strategy, and a huge farm of GPUs, the network can automatically learn the right features from raw data in the first layers, and utilizes the features in later layers. The big advantage of this approach is that you abstract away the domain problems (hopefully), and focus on picking the right network design, the right learning strategy, collecting a good data set etc. Neural network training is also easy to parallelize, so Google and the like can leverage their huge infrastructures.

Now if the features in the domain problem is more well defined, like credit ratings, and data is sparse, and domain expertise is available, decision trees are perfectly valid options.

For example, if your inputs are images or audio recordings,

Just wanted to add: and word/character/phrase embeddings.

The article mentions algorithmically generating the training set. See here for some earlier research in this area: http://bheisele.com/heisele_research.html#3D_models

Here's a short video about Google Translate just released.


This technology has been around since 2010 and was developed by Word Lens, which was acquired by google in 2014:


For those unfamiliar with google's deep learning, this talk covers their recent efforts pretty well https://youtu.be/kO-Iw9xlxy4 (not technical)

Would be great to see a more in depth article about this, and maybe even some open source code?

I would be interested in this as well. So far I found a similar app called Mitzuli which is based on open source tools:


Doesn't this article seem to say that the size of the training set is related to the size of the resulting network? It should be proportional to the number of nodes/layers that the network is configured for, not proportional to the number of training instances. Am I missing something?

The network is sized to be able to learn the training data reasonably well (e.g. via hyper-parameter optimization). If there is too much variation in data that is not seen in the real application (like rotation of letters mentioned in the article), an appropriately sized network will still learn it, but would be an overkill for the application at hand.

I generated training sets for an OCR project in JavaScript [1] a while ago using a modified version of a captcha generator [2] (practically the same technique mentioned in this article).

[1] https://github.com/mateogianolio/mlp-character-recognition

[2] https://github.com/mateogianolio/mlp-character-recognition/b...

I wonder if they use some kind of (neural) language model for their translations. Using only a dictionary (as in the text) would be about 60 years behind the state of the art...

Anyone want to do a $1 bet on an over/under for how long until word lens can handle Chinese?

There is an app called Waygo that's already capable of handling Chinese, so I guess it's not too far.

Why do they need a deep learning model for this? They are obviously targeting signs, product names, menus and similar. Model will obviously fail in translating large texts.

Was there any advantage of using a deep learning model instead of something more computationally simple?

I don't get it. They say they use a dictionary, and they say it works without an Internet connection. How can both things be true? I'm pretty sure there's not, say, a Quechua dictionary on my phone.

It doesn't come with all the languages, you have to download them.

I think it's Android only that you can download the language packs, FYI. The language packs + offline maps are super helpful when travelling abroad.

You can download the language packs on Android and iOS, and each one is about 4 MB in size.

Are you sure?

On my desktop, the english dictionary is ~1 megabyte uncompressed, and compresses to ~250k with gzip. The download for Google Translate is somewhere around 30 megabytes.

You have to download them beforehand, and offline translating is limited to just a few languages.

Given the reliability of closed captions on YouTube and the frequency of errors in plaintext Google translate, I wouldn't be surprised if this service fails often, and often when you need it most.

WordLens was an awesome app and it's good to see that Google is continuing the development.

The new fad for using the 'deep' learning buzzword annoys me though. It seems so meaningless. What makes one kind of neural net 'deep' and are all the other ones suddenly 'shallow' ?

> What makes one kind of neural net 'deep' and are all the other ones suddenly 'shallow' ?

If this is a serious question, then googling "what is a deep neural network" would take you to any number of explanations. But to summarize very briefly, it's not a buzzword; it's a technical term referring to a network with multiple nonlinear layers that are chained together in sequence. Deep networks have been talked about for as long as neural networks have been a research subject, but it's only in the last few years that the mathematical techniques and computational power have been available to do really interesting things with them.

The "fad" (as you call it) is not mainly because the word "deep" sounds cool, but because companies like Google have been seeing breakthrough results that are being used in production as we speak. For example:




I honestly didn't realise that it had any definition - I see now that calling it a 'fad' is unfair. However, the boundary between deep learning and (representational) machine learning still seems murky.

Considering the very significant accuracy gains deep learning has achieved over previous approaches (and across a number of fields), it's certainly not a simple fad. Having worked in computer vision for a good 8+ years, deep learning is basically amazing.

Deep learning is a form of representation/feature learning.

Machine learning proper encompasses a swath of applied statistical techniques, of which deep learning is only one. Machine learning could refer to linear regression, SVM, hidden markov models, dimensionality reduction, neural nets, or any number of other loosely related methods. Intro ML classes often don't even get to deep learning because theres so much more fundamental stuff to cover.

So was Word Lens doing this before Google even bought them? Because Word Lens worked fine, locally on a phone, long before Google was doing it's whole deep learning thing.

It's not entirely clear to me, but this sentence from the article:

In the end, we were able to get our networks to give us significantly better results while running about as fast as our old system—great for translating what you see around you on the fly.

Suggests that they previously were not using neural networks, or were using less powerful ones.

> What makes one kind of neural net 'deep' and are all the other ones suddenly 'shallow'

Number of layers

It's that simple

To expand on this some more: for a long time, thanks to Cybenko's theorem[1], people just used 1 hidden layer in their neural networks (also because computing was sloowww..). So, your typical NN architecture was input_layer --> hidden_layer --> output_layer.

Eventually, people realized that you could improve performance by adding more hidden layers. So while theoretically Cybenko was correct, practically stacking a bunch of hidden layers made more sense. These network architectures with stacks of hidden layers were then labelled as "deep" neural networks.

[1] https://en.wikipedia.org/wiki/Universal_approximation_theore...

It is that simple but the more complex story is that when the number of hidden layers exceeds 2, training becomes difficult. Also convnets for example cheat by having the connections between layers be incomplete bipartite graphs (not every node is connected to every other node), usually chosen because of some physical property - for computer vision nearest neighbors - eg.

Use another deep learning network to supervise training of your DLN. You can also use it to supervise itself. It is simple idea invented about decade ago (at least I heard it about decade ago here, in Ukraine).

Well, if all it cares about is looks...


Really? You're posting that in a thread about Google Translate? Come on.

Just waiting for the paper to come out that'll detail all the transformations that were done on the training data specifically for the phone and how did they arrive at deciding to use them.

> To achieve real-time, we also heavily optimized and hand-tuned the math operations. That meant using the mobile processor’s SIMD instructions and tuning things like matrix multiplies to fit processing into all levels of cache memory.

Let's see how this turns out to be. I'm still skeptical if other apps might crash because of this.

Not fitting into cache just means it will run slower. Why would it crash?

Other apps getting slow is also not a very good thing!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact