
How Google Translate squeezes deep learning onto a phone - xwintermutex
http://googleresearch.blogspot.com/2015/07/how-google-translate-squeezes-deep.html
======
liabru
This is great. I particularly like that they also automatically generated
dirty versions for their training set, because that's exactly what I ended up
doing for my dissertation project (a computer vision system [1] that
automatically referees Scrabble boards). I also used dictionary analysis and
the classifier's own confusion matrix to boost its accuracy.

If you're also interested in real time OCR like this, I did a write up [2] of
the approach that worked well for my project. It only needed to recognize
Scrabble fonts, but it could be extended to more fonts by using more training
examples.

[1] [http://brm.io/kwyjibo/](http://brm.io/kwyjibo/)

[2] [http://brm.io/real-time-ocr/](http://brm.io/real-time-ocr/)

~~~
joe_the_user
It seems your dissertation paper is behind something password protected [1].
It would be nice to see that too.

Can't get
[1][https://www.dcs.shef.ac.uk/intranet/teaching/campus/projects...](https://www.dcs.shef.ac.uk/intranet/teaching/campus/projects/archive/l31011/pdf/LBrummitt_aca08lb_com3021.pdf)

~~~
liabru
Hmm looks like they have, well here's another link to it:
[https://dl.dropboxusercontent.com/u/1672291/scrabble-
referee...](https://dl.dropboxusercontent.com/u/1672291/scrabble-referee.pdf)

------
motoboi
I am 15 years into this computers thing and this blog post made me feel like
"those guys are doing black magic".

Neural networks and deep learning are truly awesome technologies.

~~~
dr_zoidberg
They are, but once you start learning about them, you realize the "black
magic" part comes mostly from their mathematical nature and very little from
them being "inteligent computers".

A neural net is a graph, in which a subset of nodes are "inputs" (that's where
the net gets information), some are outputs, and there are other nodes which
are called "hidden neurons".

The nodes are interconnected between each other in a fashion, which is called
the "topology" or sometimes "architecture" of the net. For example I-H-O is a
tipical feed forward net, in which I (inputs) is the input layer, H is the
hidden layer and O the output layer. All the hidden neurons connect with all
the input neurons "output", and all the output neurons connect to the hidden
neurons "output". The connections are called "weights", and the training
adjusts the weights of all the neuron with lots of cases until the desired
output is achieved. There are also algorithms and criteria to stop before the
net "learns too much" and looses the ability to generalize (this is called
overfitting). In particular, a net with one hidden layer and one output layer
is a universal function estimator -- that is, an estimator that can model any
mathematical function of the form f(x1, x2, x3, ..., xn) = y.

Deep learning means you're using a feedforward net with lots of hidden layers
(I think it's usually between 5 to 15 now), which apply convolution operators
(hence the "convolutional" in the name), and lots of neurons (in the order of
thousands). All this was nearly impossible until the GPGPUs came along,
because of the time it took to train a modest network (minutes to hours for a
net with a between 50 to 150 neurons in one hidden layer).

This is a very shortened explanation -- if you want to read more I recommend
this link[1] which gives some simple Python code to illustrate and implement
the innards of a basic neural network and you can learn from the inside. Once
you get that you should move to more mature implementations, like Theano or
Torch to get the full potential of neutral nets without worrying about
implementation.

[1] [http://iamtrask.github.io/2015/07/12/basic-python-
network/](http://iamtrask.github.io/2015/07/12/basic-python-network/)

~~~
beambot
You could say very much the same about the brain...

> [...] the "black magic" part comes mostly from their mathematical nature and
> very little from them being "inteligent computers". A brain is a graph, in
> which a subset of neurons are "inputs", some are outputs, and others are
> "hidden". The nodes are interconnected between each other in a fashion,
> which is called the "topology" or sometimes "architecture" of the net.

The deep question about deep learning is "Why is it so bloody effective?"

~~~
Lawtonfogle
If there is magic to be found, it may be in that question. Why about graphs
(namely the subset that are deep neural networks) allow them to not only
contain such powerful heuristics, but also allow them to be created from
scratch with barely any knowledge of the problem domain.

As a side note, I was playing a board game last night (Terra Mystica I
believe) and wondering if you could get 5 different neural networks to play
the game and then train them against each other (and once they are good
enough, against players). I wonder how quickly one could train a network that
is unbeatable by humans? Maybe even scale it up to training it to play
multiple board games til it is really good at all of them before setting it
lose on a brand new one (with a similar genre). Maybe Google could use this to
make a Go bot.

But what happens if this is used for evil instead? Say a neural network that
reads a person's body language and determines how easily they can be
intimidated by either a criminal or the government. Or one that is used to
hunt down political dissidents. Imagine the first warrant to be signed by a
judge for no reason other than a neural network saying the target is probably
committing a crime...

~~~
thaumasiotes
The best Go bot approach (as of some years ago, but it's not like neural
networks are a new idea) uses a very different strategy. Specifically, the
strategy of "identify a few possible moves, simulate the game for several
steps after each move _using a very stupid move-making heuristic instead of
using this actual strategy recursively_ , and then pick the move that yielded
the best simulated board state".

~~~
deepnet
Monte Carlo Tree Search ( Random playout ) is currently the best computer
strategy for evaluating a Go position.

This is likely due to the way Go works , random playout provides a rough
estimate of who controls what territory ( this is how Go is scored ).

Recently two deep-learning papers showed very impressive results.

[http://arxiv.org/abs/1412.3409](http://arxiv.org/abs/1412.3409)

[http://arxiv.org/abs/1412.6564](http://arxiv.org/abs/1412.6564)

The neural networks were tasked with predicting what move an expert would make
given a position.

The MCTS takes a long time 100,000 playouts are typical - once trained the
neural nets are orders of magnitude faster.

The neural nets output a probability for each move ( that an expert would make
that move ) - all positions are evauluated in a single forward pass.

Current work centers around combining the two approaches, MCTS evaluates the
best suggestions from the neural net.

Expert Human players are still unbeatable by computer Go.

~~~
deepnet
For Chess see David Silver's work on TreeStrap

It learns to master level from self-play.

[http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Applications_fil...](http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Applications_files/bootstrapping.pdf)

also his lecture bootstrapping from tree based search

[http://www.cse.unsw.edu.au/~cs9414/15s1/lect/1page/TreeStrap...](http://www.cse.unsw.edu.au/~cs9414/15s1/lect/1page/TreeStrap.pdf)

and Silver's overview on board game learning

[http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/g...](http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/games.pdf)

------
sytelus
The most awesome and surprising thing about this is that the whole thing runs
_locally_ on your smartphone! You don't need network connection. All
dictionaries, grammar processing, image processing, DNN - the whole stack runs
on phone. I used this on my trip to Moscow and it was truely god send because
it didn't need expensive international data plans (assuming you have
connectivity!). English usage is fairly rare in Russia and it was just fun to
learn Russian this way by pointing at interesting things.

------
eosrei
I used this in Brazil this last March to read menus. It works extremely well.
The mistranslations make it even more fun. Much faster than learning
Portuguese!

I took a few screen shots. Aligning the phone, focus, light, shadows on the
small menu font was difficult. You must keep steady. Sadly, I ended up hitting
the volume control on this best example. Tasty cockroaches! Ha!
[http://imgur.com/j9iRaY0](http://imgur.com/j9iRaY0)

~~~
shkkmo
I had some Brazilian roomates who didn't speak english (and I don't speak
portugues). We used a combination of my poor spanish and google translate off
my phone to comunicate.

It worked ok (much better than nothing.) However there were a number of times
when there were very large issues in the translations that created some pretty
big misunderstandings. Luckily we had a friend who had fluent English and
Portuguese who would translate when things got to confused.

To reduce errors, you do need to be really careful to use short, complete
sentences with simple and correct grammar. It's also better to use and that
contain words that aren't ambiguous. (Those two sentences would probably not
translate well.)

e.g. Please write simple words, short phrases and simple phrases. Please write
words with just one meaning. Those phrases and words are easier to translate.

~~~
thaumasiotes
> Please write words with just one meaning.

Those words are very rare and tend to only be useful in very technical
contexts.

~~~
shkkmo
Fair enough. The idea that is intended to express is 'unambiguous'. I tend to
try to avoid more obscure words when writing text for automatic translation,
often at the expense of explicit accuracy.

------
Animats
Word Lens is impressive. It came from a small startup. Google didn't develop
it; it was a product before Google bought it. I saw an early version being
shown around TechShop years ago, before Google Glass, even. It was quite fast
even then, translating signs and keeping the translation positioned over the
sign as the phone was moved in real time. But the initial version was
English/Spanish only.

------
murbard2
I see no mention of it, but I'd be surprised if they didn't use some form of
knowledge distilling [1] (which Hinton came up with, so really no excuse), to
condense a large neural network into a much smaller one.

[1] [http://arxiv.org/abs/1503.02531](http://arxiv.org/abs/1503.02531)

------
josu
WordLens/Google Translate is the most futuristic thing that my phone is able
to do. It's specially useful in countries that don't use the latin alphabet.

------
api
"Squeezes" is very relative. These phones are equal to or larger than most
desktops 10-15 years ago, back when I was doing AI research with evolutionary
computing and genetic algorithms. We did some pretty mean stuff on those
machines, and now we have them in our pockets.

~~~
afsina
The main issue here is probably not squeezing memory but squeezing
performance. Even using regular SIMD is not good enough if your network is
medium sized. They apply linear quantization, lookups and special SIMD
operations to make it speedy.

See here for what they did for offline speech recognition:
[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41176.pdf)

------
afsina
They did this even more impressively when squeezing their speech recognition
engine to mobile devices.

[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41176.pdf)

------
teraflop
A possibly relevant research paper that they didn't mention: "Distilling the
Knowledge in a Neural Network"
[http://arxiv.org/abs/1503.02531](http://arxiv.org/abs/1503.02531)

------
cossatot
International travel now has a new source of entertainment: On-the-spot
generation of humorous mistranslations.

~~~
joosters
The oddest result I ever got from WordLens was when using it to translate a
page of poetry on a plaque. The output was wonderful :)

WordLens was awesome for translating fragments of foreign languages - stuff
like signs, menus and so on. But its offline translation seemed to be little
more than a word->word translation, so there is a huge scope for improvement
there. Very difficult when working offline!

------
zippzom
What are the advantages of using a neural network over generating
classification trees or using other machine learning methods? I'm not too
familiar with how neural nets work, but it seems like they require more
creator input than other methods, which could be good or bad I suppose.

~~~
boomzilla
Neural networks, and the plain old trusted logistic regression :) handles raw,
continuous data better than the other learning algorithms. For example, if
your inputs are images or audio recordings, it's really hard to do
classification with decision trees or random forests as you'd need to
construct the features manually. What would be a feature: color densities,
color histograms, edges, corners, Haar-like, etc.? The promise of multilayer
neural network is that given a lot of data, the right network structures, an
appropriate learning strategy, and a huge farm of GPUs, the network can
automatically learn the right features from raw data in the first layers, and
utilizes the features in later layers. The big advantage of this approach is
that you abstract away the domain problems (hopefully), and focus on picking
the right network design, the right learning strategy, collecting a good data
set etc. Neural network training is also easy to parallelize, so Google and
the like can leverage their huge infrastructures.

Now if the features in the domain problem is more well defined, like credit
ratings, and data is sparse, and domain expertise is available, decision trees
are perfectly valid options.

~~~
danieldk
_For example, if your inputs are images or audio recordings,_

Just wanted to add: and word/character/phrase embeddings.

------
poslathian
The article mentions algorithmically generating the training set. See here for
some earlier research in this area:
[http://bheisele.com/heisele_research.html#3D_models](http://bheisele.com/heisele_research.html#3D_models)

------
modfodder
Here's a short video about Google Translate just released.

[https://www.youtube.com/watch?v=0zKU7jDA2nc&index=1&list=PLe...](https://www.youtube.com/watch?v=0zKU7jDA2nc&index=1&list=PLeqAcoTy5741GXa8rccolGQaj_nVGw76g)

------
up_and_up
This technology has been around since 2010 and was developed by Word Lens,
which was acquired by google in 2014:

[https://en.wikipedia.org/wiki/Word_Lens](https://en.wikipedia.org/wiki/Word_Lens)

------
mrigor
For those unfamiliar with google's deep learning, this talk covers their
recent efforts pretty well [https://youtu.be/kO-Iw9xlxy4](https://youtu.be/kO-
Iw9xlxy4) (not technical)

------
dharma1
Would be great to see a more in depth article about this, and maybe even some
open source code?

~~~
sarwechshar
I would be interested in this as well. So far I found a similar app called
Mitzuli which is based on open source tools:

[http://www.mitzuli.com/en/](http://www.mitzuli.com/en/)

------
pschanely
Doesn't this article seem to say that the size of the training set is related
to the size of the resulting network? It should be proportional to the number
of nodes/layers that the network is configured for, not proportional to the
number of training instances. Am I missing something?

~~~
alok-g
The network is sized to be able to learn the training data reasonably well
(e.g. via hyper-parameter optimization). If there is too much variation in
data that is not seen in the real application (like rotation of letters
mentioned in the article), an appropriately sized network will still learn it,
but would be an overkill for the application at hand.

------
megalodon
I generated training sets for an OCR project in JavaScript [1] a while ago
using a modified version of a captcha generator [2] (practically the same
technique mentioned in this article).

[1] [https://github.com/mateogianolio/mlp-character-
recognition](https://github.com/mateogianolio/mlp-character-recognition)

[2] [https://github.com/mateogianolio/mlp-character-
recognition/b...](https://github.com/mateogianolio/mlp-character-
recognition/blob/master/captcha.js)

------
hellrich
I wonder if they use some kind of (neural) language model for their
translations. Using only a dictionary (as in the text) would be about 60 years
behind the state of the art...

------
tdaltonc
Anyone want to do a $1 bet on an over/under for how long until word lens can
handle Chinese?

~~~
agazso
There is an app called Waygo that's already capable of handling Chinese, so I
guess it's not too far.

------
birdsbolt
Why do they need a deep learning model for this? They are obviously targeting
signs, product names, menus and similar. Model will obviously fail in
translating large texts.

Was there any advantage of using a deep learning model instead of something
more computationally simple?

------
Uhhrrr
I don't get it. They say they use a dictionary, and they say it works without
an Internet connection. How can both things be true? I'm pretty sure there's
not, say, a Quechua dictionary on my phone.

~~~
mattmanser
It doesn't come with all the languages, you have to download them.

~~~
rndn
You can download the language packs on Android and iOS, and each one is about
4 MB in size.

------
xigency
Given the reliability of closed captions on YouTube and the frequency of
errors in plaintext Google translate, I wouldn't be surprised if this service
fails often, and often when you need it most.

------
joosters
WordLens was an awesome app and it's good to see that Google is continuing the
development.

The new fad for using the 'deep' learning buzzword annoys me though. It seems
so meaningless. What makes one kind of neural net 'deep' and are all the other
ones suddenly 'shallow' ?

~~~
teraflop
> What makes one kind of neural net 'deep' and are all the other ones suddenly
> 'shallow' ?

If this is a serious question, then googling "what is a deep neural network"
would take you to any number of explanations. But to summarize very briefly,
it's not a buzzword; it's a technical term referring to a network with
multiple nonlinear layers that are chained together in sequence. Deep networks
have been talked about for as long as neural networks have been a research
subject, but it's only in the last few years that the mathematical techniques
and computational power have been available to do really interesting things
with them.

The "fad" (as you call it) is not mainly because the word "deep" sounds cool,
but because companies like Google have been seeing breakthrough results that
are being used in production as we speak. For example:

[http://papers.nips.cc/paper/4687-large-scale-distributed-
dee...](http://papers.nips.cc/paper/4687-large-scale-distributed-deep-
networks.pdf)

[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43793.pdf)

[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42538.pdf)

~~~
joosters
I honestly didn't realise that it had any definition - I see now that calling
it a 'fad' is unfair. However, the boundary between deep learning and
(representational) machine learning still seems murky.

~~~
strebler
Considering the very significant accuracy gains deep learning has achieved
over previous approaches (and across a number of fields), it's certainly not a
simple fad. Having worked in computer vision for a good 8+ years, deep
learning is basically amazing.

Deep learning is a form of representation/feature learning.

------
anantzoid
Just waiting for the paper to come out that'll detail all the transformations
that were done on the training data specifically for the phone and how did
they arrive at deciding to use them.

> To achieve real-time, we also heavily optimized and hand-tuned the math
> operations. That meant using the mobile processor’s SIMD instructions and
> tuning things like matrix multiplies to fit processing into all levels of
> cache memory.

Let's see how this turns out to be. I'm still skeptical if other apps might
crash because of this.

~~~
sp332
Not fitting into cache just means it will run slower. Why would it crash?

~~~
anantzoid
Other apps getting slow is also not a very good thing!

