
Large Scale Deep Learning – Jeff Dean [pdf] - coderush
http://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/CIKM-keynote-Nov2014.pdf
======
brandonb
For those of you who want to learn the nuts and bolts of deep neural networks,
Andrew Ng's tutorial on Unsupervised Feature Learning and Deep Learning is
getting older but still great:
[http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial](http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial)

The best research results from 2014 and 2013 make less use of the unsupervised
techniques than initially expected, so I would start by focusing on the below
sections, which focus more on supervised learning with deep neural networks:

Sparse Autoencoder: Neural Networks, Backpropagation Algorithm

Building Deep Networks for Classification: Deep Networks: Overview, Fine-
tuning Stacked AEs

Working with Large Images: Feature extraction using convolution

You'll need some background in matrix algebra, calculus, and probability to
understand this. Having taken a previous machine learning course, although not
strictly necessary, is probably extremely helpful--I'd recommend taking any
standard course on ML on Coursera or Udacity, or going through any standard
textbook.

EDIT: I almost forgot that Michael Nielsen (who wrote the standard textbook on
quantum computation) is also writing a free online textbook on Neural Networks
and Deep Learning. Chapters 1-4 are currently available and would get you
pretty far:
[http://neuralnetworksanddeeplearning.com/](http://neuralnetworksanddeeplearning.com/)

~~~
kastnerkyle
Bookwise, Yoshua Bengio, Aaron Courville, and Ian Goodfellow are nearly
finished with their MIT Press book on deep learning:
[http://www.iro.umontreal.ca/~bengioy/dlbook/](http://www.iro.umontreal.ca/~bengioy/dlbook/)
. It is pretty strong on the true theory of what is going on in deep networks,
and has fairly good intuition for how and why things work. Paired with the
deep learning tutorials
[http://www.deeplearning.net/tutorial/](http://www.deeplearning.net/tutorial/),
as well as the content from UFLDL it is a pretty strong foundation for
advanced study.

Michael's book seems to target a more introductory level - a beginner might be
better off to start with that, follow with Andrew Ng's ML course, which has a
section on neural nets including an assignment implementing backpropagation,
then continue with the deep learning book and the {deep learning, UFLDL}
tutorials. This should be solid enough to at least read most of the cutting
edge work and papers, if that is the aim.

Hugo Larochelle's youtube course
[https://www.youtube.com/playlist?list=PL6Xpj9I5qXYEcOhn7Tqgh...](https://www.youtube.com/playlist?list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH)
and Hinton's coursera course
[https://www.coursera.org/course/neuralnets](https://www.coursera.org/course/neuralnets)
are also great references.

~~~
brandonb
Didn't realize Yoshua & co had a book coming! That would definitely be the one
to read.

BTW, for anybody who wants to learn machine learning in general, Kyle's blog
also seems to be packed full of clear explanations with working demo code:
[http://kastnerkyle.github.io/](http://kastnerkyle.github.io/)

Very nice!

~~~
kastnerkyle
Thanks for checking it out! I am planning to add a few deep learning related
posts during the holidays. The recent results for NLP, captioning and speech
using encoder/decoder models are just too cool not to demo.

------
nl
This whole slide deck is worth reading. A couple of highlights:

Pg 26, quote: "Anything humans can do in 0.1 sec, the right big 10-layer
network can do too". That is a very bold claim. It encompasses the entire
fields of image and voice recognition as well as knowledge encoding. It's
slowly becoming clear that this is likely to be true.

Pg 39, 40: Google's ImageNet-winning system in 2011 had 7 layers and an error
rate of around 16%. The 2014 system had 24 layers and an error rate of 6.66%.
Note that trained humans have an error rate of around 5%[1].

Page 50-57 talk about the miracle that is Word2Vec, and what is possible with
that.

Page 60-70 talks about paragraph embedding. I haven't seen this published
before.

Page 70-73 extends word/paragraph embedding for translation. I've seen a slide
deck showing this works before, but I need to read the new paper cited there.

Page 74+ talks about cross-modal embeddings, especially the caption generation
stuff. HN has had a few things on that over the past month or so.

[1] [http://karpathy.github.io/2014/09/02/what-i-learned-from-
com...](http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-
against-a-convnet-on-imagenet/)

~~~
Strilanc
Whelp, Page 54 blew my mind right out the window.

> E(hotter) - E(hot) + E(big) ≈ E(bigger)

> E(Rome) - E(Italy) + E(Germany) ≈ E(Berlin)

These things are linearly separable?!

~~~
abrichr
Deep neural networks essentially transform input data into a vector space
where the data is "easier" to model. So while the input vectors may not be
linearly separable in the input space, the network learns how to transform the
input vectors into a space where they are.

------
agibsonccc
I know distributed neural nets are 10000 miles out there for most, but I just
want to add a few nuggets for those considering it.

I work with distributed deep nets quite a bit. It's a different animal than
training on a GPU.

I am working on benchmarks with my framework deeplearning4j now.

That aside, a few neat references/projects that will be digestible for people.

For those of you already in neural net land, there's a few key takeaways when
doing distributed neural nets:

parameter averaging across mini batches

(depending on the algorithm) adagrad

momentum

[1] Project Adam: [http://www.wired.com/2014/07/microsoft-
adam/](http://www.wired.com/2014/07/microsoft-adam/) [2]: Associated Paper:
[https://www.usenix.org/system/files/conference/osdi14/osdi14...](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-
chilimbi.pdf)

[3]: Hogwild algorithm:
[http://www.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf](http://www.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf)

[4]: A variation of this I use called Iterative Reduce done by my partner Josh
Patterson: [https://github.com/jpatanooga/KnittingBoar/wiki/Iterative-
re...](https://github.com/jpatanooga/KnittingBoar/wiki/Iterative-reduce)

[5]: Sandblaster LBFGS by Dean and Co.
[http://research.google.com/archive/large_deep_networks_nips2...](http://research.google.com/archive/large_deep_networks_nips2012.html)

------
andrewcamel
Reading through documents like this really pains me because I it seems like
such interesting work and I immediately want to understand it better, but then
realize the time required to acquire the knowledge and experience necessary to
understand and apply this technology is so great, that it almost seems like a
waste of time. After all, think of all the things one could build in the 2
full-time years it would take to fully comprehend all of this to the point
where it's useful in any practical application.

~~~
frozenport
You should be complaining about the computing resources to train 24 hidden
layers.

~~~
tonydiv
GPUs make a lot of this tractable. Nvidia is actually offering some free
compute time for researchers:

[http://www.nvidia.com/object/gpu-test-
drive.html](http://www.nvidia.com/object/gpu-test-drive.html)

------
pacala
This video uses the same slide deck: [https://www.youtube.com/watch?v=vvK-
XOiKXOs](https://www.youtube.com/watch?v=vvK-XOiKXOs)

------
cr4zy
The second half of this video has an earlier talk that goes along with the
slides:
[http://youtu.be/S9twUcX1Zp0?t=22m49s](http://youtu.be/S9twUcX1Zp0?t=22m49s)

------
ivan_ah
HN readers in the Montreal area will have a chance to listen to the talk in
person at the McGill colloquium:

 __Scaling Deep Learning __, Wednesday, December 10th, 2:00PM-3:00 PM at the
McGill University M1 amphitheater of the Strathcona building at 3640
University Street.

------
mr_overalls
Does Deep Learning on this scale offer any obvious benefits for genomic
analysis? Does it make sense to use data from the 1000 Genomes Project (or
similar large-scale sequencing effort) and perform association studies?

------
forgotAgain
I wonder how many machines are chomping on raw data generated by chrome,
android and search. What is Google trying to learn and what does it already
know?

