
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-Of-Experts Layer - aaronyy
https://arxiv.org/abs/1701.06538
======
2bitencryption
Designing a neural network is a thousand times harder than I imagined.

After AlphaGo, I tasked myself with creating a neural network that would use
Q-Learning to play Reversi (aka Othello).

At that point, I had already utilized Q-Learning (the tabular version, not
using a neural network) for some very simple and mostly proof-of-concept
projects, so I understood how it worked. I read up only perceptrons, relu, the
benefits/disadvantages of having more/fewer layers, etc.

Then I actually started on the project, thinking "I know about Q-Learning, I
know about neural networks, now I just need to use Keras and I'll have a
network ready to learn in about twenty lines of python."

Boy was that naive. Regardless of how much you understand the CONCEPTS of
neural networks, actually putting together an effective one that matches the
problem state perfectly is so, so difficult (especially if there are no
examples to build off of). How many layers? Dropout or no, and if so, how
much? Do you flatten this layer, do you use relu, do you need a SECOND neural
network to approximate one part of the q-function and another to approximate a
different part?

I spent MONTHS messing with the hyperparameters, and got nowhere because I'm
doing this on a desktop pc without CUDA, so it takes days to train a new
configuration only to find out it hardly "learned" anything.

At one point after days of training, my agent actually had a 90% LOSE rate
against an opponent that played totally randomly. To this day I am baffled by
this.

I went into the project thinking "I have this working with a table, the
q-learning part is in place -- just need to drop in a neural net in place of
the table and I'm good to go!" It's been almost a year and I still haven't
figured this thing out.

~~~
general_ai
Doing anything large on a machine without CUDA is a fool's errand these days.
Get a GTX1080 or if you're not budget constrained, get a Pascal-based Titan. I
work in this field, and I would not be able to do my job without GPUs -- as
simple as that. You get 5-10x speedup right off the bat, sometimes more. A
very good return on $600, if you ask me.

~~~
solomatov
And why Pascal based Titan? Is it the best investment in terms of performance
per $ spent?

Also, how cost effective is it to use cloud GPUs for real world machine
learning?

~~~
nl
Cloud GPUs are cost effective if you need to either fine-tune a pretrained
network (eg, use pretrained ResNet/VGG/AlexNet for custom classes, ie[1]) or
for inference, or if you don't want the upfront costs.

A 4GB GTX1050 is ~$180. A p2 instance on Amazon is $0.9/hour. The cost
effectiveness depends on if you have a PC already.

[1] [https://blog.keras.io/building-powerful-image-
classification...](https://blog.keras.io/building-powerful-image-
classification-models-using-very-little-data.html)

~~~
dTal
And the cost of your electricity, don't forget.

~~~
robotresearcher
Current mean cost to US domestic consumers of $0.12 kW/h, says Google.

TPD Specs: GTX 1050 75W GTX 1080 180W Titan 250W

Triple these numbers for the overall machine's PSU spec.

So even the Titan costs $0.03 per hour to run, or maybe $0.10 if the rest of
the machine is flat-out.

edit: $0.19 kWh for San Francisco residents. Domestic rates.

------
jamesk_au
"It is our goal to train a trillion-parameter model on a trillion-word corpus.
We have not scaled our systems this far as of the writing of this paper, but
it should be possible by adding more hardware."

------
aabajian
I'm excited for the applications of deep learning in radiology. I'll be
starting radiology residency in a couple years (intern year first).

Can someone explain whether this solves the problem of identifying multiple
pathologies within the same image? For example, if a patient has pneumonia and
a pleural effusion, could this method activate two subnetworks and come up
with a consensus diagnosis based on both those networks? Or does only one
pathway get activated at a time?

~~~
barbolo
Sure you can. The term you are looking for is "multi task learning".

------
ben_mann
Impressive! With this work they managed to increase quality for most language
pairs while halving runtime computational cost and decreasing training time.
Since it's a gerneric component, I can see it getting used on all sorts of
problems with lots of data.

Especially interesting given that most Kaggle competitions are won with
mixture of experts models.

------
meow_mix
More parameters is very exciting! I wonder if anyone here has read about what
the relationship between the number of parameters on the network and the size
of the dataset used on the network? Wouldn't such a large number of parameters
risk serious overfitting by giving the model the capacity to fit to exact
samples?

~~~
nshazeer
I am one of the authors of "Outrageously Large Neural Networks". Yes -
overfitting is a problem. We employed dropout to combat overfitting. Even with
dropout, we found that adding additional capacity provides diminishing returns
once the capacity of the network exceeds the number of examples in the
training data (see sec. 5.2). To demonstrate significant gains from really
large networks, we had to use huge datasets, up to 100 billion words.

~~~
petra
Impressive work, mate!

Does mixture-of-experts works well the other way around, as a way to minimize
power and hardware in common sized problems ?

And would it work in low resolution networks, like BinaryConnect ?

