
Outrageously Large Neural Nets: Sparsely-Gated Mixture-of-Experts Layer (2017) - msoad
https://arxiv.org/abs/1701.06538
======
merricksb
Discussion 2 years ago at time of publication:

[https://news.ycombinator.com/item?id=13518039](https://news.ycombinator.com/item?id=13518039)

------
jcims
N00b here.

I can’t tell if this is primarily intended to provide effective sharding of a
largely homogeneous network, or if it’s intended to allow for incorporation of
diverse networks and use the gating to classify and route the inputs to the
appropriate networks.

~~~
hadsed
Heh. I'm not sure you're wrong to say it either way.

------
sgillen
This work uses conditional computation to allow for "Outrageously large"
networks which may work better in practice but which will be even harder to
understand.

I'm very interested in working on using this same sort of conditional
computation to make reasonably sized neural networks easier to understand. Has
anyone seen papers on this sort of work?

~~~
plutonorm
What is it with needing to understand NNs? Its the same thing as not trusting
a human to drive a car because you cannot understand every stage Of
computation happening in their brain. If a neural network learns a task, test
it well enough to know that it performs well enough in target domain before
using it. Don't expect to be able to understand how it works and from there
claim to know that it will work well and have more confidence in it. This
approach barely works in standard software, let alone a neural network. Stop
worrying and learn to love the NN, after all it is a mirror of your own
ineffable nature.

~~~
varjag
_" Those are scary things, those gels. You know one suffocated a bunch of
people in London a while back?"

Yes, Joel's about to say, but Jarvis is back in spew mode. "No shit. It was
running the subway system over there, perfect operational record, and then one
day it just forgets to crank up the ventilators when it's supposed to. Train
slides into station fifteen meters underground, everybody gets out, no air,
boom."

"These things teach themselves from experience, right?," Jarvis continues. "So
everyone just assumed it had learned to cue the ventilators on something
obvious. Body heat, motion, CO2 levels, you know. Turns out instead it was
watching a clock on the wall. Train arrival correlated with a predictable
subset of patterns on the digital display, so it started the fans whenever it
saw one of those patterns."

"Yeah. That's right." Joel shakes his head. "And vandals had smashed the
clock, or something."_

~~~
sgt101
To summarise: if a system is not understood there exists the possibility of
sudden, unexpected harm. The system is unsafe. This is (probably) ok if the
system is putting icing on donuts (you might get a bad batch) but is
definitely not ok if the system is deciding on dosing levels for drugs or
controlling machines that could suddenly smash into queues of school children.

~~~
varjag
Moreover, if (as customary in all technology) you build systems upon these
systems, even the low malfunction probabilities will multiply into nearly
assured failures. With a system you understand you can find the cause and fix
the issue for all: this what allows us to build ever more complex systems over
decades of engineering R&D.

But for a system you don't understand you are at whim of cascading patterns of
errors in underlying behavior.

~~~
p1esk
Two counter examples:

1\. A cpu you rely on is a well understood system yet unexpected failures do
happen (eg pentium bug, spectre exploits, etc).

2\. A human you rely on might fail unexpectedly (tired, drunk, heart attack,
going crazy, embracing terrorism, etc).

After testing reasonable number of things, if NNs perform more reliably than
those other systems we rely on currently, it will be increasingly harder to
justify using the other systems, especially when that means more people
accidentally dying every year.

~~~
varjag
I believe "With a system you understand you can find the cause and fix the
issue for all" covers the first case. The second case is applicable to both NN
and traditional control systems.

~~~
p1esk
So which system would you personally prefer to rely on in life or death
situation, the one that is well understood (accident rate 0.0001%), or poorly
understood (accident rate 0.000001%)?

------
p1esk
This is from 2017, so probably obsolete by now.

~~~
sanxiyn
It is. For example, it uses LSTM, which is obsolete now.

~~~
currymj
People keep saying this with extreme confidence; I’m not sure I buy it.

Certainly recurrent networks in general are not obsolete, even if
attention/convolution works better for some applications.

Perhaps one ought to try GRU before LSTM but there’s no reason to suppose that
it would dominate in all cases.

~~~
terminalhealth
Indeed. Here is a very fresh paper finding that attention is certainly not all
you need as sometimes recurrence is necessary.

[https://arxiv.org/abs/1906.01603](https://arxiv.org/abs/1906.01603)

This is also obvious: Without recurrence you cannot remember information that
is not externally visible, but it may be computationally very convenient and
often necessary to maintain information that is hidden.

The hard part is learning reps for hidden information as recurrences are
plagued by vanishing and shattering gradients.

