
Deep Learning Breakthrough Made by Rice University Scientists - Tomte
https://arstechnica.com/gadgets/2019/12/mach-ai-training-linear-cost-exponential-gain/
======
primitivesuave
I would hardly characterize this as a breakthrough:
[https://openreview.net/forum?id=r1RQdCg0W](https://openreview.net/forum?id=r1RQdCg0W)

~~~
Audoenus
A good rule of thumb I always use if a science article title has the word
"breakthrough"in it, then it's probably not a breakthrough.

If the title does nothing to describe the actual discovery made and solely
consist of "Breakthrough in [Field]" then it's definitely not a breakthrough.

~~~
jvm_
Sounds like the rule where if the headline ends in a question mark, the answer
to the question is No.

~~~
taneq
Oh, you mean Cunningham's law.

~~~
rflrob
I see what you did there...

------
m0zg
Word to the wise: as someone who actually works in the field, trust NO claims
until you can verify them with real code.

Papers very often contain the very uppermost bound of what's _theoretically_
possible when it comes to benchmarks. Researchers rarely have the skill to
realize those gains in practice, so any performance numbers in papers should
be assumed theoretical and unverified unless you can actually download code
and benchmark them yourself or unless they come from a research organization
known for competent benchmarking (e.g. Google Brain). In particular any
"sparse" approach is deeply suspect as far as its practical performance or
memory efficiency claims: current hardware does not deal with sparsity well
unless things are _really_ sparse (like 1/10th or more) and sparsity is able
to outweigh architectural inefficiencies.

~~~
ganzuul
[https://github.com/Tharun24/MACH/](https://github.com/Tharun24/MACH/)

~~~
m0zg
[https://github.com/Tharun24/MACH/blob/master/amazon_670k/src...](https://github.com/Tharun24/MACH/blob/master/amazon_670k/src/run.sh)

Run on a single machine by logically partitioning GPUs. Don't get me wrong,
I'm not disputing that this could work or that it could be a "breakthrough".
I'm just saying that unless it's independently replicated and confirmed, it's
just a paper like a million others.

~~~
ganzuul
It's an interesting premise nonetheless. Perhaps another similar approach
would be one from mathematical manifolds, where they have charts and atlases,
and I believe they build the atlas by having overlapping charts.

------
mpoteat
Not a full time ML researcher, but I thought I understood that batching is
already an extremely common practice. I don't see the novelty here.

~~~
dnautics
Bigger batches are good, but they result in locking. Picking a good batch size
relative to how much data you have is important. This new technique lets you,
effectively, buy a "meta batch" for free (that is a terrible analogy, but it's
the best I can do.).

As batches get bigger and can't fit inside a single gpu or single compute
node, your challenge becomes data transport. So anything that will be able to
decouple your computatational agents can be a win.

In this case, it's a more clever way of decoupling your agents. Normally
asynchronous batches are awful, but this is kind of a very clever way of
allowing for asynchronous batching of your data.

If I may opine on the matter, I think we're reaching a point where machine
learning researchers should start thinking about abandoning python as a
programming medium. For example, the other decoupling strategy (decoupled
neural net backpropagation) doesn't really seem like something I would want to
write in python, much less debug someone else's code. Python is really not an
appropriate framework for tackling difficult problems in distribution and
network coordination.

~~~
comicjk
As long as the big ML libraries support these strategies, people will use
them. The choice of user language is not critical. Tensorflow/PyTorch are
basically an ML-specific programming model with a Python interface.

~~~
dnautics
They don't, that's my point. I can find only one library for this:
[https://arxiv.org/abs/1608.05343](https://arxiv.org/abs/1608.05343)

------
ganzuul
This seems to be their latest work:
[https://arxiv.org/abs/1910.13830](https://arxiv.org/abs/1910.13830)

------
gambler
_> Instead of training on the entire 100 million outcomes—product purchases,
in this example—Mach divides them into three "buckets," each containing 33.3
million randomly selected outcomes._

So, uh, they're doing what random forests were doing for decades? What is the
key difference?

~~~
overlords
Random forests split the features This splits the outcomes.

So each tree in RF only looks at a few features. In this, each model looks at
all the features.

RF can handle multiclass problems of tens to hundreds (maybe thousands). This
MACH algo can handle multiclass problems of millions/billions (extreme
classification).

------
m3kw9
Looks like any advancement can be called a breakthrough, a onion paper
breakthrough can be a breakthrough

------
deadens
Umm... Here's an obvious idea, what if you don't store the entire model in
memory and use message passing architecture to distribute the model kinda like
how HPC people have been doing this entire time? Non distributed models are a
dead end anyway.

~~~
derision
Latency between GPUs kills performance

~~~
sudosysgen
It depends on just how huge the model is. Some models take multiple seconds to
run/backpropagate and might take hundreds of gigabytes of memory, in which
case it could be useful.

~~~
strbean
Also seems like a problem that could be partially solved by tailoring the NN
architecture. Does that make sense?

~~~
ganzuul
Do you mean like Stochastic Gradient Descent does?

