
Deep learning has a size problem - jamesonthecrow
https://heartbeat.fritz.ai/deep-learning-has-a-size-problem-ea601304cd8
======
buboard
The article starts with NLP models and then mentions the successes of
increasingly smaller vision models. NLP seems to be an outlier in increasingly
becoming a pissing contest. The models are too big and not particularly
useful. openAI spread FUD about their model but after their release , it's
rather underwhelming. Yeah you can output some text that's readable and
paraphrasing reddit, but what about understanding , intention, doing actual
useful stuff with text? Hallucinating text in itself isn't interesting. It
seems this line of nlp with transformers has hit some kind of deadend and they
are trying to brute force the next breakthrough - doubtful that this will
happen though. And then we have bizarre decisions like microsoft releasing
dialoGPT yesterday without including a generaiton script because "it might be
racist". This whole seems more like marketing than research

~~~
Al-Khwarizmi
Large transformer-based models like BERT and its ilk are not only useful to
hallucinate text. They have achieved measurable improvements in various
(although not all) classic NLP tasks, such as parsing, entailment recognition
or question answering. Google has reportedly used BERT to improve their search
algorithm, so indeed it's being used to do "actual useful stuff with text".

It pains me to say this, as I'm a researcher from an institution without the
huge resources of the big tech companies, so I can't compete in the pretrained
model arms race (and also, it has made the field more boring, as creative
solutions to problems become outperformed by approaches that just pile up more
millions of parameters). But it's the truth. Although I think it will only be
a stage of things: at some point, performance will plateau and we will need to
put our minds to work again, rather than our GPUs.

~~~
buboard
google seemed to make a genuine effort to make a model that is useful rather
than record-breaking with bert. But i think it's wrong to consider it the
"final" model upon which everything else will be built.

~~~
bitL
BERT is already outdated, but still useful as you need only 1 Titan RTX to
retrain its BERT_large model via transfer learning.

~~~
turnersr
What methods make BERT outdated? Do you have pointers to other options?

~~~
bitL
e.g. XLNet:

[https://arxiv.org/abs/1906.08237](https://arxiv.org/abs/1906.08237)

~~~
phreeza
XLnet is Bert with a bunch of additional training tricks.

~~~
bitL
BERT is a Transformer with a bunch of additional training tricks. Transformer
is self-attention with a bunch of additional training tricks...

------
blt
IMO this is not a problem. The people building insanely huge models are
expanding the set of tasks that can be done by a computer. Who cares how much
memory it takes?

Historically, computationally expensive methods eventually become cheap. In
the 1980's, researchers had access to Crays to develop physics model,
graphics, etc. requiring lots of floating point math and memory. Meanwhile,
for the home computers, game programmers had to implement all their math in
fixed point. Nowadays, game engines run the same algorithms that were running
on the Crays before.

Same with learning. It's great to use tricks to make models fit on phones.
Even better: use tricks to make training new models within the budget of a
small academic research lab. That doesn't mean we should invalidate all the
work that requires a huge cluster.

~~~
joe_the_user
_IMO this is not a problem. The people building insanely huge models are
expanding the set of tasks that can be done by a computer. Who cares how much
memory it takes?_

But are they? The example in the article describes an incremental improvement
in a benchmark in exchange for a massive increasing in training time.

Deep learning has achieved success on a number of tasks that previously
computers had been unable to do. Since the initial period of success, it is an
area of debate whether deep learning has expanded it's basic area of
applicability or whether is has incrementally on it's initial achievements.

And if it is true that deep learning is stuck on just expanding what it's
already doing, it might be the fundamental next advance might come from one
person with one machine rather than a massive team with a massive machine.
Consider that neural nets as a theory had been around since the 1990s if not
the 1960s but the fundamental advantage of DL came when grad students could
use GPU in the 2010s, not when massively parallel machines came into existence
(quite a bit earlier).

Here, the further wrinkle is that moore's law is gradually ending. We won't
access to that much more computing power twenty years hence - so making less
do more does make sense.

~~~
acollins1331
I disagree. There are lots of advancements that DL has yet to fully realize
with even the current technology. You're focused on commercial applications
but applying neural network models, especially CV models to many types of
scientific research has yet to be explored due to lack of funding.

~~~
joe_the_user
I'd like to think I put my comments "as potential problems" since I can't
claim to follow everything that's done as deep learning.

Still, to continue the devil's advocate position. Deep learning comes up with
a lot of things that are _suggestive_ but not tight enough in their
approximation to be useful.

I would guess there are huge number of correlations that seems plausible but
aren't really causations. You can apply employ a monster stream of sort of
intelligent seeming claims and predictions and find they don't yield any
progress in any firm scientific domain. The application of deep learning to
finding cancer and related diagnosis processes has been "exciting and
promising" for a long time but effectively yielded nothing so far because
"quite accurate in highly controlled situations" turns out to seldom be that
useful, at least not so far.

------
pixelpoet
How did they use an elephant as cover image without mentioning von Neumann's
famous and relevant quote: "With four parameters I can fit an elephant, and
with five I can make him wiggle his trunk."

A great article on it: [https://www.johndcook.com/blog/2011/06/21/how-to-fit-
an-elep...](https://www.johndcook.com/blog/2011/06/21/how-to-fit-an-elephant/)

~~~
RocketSyntax
That came up in The Dream Machine! Reading it now.

------
galkk
I never understand such remarcs

> Given the power requirements per card, a back of the envelope estimate put
> the amount of energy used to train this model at over 3X the yearly energy
> consumption of the average American.

So what? Training model is the hardest part, then you just reuse results

> First, it hinders democratization. If we believe in a world where millions
> of engineers are going to use deep learning to make every application and
> device better, we won’t get there with massive models that take large
> amounts of time and money to train.

So what? I can't run weather simulation on my laptop.

~~~
chongli
_So what? Training model is the hardest part, then you just reuse results_

I doubt anyone is going to want to run a 33GB model on their phone.

 _So what? I can 't run weather simulation on my laptop._

You only need to run the weather simulation once and then broadcast your
forecast to everyone’s devices. You can’t do that with NLP. In order to be
useful, NLP models need to run on different input data for every user. With a
giant 33GB model, that means round-tripping to the data centre.

If you have to run everything in the cloud, your applications are limited. The
cost is also very high, given that there are way more user devices than
servers in the world. That means you need to build more data centres if you
plan to run these giant models for every application you want to offer your
users.

~~~
phoboslab
> I doubt anyone is going to want to run a 33GB model on their phone.

Why not? Many modern phones have upwards of 512GB of storage. 33 GB for a
useful model seems entirely reasonable to me.

~~~
chongli
That’s for one application. Phones have dozens of apps. If they all use
different, giant models like this, then 512GB won’t be nearly enough.

Moreover, what is the performance going to be like? It can’t be too
spectacular if your model doesn’t fit in RAM. 33GB is manageable on a beefy
server with a ton of RAM. You’re not going to have the same luxury on your
phone.

The other major aspect of it is memory bandwidth. If the model was designed to
run on a high end GPU, with all 33GB stored in graphics memory, then it’s
going to perform terribly if it has to be paged in and out of flash on a
phone.

------
visarga
It's not such a problem, except if you want to train from scratch a large
model (NLP or CV), not if you want to fine-tune it for a related task. So one
trained model can be reused many times. In general training data is scarce,
only in a few situations it is abundant.

------
gok
The MegatronLM example is a weird one. Neural network language models are
replacing n-gram language models that grow to several terabytes for SotA
results; 8 billion parameters is tiny by comparison.

------
bitL
We are already past the point of no return. RTX 8000 is now an entry-level GPU
that allows training some of the latest NLP models. Attention is spreading
over to computer vision models as well, so one could expect memory bloat
coming there quickly. Only large companies that can deploy thousands of GPUs
in parallel will be able to compete.

~~~
latchkey
I am working on it... (well, the company I work for)... except instead of
thousands... it is hundreds of thousands.

------
phkahler
Seems to focus on reducing the size of existing models through optimization.
Better would be to find ways to train smaller models to start with. Still
interesting.

~~~
rytill
why would that be better?

~~~
gumby
Compressing it means it may take less storage, but not having to look at it in
the first place it the win. It simply takes time to process all the data. Less
data: faster computation.

------
boyadjian
Size matters. If you want intelligent neural network, you need some watts.
There is nothing astonishing in that. It is also because of constant progress
in hardware performance that deep learning has become what it is.

------
cellular
I just hope Hinton finishes his Hinton Network idea that is supposed to
replace these NNs.

------
jgalt212
> I don’t mean to single out this particular project. There are many examples
> of massive models being trained to achieve ever-so-slightly higher accuracy
> on various benchmarks.

Sounds like particle colliders and Big Science in general.

------
RocketSyntax
Deep learning doesn't parallelize well. Would be cool if you could loan CPU
cycles on your phone or home computers while at work.

~~~
question_away
In what way does it not parallelize well? There are mounds of research in
federated learning.

~~~
RocketSyntax
I've read that you can't split up large layers to be trained on separate
processors either horizontally (one layer per processor) or vertically (parts
of many layers).

~~~
pheug
On a shared memory system there's little need to do that - there's much more
parallelism to be had from accelerating fine grained operations, like matrix
multiplications to compute each layer's output.

On a distributed system, splitting up layers between machines to do
distributed training is pretty much what Google initially designed Tensorflow
for. Generally it scales less well due to the need to communicate massive
amounts of data between nodes and much lower network throughput than what
GPU/TPU memory provides.

------
RocketSyntax
It doesn't parallelize well. Would be cool if you could loan CPU cycles on
your phone or home computers while at work.

