
Why is so much memory needed for deep neural networks? - breck
https://www.graphcore.ai/blog/why-is-so-much-memory-needed-for-deep-neural-networks
======
taeric
This would have benefited heavily from a couple of figures showing the
calculations. In particular, it is interesting where they discuss that a 50
layer resnet needs about 168 megs of memory.

This is followed by a paragraph that quickly jumps that 168M to 2G. But, that
paragraph is also discussing training. I'm assuming the 168 was just the net
for execution. But, I'm not sure on that. (In particular, this raises a
question I've had of wondering when a GPU really helps with the execution of a
network, not just processing enough data to train it.)

All of that said, really cool article. Any points people think casual readers
are likely to miss?

~~~
inconsistency
The 168MB is for storing all of the weights (parameters of the model) and the
activations from passing one sample through the network. When you are
training, you want to pass a whole minibatch (32 samples in this article)
through the network, so the activations (not the weights) are repeated 32
times - that's where the 2GB figure comes from. You need to keep these
activations around for backprop to work, so you either store them all, or as
the article later suggests, you keep the ones that are expensive to compute
and re-calculate the cheap ones. Also note that when training you need to
store the gradient of every weight too, so this adds 104mb in this example -
almost negligible next to 2GB of activations.

For running a trained model, it depends how it's being used. If you're
computing one sample at a time, then the 168MB figure is an upper bound on
that (one copy of weights + one set of activations). You could optimize this
by only keeping the activations for the current layer because you don't need
to save the intermediate activations. If however you're batch processing
samples at inference time, that's where you want to make most efficient use of
the compute resources, so chances are minibatches are going to be the best
route; again the same kind of optimization with not keeping intermediate
activations applies here.

As for whether GPUs are useful at inference time, that heavily depends on your
model and available deployment hardware. TensorFlow has some support for 8-bit
quantized models for targeting mobile/embedded/CPU, so there are ways to
optimize for this use case. Even at inference time though, most of your
compute work is going to be in matrix multiplies or convolutions so there's
plenty of cases where GPUs can help.

~~~
taeric
Apologies, I did not mean to say that the information was not in the article.
It mostly was. (I think the only thing missing is that you are saying that of
that 168 megs, 104 are the weights and 64 are the activations. This gives 64 *
32 for the 2G of memory. Right?)

The difficulty for me is merely in visualizing.

And I should have said that I was not looking for a single answer on whether
GPUs make sense in predicting. A chart with rough guidance on when you will
target specific execution environments is all I'm hunting for. My bias is that
lower power targets are more common. Phones and the like. But, I would not be
willing to bet heavily on that.

------
jameshart
Short answer is surely an information theoretic one: if a neural net acquires
knowledge, it needs to put that knowledge somewhere, and the state of the net
is the only place it can be going. Sure, training the net forces it to store
that knowledge as efficiently/lossily as it can get away with per the training
criteria, but the fundamental information has to go somewhere. Doesn't it
follow that if you run a DNN with a smaller memory footprint it will be able
to learn less stuff (or its learning will be more sloppy/lossy)?

It sounds like some of this piece argues that implementation choices like SIMD
instructions are forcing particular pieces of DNN state to be constrained to
certain sizes for efficiency; so I guess the argument there is that there is
wasted precision in a lot of implemented state storage that isn't meaningfully
useful for the network to capture its learning; that's reasonable - the bit
resolution for a network weight should be driven by the mathematical precision
of the model, not the practical constraints of hardware. But if you ran those
numbers it's also perfectly possible (I guess) that You would find your DNN
wants 128bit weights and on current hardware is actually limited by the
precision of the weights and states we permit them to store - and that given
more memory depth they would be able to learn more.

This seems to start from the premise that we're giving DNNs more memory than
they need. Is that actually a given?

~~~
dharma1
Interestingly the precision of the weights doesn't seem to affect accuracy
much - it doesnt seem like 128bit weights would be that useful. At the moment
a lot of work is being done on 8bit (and lower precision, see XNOR) weights

------
jacquesm
If this is the sort of thing that keeps you awake at night: Nvidia has a $700
card (GTX1080ti) with 11G RAM and a GPU that is a slightly modified version of
the GP102, which powers the more expensive Titan X.

~~~
deepnotderp
Perhaps more efficient memory use is the answer instead. This article has
mentioned most of the published techniques.

As for the 1080 Ti, a better option might be the P100, which can have up to
16GB.

~~~
gwern
A better option... if someone else is paying for it. It is literally ~10x the
price of a 1080ti ($700 vs a suggested $7000).

~~~
BrailleHunting
The P100 has much better double-precision performance, but it's single-
precision perf doesn't justify the cost in most instances.

The DGX1 (w 8 P100's) is another example of an overpriced, turnkey money
extraction "solution."

Also worth noting it's hard to get a Ti non-founder edition, which has a hefty
premium on it for the privilege of being first. The regular 1080 non-FE cards
are probably the best bang/buck right now.

Folking doing ML/AI/DL, or any HPC really, gotta profile their apps to find
bottlenecks and root out resource waste: single prec math? ram? algorithm
complexity? storage iops? network latency? etc. because throwing money at a
problem is no substitute for having a clue how to use money, time and
electricity wisely.

~~~
llukas
Compared to what it can replace DGX-1 is not overpriced in any way.

Feel free to invest your time and money into building and debugging issues on
your own system. Companies which pay big $$$ to their data scientists have
different incentives than regular users - you do not want to waste time of
your expensive staff on chasing hardware/software issues.

~~~
Roritharr
129000$ for the DGX-1, for anyone interested.

Saves you debugging your way to a workable 8*1080ti, when you look at it from
this angle. Not sure if being able to swap out the cards is what makes
building them yourself worth it in addition to the cost saving.

~~~
modeless
P100 has double rate FP16 which makes it twice as fast, so make that 16x
1080Ti. Also those won't fit into one machine, so now you're building a
cluster, and what are you going to use for the interconnect? You won't get
linear scaling, so make that more like 24x 1080Tis plus extra development
effort for scaling on a cluster, if your problem even scales that way. Now
account for the power usage...

------
drewm1980
"In GPUs the vector paths are typically 1024 bits wide, so GPUs using 32-bit
floating-point data typically parallelise the training data up into a mini-
batch of 32 samples, to create 1024-bit-wide data vectors. This mini-batch
approach to synthesizing vector parallelism multiplies the number of
activations by a factor of 32, growing the local storage requirement to over 2
GB."

I would think about that the other way around... batching your data is an ugly
hack to get around not being able to load it all simultaneously.

~~~
pjreddie
Mini-batches are not an ugly hack, batch gradient descent is too slow since
you have to go through the whole data and stochastic gradient descent is too
high variance (plus you can't do cool things like batch-norm). Mini-batches
give you stability and speed, the best of both worlds.

~~~
p1esk
A brain learns one example at a time, which probably means a gradient descent
is not a very good learning mechanism.

~~~
jacquesm
These are not brains, they are not even models of brains even if they use the
word 'neuron'.

~~~
p1esk
That was kind of my point. Brain learns well using one example at a time. ANNs
don't. Hence my conclusion.

------
femto
> Combining memory and processing resources in a single device has huge
> potential to increase the performance and efficiency of DNNs as well as
> others forms of machine learning systems.

Isn't that describing a Field Programmable Gate Array: memory embedded in a
sea of logic cells?

What's the best resource to get someone who knows FPGAs, a decent amount of
maths and 1990s style neural nets up to speed on today's deep neural networks?

~~~
jacquesm
[http://course.fast.ai/](http://course.fast.ai/)

~~~
femto
Thanks for prompting me to do further research on this. I didn't realise from
the initial posting, that this course is based on prerecorded videos and can
be completed asynchronously, to fit in with life. The site say students "need
access to a computer with an Nvidia GPU", but I reckon it could be an
exceedingly interesting exercise to try and do the course with an FPGA in
place of the GPU.

~~~
jacquesm
Alternatively you could use some GPU cloud instances but the end result of
that is probably that it ends up costing you more than it would to buy a low-
end GPU that you can complete the course with.

It'd be very interesting to see how you do this with an FPGA, that would be
quite the accomplishment, especially given the monstrous compute capability
that GPUs give, and the communications between high level (say: python) and
low level (a GPU or some other co-processor) have all been worked out and
debugged for GPUs.

If there is room for FPGA work in there it would probably be some specialized
function.

Regardless I'd love a write-up if you do this.

~~~
dnautics
If my understanding is correct, FPGAs aren't that fast, and while the really
small ones are cheap, they don't fit that much. I synthesized a 32-bit
multiplier on a spartan 6, and it took up about 1/5 of the footprint. I figure
maybe you could fit 4 16-bit fused-multiply adds (which ought to be able to do
machine learning) onto a spartan 6, which is not that great.

What is really going to slaughter you for ML performance on FPGAs is memory
bandwidth.

If you have access to a really big FPGAs with lots of memory (on the order of
$10-30k...) it might make a difference.

~~~
jacquesm
I'd see more applications on the evaluation than on the training side but even
in training there might be a few situations where it could pay off, you'd
still be stuck with a huge communications issue to solve.

------
syntaxing
This question might be a bit off topic but does anyone know of a calculator or
library (I primarily use Tensorflow and Keras) to calculate the GPU memory
footprint of the network? I only have a 1050Ti and I fit my networks in the
memory by manually adjusting different parameters. It would be nice to know
how much room I have left to expand my network.

~~~
Q6T46nT668w6i3m
model.summary() in Keras will provide you the number of parameters. You can
use this with the shape of your data to measure the memory cost.

~~~
syntaxing
Thanks! Can you explain how I can calculate the memory cost with the shape of
my data assuming a batch size of 1 and a input data shape of (10,20,3)?

------
radarsat1
> So a transformation called 'lowering' is used to convert those convolutions
> into matrix-matrix multiplications (GEMMs) which GPUs can execute
> efficiently.

This confuses me a bit, I'd love to know the details of this "lowering"? How
is convolution not already a very straight-forward linear algebra operation?

~~~
zebrafish
[https://cs.brown.edu/~sk/Publications/Papers/Published/bck-l...](https://cs.brown.edu/~sk/Publications/Papers/Published/bck-
lowering-opt-trans-frp/paper.pdf)

 _FrTime induces construction of the dataflow graph by redefining operations
through an implicit lifting transformation. Lifting takes a function that
operates on constant values and produces a new function that performs the same
operation on time-varying values. Each time the program applies a lifted
function to time-varying arguments, it builds a new node and connects it to
the nodes representing the arguments..._

 _...Unfortunately, this implicit graph construction can be very inefficient.
Every application of a lifted function may create a new dataflow node, whose
construction and maintenance consume significant amounts of time and
space...._

 _...The technique works by collapsing regions of the dataflow graph into
individual nodes. This moves computation from the dataflow model back to
traditional call-by-value, which the runtime system executes much more
efficiently. Because this technique undoes the process of lifting, we call it
lowering._

edit* or maybe?:

[http://physics.gmu.edu/~joe/PHYS428/Topic5.pdf](http://physics.gmu.edu/~joe/PHYS428/Topic5.pdf)

slide 16?

~~~
radarsat1
Interesting. Thanks!

------
dnautics
I didn't see any mention of striding. For convnets, you have to do a very
unwieldy data unroll that either copies your data by (nxn) where n is the
width of your kernel, or you have to manually stride, which can cause all
sorts of problems with memory bank collisions, paging problems, etc.

~~~
jacquesm
Yes, this is a real problem. I can easily tell if my GPU is having striding
issues by looking at the power consumption of the machine. If I manage to get
> 40% utilization I consider that pretty good. GPUs are useful but far from
perfect for this application and the quickest way to sink performance is to
'starve the beast' by having sub-optimal memory access patterns.

~~~
deepnotderp
So don't use a GPU, don't suffer from old paradigms :)

