Hacker News new | comments | show | ask | jobs | submit login
Why is so much memory needed for deep neural networks? (graphcore.ai)
148 points by breck 212 days ago | hide | past | web | 57 comments | favorite



This would have benefited heavily from a couple of figures showing the calculations. In particular, it is interesting where they discuss that a 50 layer resnet needs about 168 megs of memory.

This is followed by a paragraph that quickly jumps that 168M to 2G. But, that paragraph is also discussing training. I'm assuming the 168 was just the net for execution. But, I'm not sure on that. (In particular, this raises a question I've had of wondering when a GPU really helps with the execution of a network, not just processing enough data to train it.)

All of that said, really cool article. Any points people think casual readers are likely to miss?


The 168MB is for storing all of the weights (parameters of the model) and the activations from passing one sample through the network. When you are training, you want to pass a whole minibatch (32 samples in this article) through the network, so the activations (not the weights) are repeated 32 times - that's where the 2GB figure comes from. You need to keep these activations around for backprop to work, so you either store them all, or as the article later suggests, you keep the ones that are expensive to compute and re-calculate the cheap ones. Also note that when training you need to store the gradient of every weight too, so this adds 104mb in this example - almost negligible next to 2GB of activations.

For running a trained model, it depends how it's being used. If you're computing one sample at a time, then the 168MB figure is an upper bound on that (one copy of weights + one set of activations). You could optimize this by only keeping the activations for the current layer because you don't need to save the intermediate activations. If however you're batch processing samples at inference time, that's where you want to make most efficient use of the compute resources, so chances are minibatches are going to be the best route; again the same kind of optimization with not keeping intermediate activations applies here.

As for whether GPUs are useful at inference time, that heavily depends on your model and available deployment hardware. TensorFlow has some support for 8-bit quantized models for targeting mobile/embedded/CPU, so there are ways to optimize for this use case. Even at inference time though, most of your compute work is going to be in matrix multiplies or convolutions so there's plenty of cases where GPUs can help.


Apologies, I did not mean to say that the information was not in the article. It mostly was. (I think the only thing missing is that you are saying that of that 168 megs, 104 are the weights and 64 are the activations. This gives 64 * 32 for the 2G of memory. Right?)

The difficulty for me is merely in visualizing.

And I should have said that I was not looking for a single answer on whether GPUs make sense in predicting. A chart with rough guidance on when you will target specific execution environments is all I'm hunting for. My bias is that lower power targets are more common. Phones and the like. But, I would not be willing to bet heavily on that.


Short answer is surely an information theoretic one: if a neural net acquires knowledge, it needs to put that knowledge somewhere, and the state of the net is the only place it can be going. Sure, training the net forces it to store that knowledge as efficiently/lossily as it can get away with per the training criteria, but the fundamental information has to go somewhere. Doesn't it follow that if you run a DNN with a smaller memory footprint it will be able to learn less stuff (or its learning will be more sloppy/lossy)?

It sounds like some of this piece argues that implementation choices like SIMD instructions are forcing particular pieces of DNN state to be constrained to certain sizes for efficiency; so I guess the argument there is that there is wasted precision in a lot of implemented state storage that isn't meaningfully useful for the network to capture its learning; that's reasonable - the bit resolution for a network weight should be driven by the mathematical precision of the model, not the practical constraints of hardware. But if you ran those numbers it's also perfectly possible (I guess) that You would find your DNN wants 128bit weights and on current hardware is actually limited by the precision of the weights and states we permit them to store - and that given more memory depth they would be able to learn more.

This seems to start from the premise that we're giving DNNs more memory than they need. Is that actually a given?


Interestingly the precision of the weights doesn't seem to affect accuracy much - it doesnt seem like 128bit weights would be that useful. At the moment a lot of work is being done on 8bit (and lower precision, see XNOR) weights


Often most of the memory is for intermediate activations of the network, not the parameters used to calculate those intermediate activations.


If this is the sort of thing that keeps you awake at night: Nvidia has a $700 card (GTX1080ti) with 11G RAM and a GPU that is a slightly modified version of the GP102, which powers the more expensive Titan X.


Perhaps more efficient memory use is the answer instead. This article has mentioned most of the published techniques.

As for the 1080 Ti, a better option might be the P100, which can have up to 16GB.


A better option... if someone else is paying for it. It is literally ~10x the price of a 1080ti ($700 vs a suggested $7000).


The P100 has much better double-precision performance, but it's single-precision perf doesn't justify the cost in most instances.

The DGX1 (w 8 P100's) is another example of an overpriced, turnkey money extraction "solution."

Also worth noting it's hard to get a Ti non-founder edition, which has a hefty premium on it for the privilege of being first. The regular 1080 non-FE cards are probably the best bang/buck right now.

Folking doing ML/AI/DL, or any HPC really, gotta profile their apps to find bottlenecks and root out resource waste: single prec math? ram? algorithm complexity? storage iops? network latency? etc. because throwing money at a problem is no substitute for having a clue how to use money, time and electricity wisely.


Compared to what it can replace DGX-1 is not overpriced in any way.

Feel free to invest your time and money into building and debugging issues on your own system. Companies which pay big $$$ to their data scientists have different incentives than regular users - you do not want to waste time of your expensive staff on chasing hardware/software issues.


129000$ for the DGX-1, for anyone interested.

Saves you debugging your way to a workable 8*1080ti, when you look at it from this angle. Not sure if being able to swap out the cards is what makes building them yourself worth it in addition to the cost saving.


P100 has double rate FP16 which makes it twice as fast, so make that 16x 1080Ti. Also those won't fit into one machine, so now you're building a cluster, and what are you going to use for the interconnect? You won't get linear scaling, so make that more like 24x 1080Tis plus extra development effort for scaling on a cluster, if your problem even scales that way. Now account for the power usage...


What interconnect do you have between those 8*1080ti? If your problem doesn't require communication then it may make sense.

If you require non-trivial communication no PCIe setup would beat NVLink communication bandwidth on DGX-1.


> 129000$ for the DGX-1, for anyone interested.

If you are allowed to buy one. Nvidia doesn't have nearly enough chips to meet demand, the bulk of their production has been earmarked for quite a while ahead, what little is left over and makes it into the DGX-1 is sold to very few hand picked customers.


There is also the unified memory on the DGX-1, which is supposed to make streaming data to/from the GPUs a lot more efficient, as well as enable much larger data sets.


The P100 also has 2x the performance on FP16 and NVLink, which is about 10x faster than PCIe.


> As for the 1080 Ti, a better option might be the P100, which can have up to 16GB.

$4K or so, vs $700?


Most of the major labs I've seen pay six figures easily to their top researchers. Good compute is one of the biggest reasons why researchers would want to work somewhere. Skimping on compute seems foolish at that stage.

Here's a half-joking, half-serious take on the matter: https://twitter.com/deliprao/status/842636496666419200


> Most of the major labs I've seen pay six figures easily to their top researchers.

Even a normal programmer easily makes six figures, I hope they are doing better than that given the demand in this space.


That was a conservative estimate.


I thought you might have meant monthly


Yeah but is that P100 better than five or six 1080Ti cards?


Because parallelization is so difficult, yes.


If you are using a GPU, parallelisation is not difficult.


I believe you may be missing something. Even on a GPU there are synchronization points in some algorithms and data needs to be passed from one processor to another once you spill over the boundary of what can be held in the RAM.

Also, some parts of an algorithm may not be parallelizable at all.

GPU parallelization is only 'not difficult' if:

  - all your data fits on a single GPU

  - your code is embarassingly parallel
 
  - for the total duration of one computation
Bonus if you do can use the output of one computation as the input of the next. In all other kinds of computations the usual bottle-necks apply.

https://en.wikipedia.org/wiki/Amdahl%27s_law


I was talking about multi-GPU parallelization. Multi GPU SGD parallelization is not easy. Single GPU parallelization is trivial of course.


My understanding from this article is that for a greater GPU ram size you will need to increase the minibatch size which has a negative impact on the models ability to generalize[0].

[0] https://arxiv.org/abs/1609.04836


"In GPUs the vector paths are typically 1024 bits wide, so GPUs using 32-bit floating-point data typically parallelise the training data up into a mini-batch of 32 samples, to create 1024-bit-wide data vectors. This mini-batch approach to synthesizing vector parallelism multiplies the number of activations by a factor of 32, growing the local storage requirement to over 2 GB."

I would think about that the other way around... batching your data is an ugly hack to get around not being able to load it all simultaneously.


Mini-batches are not an ugly hack, batch gradient descent is too slow since you have to go through the whole data and stochastic gradient descent is too high variance (plus you can't do cool things like batch-norm). Mini-batches give you stability and speed, the best of both worlds.


drewm1980 is not totally incorrect, though. I don't believe that people began using batch gradient descent because they knew a priori that it had decent properties. I would venture to guess that, initially, people were constrained my memory and computational tractability and only observed after-the-fact that SGD actually worked and in fact, often worked nicer than non-stochastic descent.

I'm not sure really at what point SGD was linked to stochastic approximation, and theoretical explanation of convergence behavior was really pinned down. It feels recent (see work by Leon Bottou, in particular), but I'm certainly no expert in this area.

--- edits: grammar


So interestingly, SGD has a nice intuitive explanation for why it is better than GD.

If you compute the gradient step for all data, you're expending computational power on redundant data. You're going to get to the minimum with fewer data if you make steps as you get useful information.


A brain learns one example at a time, which probably means a gradient descent is not a very good learning mechanism.


These are not brains, they are not even models of brains even if they use the word 'neuron'.


That was kind of my point. Brain learns well using one example at a time. ANNs don't. Hence my conclusion.


You can use gradient descent one example at a time. It still works just fine. The gradients are more unstable, but you will still converge eventually.


"works just fine" is a relative term. I'm pretty sure when I'm learning a (new) alphabet, I don't need to see 1000 examples of each letter 100 times.


> Combining memory and processing resources in a single device has huge potential to increase the performance and efficiency of DNNs as well as others forms of machine learning systems.

Isn't that describing a Field Programmable Gate Array: memory embedded in a sea of logic cells?

What's the best resource to get someone who knows FPGAs, a decent amount of maths and 1990s style neural nets up to speed on today's deep neural networks?



Thanks for prompting me to do further research on this. I didn't realise from the initial posting, that this course is based on prerecorded videos and can be completed asynchronously, to fit in with life. The site say students "need access to a computer with an Nvidia GPU", but I reckon it could be an exceedingly interesting exercise to try and do the course with an FPGA in place of the GPU.


Alternatively you could use some GPU cloud instances but the end result of that is probably that it ends up costing you more than it would to buy a low-end GPU that you can complete the course with.

It'd be very interesting to see how you do this with an FPGA, that would be quite the accomplishment, especially given the monstrous compute capability that GPUs give, and the communications between high level (say: python) and low level (a GPU or some other co-processor) have all been worked out and debugged for GPUs.

If there is room for FPGA work in there it would probably be some specialized function.

Regardless I'd love a write-up if you do this.


If my understanding is correct, FPGAs aren't that fast, and while the really small ones are cheap, they don't fit that much. I synthesized a 32-bit multiplier on a spartan 6, and it took up about 1/5 of the footprint. I figure maybe you could fit 4 16-bit fused-multiply adds (which ought to be able to do machine learning) onto a spartan 6, which is not that great.

What is really going to slaughter you for ML performance on FPGAs is memory bandwidth.

If you have access to a really big FPGAs with lots of memory (on the order of $10-30k...) it might make a difference.


I'd see more applications on the evaluation than on the training side but even in training there might be a few situations where it could pay off, you'd still be stuck with a huge communications issue to solve.


http://neuromorphic.eecs.utk.edu/pdfs/2016-IJCNN-Schuman.pdf

These include SNN implementations on field-programmable gate arrays (FPGAs), one that is trained using a mixture of unsupervised/supervised learning [61] and another that is trained using a GA [62]

An initial implementation of DANNA on field programmable gate arrays (FPGAs)


> Isn't that describing a Field Programmable Gate Array

hard to say, my wristwatch has processing and memory, too.


This question might be a bit off topic but does anyone know of a calculator or library (I primarily use Tensorflow and Keras) to calculate the GPU memory footprint of the network? I only have a 1050Ti and I fit my networks in the memory by manually adjusting different parameters. It would be nice to know how much room I have left to expand my network.


model.summary() in Keras will provide you the number of parameters. You can use this with the shape of your data to measure the memory cost.


Thanks! Can you explain how I can calculate the memory cost with the shape of my data assuming a batch size of 1 and a input data shape of (10,20,3)?


> So a transformation called 'lowering' is used to convert those convolutions into matrix-matrix multiplications (GEMMs) which GPUs can execute efficiently.

This confuses me a bit, I'd love to know the details of this "lowering"? How is convolution not already a very straight-forward linear algebra operation?


https://cs.brown.edu/~sk/Publications/Papers/Published/bck-l...

FrTime induces construction of the dataflow graph by redefining operations through an implicit lifting transformation. Lifting takes a function that operates on constant values and produces a new function that performs the same operation on time-varying values. Each time the program applies a lifted function to time-varying arguments, it builds a new node and connects it to the nodes representing the arguments...

...Unfortunately, this implicit graph construction can be very inefficient. Every application of a lifted function may create a new dataflow node, whose construction and maintenance consume significant amounts of time and space....

...The technique works by collapsing regions of the dataflow graph into individual nodes. This moves computation from the dataflow model back to traditional call-by-value, which the runtime system executes much more efficiently. Because this technique undoes the process of lifting, we call it lowering.

edit* or maybe?:

http://physics.gmu.edu/~joe/PHYS428/Topic5.pdf

slide 16?


Interesting. Thanks!


I'm not 100% sure but I think justin is explaining the lowering transformation here:

https://www.youtube.com/watch?v=dUTzeP_HTZg&t=36m30s


I didn't see any mention of striding. For convnets, you have to do a very unwieldy data unroll that either copies your data by (nxn) where n is the width of your kernel, or you have to manually stride, which can cause all sorts of problems with memory bank collisions, paging problems, etc.


Yes, this is a real problem. I can easily tell if my GPU is having striding issues by looking at the power consumption of the machine. If I manage to get > 40% utilization I consider that pretty good. GPUs are useful but far from perfect for this application and the quickest way to sink performance is to 'starve the beast' by having sub-optimal memory access patterns.


So don't use a GPU, don't suffer from old paradigms :)


Couldn't you organize the input data in a "pre-strided" layout? Not sure if that makes any sense.. But I mean, re-arrange the pixels so that consecutive pixels per stride are consecutive in memory.


> For convnets, you have to do a very unwieldy data unroll that either copies your data by (nxn) where n is the width of your kernel

Justin does a WAY better job of explaining it than I do

https://www.youtube.com/watch?v=dUTzeP_HTZg&t=36m30s




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: