This is followed by a paragraph that quickly jumps that 168M to 2G. But, that paragraph is also discussing training. I'm assuming the 168 was just the net for execution. But, I'm not sure on that. (In particular, this raises a question I've had of wondering when a GPU really helps with the execution of a network, not just processing enough data to train it.)
All of that said, really cool article. Any points people think casual readers are likely to miss?
For running a trained model, it depends how it's being used. If you're computing one sample at a time, then the 168MB figure is an upper bound on that (one copy of weights + one set of activations). You could optimize this by only keeping the activations for the current layer because you don't need to save the intermediate activations. If however you're batch processing samples at inference time, that's where you want to make most efficient use of the compute resources, so chances are minibatches are going to be the best route; again the same kind of optimization with not keeping intermediate activations applies here.
As for whether GPUs are useful at inference time, that heavily depends on your model and available deployment hardware. TensorFlow has some support for 8-bit quantized models for targeting mobile/embedded/CPU, so there are ways to optimize for this use case. Even at inference time though, most of your compute work is going to be in matrix multiplies or convolutions so there's plenty of cases where GPUs can help.
The difficulty for me is merely in visualizing.
And I should have said that I was not looking for a single answer on whether GPUs make sense in predicting. A chart with rough guidance on when you will target specific execution environments is all I'm hunting for. My bias is that lower power targets are more common. Phones and the like. But, I would not be willing to bet heavily on that.
It sounds like some of this piece argues that implementation choices like SIMD instructions are forcing particular pieces of DNN state to be constrained to certain sizes for efficiency; so I guess the argument there is that there is wasted precision in a lot of implemented state storage that isn't meaningfully useful for the network to capture its learning; that's reasonable - the bit resolution for a network weight should be driven by the mathematical precision of the model, not the practical constraints of hardware. But if you ran those numbers it's also perfectly possible (I guess) that You would find your DNN wants 128bit weights and on current hardware is actually limited by the precision of the weights and states we permit them to store - and that given more memory depth they would be able to learn more.
This seems to start from the premise that we're giving DNNs more memory than they need. Is that actually a given?
As for the 1080 Ti, a better option might be the P100, which can have up to 16GB.
The DGX1 (w 8 P100's) is another example of an overpriced, turnkey money extraction "solution."
Also worth noting it's hard to get a Ti non-founder edition, which has a hefty premium on it for the privilege of being first. The regular 1080 non-FE cards are probably the best bang/buck right now.
Folking doing ML/AI/DL, or any HPC really, gotta profile their apps to find bottlenecks and root out resource waste: single prec math? ram? algorithm complexity? storage iops? network latency? etc. because throwing money at a problem is no substitute for having a clue how to use money, time and electricity wisely.
Feel free to invest your time and money into building and debugging issues on your own system. Companies which pay big $$$ to their data scientists have different incentives than regular users - you do not want to waste time of your expensive staff on chasing hardware/software issues.
Saves you debugging your way to a workable 8*1080ti, when you look at it from this angle. Not sure if being able to swap out the cards is what makes building them yourself worth it in addition to the cost saving.
If you require non-trivial communication no PCIe setup would beat NVLink communication bandwidth on DGX-1.
If you are allowed to buy one. Nvidia doesn't have nearly enough chips to meet demand, the bulk of their production has been earmarked for quite a while ahead, what little is left over and makes it into the DGX-1 is sold to very few hand picked customers.
$4K or so, vs $700?
Here's a half-joking, half-serious take on the matter: https://twitter.com/deliprao/status/842636496666419200
Even a normal programmer easily makes six figures, I hope they are doing better than that given the demand in this space.
Also, some parts of an algorithm may not be parallelizable at all.
GPU parallelization is only 'not difficult' if:
- all your data fits on a single GPU
- your code is embarassingly parallel
- for the total duration of one computation
I would think about that the other way around... batching your data is an ugly hack to get around not being able to load it all simultaneously.
I'm not sure really at what point SGD was linked to stochastic approximation, and theoretical explanation of convergence behavior was really pinned down. It feels recent (see work by Leon Bottou, in particular), but I'm certainly no expert in this area.
If you compute the gradient step for all data, you're expending computational power on redundant data. You're going to get to the minimum with fewer data if you make steps as you get useful information.
Isn't that describing a Field Programmable Gate Array: memory embedded in a sea of logic cells?
What's the best resource to get someone who knows FPGAs, a decent amount of maths and 1990s style neural nets up to speed on today's deep neural networks?
It'd be very interesting to see how you do this with an FPGA, that would be quite the accomplishment, especially given the monstrous compute capability that GPUs give, and the communications between high level (say: python) and low level (a GPU or some other co-processor) have all been worked out and debugged for GPUs.
If there is room for FPGA work in there it would probably be some specialized function.
Regardless I'd love a write-up if you do this.
What is really going to slaughter you for ML performance on FPGAs is memory bandwidth.
If you have access to a really big FPGAs with lots of memory (on the order of $10-30k...) it might make a difference.
These include SNN
implementations on field-programmable gate arrays (FPGAs),
one that is trained using a mixture of unsupervised/supervised
learning  and another that is trained using a GA 
implementation of DANNA on field programmable gate arrays
hard to say, my wristwatch has processing and memory, too.
This confuses me a bit, I'd love to know the details of this "lowering"? How is convolution not already a very straight-forward linear algebra operation?
FrTime induces construction of the dataflow graph by redefining
operations through an implicit lifting transformation.
Lifting takes a function that operates on constant values
and produces a new function that performs the same
operation on time-varying values. Each time the program
applies a lifted function to time-varying arguments, it builds
a new node and connects it to the nodes representing the
...Unfortunately, this implicit graph construction can be very
inefficient. Every application of a lifted function may create
a new dataflow node, whose construction and maintenance
consume significant amounts of time and space....
...The technique works by collapsing
regions of the dataflow graph into individual nodes.
This moves computation from the dataflow model back to
traditional call-by-value, which the runtime system executes
much more efficiently. Because this technique undoes the
process of lifting, we call it lowering.
edit* or maybe?:
Justin does a WAY better job of explaining it than I do