+ P2 (K80) with single GPU: ~95 seconds per epoch
+ P3 (V100) with single GPU: ~20 seconds per epoch
Admittedly this isn't exactly fair for either GPU - the K80 cards are straight up ancient now and the Volta isn't sitting at 100% GPU utilization as it burns through the data too quickly ([CUDA kernel, Python] overhead suddenly become major bottlenecks).
This gives you an indication of what a leap this is if you're using GPUs on AWS however.
Oh, and the V100 comes with 16GB of (faster) RAM compared to the K80's 12GB of RAM, so you win there too.
For anyone using the standard set of frameworks (Tensorflow, Keras, PyTorch, Chainer, MXNet, DyNet, DeepLearning4j, ...) this type of speed-up will likely require you to do nothing - except throw more money at the P3 instance :)
If you really want to get into the black magic of speed-ups, these cards also feature full FP16 support, which means you can double your TFLOPS by dropping to FP16 from FP32. You'll run into a million problems during training due to the lower precision but these aren't insurmountable and may well be worth the pain for the additional speed-up / better RAM usage.
- Good overview of Volta's advantages compared to event the recent P100: https://devblogs.nvidia.com/parallelforall/inside-volta/
- Simple table comparing V100 / P100 / K40 / M40: https://www.anandtech.com/show/11367/nvidia-volta-unveiled-g...
- NVIDIA's V100 GPU architecture white paper: http://www.nvidia.com/object/volta-architecture-whitepaper.h...
- The numbers above were using my PyTorch code at https://github.com/salesforce/awd-lstm-lm and the Quasi-Recurrent Neural Network (QRNN) at https://github.com/salesforce/pytorch-qrnn which features a custom CUDA kernel for speed
Some of the slowdowns now just seem silly and aren't even listed in the per epoch timings: PyTorch doesn't have an asynchronous torch.save(). This means that if you save your model after each epoch, and the model save takes a few seconds, you're increasing your per epoch timings 5-10% just by saving the damn thing!
Regarding FP16, PyTorch supports, and there's even a pull request that updates the examples repo with FP16 support for language modeling and ImageNet. It's not likely to be merged as it greatly complicates a codebase that's meant primarily for teaching purposes but it's lovely to look at. I also think many of the FP16 issues will get a general wrapper and they'll become far more agnostic to the end user. For the most part they're all outlined in NVIDIA / Baidu's "Mixed Precision Training" paper. Might be useful for DeepLearning4j to go through the most common heavy throughput use cases and get them running (just as an example of how to work around issues really) if customers were using P100s/V100s?
I'm really interested in exploring the FP16 aspect as the QRNN, especially for single GPU, is sitting at basically 100% utilization, with almost all the time spent on matrix multiplications. FP16 is about the only way to speed it up at that stage. This gets a tad more complicated regardless as the CUDA kernel is not written in FP16 (and is not easy to do so) but even converting FP16->FP32->(QRNN element-wise CUDA kernel)->FP16 ("pseudo" FP16) should still be a crazy speedup. I tested that on the P100 and it took per epoch AWD-QRNN from ~28 seconds to ~18.
- PyTorch async save issue: https://github.com/pytorch/pytorch/issues/1567
- PyTorch FP16 examples pull request: https://github.com/pytorch/examples/pull/203
- "Mixed Precision Training": https://arxiv.org/abs/1710.03740
P.S. with that memory speed, it can probably run 300..400MH/s on ETH.
The P100 instances on Softlayer would cost around $2,000/mo, and would generate approximately $170/mo in ETH when fully optimized. One could probably build a DIY rig with the same hashing power for less than 2k total.
p3.2xlarge: 8 vCPU, 61 GB RAM, $3.06/h
p3.8xlarge: 32 vCPU, 244 GB, $12.24/h
p3.16xlarge: 64v CPU, 488 GB., $24.48/h
Think of us as the DigitalOcean for GPUs with a simple, transparent pricing and effortless setup & configuration:
AWS: $3.06/hr V100*
Paperspace: $2.30 /hr or $980/month for dedicated (effective hourly is only $1.3/hr)
Learn more here: https://www.paperspace.com/pricing
[Disclosure: I am one of the founders]
Getting the data into and out of compute services is the most difficult part financially, at least in my experience.
You can never forget that this is entirely because of compute services ripping you off, not because they're providing a valuable service in return for the transfer pricing.
Even their "direct connect" services cost more than my transit does.
Both the interface and GPU prices are fantastic.
Keep up the good work!
I'm looking for a way to run serverless (Amazon Lambda style) GPU operations (preferably using OpenCL). Are there any plans for such a service in your platform?
(I'm an engineer on Google Compute Engine with a deep interest in customer networking use stories, particularly heavy utilization customers, even if they're not my customers :)
All our prebuilt binaries have been built with CUDA 8 and cuDNN 6.
We anticipate releasing TensorFlow 1.5 with CUDA 9 and cuDNN 7.
Testing new Tesla V100 on AWS. Fine-tuning VGG on DeepSent dataset for 10 epochs.
GRID 520K (4GB) (baseline):
* 780s/epoch @ minibatch 8 (GPU saturated)
* 30s/epoch @ minibatch 8 (GPU not saturated)
* 6s/epoch @ minibatch 32 (GPU more saturated)
* 6s/epoch @ minibatch 256 (GPU saturated)
Yes, the support should already be there for both frameworks.
These look very good for half precision training
I'm sure this will change with demand, though. :(
I can easily see people paying full price for that. Still, spot price is currently $2.40.
You’d still build your own for that money, I think, but it’s an interesting datapoint.
Still nice if you quickly need to get some model results though.
Cryptocurrencies are the invisible robot hand of the market. (Which is, I think, not a claim about whether they're good, but certainly a claim about whether they are to be feared. If you squint hard enough, the giant Bitcoin mines in China are the work of an unfriendly AI employing people to make paperclips.)
For the P3 (Volta V100) instances you'll want to ensure you use an AMI preloaded with CUDA 9, though not all DL frameworks are happy with that yet.
CUDA 8 programs will run, but terribly slowly as they JIT their GPU code without optimization for Volta. You want the CUDA 9 AMI version (https://aws.amazon.com/marketplace/pp/B076TGJHY1?qid=1509090...), but it currently only has MXNet and TF.
If you need other frameworks there's the NVIDIA AMI (https://aws.amazon.com/marketplace/pp/B076K31M1S?qid=1509090...) and Volta optimized containers for NVCaffe, Caffe2, CNTK, Digits, MXNet, PyTorch, TensorFlow, Theano, Torch, CUDA 9/CuDNN7/NCCL.
GPUs just work very well when you have a a lot of data and you are able to run the operations on the data set in parallel. Machine learning seems to fit this model quite well which is why you see many GPUs used in this field. Other things that take advantage of parallelism would be graphics and crypto-currency mining.
ML might be a bit of a moving target though.