
Introducing Amazon EC2 P3 Instances - aseidl
https://aws.amazon.com/about-aws/whats-new/2017/10/introducing-amazon-ec2-p3-instances/
======
Smerity
The P3 instances are the first widely and easily accessible machines that use
the NVIDIA Tesla V100 GPUs. These GPUs are straight up scary in terms of
firepower. To give an understanding of the speed-up compared to the P2
instances for a research project of mine:

\+ P2 (K80) with single GPU: ~95 seconds per epoch

\+ P3 (V100) with single GPU: ~20 seconds per epoch

Admittedly this isn't exactly fair for either GPU - the K80 cards are straight
up ancient now and the Volta isn't sitting at 100% GPU utilization as it burns
through the data too quickly ([CUDA kernel, Python] overhead suddenly become
major bottlenecks). This gives you an indication of what a leap this is if
you're using GPUs on AWS however. Oh, and the V100 comes with 16GB of (faster)
RAM compared to the K80's 12GB of RAM, so you win there too.

For anyone using the standard set of frameworks (Tensorflow, Keras, PyTorch,
Chainer, MXNet, DyNet, DeepLearning4j, ...) this type of speed-up will likely
require you to do nothing - except throw more money at the P3 instance :)

If you really want to get into the black magic of speed-ups, these cards also
feature full FP16 support, which means you can double your TFLOPS by dropping
to FP16 from FP32. You'll run into a million problems during training due to
the lower precision but these aren't insurmountable and may well be worth the
pain for the additional speed-up / better RAM usage.

\- Good overview of Volta's advantages compared to event the recent P100:
[https://devblogs.nvidia.com/parallelforall/inside-
volta/](https://devblogs.nvidia.com/parallelforall/inside-volta/)

\- Simple table comparing V100 / P100 / K40 / M40:
[https://www.anandtech.com/show/11367/nvidia-volta-
unveiled-g...](https://www.anandtech.com/show/11367/nvidia-volta-unveiled-
gv100-gpu-and-tesla-v100-accelerator-announced)

\- NVIDIA's V100 GPU architecture white paper:
[http://www.nvidia.com/object/volta-architecture-
whitepaper.h...](http://www.nvidia.com/object/volta-architecture-
whitepaper.html)

\- The numbers above were using my PyTorch code at
[https://github.com/salesforce/awd-lstm-lm](https://github.com/salesforce/awd-
lstm-lm) and the Quasi-Recurrent Neural Network (QRNN) at
[https://github.com/salesforce/pytorch-
qrnn](https://github.com/salesforce/pytorch-qrnn) which features a custom CUDA
kernel for speed

~~~
Beltiras
I can't easily find pricing information on the P3 instances. Have you come
across a simple table with the prices?

~~~
joelhaasnoot
Unfortunately, P3 isn't listed yet, but this is my go to site for EC2 pricing:
[http://www.ec2instances.info/](http://www.ec2instances.info/)

~~~
joelhaasnoot
Oh, it is now listed. Be sure to click on "Columns" and add "GPU" to see the
different options

------
DTE
Hi guys, Dillon here from Paperspace
([https://www.paperspace.com](https://www.paperspace.com)). We are a cloud
that specializes in GPU infrastructure and software. We launched V100
instances a few days ago in our NY and CA regions and its much less expensive
than AWS.

Think of us as the DigitalOcean for GPUs with a simple, transparent pricing
and effortless setup & configuration:

AWS: $3.06/hr V100*

Paperspace: $2.30 /hr or $980/month for dedicated (effective hourly is only
$1.3/hr)

Learn more here:
[https://www.paperspace.com/pricing](https://www.paperspace.com/pricing)

[Disclosure: I am one of the founders]

~~~
Veratyr
Your pricing page notably omits transfer pricing. Do you have free bandwidth
between yourself and AWS/GCP/Azure or do you peer at any major exchanges?

Getting the data into and out of compute services is the most difficult part
financially, at least in my experience.

~~~
dkobran
Dan here (also Paperspace team). Totally agree that transfer costs are a
significant pain point which is why we do not charge for it. We can peer with
other providers (eg with AWS we can leverage Direct Connect directly from our
datacenters) but most of our customers don't implement this unless they're
moving major traffic.

~~~
Veratyr
That's a good start but do you have a partnership with anyone that can provide
storage with free/low cost bandwidth to your service? Even Direct Connect is
ridiculously expensive compared to transit.

------
plantain
But where are the C5 instances? It's been 11 months since Amazon announced
Skylake C5's and we're still waiting!

[https://aws.amazon.com/about-aws/whats-new/2016/11/coming-
so...](https://aws.amazon.com/about-aws/whats-new/2016/11/coming-soon-amazon-
ec2-c5-instances-the-next-generation-of-compute-optimized-instances/)

~~~
STRML
Waiting for them as well. Most of all, we really need fast-CPU instances with
the ENA, not the Intel NIC.

~~~
jsolson
Out of professional curiosity, what are you looking for from ENA?

(I'm an engineer on Google Compute Engine with a deep interest in customer
networking use stories, particularly heavy utilization customers, even if
they're not _my_ customers :)

------
jeffbarr
More details in my blog post at [https://aws.amazon.com/blogs/aws/new-amazon-
ec2-instances-wi...](https://aws.amazon.com/blogs/aws/new-amazon-
ec2-instances-with-up-to-8-nvidia-tesla-v100-gpus-p3/)

~~~
SloopJon
This post states, "In order to take full advantage of the NVIDIA Tesla V100
GPUs and the Tensor cores, you will need to use CUDA 9 and cuDNN7." What
version of TensorFlow does it use? From what I can tell, TensorFlow doesn't
fully support the latest versions yet.

~~~
sipherhex
Chris from NV here. You can also get a full compliment of DL framework
containers, as well as CUDA 9/CuDNN 7/NCCL 2 base container, optimized for
Volta by NVIDIA via this AMI
[https://aws.amazon.com/marketplace/pp/B076K31M1S?qid=1509089...](https://aws.amazon.com/marketplace/pp/B076K31M1S?qid=1509089922554)

~~~
dharma1
Are there any frameworks yet supporting the advertised 120 TFLOPs mixed
precision training?

------
eggie5
Here's my results:

Testing new Tesla V100 on AWS. Fine-tuning VGG on DeepSent dataset for 10
epochs.

GRID 520K (4GB) (baseline):

* 780s/epoch @ minibatch 8 (GPU saturated)

V100(16Gb):

* 30s/epoch @ minibatch 8 (GPU not saturated)

* 6s/epoch @ minibatch 32 (GPU more saturated)

* 6s/epoch @ minibatch 256 (GPU saturated)

~~~
dharma1
Thanks! Curious how this would scale on the 8x or 16x instances

~~~
eggie5
what do you mean? 8 or 16 GPUs? That's require changing the code to use
distributed tensorflow...

~~~
dharma1
Yes exactly. The instances with 8 or 16 GPUs. Does the training time reduce
linearly, is the GPU utilisation 100%, is it plug and play with TF

------
mamon
Slightly off-topic but I'm curious: Nvidia Volta is advertised as having
"tensor cores" \- what does it take for a programmer to use them? Will typical
Tensorflow or Cafe code take advantage of it? Or should we wait for some new
optimized version of ML frameworks?

~~~
exDM69
> Will typical Tensorflow or Cafe code take advantage of it?

Yes, the support should already be there for both frameworks.

------
corford
Hmm just tried to spool up a p3.2xlarge in Ireland but hit an instance limit
check (it's set at 0), went to request a service limit increase but P3
instances are not listed in the drop down box :(

~~~
avvakum
Same problem here and it does not seem to be zone specific. I wonder how
others worked around this ...

~~~
corford
Maybe by being bigger customers... :)

~~~
avvakum
they fixed the dropdown for quota request

~~~
corford
\o/ I reached out to support and got them to do it manually but good to know.

------
dharma1
Price: p3.2xlarge - $3/hr, p3.8xlarge - $12/hr, p3.16xlarge - $25/hr

These look very good for half precision training

~~~
moonbug22
Come on, no one with any sense pays the on demand price for these things.
Watch the spots.

~~~
dx034
There are enough companies out there with deep pockets that want to do some
ML. They'll pay pay those prices, no questions asked.

~~~
IanCal
Per the marketing material it's up to a PFLOP of mixed precision (is that the
same as just saying "half precision"? or is it 8 bit?) for $25/hour.

I can easily see people paying full price for that. Still, spot price is
currently $2.40.

~~~
Beltiras
It must be a misplaced comma. 15.7 single precision can never translate to 125
mixed.

~~~
dharma1
The reason Nvidia quote 120 TLFOPS mixed precision on V100 is because of the
new tensor core.

[https://devblogs.nvidia.com/parallelforall/inside-
volta/](https://devblogs.nvidia.com/parallelforall/inside-volta/)

------
kshnell
Looks like Paperspace announced Volta support yesterday:
[https://blog.paperspace.com/tesla-v100-available-
today/](https://blog.paperspace.com/tesla-v100-available-today/) One nice
thing here is you can do monthly plans instead of reserved on AWS which is a
minimum $8-17k upfront. Really great to see the cloud providers adopting
modern GPUs.

------
moconnor
An exaflop of mixed-precision compute for $250M over 3 years. That’s ballpark
what the HPC community is paying for their exaflop-class machines.

You’d still build your own for that money, I think, but it’s an interesting
datapoint.

~~~
dx034
How long if you build it your own incl electricity prices? If margins are
similar to other EC2 instances, you'd probably break-even after 6 months or
so. Which makes EC2 uneconomical for any lab/company that can utilise the
cluster 24/7.

Still nice if you quickly need to get some model results though.

~~~
laumars
Amazon prices are for the pay as you go model. You can shave a significant
amount off the price if you know you're going to be running them for 12
months.

~~~
dexterdog
And even less if it's 36 months.

------
bprasanna
...advanced workloads such as machine learning (ML), high performance
computing (HPC), data compression, and crypto__________.

~~~
Yuioup
How many bitcoins can you mine out of this on max power and would it be
profitable? I'm sure that Amazon has done the math on this but I'm still
curious.

~~~
geofft
It's not just that Amazon has done the math, it's that sufficiently liquid
cryptocurrencies will, by the efficient market hypothesis, quickly gain enough
value to make mining on whatever Amazon offers no longer profitable. As soon
as you're able to profitably mine without an up-front capital investment,
people will take advantage of the arbitrage opportunity until the market
adjusts its price, and if the currency is designed at least somewhat
competently and has enough of a working market (both of which are definitely
true of Bitcoin), that won't take very long.

Cryptocurrencies are the invisible robot hand of the market. (Which is, I
think, not a claim about whether they're good, but certainly a claim about
whether they are to be feared. If you squint hard enough, the giant Bitcoin
mines in China _are_ the work of an unfriendly AI employing people to make
paperclips.)

------
againa
Use reserve instances or use spot. The price decrease substantially. Then when
you don’t need it... you don’t pay it... it’s a good deal

~~~
jerianasmith
yaah

------
sethgecko
Is there an AMI that comes with Tensorflow/keras with GPU support preinstalled
or you have to do it yourself?

~~~
Smerity
Amazon offer an official AMI which comes preloaded with various deep learning
frameworks: MXNet, TensorFlow, CNTK, Caffe/2, Theano, Torch and Keras.

For the P3 (Volta V100) instances you'll want to ensure you use an AMI
preloaded with CUDA 9, though not all DL frameworks are happy with that yet.

[https://aws.amazon.com/amazon-ai/amis/](https://aws.amazon.com/amazon-
ai/amis/)

~~~
sipherhex
Be careful with the non-CUDA 9 AMIs.

CUDA 8 programs will run, but terribly slowly as they JIT their GPU code
without optimization for Volta. You want the CUDA 9 AMI version
([https://aws.amazon.com/marketplace/pp/B076TGJHY1?qid=1509090...](https://aws.amazon.com/marketplace/pp/B076TGJHY1?qid=1509090457271)),
but it currently only has MXNet and TF.

If you need other frameworks there's the NVIDIA AMI
([https://aws.amazon.com/marketplace/pp/B076K31M1S?qid=1509090...](https://aws.amazon.com/marketplace/pp/B076K31M1S?qid=1509090567000))
and Volta optimized containers for NVCaffe, Caffe2, CNTK, Digits, MXNet,
PyTorch, TensorFlow, Theano, Torch, CUDA 9/CuDNN7/NCCL.

------
science404
Why Ireland and not the UK? I can imagine a lot of startups/banks in London
could use this... Brexit fears?

~~~
maffydub
I wouldn't read too much into this - Amazon's Ireland region was deployed
earlier (2008?) than London (2016?) and seems to receive updates earlier too.

------
psychometry
Random question: Why are we still using mostly GPUs for computation rather
than CPUs custom-designed for ML tasks?

~~~
zolthrowaway
GPUs are quite good at doing arithmetic in parallel. A large part of machine
learning is doing arithmetic on large data sets. It makes sense to do these
operations in parallel. For example, implementing k-nearest neighbors on a GPU
is almost 2 orders of magnitude faster than on a CPU[0].

GPUs just work very well when you have a a lot of data and you are able to run
the operations on the data set in parallel. Machine learning seems to fit this
model quite well which is why you see many GPUs used in this field. Other
things that take advantage of parallelism would be graphics and crypto-
currency mining.

[0]
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.159...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.159.9386&rep=rep1&type=pdf)

------
jerianasmith
P3 instances no doubt provide a powerful platform and is going to be useful
for data compression.

~~~
arnon
And GPU databases that use compression will gain another big advantage

------
JeanMarcS
If ever you've got password hash to decrypt :)

------
g105b
Bitcoin?

~~~
pyvpx
bitcoin mining at any hope of profitability comes with custom, specific ASICs.

