AWS has V100's available so most Universities with a decent budget should be abl...

notsuoh · on Sept 3, 2020

Not even that, spot pricing on an 8 GPU instance (the 16xl I believe, the larger one has the same number of GPUs but more memory per GPU) is something like $6/hr. I use this for personal projects sometimes, get all data in S3, a good launch template, and then spin up a spot instance and be super efficient about training quickly. I've even run evals on a separate, cheaper, machine so the 16xl can spend all its time training. It's still not "cheap", but $50 for 8 hours of training on a machine like that with $64k of GPUs on board is really not bad.

bravura · on Sept 4, 2020

If you wrote a blog post about how to streamline this, I would read it and upvote it.

belval · on Sept 3, 2020

24$/hour for 8 V100 to be exact.

sudosysgen · on Sept 3, 2020

It's interesting how renting 8 GPUs costing 7000$ each costs about the same as renting the services of the average US worker.

smabie · on Sept 4, 2020

24/hour comes to a salary of $49,920, working 8 hours a day, 5 days a week, no vacation, no holidays.

Meanwhile the GPUs cost $56,000 total. So the numbers aren't really very far off.

sudosysgen · on Sept 4, 2020

They kind of are, these GPUs last multiple years. Anyways, it's just an interesting observation.

smabie · on Sept 4, 2020

I mean you also have to take into account electricity, which I assume is pretty expensive.

sudosysgen · on Sept 4, 2020

It's only about 2.5kW, so around 1-2$ per hour for the full system

fxtentacle · on Sept 3, 2020

Except that cloud-ified V100s are significantly less powerful than if you have direct access to the hardware. Last time I checked, in AWS they're actually external devices mapped in over GBit ethernet, which is significantly slower than the 8GB/s that PCIe x4 has.

nl · on Sept 4, 2020

This isn't true. AWS V100s are standard V100s.

I think you are confusing this with AWS Elastic Inference.

If you use AWS Elastic Inference, then you get networked attached devices. But these are Amazon's own (non-NVidia) devices and only used for inference, so it's not really comparable.

https://aws.amazon.com/machine-learning/elastic-inference/fa...

p1esk · on Sept 4, 2020

I routinely switch between AWS 8x V100 instances and on-premise 8x V100 servers and I observe no difference in speed (time per epoch).

Reelin · on Sept 4, 2020

Presumably that depends on maximum PCIe bandwidth consumption before your workload bottlenecks elsewhere? A 2018 benchmark (https://www.pugetsystems.com/labs/hpc/PCIe-X16-vs-X8-with-4-...) seems to indicate that x8 isn't generally a bottleneck for common (at the time) workloads. x8 is a far cry from the claimed gigabit ethernet though!

p1esk · on Sept 4, 2020

AWS is tricky in terms of how storage is provisioned - I don't remember details, but it's easy to put your datasets on storage that is connected to your GPU servers over 1Gb link. That could easily become a bottleneck. Datasets should live on Elastic Block Storage or something like that, over high speed links. Again, it's been a while since I looked into that, so I don't remember the details.

Reelin · on Sept 4, 2020

The earlier comment claimed that the GPUs (!!!) were located elsewhere on the network; I suspect that the scenario you describe is what they intended to refer to.

(IIRC AWS offers compute optimized instances with a volume that's guaranteed to be backed by blocks on a local NVMe drive.)

nl · on Sept 4, 2020

I think they are confused with AWS Elastic Inference. That is a different thing which does have network attached accelerators:

Amazon Elastic inference accelerators are GPU-powered hardware devices that are designed to work with any EC2 instance, Sagemaker instance, or ECS task to accelerate deep learning inference workloads at a low cost. When you launch an EC2 instance or an ECS task with Amazon Elastic Inference, an accelerator is provisioned and attached to the instance over the network.

https://aws.amazon.com/machine-learning/elastic-inference/fa...

marcinzm · on Sept 4, 2020

Citation? AWS GPU instances are cpu+gpu+memory in fixed ratio packages so I don't see why it'd be over ethernet.