I'm not mentioning this to be smug - if you don't need massive GPU clusters on demand, the cost difference is substantial. I can build a GPU rig for $1,200 and it's going to cost $40/mo to host it. Compare that to the $500+ per month you're going to spend at a cloud provider.
I'm actually shipping one out today. It looks like this https://twitter.com/neocitiesweb/status/804065175045218304
"The cloud" is just marketing on top of shared hosting and virtual private servers. Those things have been around for decades. At the end of the day, your code is still executing on some poor Xeon out there, but maybe it's using some kernel features to restrict memory access and you deploy it using a fancy whale with a shipping container. We've had LXC, cgroups, jails, and xen for many years as well (LXC was 2008, Xen 2003). Solaris had zones in ~2004. We have SSDs, DDR4, 10GbE, and 32-core-per-socket Xeons now, so you don't have to really give two shits about crazy amounts of random reads / writes, storing state in 4 separate network-connected nosql systems with REST APIs, 4 layers of reverse proxies, 3 layers of virtualization, and 60-frame-deep call stacks as most stuff just works well enough with little thought put into the systems aspect of it. Despite all of this, Google Sheets is less responsive than the 1985 release of Excel, and takes longer in wall-clock time to enter in 100 rows of integers, then sum up a column. See also: https://en.m.wikipedia.org/wiki/Wirth's_law
The only thing which is really new is a generation of people who don't understand anything system-related deeper than the first abstraction layer.
Good points re: abstraction of the deeper system layers though. I guess my position as a frontend developer is that every braincell I spend thinking about things like hardware and memory allocation is one less to spend thinking about the UX of my app. Abstracting layers away is a good thing - except when they leak, which is often...
So to elastically provide GPUs over a rack how does that work? How do you not have a ton of GPUs just sitting around due to the physical constraints of PCI-E given that you can attach GPUs to some common instance types at any time? How do you not run out of capacity and just have to say no?
*there are newer/better/faster versions of the above available, just two examples I had handy as my guess of how to make this happen. I'm confident that even if AWS is doing something similar it's being done on customized hardware.
Latency to start and stop "jobs" is critical in gaming as you are trying to hit a 60-144hz job time.
Fortunately the amount of additional latency introduced is likely to be negligible (another comment cites PCI-E switches incurring <= 1µs).
The goal is to cut our capital costs, allow access to GPU-based graphics apps "anywhere", and host all of our virtual film production assets on AWS, where they are accessible for rendering, simulation, etc.
I don't know the answer to your other questions but EC2 can and often does refuse our run instance requests because they don't have the resources available in the AZ we've requested.
Or you could use PCIe switches and cables to dynamically reassign GPUs to different physical servers.
(This is speculation/questioning, not implying knowledge. I noticed the placement of the ? makes that a bit ambiguous)
I agree that external PCIe is the obvious good/best solution. I didn't realize that external PCIe switch boxes existed.
Although, recently, GPU rendering has gotten some traction in larger facilities. Cloud rendering makes it easier to move towards these kinds of things becuase you don't have to commit to the hardware upfront. However, the big problem with cloud rendering at even modestly sized animation/vfx houses is transferring the terabytes to and from the cloud (the consenus is to leave it in the cloud or use a local cloud).
I'm curious what the pricing will be.
It's clear what they are doing: you call Amazon's OpenGL library, which applies some batching and compression when talking to a remote GPU somewhere else in the cluster. You are not allowed to or even need to know what kind of hardware is on the other side. They could even pick different manufacturers. And because of this proxying, you can only use an open and vendor-neutral API like GL, hence no CUDA support.
Just looking at the spot pricing, I see that a p2.xlarge is $0.57 an hour, while a p2.8xlarge is $72/hour and a p2.16xlarge is $144/hour. That tells me there is extreme demand for heavy GPU instances, and a home cluster would be one way to insulate myself from that: