
Combine Multiple AWS Instances into a 16-GPU Monster Machine - Noughmad
http://www.bitfusion.io/2016/03/31/introducing-monster-machines-worlds-largest-cloud-gpu-instances-aws/
======
manav
I've found amazon GPU instances to be really expensive (even the spot prices
have been high recently), especially if you need it for longer runs for deep
learning. The other issue is that the additional layers of virtualization
create bandwidth overhead issues.

I'd like to see something in the cloud thats bare-metal / full access to GPUs
(Maybe a good idea to start one). For scaling higher with a very large number
of GPUs, you'd need Infiniband but at some point there is going to be a
bandwidth tradeoff.

It would be interesting if someone could run some benchmarks of these
instances versus a physical server.

~~~
ant6n
Somebody should make a startup that allows people to sell access to their
computers by the minute. Like spot instances in the cloud ... in people's
basements. The true sharing economy.

~~~
eximius
That sounds like a nightmare. You'd have zero uptime guarantees!

~~~
Johnny555
Does Amazon have any instance uptime guarantees?

~~~
joombaga
Yes, >99.95% Monthly Uptime Percentage.

"“Monthly Uptime Percentage” is calculated by subtracting from 100% the
percentage of minutes during the month in which Amazon EC2 or Amazon EBS, as
applicable, was in the state of “Region Unavailable.” Monthly Uptime
Percentage measurements exclude downtime resulting directly or indirectly from
any Amazon EC2 SLA Exclusion (defined below)."

[https://aws.amazon.com/ec2/sla/](https://aws.amazon.com/ec2/sla/)

~~~
Johnny555
That's a regional outage, they provide no SLA for individual instances:

 _Amazon EC2 SLA Exclusions... (v) that result from failures of individual
instances or volumes not attributable to Region Unavailability_

Presumably if you're using a cloud hosted in people's basements, if someone's
basement server dies, you'd just pick one from someone else's basement, so
this model could provide better availability than AWS.

------
TheGuyWhoCodes
That's some really cool tech. It seems like it's Linux only. Is there windows
support planned? That would solve the problem with wanting to run code on the
GPU within a Linux VM while the host is windows.

~~~
mtweak
Windows support is coming in mid-April, stay tuned!

~~~
TheGuyWhoCodes
That's great! Reading the documentation it seems there is no support for
multiple clients and multiple GPUs (Many-To-Many), is there anything planned
on that side?

~~~
mtweak
You can absolutely do that. That's actually one of the more interesting
configurations: the ability to pool GPU systems.

Just go to the custom link at the bottom of the page, the link is:
[https://console.aws.amazon.com/cloudformation/home?region=us...](https://console.aws.amazon.com/cloudformation/home?region=us-
east-1#/stacks/new?stackName=BitfusionCluster&templateURL=https:%2F%2Fs3.amazonaws.com%2Fbitfusionio%2Fcfn%2Fbitfusion-
boost-cluster.cfn)

There you can select any number of clients and servers. For example: 5 clients
and 1 server (many to one), or 5 clients to 5 servers (many to many).

~~~
TheGuyWhoCodes
Nice. The doc at [https://bitfusionio.readme.io/docs/bitfusion-
boost](https://bitfusionio.readme.io/docs/bitfusion-boost) is a bit misleading
with the possible configurations. Maybe add one configuration with multiple
Boost Clients(CPU) and many Boost Servers(GPU)

------
mchahn
At first I thought this was the same problem as automatically breaking up apps
to run in multiple cpus. This problem has been heavily researched with no
success.

Is it the fact that GPU code already runs in parallel streams that makes this
possible?

~~~
mtweak
Yes, your app would have to support multiple GPUs. What's done here is
remoting CUDA/OpenCL/etc. calls so that remote GPUs can be accessed from a
single instance. When performing device/platform enumeration, all GPUs appear
to be directly connected to a single instance -- hence no change to the
application required.

~~~
derefr
Sounds like Plan9's concept of "CPU server mounts" has been reborn as "GPU
server mounts." Could actually get traction this time, given that existing
multi-GPU programs will Just Work.

~~~
LoSboccacc
I cant wait for company to provide opencl/cuda mflops as a service instead of
giving you vms as a whole, so one could just attach remote engine to any
smallish controller vm

~~~
mbajkowski
What you suggest is technically possible by installing our Boost software on
any GPU machine, and then accessing that machine from any clients running our
Boost software as well. That client does not need to have a GPU. This
configuration is supported in AWS today, where for example you can connect one
or more t2.large isntance to a g2.8xlarge. All that would have to be done is
some metering on the GPU machine to implement the service you suggest :)

We are not limiting our software to AWS so you can built this kind of service
on any kind of cluster by installing our software directly from
[https://boost.bitfusion.io](https://boost.bitfusion.io) \- I say cluster,
because we have played with the idea of thin devices accessing remote GPU
instances in the cloud, but over public networks the network performance was a
limiting factor.

------
minimaxir
Are there benchmarks/code examples for the Monster Machines?

~~~
mtweak
Yes! Whenever you spin up one of our AMIs, there is a README that will guide
you through a couple of simple examples. We are about to publish performance
results on the monster machines in a few days, so watch out for it. Scaling
depends on the compute density of the GPU workload, but in general we've seen
pretty good results with 1) Deep learning (caffe) scaling to 16 GPUs (near
native scaling with local GPUs, especially deep nets), 2) Raytracing of photo-
realistic and complex scenes - near linear scaling with increasing GPUs, and
3) Physical modeling and simulation does very well too.

~~~
semi-extrinsic
Have you done any molecular dynamics benchmarks? If so, what kind/what system?
I'd be very interested to see those.

If you haven't, I could probably contribute some strong and weak scaling
testcases.

~~~
mtweak
We've only done cursory evaluation of NAMD scaling. We saw a 7X improvement
going from a non-GPU system to remote GPUs located in a different datacenter
over shared 10g. We're not sure if that was with a representative dataset (MD
is not our skill set), so if you can help us with a case study we'd be excited
to work with you. Please do contact me.

~~~
semi-extrinsic
I sent you a message with some more info and my email via the contact form at
bitfusion.io

------
yankoff
Do you have a support for spot instances?

~~~
mbajkowski
No yet, but it is on our roadmap. We have had several customer inquire about
it. Drop us a note on our site and I will ping you when it becomes available.

------
mtanski
Congrats guys this is a really neat hack, really impressive.

Did you guys think about building it further out to provide a GPU load
balancer for multiple frontend machines running Cuda / OpenCL?

~~~
mtweak
Hmm, can you elaborate? Do you mean having multiple smaller instances talk to
a single GPU instance?

~~~
mtanski
I'm talking about time-sharing. It doesn't mater if it's smaller instances
sharing a single GPU instance or many instances sharing many GPU instances.
Essentially N:M sharing (with some scheduling).

Since the GPU client is now abstracted from the GPU devices by placing the
GPUs across the network. It seams like time-sharing should be next logical
step.

~~~
mtweak
Got it, this is actually already supported. At the very end of the blog post
there is a link to create a custom configuration. You can create any N:M
configuration, that is any number of clients to servers and therefore the
level of performance scaling or GPU pooling.

Check it out:
[https://console.aws.amazon.com/cloudformation/home?region=us...](https://console.aws.amazon.com/cloudformation/home?region=us-
east-1#/stacks/new?stackName=BitfusionCluster&templateURL=https:%2F%2Fs3.amazonaws.com%2Fbitfusionio%2Fcfn%2Fbitfusion-
boost-cluster.cfn)

~~~
mtanski
Great, thanks for clearing that up.

------
vessenes
This is really cool; publishing an AMI seems like such a good win for you
guys; configuration is done, you get paid as customers use it.

Hopefully you'll see some good uptake.

~~~
Noughmad
It goes nicely with the "supercomputing to the masses" mission. Especially
when the alternative is buying lots of machines and installing all the
required software manually.

------
skamma77
Congrats Bitfusion Team. This is really exciting!

------
homero
You can do this 10x cheaper at home

------
man5quid
Congratulations it's awesome to see the AMIs published.

------
flamethrower
When will Bitfusion be available for Google and Microsoft?

