
Nvidia Digits DevBox - hendler
https://developer.nvidia.com/devbox
======
modeless
Nvidia owns deep learning. They are alone at the top. Intel and AMD aren't
even in the picture. I think this could end up being a bigger business than
graphics accelerators. There's a huge opportunity here for the first company
to put out a specialized deep learning chip that can beat GPUs (which is
definitely possible; probably by 10x or more).

~~~
trsohmers
Shameless self promotion... my startup
([http://rexcomputing.com](http://rexcomputing.com)) is producing a standalone
chip capable of 64 GFLOPs/watt double precision (128 GFLOPs/watt single
precision), compared to NVIDIA's next generation chips only hitting
20GFLOPs/watt single precision... and that is before you take into account the
power waste of the CPU controlling the NVIDIA GPU.

Our biggest plus factors compared to a GPU is that we are a fully
standalone/independent chip that does not need to have a CPU with your main
system memory attached to it. Large machine learning data sets are getting
into terabytes in size, and the biggest bottleneck with GPUs is the PCIe link
limiting them to 16GB/s and adds a whole lot of additional latency. In our
case, we have direct connection to DRAM (We've been looking at DDR4 and HMC).
In addition, we have designed the architecture to allow massive scalability
with up to 384 GB/s of aggregate chip-to-chip bandwidth... NVIDIA's NVLink is
aiming for 80GB/s in the 2018/2019 timeframe and will still need a connected
CPU to issue jobs.

EDIT: I should also mention that our chip is fully general purpose, but we
perform really well when it comes to dense matrix math (Most deep learning),
with a 10 to 15x efficiency advantage over GPU. Our real killer app is FFTs,
which GPUs do abysmally on, and our current benchmarks are showing a 25x
efficiency advantage over the best DSPs and FPGAs built for large
constellation FFTs.

~~~
modeless
Single precision isn't low enough. You want half precision or maybe even
lower. You also want to throw IEEE 754 out the window. Save on power and area:
no denormals, no infinities, no NaNs, relaxed precision requirements. It may
even be worth looking at exotic things like logarithmic number systems or
analog logic (deep learning should tolerate noise extremely well). You're also
going to need vast amounts of memory bandwidth, which means on-package memory,
and probably specialized caches and compute units for convolution.

A truly specialized deep learning chip probably wouldn't be useful for much
else, but it would be a monster at deep learning. And the thing about deep
learning is it scales really well. If you have a 10x faster machine you're
almost certain to set world records on any machine learning benchmark you try.

~~~
trsohmers
While I personally dislike IEEE Float, we decided to remain compliant for our
first chip, as that is a checkbox for a lot of businesses that we want to sell
into. We are looking at a new variable precision floating point format, called
Unum, created by one of our advisers (And HPC industry legend) John Gustafson.
Unum would be fantastic for deep learning, asyou would only use the precision
actually required, thus bringing the program size and memory bandwidth numbers
down and total energy efficiency up at least ~30%-50% over the same system
with IEEE float... You can check out a previous HN discussion on it here:
([https://news.ycombinator.com/item?id=9943589](https://news.ycombinator.com/item?id=9943589)

We still have the option of including a 16 bit (half precision float) packed
SIMD mode into our FPUs, which would add a bit of complexity (bringing our
efficiency numbers down a bit for the double precision float, which we like to
talk about as it is over 10x better than anything out there), but if there is
enough customer interest we may decide to include it.

~~~
modeless
Variable precision sounds scary, but maybe it could work. You should look into
building support for your chips into Theano, Torch7, and/or Caffe. You should
be able to hide Unum or any other quirks behind the interfaces of those
libraries, so people can drop in their existing models with no work. If you
can show a significant training speed advantage over GPUs, you'll have a
market.

~~~
trsohmers
You should check unum out... It is actually higher accuracy then IEEE float,
as it does not have "rounding", which causes errors over time. unum is to
floating point as floating point is to integer.

The other nice thing is that it is a superset of IEEE float, and has a "IEEE
mode" where you can convert to IEEE float, which is also jokingly called the
"guess" function.

As for support, that is the plan. Right now our customers have been most
interested in high end signal processing, so we have been taking time to port
FFTW, but supporting Theano/Torch/Caffe, etc is a relatively straightforward
process.

~~~
tim333
I just stuck a bit on unums on Wikipedia as there didn't seem to be much.

[https://en.wikipedia.org/wiki/John_Gustafson_(scientist)#Unu...](https://en.wikipedia.org/wiki/John_Gustafson_\(scientist\)#Unums)

Feel free to correct/improve it anyone.

~~~
trsohmers
The End of Error ([http://www.amazon.com/The-End-Error-Computing-
Computational/...](http://www.amazon.com/The-End-Error-Computing-
Computational/dp/1482239868)) is the book describing them in great detail...
I'm working with John to put together a publicly accessible wiki, which I hope
will be up in the next month or two. That being said, the book is worth
having.

------
sandGorgon
_installed standard Ubuntu 14.04 w / Caffe, Torch, Theano, BIDMach, cuDNN v2,
and CUDA 7.0_

whoa - are you telling me that the nVidia drivers on Linux are so stable that
they are building a commercial deep learning system on top of that. Is this
the same thing as normal graphics drivers ?

~~~
Galaxeblaffer
Nvidias drivers for Linux are pretty darn good these days..

~~~
yulaow
Those for laptops are pretty shitty, we are still using third party programs
to support optimus technology (with not great results).

~~~
ris
Optimus technology is a fairly stupid idea in the first place though that
exists for business reasons rather than technical.

~~~
StavrosK
What's stupid about it? Sounds pretty smart to not have to use the powerful
GPU if I'm only browsing the web, to me.

~~~
ris
"The powerful GPU."

You've been sold the idea that a "powerful GPU" needs to suck a lot of power
all the time.

There is no real reason a "powerful GPU" shouldn't be able to scale its power
usage way down when doing something simple like browsing the web. The only
reason NVidia weren't able to do the "low power" thing on these systems is
they weren't able to be the ones putting their GPUs on the same die as the CPU
like Intel (& AMD) were. But of course they still wanted part of the action,
so people ended up being sold this massive engineering bodge and told it's a
good thing.

~~~
Sanddancer
I have a laptop that has an intel cpu, and a powerful discrete amd gpu. Even
though on-die gpus have gotten better, there is still a considerable
difference in performance between on-die and dedicated. AMD and NVidia both
realize that there is an onboard chip that can do the common stuff, and so
they have made the ability to turn the dedicated off when needed.

~~~
ris
If you imagine a properly, holistically-designed product, which wasn't full of
chips from different warring companies, the high power GPU could be used to
_augment_ the power of the on-die GPU, instead of having to turn it off and
deal with a whole bunch of mad signal-switching issues. AMD products can do
this to an extent with crossfire, but generally, this is a world that we don't
live in.

------
sxp
The price for a custom build with these specs is ~$8k:
[http://pcpartpicker.com/p/NP4MNG](http://pcpartpicker.com/p/NP4MNG) Spending
that money on an EC2 GPU instance would be a better use of money unless you
really need a local workstation.

~~~
ericjang
I'd recommend a local custom build over EC2. EC2 GPU instances are virtualized
via a Hypervisor, which dramatically reduces performance on multi-GPU
networks. and this doesn't take into account the large amounts of disk space
needed for the training set.

~~~
sandGorgon
does anyone know which hypervisor do they use ? Can I build a local EC2 GPU
instance with these GPUs ? I'm quite amazed that they are able to get the
drivers, etc working with these GPUs on top of a hypervisor

~~~
UK-AL
There are dedicated instances specifically for this purpose.

~~~
sandGorgon
agreed - but which hypervisor do they use ?

~~~
UK-AL
Normally xen?

------
bobjordan
We built our own quad-titan devbox a few months ago, same general components
as this, except used Core i7-5960X and threw in a few 1TB samsung SSD's in
Raid, came in just at $9,000 USD hardware cost, which I think Nvidia was
charging about $15,000. Still, I'm sure they aren't making a ton of money, and
you get hardware guarantee with configuration (but config wasn't so bad..).

~~~
viklas
Agree - we crunched the numbers and came up with the same figures to do-it-
yourself (~USD$9K). Although, I remember from the day this was announced (few
months back) that Nvidia were loud-and-proud that they weren't going to make
money from this. Each box was hand built and tested, so not deemed to be a
large-scale device - they recognized that it's a niche market.

$15K is probably OK(ish) if you figure in your own time for the DIY
build...probably a few days. Plus you get some vendor support, warranty on the
whole package, certified working stack, future test-bed for CUDA updgrades
(will work first), etc as you say.

In wild agreement. Save maybe 30% doing a custom build...so they aren't adding
a huge mark-up, as they would for a gaming machine. Apparently...someone at
Nvidia is looking a bit further into the future than just the short-term
revenue.

------
seiji
newegg has been selling quad 12GB Titan X GPU combo packs for a while. single-
click add-to-cart for 18 components:
[http://www.newegg.com/Product/ComboBundleDetails.aspx?ItemLi...](http://www.newegg.com/Product/ComboBundleDetails.aspx?ItemList=Combo.2349536)

------
alricb
FWIW, the case is a Corsair A540 with hard drive sleds in the two 5.25" bays:
[http://www.corsair.com/en-us/carbide-series-air-540-high-
air...](http://www.corsair.com/en-us/carbide-series-air-540-high-airflow-atx-
cube-case)

Makes sense to me, since you want the best airflow possible getting to the
cards in a multi-GPU setup, and unlike in conventional cases, the A540 doesn't
have a drive cage between the front fans and the video cards.

------
kfor
I wonder how Nvidia building their own machines goes over with the many, many
third party partners building similar rigs. On the Supercomputing 2014
showroom floor it seemed like half the booths were selling something like this
and were covered in Nvidia branding.

~~~
Sanddancer
This is a developers platform, in a rather inconvenient form factor for any
sort of scale deployment. The partners honestly probably love it because it
means they won't be hit with support requests because a driver's acting up,
and will just be getting the sales to handle the finished product.

------
afsina
I think Boxx Apexx-5 boxes are already on par with these (even more powerful).

[http://www.boxxtech.com/products/apexx-5](http://www.boxxtech.com/products/apexx-5)

~~~
choppaface
That unit with _only one_ K40 Tesla appears to be $12,000. If the nVidia box
is really $15,000 (with 4 Titans, 9TB SATA, etc), then the nVidia box looks
like a much better value (and probably more powerful).

~~~
afsina
Keep in mind this has a dual socket motherboard. My colleagues bought 2 of
those boxes with 4 titan-X and they are actually cheaper than the price on the
web site.

------
happycube
The Pascal cards are going to be _much_ better, with HBM2 memory and possibly
even actual double-precision performance (which isn't a problem for deep
learning, but still...)

~~~
ris
NVidia in particular are _very_ good at selling the future - I'll warn you
that much.

~~~
happycube
True 'dat, but combined with the first GPU die shrink in years, there's a
decent chance a generational jump is actually coming. Whether the first
version will be bug-free is much more questionable...

------
bagels
Why would I buy this, vs. renting a cluster of ec2 gpu nodes?

~~~
Sanddancer
Latency, support, specifications. The EC2 GPU nodes have less, slower, memory,
graphics cards with half the performance and a third of the memory, and less
drive space. Additionally, said compute resources are not next to you, which
means certain things, like deep learning combined with AR are not possible. If
you're just doing number crunching, you may be fine with EC2 instances, but
any sort of realtime development, you probably want a supported platform right
next to you.

------
Twirrim
Titans are $1.5k each, so that's $6k down before you even account for the rest
of the hardware to run it. Ouch.

~~~
brudgers
The whole machine off the lot is going to be less than a drywall contractor's
Ford F150, yet potential payback is many times higher while the operating
costs are several orders of magnitude lower. Throw in the three year
depreciation and the box is bargain so long as it can be put to work.

~~~
DanielBMarkham
_The whole machine off the lot is going to be less than a drywall contractor
's Ford F150...so long as it can be put to work_

Yes. You're a drywall contractor you go to Craigslist and start humping some
jobs. Make some bucks and buy you a nice F150.

Conversely, there aren't a lot of "need deep learning in my dentistry" ads on
CL -- and not as easy or clearcut path from point A to point B.

So the investment makes sense, if the right conditions are true. To the
average developer, those conditions look very murky and uncertain to gauge. So
while it's obviously some sweet tech, but it takes a lot more than tech.

~~~
brudgers
I don't disagree. The price is the price because the price makes sense for
businesses. Prosumers may buy one for the same reason pump and valve
salespeople buy 911 GT3's...to drive something fast slowly. It's on a
continuum the drywall feller ordering the 6" lift and 34" tires, leather seats
and dual climate control. All that he really needed was the four wheel drive
and the diesel V8.

Which BTW, the drywall impresario who's buying a new truck for his business
isn't finding jobs on Craigslist. He's got business relationships with serious
people who call him when a bid needs bidding and drywall needs hanging. It's
deal flow just as it is for the sort of person who needs a 4 GPU box and CUDA
code for their business. It's only a lot of money if it sits idle.

If there isn't a business case, there isn't a business case and buying it is
an inefficient allocation of resources.

------
erikj
It looks like the NeXTcube:
[https://upload.wikimedia.org/wikipedia/commons/2/27/NeXTcube...](https://upload.wikimedia.org/wikipedia/commons/2/27/NeXTcube.jpg)

------
nextos
I'm working on probabilistic programming. Hierarchical models are _very_ close
to deep learning. PyMC3 has a Theano backend, so this kind of setup is very
exciting. Anyone else with the same thought/interests?

------
mobileexpert
NVidia should also market this for people who want to do molecular dynamics
and other gpu enabled physics sim locally.

~~~
madengr
I have one of these for electromagnetics sim:

[http://www.microway.com/product/whisperstation-
tesla/](http://www.microway.com/product/whisperstation-tesla/)

------
z3t4
Why not get a server tower case and motherboard while you're at it? Supermicro
has some good ones.

~~~
Sanddancer
A machine like this, you're buying the support, with the hardware as an add-
on. This isn't made for the deployments, it's made for the development and
debugging, where being able to call nvidia at any hour and get a decent
engineer is worth it.

