
Nvidia  CEO Reveals New TITAN X at Stanford Deep Learning Meetup - bcaulfield
https://blogs.nvidia.com/blog/2016/07/21/titan-x/
======
cs702
Great news all-around for deep learning practitioners.

Nvidia says memory bandwidth is "460GB/s," which will probably have the most
impact on deep learning applications (lots of giant matrices must be fetched
from memory, repeatedly, for extended periods of time). For comparison, the
GTX 1080's memory bandwidth is quoted as "320GB/s." The new Titan X also has
3,584 CUDA cores (1.3x more than GTX 1080) and 12GB of RAM (1.5x more than GTX
1080).

We'll have to wait for benchmarks, but based on specs, this new Titan X looks
like the best single GPU card you can buy for deep learning today. For certain
deep learning applications, if properly configured, two GTX 1080's might
outperform the Titan X and cost about the same, but that's not an apples-to-
apples comparison.

A beefy desktop computer with four of these Titan X's will have 44 Teraflops
of raw computing power, about "one fifth" of the raw computing power of the
world's current 500th most powerful supercomputer.[1] While those 44 Teraflops
are usable only for certain kinds of applications (involving 32-bit floating
point linear algebra operations), the figure is still kind of incredible.

[1]
[https://www.top500.org/list/2015/06/?page=5](https://www.top500.org/list/2015/06/?page=5)

~~~
pavlov
_A beefy desktop computer with four of these Titan X 's will have 44 Teraflops
of raw computing power, about "one fifth" of the raw computing power of the
world's current 500th most powerful supercomputer._

There have been calls that Moore's Law is dead for at least 10 years, and yet
desktops keep catching up to supercomputing.

I know this increase is not in single-threaded general-purpose computing power
in the fashion of the old gigahertz race... But on the other hand, the scope
of what's considered "general purpose" keeps expanding too. Machine learning
may be part of mainstream consumer applications in 5 years.

~~~
rasz_pl
Moore's Law IS dead for single core CPU performance.

~~~
mrb
_" Moore's Law IS dead for single core CPU performance."_

Moore's Law was never about single core CPU performance. It is about the
number of transistors in a single chip; this has always been its definition.

~~~
honkhonkpants
Actually Moore's law is about the number of transistors you can etch for a
single dollar.

~~~
astrodust
Moore's Law: "Moore's law (/mɔərz.ˈlɔː/) is the observation that the number of
transistors in a dense integrated circuit doubles approximately every two
years."

That's the definition I've always known. It has nothing to do with money or
speed or performance, but usually these things are correlated.

------
tylerwhipple
As great as this product may be, I went to the Stanford Deep Learning Meetup
to learn more about how Baidu Research/Andrew Ng are solving large scale deep
learning problems. I am disappointed by how much (unannounced) time was
dedicated to the keynote/sales pitch.

~~~
argonaut
I'm pretty sure Andrew Ng was hired by Baidu as a recruiting tool /
evangelist. He hasn't done much research recently.

~~~
nabla9
If his job as a Chief Scientst at Baidu is similar to the Director of Research
at Google (Peter Norvig) he is too busy to be part of individual research
projects.

He is probably one of the guys who decides where the direction of company
research is heading and the one who supervises those projects.

------
kartD
Since this question has come up so many times in the thread, my take is that
FP64 and FP16 won't be as good at the GP100. If TITAN X is based on the
consumer parts, it misses out on the GP100's FP improvements.

From Anandtech's GTX1080 review page2: As a result while GP100 has some
notable feature/design elements for HPC – things such faster FP64 & FP16
performance, ECC, and significantly greater amounts of shared memory and
register file capacity per CUDA core – these elements aren’t present in GP104
(and presumably, future Pascal consumer-focused GPUs).

This requires confirmation though, it depends on whether it uses the consumer
chip or the HPC chip.

Edit: AT has an article up [http://www.anandtech.com/show/10510/nvidia-
announces-nvidia-...](http://www.anandtech.com/show/10510/nvidia-announces-
nvidia-titan-x-video-card-1200-available-august-2nd)

They think it's likely a consumer card, so lower FP16 and FP64 perf. Should be
a gaming monster though

~~~
mrb
Well the Titan X is a new chip: GP102. So they could have picked and chosen
features from either GP100 (professional Tesla) or GP104 (consumer GeForce).

We know almost certainly it has lower FP64 perf than GP100 because it has 22%
fewer transistors and the FP64 units take a lot of transistors count. However
it is less clear about what the FP16 performance is. Nvidia could have decided
to match GP100 on that regard.

~~~
kartD
Doubt it, why have three versions of a chip (one with fp16 improv, one with
fp16/fp64 improv and one without them)? Especially when the Titan X has the
same number of CUDA cores as the GP100.

~~~
jhj
"CUDA cores" is a misleading term.

When they quote "CUDA cores", they've been counting float32 fma functional
units; e.g., Tesla K40 has 192 float32 fma units per SM x 15 SMs => 2880 "CUDA
cores".

[http://docs.nvidia.com/cuda/cuda-c-programming-
guide/index.h...](http://docs.nvidia.com/cuda/cuda-c-programming-
guide/index.html#arithmetic-instructions__throughput-native-arithmetic-
instructions)

fp16 and fp64 are likely different functional units with different issue
rates, as is the case with old hardware; unless they've managed to share the
same hardware (since for P100 the quoted fp64 rate is exactly half the fp32
rate, and the fp16 is exactly double the fp32 rate).

~~~
kartD
True, I forgot NVIDIA (and AMD) play around with the "core" definition

~~~
jhj
There's no real analogue to a CPU core (or thread) on a GPU, there are warp
schedulers (Nvidia) and functional units with varying throughput rates. The
closest analogue to a CPU thread is the GPU warp (Nvidia), which shares a
program counter and executes as a 32-wide vector. AMD wavefronts (64-wide) are
a little bit different, but not by much. The CUDA "thread" is really an
illusion that makes a SIMD lane easier to program to (sometimes...)

~~~
kartD
I agree, I think an SMX is the closest to a CPU core - it contains dispatch,
cache, schedulers etc. CUDA threads are of course have important differences
since all threads in a warp move in lockstep. IMO, all of these contrived
definitions are just to ease in programmers into the CUDA/OpenCL model or for
bragging rights (like how a Jetson has 192 cores!).

------
bryanlarsen
For gaming, it's an estimated 24% faster than a 1080 for twice the price.

[http://www.anandtech.com/show/10325/the-nvidia-geforce-
gtx-1...](http://www.anandtech.com/show/10325/the-nvidia-geforce-
gtx-1080-and-1070-founders-edition-review)

~~~
baq
does it have a place where you can plug a display into...?

~~~
jsheard
Yes, the spec sheet says it has DisplayPort 1.4, HDMI 2.0b and Dual-Link DVI.

[http://www.geforce.co.uk/hardware/10series/titan-x/](http://www.geforce.co.uk/hardware/10series/titan-x/)

The Tesla cards are the ones with no display outputs.

~~~
astrodust
Nikolai Tesla was a titan so the confusion is natural.

------
stephanheijl
I'm looking forward to seeing this in stores, as I've wanted to build a new
machine learning rig for some time. The GTX 1080 just didn't seem like it
would do the trick, with ostensibly limited software support and all.

I'm specifically wondering about FP16 handling though. Single precision FLOPS
are never mentioned in the blog, nor on the NVidia page. It would be a shame
if the FP16 units on this card were gimped in the same way as the GTX 1080...

~~~
nitrogen
The GTX1080 still isn't available anyway, unless you pay an exorbitant sum to
a scalper.

~~~
Jach
Funny, my friend had no issue getting two of them. The only impossible to get
model is gigabyte's xtreme version and it's not even the 'best'.

~~~
nitrogen
From which vendor, may I ask?

~~~
Jach
Zotac Extreme.

~~~
nitrogen
Cool. Who has it in stock?

------
jesperhh
Interesting that this does not have HBM2 memory. Apparently this will only be
for the Tesla GPUs on pascal GPUs unless they are going to put it on the 1080
ti, which does not seem likely when the Titan does not have it.

~~~
gbrown_
I don't see this showing up in consumer parts soon. It seems Nvidia can
squeeze out GDDR5(X) outside of the HPC space for this generation. Which is
good for cost but also reduces the risk in terms of reliability of throwing
lots of new technology into a single product.

HPC obviously has different requirement but Nvidia can work with interrogators
and customers with less of a backlash when fixing issues in this segment.

------
nl
This is so great.

NVidia actually care about research, researchers and the scientific computing
market.

Next time someone complains about the lack of OpenCL support, again, in
another framework remember how much work NVidia puts into supporting people
who use their cards for scientific computing, and how they listen to them.

~~~
Cybiote
Microsoft also carefully listened to developers while building DirectX and by
versions 8 and especially 9, it really showed. But only Windows benefited from
this. Having the control of important GPU tech so strongly centered about a
single company is never a good idea, it sets up a conflict of interest.

Something like OpenCL does not face the same conflict nVidia would if porting
core APIs across a wide set of competing technologies. With CUDA, nVidia
prioritizes themselves above AMD, intel, FPGAs and whatever parallel compute
technology the future holds.

~~~
nl
Oh, I agree 100%.

But the truth is that without the hardware vendors putting significant
resources into OpenCL it just isn't competitive and won't be until that
happens.

The truth is that most of the work in Deep Learning is developing new NN
architectures and other algorithmic optimisations. If you are working in the
field there is no reason to put up with second class support from non NVidia
vendors - just build in TensorFlow, Torch or a couple of other frameworks and
wait for the day (one day, we are promised!) when OpenCL is competitive. Then
the framework backends get ported, your code keeps running the same, and it
can run on all those other architectures.

Everyone has been waiting for that day since One Weird Trick[1]. There isn't
really anything to indicate it is getting closer, and AMD's dismissal of the
NVidia "doing something in the car industry"[2] doesn't give me a lot of
confidence.

Anyway, I hope I'm wrong. Maybe Intel will step-up.

[1] [https://arxiv.org/abs/1404.5997](https://arxiv.org/abs/1404.5997)

[2] [http://arstechnica.co.uk/gadgets/2016/04/amd-focusing-on-
vr-...](http://arstechnica.co.uk/gadgets/2016/04/amd-focusing-on-vr-mid-range-
polaris/)

------
imtringued
Intel already surpassed the old Titan X with their Xeon Phi Knights Landing
CPUs which have a peak performance of up to 7TFLOPS. It was about time that
they release a new Titan X.

~~~
vegabook
yeah but where can you actually buy the Xeon Phi KL? I only see dev-programme
versions. I would love to pick one up.

~~~
lorenzhs
The webshops where I've seen it say it will be available from August 9th, so
hopefully it won't be long.

------
hkhall
As this thread is filled with people that know way, way more about CUDA and
OpenCL than myself I hope that you will indulge me a serious question: I get
that graphics cards are great for floating point operations and that bitwise
binary operations are supported by these libraries, but are they similarly
efficient at it?

Some background: I occasionally find myself doing FPGA design for my doctoral
work and am realizing that the job market for when I get done may be better
for me if I was fluent in GPGPU programming as it is easier to build, manage,
and deploy a cluster of such machines than the same for FPGAs.

My current problem has huge numbers of XOR operations on large vectors and if
OpenCL or CUDA could be learned and spun up quickly (I have a CS background) I
may be inclined to jump aboard this train vs buying another FPGA for my
problem.

~~~
ldargin
AMD GPUs have a reputation for speedy integer operations, which are
essentially bit-wise operations, so they are often chosen for bitcoin mining.
So you might want to consider learning OpenCL, since CUDA runs only on NVidia
cards.

~~~
sounds
I've spent a lot of time using both OpenCL and CUDA, and I would recommend
CUDA not because I like NVidia as a company, but because your productivity
will be so much higher.

NVidia has really invested into their developer resources. Of course, if your
time to write code and debug driver issues isn't that important, then an AMD
card using OpenCL might be the right choice.

(I'll try to be honest about my bias against NVidia, so you can more
accurately interpret my suggestions. I think along the lines of Linus Torvalds
with regard to NVidia... [http://www.wired.com/2012/06/nvidia-linus-
torvald/](http://www.wired.com/2012/06/nvidia-linus-torvald/) )

------
epaulson
For comparison, the first supercomputer that was in the Teraflops range and
that was available outside of the nuclear labs was the Pittsburgh Terascale
machine.

[http://www.psc.edu/publicinfo/news/2001/terascale-10-01-01.h...](http://www.psc.edu/publicinfo/news/2001/terascale-10-01-01.html)

It cost $45M, and peaked at 6 Teraflops. (I think that was on 32 bit floats,
but I can't find the specs. It might have been 64 bit floats)

"Total TCS floor space is roughly that of a basketball court. It uses 14 miles
of high-bandwidth interconnect cable to maintain communication among its 3,000
processors. Another seven miles of serial, copper cable and a mile of fiber-
optic cable provide for data handling.

The TCS requires 664 kilowatts of power, enough to power 500 homes. It
produces heat equivalent to burning 169 pounds of coal an hour, much of which
is used in heating the Westinghouse Energy Center. To cool the computer room,
more than 600 feet of eight-inch cooling pipe, weighing 12 tons, circulate up
to 900 gallons of water per minute, and twelve 30-ton air-handling units
provide cooling capacity equivalent to 375 room air conditioners."

------
ucaetano
It will be interesting to see how this compares to Google's custom TPU:
[https://cloudplatform.googleblog.com/2016/05/Google-
supercha...](https://cloudplatform.googleblog.com/2016/05/Google-supercharges-
machine-learning-tasks-with-custom-chip.html)

NVidia is still taking the one-size-fits-all approach to AI and graphics,
maybe it is time to develop AI-specific hardware.

~~~
melling
Isn't this simple economics? The R&D cost is amortized over a larger market.
You can always try to build specialized chips but the market might not be
nearly as big so there's a lot less money. The general purpose market still
moves faster.

~~~
ucaetano
Given that the market is big enough for a single player to develop a chip for
internal usage, minimum efficient scale is probably not an issue.

I'd guess NVidia sells far more cards than Google will produce TPUs.

------
dbalan
HN hug of death. Google cache here:
[https://webcache.googleusercontent.com/search?q=cache:https%...](https://webcache.googleusercontent.com/search?q=cache:https%3A%2F%2Fblogs.nvidia.com%2Fblog%2F2016%2F07%2F21%2Ftitan%2Dx%2F)

~~~
clw8
It baffles me how a company as large as Nvidia could get death hugged by HN. I
guess people are just that excited for Titan X.

~~~
anc84
The company is fine, just one of their blogs is down. It's Wordpress, maybe
they did bad or no caching.

------
p1esk
Anyone knows what is this new INT8 instruction does?

~~~
nabla9
NVIDIA says: 44 TOPS INT8 (new deep learning inferencing instruction)

I think it's a related to storing float arrays as arrays of 8-bit integers in
memory and converting them into floats just before using. It's 2x more space
efficient than fp16

[http://www.kanadas.com/program-e/2015/11/method_for_packing_...](http://www.kanadas.com/program-e/2015/11/method_for_packing_int8_arrays.html)

~~~
astrodust
You can get away with 8-bit floats on many neural network type applications so
that might be the idea here.

~~~
bitfiddler0
It's an 8-bit int type though. Perhaps even that's sufficient for inference?

~~~
astrodust
For a lot of network types that will do the job. 2x the performance is often
better than 2x the fidelity. You can get higher accuracy with more nodes.

------
mrb
Slightly more tech details at [http://wccftech.com/nvidia-geforce-gtx-titan-x-
pascal-unleas...](http://wccftech.com/nvidia-geforce-gtx-titan-x-pascal-
unleashed/) Notably: 1.53GHz is the boost clock, while the base clock is
1.41GHz.

Does anyone knows the performance of the half-precision units (16-bit floating
point)? It's probably 1/64 the FP32 rate, but Nvidia may have been generous
and uncapped it at 2× FP32 like GP100, which would be a big difference (128×
factor!)

------
zk00006
Can we expect the price of "old" Titan X to drop after this announcement? I
would like to upgrade 4x SLI.

~~~
dogma1138
If you have a 40 PCIE lane CPU (and not NVME drives) was the price the thing
that actually held you back? The scaling past 2 cards is also pretty horrible,
depending on how much of a performance increase it would most likely still be
cheaper to get 2 new Titan X's instead of upgrading your maxwell cards if you
can offload them at 400-500$.

~~~
akiselev
It would probably be pretty useless for gaming over 2x or 3x SLI but depending
on whether any CUDA work you need to do is compute or memory (not bandwidth)
bound, it can still make a difference worth the cost.

~~~
dogma1138
Well SLI has nothing to do with CUDA, you can have as many CUDA supported
cards as you want and use all of them they don't even have to be from the same
model/generation and you can use them all. Since he mentioned SLI i assumed it
was for gaming since it's the only thing that actually limits you. But that
said the number of PCIE lanes is still a problem this is why people that do
work on compute opt out (well are forced to) use Xeon parts with QPI to get
enough PCIE lanes because the amount of lanes available for standard desktop
parts (including the PCH) is pretty pathetic even if you are going with the
full 40 PCIE lanes E series of CPU's. This can be even worse if you want to
use SATA Express or NVME drives since they also use your PCIe lanes, as well
as a few other things like M.2 wireless network cards (pretty common these
days on the upper midrange and high end motherboards), Thunderbolt and a few
other things.

~~~
akiselev
To repeat what I just said woth slightly different wording: Yes, my point was
specifically referring to the difference between users who want the Titan X
for supercomputing versus those who want it for gaming. You can have as many
CUDA cards as you want but you face the same problem of limited PCIe bandwidth
just like you do with gamers who want SLI. If your use case is CUDA and not
gaming, then the usefulness of four Titans depends entirely on whether your
algorithms are limited by memory, memory bandwidth, or computing power.

------
steve19
Annoying Apple style branding "new Titan X" and "previous Titan X".

X2 would have been easier all round.

~~~
anoother
I agree, but X2 already holds the connotation of a dual-GPU card.

~~~
setrofim_
Could have gone with "TITAN XX" then. The added advantage is that the next one
would be "TITAN XXX", and think of the marketing opportunity there!

~~~
earthnail
The Titan XXX will be a breakthrough in VR.

------
dharma1
Fp16?

~~~
quantumhobbit
While we're at it, Fp64?

------
ipunchghosts
What is the FP64 performance???

~~~
ipunchghosts
[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...](https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units)

Looks like its still crap :(

Long live the titan black!

------
akhilcacharya
>$1200

Well I'm glad I got my 1080 now.

------
314
About 50-60% faster than a 980-ti / old titan. Quite pedestrian for a
generation of cards.

~~~
zeroer
Do you think we're in the mid-2000s?

~~~
314
Do you think this is any different from the trend in the past five generations
from Nvidia?

~~~
dogma1138
Yes, gen to gen was 15-20% overall improvement (which is also questionable
because when a new gen arrives all of a sudden the previous gen starts losing
performance in games, it was really bad between Kepler and Maxwell where the
780/780ti all of a sudden lost 5-15% performance in certain games with initial
"Maxwell Drivers").

~~~
314
Depends on which figures you look at.
[http://www.techspot.com/article/928-five-generations-
nvidia-...](http://www.techspot.com/article/928-five-generations-nvidia-
geforce-graphics-compared/) has a reasonable comparison. Varies between 20-50%
over the range of that sample. The increase in performance between the
980/1080 or the titans does not seem out of the "ordinary".

~~~
dogma1138
Look at the final trend chart

780ti > 980 = ~15% increase 780 > 780ti = ~12% increase 680 (potentially
should've used the 690 as a reference even tho it was a dGPU card but it was
the) > 780 = ~27 increase.

For the most part in generation that there wasn't a near ~2 year gap, and in
generations where there wasn't a huge change in GPU memory type or a major
architecture change (like dropping the hardware scheduler in Fermi for Kepler
in order to save silicon space for things that actually matter(ed) for game at
the time) there isn't a 50% increase gen to gen. 15-20% increase gen to gen
with comparable price point cards (nvidia has been making it harder now by
charging 150-200$ more per card and effectively bumping a price point lately)
is what you should be expecting.

