
Nvidia  Pascal GPU to Feature 17B Transistors and 32GB HBM2 VRAM - cma
http://wccftech.com/nvidia-pascal-gpu-17-billion-transistors-32-gb-hbm2-vram-arrives-in-2016/
======
gwern
> With 8Gb per DRAM die and 2 Gbps speed per pin, we get approximately 256
> GB/s bandwidth per HBM2 stack. With four stacks in total, we will get 1 TB/s
> bandwidth on NVIDIA’s GP100 flagship Pascal which is twice compared to the
> 512 GB/s on AMD’s Fiji cards and three times that of the 980 Ti’s 334GB/s.

> The Pascal GPU would also introduce NVLINK which is the next generation
> Unified Virtual Memory link with Gen 2.0 Cache coherency features and 5 – 12
> times the bandwidth of a regular PCIe connection. This will solve many of
> the bandwidth issues that high performance GPUs currently face.

To point out the obvious, this sounds like it could be fantastic for deep
learning. Not just is the RAM big enough to hold a lot of current datasets,
it'll help alleviate the latency bottlenecks in updating stuff.

~~~
bd
Yup, this is exactly the marketing angle Nvidia CEO was using in his GPU
Technology Conference presentation (10x speed-up on deep-learning vs their
current Maxwell architecture):

[http://blogs.nvidia.com/blog/2015/03/17/pascal/](http://blogs.nvidia.com/blog/2015/03/17/pascal/)

~~~
ak217
This also seems to be the motivation behind the already available Titan X with
its 12GB of RAM. Some of the gaming hardware reviewers are scratching their
heads as to why anyone would need 12GB attached to one die. I chuckled when I
saw those reviews that were totally oblivious to the deep learning
applications.

~~~
deelowe
The community as a whole is completely oblivious. It's pretty funny to see the
youtube reviewers get all worked up over how nvidia and amd are going at it
again and such. As if gaming is what's driving this battle. Anyone working in
computer science knows the battle is over machine learning, not first person
shooters.

Soon headless, socketed solutions will be the preferred form factor for HPC. I
image the desktop and server product lines will diverge at that point. It'll
be curious to see what will happening to PC gaming at that point.

~~~
ehvatum
The gaming hardware junky community is aware, if only as a result of most
Titan Black/Z reviews arriving at the conclusion, "a gamer's money is better
spent elsewhere, as all a Titan really provides over the latest consumer
offerings is good workstation performance (double precision floating point)."

For reference: Titan Z DP FP: 2707 GFLOPS ATI 5870 (released sept 2009): 544
GFLOPS Titan X (current generation - NVidia recognized that gamers don't value
DP performance): 192 GFLOPS

I sometimes wonder how it came to be that the Titan Z preceded the Titan X...

E: to clarify, review sites are aware that STEM people need double precision
and 6/12GiB GPU memory for _something_

------
blinkingled
AMD get it together, pretty please. Nobody wants to live in a world where only
NVidia produces discrete GPUs!

That 32GB seems too high for cost constrained consumer market though - may be
they will have a leaner variant for desktops/gaming.

~~~
bryanlarsen
It could be argued that AMD does have it together. If you look at the
benchmarks for pretty much any particular price point, AMD & nVidia cards are
roughly comparable. AMD is better at some games and resolutions, and nVidia is
better at others. Rarely is the difference more than 10%. nVidia used to have
a sizable power/noise advantage, but Fury appears to close that gap. AMD even
seems to be closing the driver stability gap.

But the Internet effect on the video card market is huge. There are hundreds
of online benchmarks, huge fanboy communities, et cetera. The benchmarks
magnify small differences causing a winner-take-all effect. The communities
create a bandwagon effect. It's a wonder that AMD maintains the share that it
does.

Yes, AMD is obviously #2, but it should be a close #2. In most markets close
#2 is not a bad position to be in.

~~~
raverbashing
Not sure about their graphics division, but AMD has stopped innovating in the
CPU department...

~~~
bryanlarsen
So has Intel. Skylake (2016) and Sandy Bridge (2010) have roughly comparable
single core desktop benchmarks. Power/performance ratios have increased
drastically, but that's mostly because of process improvements. AMD has been
stuck on 28nm for a _long_ time, but that's not AMD's fault.

~~~
cfallin
> Skylake (2016) and Sandy Bridge (2010) have roughly comparable single core
> desktop benchmarks.

That's not quite true -- Intel aims for a big IPC (instructions/clock)
improvement for each "tock" generation (Nehalem -> Sandy Bridge -> Haswell ->
Skylake), and IIRC has pretty much delivered. Some benchmarks are really hard
to push because they're memory/cache-miss bound (so it's really just about
throwing in more memory channels and clocking them up), but a lot of things
have gotten seriously better for tricky integer code, especially in e.g.
branch prediction/uop cache in the frontend and available execution ports in
the backend.

~~~
hajile
[http://arstechnica.com/gadgets/2015/07/intel-confirms-
tick-t...](http://arstechnica.com/gadgets/2015/07/intel-confirms-tick-tock-
shattering-kaby-lake-processor-as-moores-law-falters/)

While the headline is hyperbole, the fact is that Intel has passively admitted
that newer process sizes are taking more time to achieve.

------
hyperpallium
4096bit memory interface is surely supercomputer territory.

My ZX81 only had 8192 bits of _memory_ (1KB).

~~~
onion2k
And it could do 3D graphics. Sort of. 3D Monster Maze:
[https://www.youtube.com/watch?v=nKvd0zPfBE4](https://www.youtube.com/watch?v=nKvd0zPfBE4)

~~~
foobarge
Nice - who needs 8K. 8. K. I haven't even updated to 4K. I don't think I have
anything that runs 1080p.

~~~
shasheene
Virtual reality would benefit a lot from from 16000x16000 PER EYE. Rendered at
120+ frames per second with 10,000Hz eye tracking for foveated rendering
(120fps may be too low for that though).

For the goal of full VR, computing tech has a long way to go

------
whoisthemachine
> The 17 Billion transistors on the Pascal GPU are twice the transistors found
> on the GM200 Maxwell and the Fiji XT GPU core which is literally insane

Seems like Moore's law is alive and well in the graphics/attached processor
space.

~~~
icegreentea
That's because GPUs were frozen at the 28nm node since like 2011. It'll be ~4
years of no die shrinks at the top end of GPUs when they finally transition to
16nm. If anything, Moore's law is behind schedule in the GPU space.

Note, that both NVidia and AMD rely on TSMC to manufacture their chips, so
they're completely constrained by TSMC's ability to implement new process
nodes.

~~~
varelse
2011: GTX 580, 1.5 TFLOPS

2012: GTX 680, 3.0 TFLOPS (~2.0 attainable)

2013: GTX Titan, 4.4 TFLOPS (~3.2 attainable)

2014: GTX 980, 4.6 TFLOPS

2015: GTX Titan X, 6.7 TFLOPS

Looks to me like they're doubling perf roughly every 2 years.

Meanwhle, my Core i7-5930k's SOL is <1/2 of 2011's GTX 580 at 672 GFLOPS and
it still doesn't have fast approximate transcendentals. Skylake begins to fix
this, but c'mon, GPUs have had these for almost a decade now...

~~~
CHY872
Moore's law isn't about performance, it's the number of transistors on a
single IC. Perf is related but irrelevant to the discussion.

~~~
varelse
GTX 580: 585M transistors

GTX 680: 3.5B transistors

GTX Titan: 7.1B transistors

GTX 980: 5.2B transistors

GTX Titan X: 8B transistors

Core i7-5930k: 2.6B transistors

What the data above suggests to me is that relying solely on Moore's Law to
predict performance is a fool's errand. Going forward, process transitions are
obviously slowing down and IMO victory will go to those who make the best use
of the available transistors. Just like programmers who make the best use of
the caches and registers in these processors get dramatically better
performance than those who can't be bothered to even think about such things.

Intel's business strategy of backwards-compatibility is a giant albatross for
them here in that they spend a lot of transistors on this, but clearly
otherwise profitable. In contrast, while GPUs are mostly backwards-compatible,
they usually oops I meant nearly always oops I meant always need some
refactoring to hit close to peak performance. But that usually leads to ~2x
performance improvements per generation so far.

Whenever someone complains about having to do this I ask them if they prefer
this over hand-coded assembler inner loops for maximally exploiting
SSE/SSE2/SSE3/SSE4/AVX2/AVX512? Usually, I get some dismissive remark about
leaving that to the compiler. Good luck with that plan IMO.

~~~
CHY872
Just to nitpick, backwards compatibility isn't really a huge issue for Intel.
Most of the really old stuff that's a pain to maintain can be shoved in
microcode; compilers won't emit those instructions.

There are obvious downsides to the architecture, but the need to be backwards
compatibility shouldn't hurt it too much.

GPU workloads are very different in that generally you don't have to look
particularly hard to find a bunch of parallelism that you can exploit (if you
did, your code would run terribly); so you can generally gain a load of
performance by just scaling up your design.

CPUs are super restricted by the single threaded, branching nature of the code
you run on them, and this is what makes CPU performance a little more nuanced,
and not directly comparable.

~~~
arcticbull
That's not really true; backwards compatibility on x86 architectures takes a
tremendous amount of power and die space, and the 'throw it in microcode'
solution only partially mitigates this issue.

A paper
([http://www.ic.unicamp.br/~ra045840/cardoso2013wivosca.pdf](http://www.ic.unicamp.br/~ra045840/cardoso2013wivosca.pdf))
states that a mostly-microcode solution would still require 20% of the die
area to be dedicated solely to microcode ROM.

I can't remember where I read it but something like 30+% of an Intel CPU die
area/power consumption is due to the x86 ISA. Apparently the original Pentium
CPU was 40% instruction decoding by die area. And the ISA has grown enormously
since then.

------
samch
Now, if only I could get a MackBook Pro with 32GB of system memory... Does
anybody happen to know why that is such a difficult thing to achieve? It's
amazing that, starting next year, there will be mainstream graphics cards with
more memory than top-of-the-line laptops.

~~~
kozukumi
I don't think you could get 16GB SO-DIMMs until earlier this year could you? I
know you can get 32GB in a W550s which is probably the thinnest/lightest
laptop capable of 32GB.

~~~
rincebrain
Can you? I have a W550s and didn't see anyone hocking 16GB SODIMMs even
earlier this year.

~~~
kozukumi
Yeah a colleague ordered one about 2 weeks ago. Lovely machine and Lenovo have
done an amazing job with the weight. I still prefer the T series over the W
series as I don't need that much power and prefer the much lighter machine.

------
raverbashing
I'm waiting for the comparisons between NVidia's offerings and Xeon Phi in
real benchmarks

~~~
semi-extrinsic
I believe the Xeon Phi is doing quite bad in that comparison, so you don't see
much benchmark trumpeting from Intel. Here's one from Nvidia though (so add a
pinch of salt or two), showing a 2x - 5x advantage of Tesla K80 over Knights
Corner:

[http://www.nvidia.com/object/justthefacts.html](http://www.nvidia.com/object/justthefacts.html)

------
mtarnovan
"The 17 Billion transistors on the Pascal GPU are twice the transistors found
on the GM200 Maxwell and the Fiji XT GPU core which is literally insane. "

[https://xkcd.com/725/](https://xkcd.com/725/)

~~~
chillingeffect
[https://search.yahoo.com/yhs/search?p=literally+now+means+fi...](https://search.yahoo.com/yhs/search?p=literally+now+means+figuratively)

;)

------
Zekio
Looks like the next few generations of GPUs are gonna be fast!

~~~
norea-armozel
I think this may be specs for their Tesla cards. I doubt they would throw 32
HBM2 even on a Titan unless they can keep the price point the same (even for
folks who can spend 1k USD have limits on their budget or at least their
perception of a budget). But I wouldn't doubt by Q2/2016 Nvidia will be
bringing more competition to the market with chips based on HBM usage.

IMO, AMD pulled the trigger a tad too soon on their HBM cards, but to be fair
their last line of R9 and R7 cards weren't interesting at least to me (and I
have a Radeon HD 7870). So, they had to go first to get their customer base
excited for the future.

~~~
noir_lord
Surprises me that RAM hasn't increased more, the machine I'm on is 5 years old
and it has 2GB of RAM on the GPU (HD6950) which I paid ~200 quid for 5 years
ago.

~~~
icegreentea
Memory bandwidth/throughput has been a far larger bottleneck on gaming
performance than total video memory for the last few years. They've been
trying to deal it by using ever faster and slightly wider memory interfaces,
but now they've hit the wall, thus the move to HBM.

------
rwmj
What are the parts marked "R125" around the edge?

~~~
bearbin
I believe they are inductors, part of a switch-mode power supply to reduce the
voltage of incoming power to the level required by the cores (AFAIK ~1V).

~~~
zokier
The definitely are inductors (being labeled "Lxx" on PCB is a sure tell-tale
sign), and the small 48ish pin packages near them are most likely SMPS
controllers. Seems bit crazy that there are total of 18 of them, but I suppose
that's what you gotta do when you are pushing possibly hundreds of amps of
current.

~~~
steckerbrett
Makes the normal buck converter arrangement too. Controller, inductor, diode.
The GPU will be taking like half a volt so things are definitely going to be
getting toasty.

~~~
zokier
Those big two-legged components look like caps to me though.

------
CyberDildonics
This is not a real site, they have a review from 4 months ago benchmarking a
990 Ti 24GB card with 'minesweeper' and 'solitaire'.

~~~
largote
Was that review written on April 1st?

~~~
CyberDildonics
It must have been. They mixed in it with all their other headlines and the
date was '4 months ago' so it wasn't labeled April 1st. All their other
stories are shaky rumors so I thought the whole site was joke articles
predicting extrapolating the next press release.

------
Lejendary
Why so much VRAM, is that really necessary?

------
curiousjorge
I've been look at gtx 750 ti which has maxwell processor in it. I don't care
too much about playing games on ultra but the sheer quiet, cool temperature on
load, and low power consumption is what makes me attracted to it. I wonder if
the new Pascal GPU will make room for more such low power consumption and
quiet cards that pack plenty of punch?

