
NVIDIA Announces the GeForce GTX 1000 Series - paulmd
http://anandtech.com/show/10304/nvidia-announces-the-geforce-gtx-1080-1070
======
onli
This article buys in in the hype. It reads like they just copied statements of
the press release. The article on anandtech[0] is a lot better.

Also, do not forget that benchmarks made by Nvidia mean nothing. How fast the
cards really are will only be clear if independent people do real benchmarks.
That's true for games, but it will also be true for all the speculation found
here over the ML performance.

[0] [http://anandtech.com/show/10304/nvidia-announces-the-
geforce...](http://anandtech.com/show/10304/nvidia-announces-the-geforce-
gtx-1080-1070)

~~~
varelse
Fair enough, but when neither Intel nor AMD is bothering to put out
competitive software/hardware solutions for deep learning, which is apparently
going to be a ~$40B market by 2020, I predict continued dominance here. That
said, since AMD is getting beaten by a factor of 2-3 in performance while
Intel's Accelerator/FPGA solutions are getting beaten by a factor of 5-10, AMD
has a real opportunity to be the competitor to NVIDIA here.

And despite questions about it, GTX 1080 will be a fine deep learning board,
especially if the framework engineers get off their lazy butts and implement
Alex Krizhevsky's one weird trick algorithm, an approach which allows an 8-GPU
Big Sur with 12.5GB/s P2P bandwidth to go toe to toe with an 8-GPU DGX-1 on
training AlexNet.

[http://arxiv.org/abs/1404.5997](http://arxiv.org/abs/1404.5997)

~~~
jhj
That's just a combination of GPU model and data parallelism, which have been
supported for about 2 years in Torch among others. You can add in the outer
product trick for fc layers as well.

[https://github.com/soumith/imagenet-
multiGPU.torch/blob/mast...](https://github.com/soumith/imagenet-
multiGPU.torch/blob/master/models/alexnetowt.lua)

nccl is in Torch as well, but it doesn't always win, and has some weird
interactions regarding streams and other such things with its use of
persistent kernels.

[https://github.com/torch/cunn/blob/master/DataParallelTable....](https://github.com/torch/cunn/blob/master/DataParallelTable.lua)

However, this feels more of a benchmarking thing now. Networks tend to be
over-parameterized and redundant in many ways; sending less data with much
deeper / smaller parameter networks has been where the action is (e.g.,
residual networks, GoogLeNet, etc.), or to non-synchronous forms of
communication among workers. Trying to squeeze out every last drop of GPU/GPU
bandwidth is not as important as iterating on the architecture and learning
algorithms themselves.

~~~
varelse
I don't use NCCL, I wrote my own ring collectives after reinventing them in
2014 (I felt really clever for a whole 20 minutes before a Google search
popped my balloon) to avoid working with a vendor that was trying to gouge us
on servers: they beat NCCL by ~50%.

That said, most people couldn't write them so I advise NCCL. You're the third
person to tell me NCCL has ish(tm), fascinating.

And sure, you could do the outer product trick. You could use non-
deterministic ASGD. And you can do a lot of weird(tm) tricks. But why do these
(IMO ad-hoc and potentially sketchy task-specific) things when there's an
efficient way to parallelize the original network in a deterministic manner
for training that allows performance on par with a DGX-1 server?

Because for me, the action is in automagically discovering the optimal
deterministic distribution of the computation so researchers don't need to
worry about it. And IMO that's where the frameworks fail currently.

------
neverminder
What really blew me away is that they went straight for Displayport 1.4
([https://en.wikipedia.org/wiki/DisplayPort#1.4](https://en.wikipedia.org/wiki/DisplayPort#1.4))
that was only announced on 1st of March 2016. DP 1.3 was approved on 15th of
September 2014 and as of today there are no cards supporting it except this
one (with backwards compatibility).

The bad news is that there's no news about new gen displays that could take
advantage of this graphic card. I'm talking about 4K 120Hz HDR
([https://en.wikipedia.org/wiki/High-dynamic-
range_rendering](https://en.wikipedia.org/wiki/High-dynamic-range_rendering))
displays. This is a total WTF - we have a graphic card with DP 1.4 and we
don't even have a single display with so much as DP 1.3...

~~~
memonkey
What cable transfers 4k 120-144hz? What are the benefits of DisplayPort 1.4
over something like Dual Link DVI or the newest HDMI?

~~~
neverminder
Displayport 1.4 is a standard for socket and cable, so it will transfer 4K
120Hz.

Advantages over other standards:
[https://en.wikipedia.org/wiki/DisplayPort#Advantages_over_DV...](https://en.wikipedia.org/wiki/DisplayPort#Advantages_over_DVI.2C_VGA_and_FPD-
Link)

Also, another major advantage - USB Type-C compatibility
([http://www.displayport.org/what-is-displayport-over-
usb-c/](http://www.displayport.org/what-is-displayport-over-usb-c/))

~~~
bawana
Why didn't video cables go optical? Like toslink did w audio?

~~~
stevep98
Thunderbolt was supposed to go optical. You can read about it here, under
"Copper vs. optical":

[https://en.wikipedia.org/wiki/Thunderbolt_(interface)](https://en.wikipedia.org/wiki/Thunderbolt_\(interface\))

~~~
seanp2k2
USB3 was also supposed to have fiber components. Really hard to find
references for this online since at CES 2008 they showed the connectors
without any optical stuff, but I found this:
[http://www.theregister.co.uk/2007/09/19/idf_usb_3_announced/](http://www.theregister.co.uk/2007/09/19/idf_usb_3_announced/)

Anyone know why they didn't end up going optical + copper for USB3?

------
mrb
Note that in terms of pure compute performance the new 16nm Nvidia GTX 1080
(2560 shaders at up to 1733 MHz = 8873 SP GFLOPS) barely equals the
performance of the previous-generation 28nm AMD Fury X (4096 shaders at up to
1050 MHz = 8602 SP GFLOPS). Of course the 16nm chip does so at a significantly
lower TDP (180 Watt) than the 28nm chip (275 Watt), so it will be interesting
to see what Nvidia can achieve at a higher thermal/power envelope with a more
high-end card... I am waiting impatiently to see how AMD's upcoming 14nm/16nm
Polaris chips will fare, but from the looks of it it seems like Polaris will
beat Nvidia in terms of GFLOPS per Watt.

~~~
redtuesday
Yeah, it will be interesting to see how good AMD's Polaris chips will be,
especially when it comes to energy efficiency. Fury was a step in the right
direction (I'm wondering how much of this can be attributed to HBM's lower
voltage etc.). But I'm even more interested in AMD Vega - the high end
competitor to Nvidias GP 100 that is equipped with HBM2. Especially if AMD has
an advantage over Nvidia when it comes to implementing HBM, since they
developed it together with SK Hynx and have previous experience with Fury, or
if this doesn't make a difference.

Vega is also exiting because it should be released when AMD's Zen CPU's will
be released, and there were rumors about a HPC Zen APU with Vega and HBM.

~~~
mtgx
If we're discussing future GPUs, does anyone know what exactly AMD means by
"next-gen memory" when it talks about Navi?

At first I thought it means that HBM2 would be available across their whole
lineup, but I think if that was the case they would've just said so.

I've also read a bit about how Nvidia wanted to go with HMC (Hyper Memory
Cube) instead of HBM, but it seems it's still at least twice as expensive/GB,
and they needed to go with high-end GPUs that have at least 16GB. HMC is not
even at 8GB yet. Intel also seems to have adopted HMC for some servers.

So is the "next-gen" memory HMC, or something else? AMD is supposed to come
out with it in 2018, hopefully at 10nm.

~~~
redtuesday
Was the difference between HMC and HBM not basically HMC had good latency
comparable to normal main RAM like DDR3 plus good bandwith (up to 400 GB/s),
while HBM had worse latency than DDR3 (like GDDR5) with a large bus and really
good bandwith (1 TB/s)? Will indeed be interesting to see what AMD means with
next gen memory.

~~~
Tuna-Fish
No. HMC is fully buffered and provides a packet-based interface to the CPU
it's connected into. This means that the CPU doesn't need to know anything
about the memory it uses, and saves die space on the CPU side, but it adds
latency. HBM has lower latency than HMC.

Also, GDDR5 does not have higher latency than DDR3. The memory arrays
themselves are the same and exactly as fast, and the bus doesn't add any extra
latency. GDDR5 doesn't trade latency for bandwidth, it trades density for
bandwidth.

Things built to use GDDDR5 often have much higher latencies than things built
to use DDR3, but this has nothing to do with the memory and instead about how
the memory controllers in GPUs delay accesses to merge ones close to each
other to optimize for bandwidth.

~~~
redtuesday
Thank you for the detailed explanation. Always thought the added latency came
from GDDR5. Good to know.

------
gavanwoolery
Worth noting - IIRC the stream mentioned that it could do up to 16
simultaneous projections at little additional performance cost. This is
important for VR...a big part of the cost, when you are dumping many vertices
to the GPU, is performing a transform on each vertex (a four component vector
multiplied by a 4x4 matrix) +. Even bigger cost comes from filling the
resulting polygons, which if done in two passes (as is fairly common) results
in something that violates cache across the tiles that get filled. So, in
other words, its expensive to render something twice, as is needed for each
eye in VR - from what they have shown, their new architecture largely reduces
this problem.

\+ This is a "small" part of the cost, but doing 5m polygons at 60 fps can
result in about 30 GFLOPS of compute for that single matrix operation (in
reality, there are many vertex operations and often many more fragment
operations).

~~~
paulmd
I was thinking about this earlier - the transition from a warp of 32 to a warp
of 64 that Pascal supposedly made sounds _exactly_ like how you would
accelerate multiple projections (by at least a factor of 4).

edit: apparently Pascal still has a warp-size of 32.

~~~
pandaman
I am not sure I follow, how does processing 64 vertices instead of 32 at once
accelerates anything, provided the total number of ALUs is the same?

~~~
paulmd
A major concept in GPGPU programming is "warp coalescing".

Threads are executed an entire warp at a time (32 or 64 threads). All threads
execute all paths though the code block - eg if ANY thread executes an if-
statement, ALL threads execute an if-statement. The threads where the if-
statement is false will execute NOP instructions until the combined control
flow resumes. This formula is followed recursively if applicable (this is why
"warp divergence/branch divergence" absolutely _murders_ performance on GPUs).

When threads execute a memory load/store, they do so as a bank. The warp
controller is designed to combine these requests if at all possible. If 32
threads all issue a request for 32 _sequential_ blocks, it will combine them
into 1 request for 32 blocks. However, it cannot do anything if the request
isn't either _contiguous_ (a sequential block of the warp-size) or _strided_
(a block where thread N wants index X+0, N+1 wants thread X+0, X+2N, etc). In
other words - it doesn't have to be contiguous, but it does have to be
uniform. The resulting memory access will be broadcast to all units within the
warp, and this is a huge factor in accelerating compute loads.

Having a warp-size of 64 hugely accelerates certain patterns of math,
particularly wide linear algebra.

edit: apparently Pascal still has a warp-size of 32.

~~~
pandaman
Wow, memory access on NVidia is pretty bad. AMD has a separate unit that
coalesces memory requests and goes through cache so if you do strided loads,
for example, the next load will likely be reading data cached by the previous
one and it does not matter how many lanes are active. AMD has 64-wide "wraps"
btw and it does not seem superior to NV in computation on the same number of
ALUs.

~~~
paulmd
I did my grad research on disease transmission simulations on GPUs, so this is
super interesting to me. Could you please hit me with some papers or
presentations?

The NVIDIA memory model also goes through L1 cache - but that's obviously not
very big on a GPU processor (also true on AMD IIRC). Like <128 bytes per
thread. It's great if your threads hit it coalesced, otherwise it's pretty
meaningless.

~~~
pandaman
I program AMD chips in game consoles so I use a different set of manuals but
AMD has a lot of docs available to public at
[http://developer.amd.com/resources/documentation-
articles/de...](http://developer.amd.com/resources/documentation-
articles/developer-guides-manuals/)

At glance there is a lot of legacy stuff so I'd look at anything related to
GCN, Sea Islands and Southern Islands. Evergreen, R600-800 etc are legacy VLIW
ISA as far as I know.

~~~
SXX
There also fairly recent GCN3 ISA from 2015 available that shed light on their
modern hardware architecture.

~~~
robbies
Well, sheds light on their Compute Unit architecture.

------
paperwork
Can someone provide a quick overview of the current GPU landscape?

There seems to be Nvidia's pascal, gtx, titan, etc. Something called geforce.
And I believe these are just from Nvidia.

If I'm interested in building out a desktop with a gpu for: 1\. Learning
computation on GPU (matrix math such as speeing up linear regression, deep
learning, cuda) using c++11 2\. Trying out oculus rift

Is this the right card? Note that I'm not building production models. I'm
learning to use gpus. I'm also not a gamer, but am intrigued by oculus. Which
GPU should I look at?

~~~
paulmd
Do you need FP64? If so, right now the OG GTX Titan is the default choice - it
offers full double-precision performance with 6 GB of VRAM. There's nothing
better south of $3000.

If not, the 980 Ti or Titan X offer _excellent_ deep learning performance,
albeit _only_ at FP32. And their scheduling/preemption is not entirely there,
they may not be as capable of Dynamic Parallelism as Kepler was. The 780 Ti is
actually a more capable card in some respects.

The new consumer Pascal cards will almost certainly _not_ support FP64, NVIDIA
has considered that a Quadro/Tesla feature since the OG Titan. If DP
performance is a critical feature for you and you _need_ more performance than
an OG Titan will deliver, you want the new Tesla P100 compute cards, and
you'll have to convince NVIDIA you're worthwhile and pay a 10x premium for it
if you want it within the next 6 months. But they _probably_ will support
compute better, although you should wait for confirmation before spending a
bunch of money...

For VR stuff or deep learning, the consumer Pascal cards sound ideal. Get a
1070 or 1080, definitely. The (purportedly) improved preemption performance
alone justifies the premium over Maxwell, and the FP16 capability will
significantly accelerate deep learning (FP16 vs FP32 is not a significant
difference in overall net output in deep learning).

~~~
viperscape
nothing better south of 3k$? does that include the new amd duo pro? or is that
just useful for VR?

~~~
paulmd
The AMD Pro Duo seems aimed at the workstation/rendering market rather than
the compute market.

NVIDIA is totally dominant in the compute market. They have an enormous amount
of software running on CUDA that you would be locking yourself out of with
AMD, and since NVIDIA has such a dominant share of the compute hardware you
would also be tuning for what amounts to niche hardware.

AMD has recently been working on getting a CUDA compatibility layer working,
hopefully this will improve in the future.

------
frik
I am waiting for the Nvidia GTX 1070, and sincerely hope NVidia doesn't fuck
it up again like the GTX 970.

GTX 970: It was revealed that the card was designed to access its memory as a
3.5 GB section, plus a 0.5 GB one, access to the latter being 7 times slower
than the first one. --
[https://en.wikipedia.org/wiki/GeForce_900_series#False_adver...](https://en.wikipedia.org/wiki/GeForce_900_series#False_advertisement)

~~~
nness
I have a GTX 970 and have not yet noticed any issues as a result of that
memory issue. Maybe I'm not pushing the card hard enough...

~~~
frik
Weird comment. Like a grandma driving a Ferrari with a broken gearbox: "I am
perfectly fine driving at moderate speed, maybe I am not pushing the car hard
enough. I am perfectly fine that the mechanic deactivated the malfunctioning
higher gears".

You may not notice it nowadays because several games received patches and
newer nVidia drivers limit the memory consumption to just below 3.5 GB.
Otherwise you would experience a major slowdown. Also a major problem for
CUDA.

~~~
ozi
I see you point, but I use a 970 and just finished Dark Souls 3 on max
settings without a single hiccup.

No doubt that with future games the 970 will not perform as well as it should,
but I will have a different card by then. What they did is a shame and
shouldn't happen again, but I haven't noticed any real-world ramifications
yet.

~~~
sqldba
How was DS3? I liked DS but haven't tried DS2 yet.

~~~
ozi
It's amazing(ly frustrating).

------
tostitos1979
I was helping a friend put together a Titan X "rig" and we realized that case
space, power supply and motherboard slots were some mundane but frustrating
challenges. For someone building out a new rig for personal/hobbyist work in
deep learning, any recommendations? Is the best setup to get two 1080s, 16-32
GB of RAM and a 6th generation i7?

~~~
kmike84
I built a rig a couple days ago; it works nicely both for deep learning and
for RAM and CPU-heavy tasks:
[http://pcpartpicker.com/p/GvTGsY](http://pcpartpicker.com/p/GvTGsY) \- two 8
core xeons, 128GB RAM, slots for 2 GPUs for ~2000$. Currently it has gtx970;
likely I'll add 1080 at some point in future.

I've heard 6th gen i7 is not good for deep learning because its PCIe
performance is crippled (16 PCIe lanes instead of 40 in previous generations,
it should matter for dual GPU use case). Don't quote me on that ;) Used xeons
2670v1 are dirt cheap on ebay now, and they are modern enough. Single-core
performance is worse than in modern desktop CPUs, but not too much, multi-core
performance is great, and these xeons allow to install lots of RAM.

If you don't want that much RAM then for the same price a single desktop CPU
(i7 5th gen?) could work better because of a higher clock rate.

~~~
dweekly
Neat, I hadn't heard of PCpartPicker before. The Xeon 2670 you listed for $90
goes for >$1500 on NewEgg; that is quite a retail / eBay price split! Any idea
why there's such a gap?

~~~
T-A
"late last year, the supply became huge when thousands of these processors hit
the market as previous-gen servers from Facebook and other big Internet
companies were decommissioned by used equipment recyclers" [1]

[1] [http://www.techspot.com/review/1155-affordable-dual-xeon-
pc/](http://www.techspot.com/review/1155-affordable-dual-xeon-pc/)

------
marmaduke
The lower TDP is just as significant as speed. I've got a pair of GTX 480 that
I can use to heat my office with a light workload. How many 1080s could run in
a single workstation?

~~~
chx
Haswell-E chips have 40 PCIe Lanes. If you give 8 to each then 4 is a go
although cooling that will be fun. Now that already exceeds the ATX 7 slots
but many cases (especially those that offer SSI EEB compatibility) offers 8
slots. Four such GPUs, a Haswell-E, the motherboard -- it'd be challenging to
assemble this below $3K but I think $4k is a reasonable target.

Going beyond that is not impossible but requires server grade hardware. For
example, the Dell R920/R930 has 10 PCIe slots so does the Supermicro
SuperServer 5086B. The barebone for the latter is above $8K. You need to buy
Xeon E7 chips for these and those will cost you more than a pretty penny. I do
not think $20K is unreasonable to target.

Not enough? A single SGI UV 300 chassis (5U) provides 4 Xeon E7 processors, up
to 96 DIMMs, and 12 PCIe slots. You can stuff 8 Xeon E7 CPUs into the new HP
Integrity MC990 X and have 20 (!) PCIe slots. How much these cost? An arm and
_two_ legs. I can't possibly imagine how such a single workstation would be
worth it instead of a multitude of cheaper workstations with with just 4 GPUs
(you'd likely end up with a magnitude more GPUs in this case -- E7 CPUs and
their base systems are hideously expensive) but to each their own.

~~~
chx
I left out something: the Supermicro 6037R-TXRF Server is dual socket and has
ten slots and since it's an older dual socket even with CPUs and RAM it can be
had for <$3K so together with five GPUs you can have it for less than $6K.
It's a whole another question whether almost double the price worths adding
another GPU (no).

------
kayoone
I am just curious (and a total machine learning novice). If you were to
experiment with ML, what are the benefits to getting a fast consumer card like
this or use something like AWS GPU instances (or some other cloud provider) ?
Or phrased differently: When does it make sense to move from a local GPU to
the Cloud and vice versa ?

~~~
nightski
I'm a relative novice also, but we have run an EC2 8 way GPU instance.
Basically the GPUs are a bit slower and they only have 4 GB of ram. Also
getting everything set up and uploading a data set is tedious. Costs weren't
too bad. I could see AWS being used for training smaller problems and
learning. If you have a serious project it probably won't fit the bill.

------
jkldotio
Isn't the Titan X also about RAM? It has 12GB to the GTX 1080's 8GB. At the
low end you could just buy more than one GTX 1080, so it looks like a good
deal there, but at the top end you are running out of slots for cards.

~~~
vessenes
Larger learning sets, essentially. I'm a neural net novice, but I think the
general perspective seems to be that 8 vs 12 is not a big deal; either you are
training something that's going to have to get split up into multiple GPUs
anyway, or there's probably a fair amount of efficiency you can get in your
internal representations shrinking RAM usage.

One thing not mentioned in this gaming-oriented press release is that the
Pascal GPUs have additional support for really fast 32 bit (and do I recall 16
bit?) processing; this is almost certainly more appealing for machine learning
folks.

On the VR side, the 1080 won't be a minimum required target for some time is
my guess; the enthusiast market is still quite small. That said, it can't come
too quickly; better rendering combined with butter-smooth rates has a visceral
impact in VR which is not like on-screen improvements.

~~~
fscherer
8 vs 12 is a really big difference - especially with state of the art deep
learning architectures. The problem isn't the size of the dataset but of the
model. Large Convolutional Neural Networks often need the whole 12 GB VRAM of
a Titan card for themselves and distributing over multiple cards/machines adds
a lot of complexity.

------
voltagex_
I wonder what the idle power usage of one of these would be? I wish my
motherboard allowed me to turn off the dedicated card and fall back to the
Intel chipset when I didn't need the performance.

~~~
tjohns
From the slides in Nvidia's presentation, it looks like the max power
consumption is about 180 W. That puts it squarely between the GTX 980 (165 W)
and the GTX 980 Ti (250 W).

Idle power consumption is an order of magnitude less. For the 980 Ti, you're
looking at about 10 W of power consumption while running outside of a gaming
application. Maybe 40 W if you're doing something intensive. [1]

[1]: [http://www.tomshardware.com/reviews/nvidia-geforce-
gtx-980-t...](http://www.tomshardware.com/reviews/nvidia-geforce-
gtx-980-ti,4164-7.html)

------
tjohns
All the benchmarks I can find are comparing this against the GTX 980. I'm
curious how it compares to a 980 Ti.

~~~
jon_richards
I've always been amazed by this chart
[http://www.videocardbenchmark.net/high_end_gpus.html](http://www.videocardbenchmark.net/high_end_gpus.html)

I'm sure some of those expensive cards have extra features, but the pricing
differences are still crazy.

~~~
TylerE
A big part of what you're paying for with the workstation-grade cards (EG
Quadro/FirePro) is certification. If you're using many "serious" applications
like 3D CAD, you must be using a certified (e.g. not consumer grade) card to
be fully supported. They run fine on the consumer cards - but don't expect
support.

~~~
hengheng
And as it happens, the companies that specialize in cad support will use every
opportunity to shove these overpriced cards down your throat. As far as I've
understood their business model, they earn almost nothing from selling
licenses, so they live on their maintenance contracts and selling Quadro/Fire
cards.

Meanwhile we have no problems with an office running on 950Ms and 780/980
GTXes, except for some weird driver bugs that only occur on the single Quadro
machine that we keep for reference.

------
drewg123
Does anybody know what kind of performance a modern Nvidia card like this can
provide for offloading SSL ciphers (aes gcm 128 or aes gcm 256)?

~~~
robert_foss
There are data-dependencies in most AES modes (except ecb) which prevent
parallelization. If you however run 2k separate AES encryption streams you
could expect a nice performance increase. But for a single stream there is
little performance to be had.

~~~
drewg123
We run tens of thousands of streams in parallel.. So that's not a problem. :)

I'm just realizing that GPUs could be a viable alternative to specialized (eg,
expensive) encryption offload engines or very high end (eg, expensive) v4
Xeons with many, many cores. However, it is rather hard to find data on the
bandwidth at which GPUs can process a given cipher. I'm just now starting to
look into this, so I could very well be mis-understanding something,

~~~
paulmd
Throughput 100% depends on how hard they need to hit memory. GPUs _love_ stuff
that they can do entirely within registers. Having to hit (the limited
quantity of) shared memory slows things down much more. Inefficient/non-
coaslesced use of shared memory or global memory too often can trash
performance. Hitting host/CPU memory in any non-trivial amount usually dooms a
program.

With each step you not just cut down your _bandwidth_ but also you increase
your _latency_ and that has a huge impact.

~~~
serialx
Guys have a look at SSLShader[1]

[1]: [http://shader.kaist.edu/sslshader/](http://shader.kaist.edu/sslshader/)

~~~
nwrk
Thanks!

------
rl3
I'm somewhat dismayed that the GTX 1080 has only 8GB of VRAM considering that
previous generation AMD products already had 8GB, and the previous generation
Titan model had 12GB—the latter of course being outperformed by the GTX 1080.

Then again, rendering a demanding title at 8192x4320 on a single card and
downsampling to 4k is probably wishful thinking anyways. However, it's
definitely a legitimate concern for those with dual/triple/quad GPU setups
rendering with non-pooled VRAM.

On the bright side, 8GB should hopefully be sufficient to pull off 8k/4k
supersampling[0] with less demanding titles (e.g. Cities: Skylines).
Lackluster driver or title support for such stratospheric resolutions may
prove to be an issue, though.

It's possible Nvidia is saving the 12GB configuration for a 1080 Ti model down
the road. If they release a new Titan model, I'm guessing it'll probably weigh
in at 16GB. Perhaps less if those cards end up using HBM2 instead of GDDR5X.

[0]
[https://en.wikipedia.org/wiki/Supersampling](https://en.wikipedia.org/wiki/Supersampling)

~~~
rl3
Correction: 7680×4320.

Accidentally multiplied DCI 4K, not UHD 4K.

------
mioelnir
Towards the end of the article in the table overview, they list a `NVIDIA
GP100` model with a memory bus width of 4096 bit. It is still shared memory,
but considering bcrypt only requires a 4k memory working set, that now fits
into 8 cycles instead of the 128 of 256 bit bus architectures...

Am I wrong to think this card could really shake bcrypt GPU cracking up?

~~~
gnoway
The AMD Radeon R9 Fury series has a 4096bit bus (HBM1) - it's been out for
months - so it would already be shaking it up if so.

~~~
mioelnir
Ah, did not know that. Nevermind.

------
tormeh
I'll buy when there's Vulkan/DX12 benchmarks and real retail prices for both
AMD and NVIDIA next-gen cards. Buying now seems slightly premature.

But oh man am I excited!

------
ChuckMcM
This has a similar leap to the one I felt when the 3Dfx Voodoo 2 SLI came out.
The possibilities seem pretty amazing.

I'm interested to know how quickly I can plug in a machine learning toolkit,
it was bit finicky to get up and running on a box with a couple of 580GTs in
it but that might just be because it was an older board.

------
BuckRogers
I watched the livestream, looked good and love the performance/watt. 180watt
card. Way more GPU power than I need professionally or for fun though. I'm
actually all-in on the new Intel gaming NUCs. Skull Canyon looks fantastic and
has enough GPU performance for the gaming I do anymore. Mostly older games,
some CS:Go (my friends and I play at 1280x1024 anyway since it's competitive
gaming) and League of Legends.

It's also nice to have an all Intel machine for Linux. I'd use a lowend NV
Pascal to breathe new life into an older desktop machine since NV seems to
always have a bit better general desktop acceleration that really helps out
old CPUs. If building a high end gaming rig I'd probably wait for AMD's next
chip. I've liked them more on the high end for a few generations now. Async
compute and fine-grained preemption, generally better hardware for
Vulkan/DX12. AMD is also worth keeping an eye on for their newfound console
dominance, subsequent impact if they push dual 'Crossfire' GPUs into XboxOne
v2, the PS4K and Nintendo NX. That would be a master stroke to get games
programmed at a low-level for their specific dual GPUs by default. Also, the
removal of driver middleware mess with the return of low-level APIs to replace
OGL/DX11 will remove the software monkey off AMD's back. That always plagued
them and the former ATI a bit.

I'll probably buy the KabyLake 'Skull Canyon' NUC revision next year and if I
end up missing the GPU power, hook up the highest end AMD Polaris over
Thunderbolt. Combining the 256MB L4 cache that Kabylake-H will have with
Polaris will truly be next-level. Kaby also has Intel Optane support in the
SODIMM slots, it's possible we'll finally see the merge of RAM+SSDs into a
single chip.

But more than anything, I want Kabylake because it's Intel's last 14nm product
so here's to hoping they finally sort out their electromigration problems.
Best to take a wait and see on these 16nm GPUs for the same reason. I'm moving
to a 'last revision' purchase policy on these <=16nm processes.

~~~
rjbwork
>my friends and I play at 1280x1024 anyway since it's competitive gaming

Can you elaborate on this? Never heard of this.

~~~
BuckRogers
It's common for pros and old timer Counterstrike players. My friends and I
have been PC gaming since the mid-80s so we've been in on CS since the
earliest days. Here's a spreadsheet of some pro player's setups.

[https://docs.google.com/spreadsheets/d/1UaM765-S515ibLyPaBtM...](https://docs.google.com/spreadsheets/d/1UaM765-S515ibLyPaBtMnBz7xiao0HL5f-F1zk_CSF4/htmlview?sle=true#gid=1762004852)

A little bit of everything but note how common 1024x768 is. It looks a little
sharper, enough that it doesn't need AA (which I always felt introduced some
strange input lag).

Guarantees good performance, makes the models slightly bigger. Gives you a
slight edge. You can also run a much slower GPU, that's how I'm getting away
with using stuff like Intel Iris Pro. I'd run 1280x1024 with all options on
lowest even if I had a Fury X. Other people I know have GF780s and do the
same. I only turn up graphics options in single player games, at which point I
don't care about FPS dips. When playing competitively I want every edge I can
get.

~~~
knodi123
> When playing competitively I want every edge I can get.

Jesus christ, I can't even imagine the leaps in skill I'd have to make before
the difference between 50fps and 150fps had any impact on my performance.
Watching gaming at that level feels like being a civilian in Dragonball Z,
just seeing a bunch of blurs zipping in the sky while wondering if I'm about
to become obsolete.

~~~
chrisan
> I can't even imagine the leaps in skill I'd have to make before the
> difference between 50fps and 150fps had any impact on my performance

Keep in mind fps are normally given in averages. You can go up to a wall and
have nothing on screen and get an insane FPS or you can back out and take in
all of the area with max view distance and see a lower FPS. Add in
players/objects/explosions etc and your FPS dips and dives.

Having an average of 50FPS means you likely ARE going to notice it at certain
dips while at 150 average would be much less likely to get a perceivable fps
dip

------
Szpadel
> The GP104 Pascal GPU Features 2x the performance per watt of Maxwell.

And

> The GTX 1080 is 3x more power efficient than the Maxwell Architecture.

I think that someone get carried away by imagination.

I found that 980 has 4.6 TFLOPs (Single precision) And assuming that 1080
performance (9 TFLOPs) is also for single precision and new card has the same
TDP, this is 1.95x increase, so it is ~2x

EDIT: I found that 1080 will have 180W TDP, where 980 has 165W, so correction
it will be 1.79x increase

------
agumonkey
Not long ago I found this 2009 project
[https://en.wikipedia.org/wiki/Fastra_II](https://en.wikipedia.org/wiki/Fastra_II)

built to give a 'desktop' cluster class performance based on GPUs. I wonder
how it would fare against a 1080.

~~~
Sanddancer
On paper, the Fastra II would still be a fair bit faster, and would have more
RAM. However. in a standoff, I'd put my money on the 1080 simply because of
the overhead of wrangling all those GPUs to work together.

------
partiallypro
I'm more interested on the impact it will have on the 10 series as a whole. I
literally just bought a 950, now I'm wondering if the 1050 will be priced just
as reasonably and offer big performance gains. Also, what is the time table
for the rest of the 10 series?

~~~
touristtam
Based on the previous gens (since 400 series) you can expect about 20-25 %
gain from one generation to the next. Traditionally the release have been:
April to June: paper release (reads very limited supply of the top of the
range). July to October: x70/x60 range (which are the main gaming GPU).
November-December: rest of the range released with ample supply for the end of
the year festive season. January to March: annoncement of the next generation.
Rinse and repeat.

------
MichaelBurge
How good does it look for a hobbyist manipulating large matrices for use in
machine-learning?

~~~
slashcom
Pretty good; 8gb memory and impressive transfer speeds. The 980 Ti is pretty
common in the Deep Learning community. Only the Titan X is more popular, and
that thing is far outside any average person's budget. I'm definitely looking
to pick up one of these soon.

~~~
aab0
Titan budgets are definitely a problem, which is why you see so many
acknowledgements of Nvidia in deep learning, thanking them for giving a Titan.

------
kosmic_k
Does anyone know why video cards have stayed on a 28nm process for so long?
It's appears that a significant factor in this incredible leap of performance
is the process change, but I'm puzzled as to why 22nm was skipped.

~~~
msbarnett
TSMC doesn't have a 22nm process -- they offer a 20nm process instead, but it
was heavily delayed, had huge yield issues, and most of its production
capability was double-booked with Apple, all of which meant it wasn't feasible
to use in time for Nvidia's previous generation (Maxwell) launch.

Maxwell had been planned as 20nm, but when TSMC couldn't solve their process
issues the first Maxwell was taped out at 28nm (GM107 aka the GeForce 750),
while they waited on the rest of lineup. 6 months later, with TSMC still
having issues at 20nm, the GM204(GeForce 980/970) was launched at 28nm.

16nm FinFET wasn't subject to massive delays (it launched early, in fact), and
seems to have better yields than 20nm early on, so Pascal skipped right down
to it.

~~~
kosmic_k
Thanks for the explanation! I haven't been following fab process news for some
time so I've been very much out of the loop.

------
valine
I'm hopeful this will provide a significant speed up for my 3D modeling /
rendering. The number of cuda cores is only slightly higher than the 780. I'll
definitely wait for more benchmarks.

------
cheapsteak
Could someone clarify if they meant that one 1080 is faster than two 980s in
SLI? Or did they mean two 1080s were faster than two 980s?

~~~
gnoway
He meant one 1080 is faster than two 980s in SLI, but that seems pretty hard
to believe; we're going to need to wait for some 3rd party reviews.

~~~
mkhalil
I might believe 980s but not 980tis. That would be a feat

------
science404
No word on double-precision performance? Could replace the Titan Z and Titan
(black) as a good CUDA-application card..

~~~
programmer_dude
DP will be better than maxwell cards but you need to wait and see if it can
beat kepler (IMO it will).

------
mtgx
"GTX 1080"

"10 Gaming Perfected"

Seems like a missed marketing opportunity to make it a 10 GFLOPs card.

~~~
Kephael
There is likely to be a GTX 1080Ti with a significantly larger GPU (GP200)
that will exceed 10 GFLOPs.

~~~
tasnent
Any idea when it'll be released?

------
tronreg
What is the "founders edition"?

~~~
yohui
[http://www.anandtech.com/show/10304/nvidia-announces-the-
gef...](http://www.anandtech.com/show/10304/nvidia-announces-the-geforce-
gtx-1080-1070/2)

> _There will be two versions of the card, the base /MSRP card at $599, and
> then a more expensive Founders Edition card at $699. At the base level this
> is a slight price increase over the GTX 980, which launched at $549.
> Information on the differences between these versions is limited, but based
> on NVIDIA’s press release it would appear that only the Founders Edition
> card will ship with NVIDIA’s full reference design, cooler and all.
> Meanwhile the base cards will feature custom designs from NVIDIA’s partners.
> NVIDIA’s press release was also very careful to only attach the May 27th
> launch date to the Founders Edition cards._

------
rasz_pl
and still not fully DX 12, oh nvidia

~~~
bpye
Maxwell currently leads when it comes to DX12 feature support (along with
Skylake), see
[https://en.wikipedia.org/wiki/Feature_levels_in_Direct3D#Dir...](https://en.wikipedia.org/wiki/Feature_levels_in_Direct3D#Direct3D_12)
. I don't know what Pascal has but I doubt it'd be a regression.

------
interdrift
This will be pretty nice for VR cause it will push older generation cards back
in price

------
amelius
Why are these coprocessors still positioned as graphics cards?

~~~
ChrisLTD
Because consumers buy them for video games.

~~~
amelius
But you can perceive this as: if from a business (money) perspective it
becomes interesting to lock down these graphics cards to specific games, then
the computational folks lose their processing power. They were not a market in
the first place (relatively speaking).

