Also, do not forget that benchmarks made by Nvidia mean nothing. How fast the cards really are will only be clear if independent people do real benchmarks. That's true for games, but it will also be true for all the speculation found here over the ML performance.
And despite questions about it, GTX 1080 will be a fine deep learning board, especially if the framework engineers get off their lazy butts and implement Alex Krizhevsky's one weird trick algorithm, an approach which allows an 8-GPU Big Sur with 12.5GB/s P2P bandwidth to go toe to toe with an 8-GPU DGX-1 on training AlexNet.
nccl is in Torch as well, but it doesn't always win, and has some weird interactions regarding streams and other such things with its use of persistent kernels.
However, this feels more of a benchmarking thing now. Networks tend to be over-parameterized and redundant in many ways; sending less data with much deeper / smaller parameter networks has been where the action is (e.g., residual networks, GoogLeNet, etc.), or to non-synchronous forms of communication among workers. Trying to squeeze out every last drop of GPU/GPU bandwidth is not as important as iterating on the architecture and learning algorithms themselves.
That said, most people couldn't write them so I advise NCCL. You're the third person to tell me NCCL has ish(tm), fascinating.
And sure, you could do the outer product trick. You could use non-deterministic ASGD. And you can do a lot of weird(tm) tricks. But why do these (IMO ad-hoc and potentially sketchy task-specific) things when there's an efficient way to parallelize the original network in a deterministic manner for training that allows performance on par with a DGX-1 server?
Because for me, the action is in automagically discovering the optimal deterministic distribution of the computation so researchers don't need to worry about it. And IMO that's where the frameworks fail currently.
Eg Caffe: https://github.com/BVLC/caffe/blob/master/docs/multigpu.md
Nope... This is the stupid way... They should be using ring reductions, handily provided by the NCCL library:
My heuristic whenever I think I've invented something is to aggressively exploit Google to prove me wrong. It usually does so, but sometimes in very unusual ways. Glad to hear about Torch though. Do they have automagic model parallelization (not data parallelization) as well?
Not that I know of.
Regarding claims of novelty, I don't think the Caffe maintainers are claiming that their multi-gpu update method is novel or even very good. I think it was just the easiest thing someone could think of. I think Flickr originally wrote the multi-gpu extensions and the maintainers simply accepted the pull request.
If anything, I think the maintainers are more than willing to listen to people in the scientific computing community with experience. Even better if they have a pull request in hand. But otherwise, they probably won't know about better methods and won't care.
Am I missing something here?
The bad news is that there's no news about new gen displays that could take advantage of this graphic card. I'm talking about 4K 120Hz HDR (https://en.wikipedia.org/wiki/High-dynamic-range_rendering) displays. This is a total WTF - we have a graphic card with DP 1.4 and we don't even have a single display with so much as DP 1.3...
Displays get to play catchup for a while. This is great - it means we can avoid stupid hacks like MST.
It would also be useless to have the displays first and cards later
I guess some manufacturers have it lined up already
Your link is now bit outdated in the regard that these days there are plenty of consumer HDR standards/names out there. HDR10 and Dolby Vision are the two main competing standards, and we have marketing terms like "Ultra HD Premium" or "HDR Compatible".
I think NCX is being bit overly negative (in his characteristic style) in his predictions. Sure, there will be cheap crap released, but considering that almost all major manufacturers are pushing this tech heavily on TV screens it can't be that long until it trickles down to monitors too. Of course you can plug your PC right now to one of those fancy new HDR TVs, I'm not sure though if you can get HDR actually working. I guess that would be mostly a software issue?
So if it isn't just more contrast then what does it actually mean? In images it means the image has values stored above what can be displayed.
Dual link DVI is significantly old-hat compared to the newer DisplayPort and HDMI standards. A single dual link DVI cable can carry either half a 4K frame (1920x2160) at 8 bits per color, 60hz, OR a single 1080p frame at 10 bits per color, 60hz. To get a 4K 60hz display running you need to use two dual link DVI cables, which is insanely bulky. Plus you at best get 8 bit color out of it, and no audio.
The latest HDMI standard can do 4096x2160 60Hz in deep color (12 bits per color).
DisplayPort 1.4 has a higher maximum bandwidth than HDMI 2.0 (31.4 Gbps vs 18Gbps). This allows it to do 120Hz, deep color/HDR or carry multiple 4K streams (multiple monitors or 3D) on one cable. It's by far the most advanced standard out there for consumer video transmission.
Source: I'm an engineer working with video systems, and Wikipedia
Advantages over other standards: https://en.wikipedia.org/wiki/DisplayPort#Advantages_over_DV...
Also, another major advantage - USB Type-C compatibility (http://www.displayport.org/what-is-displayport-over-usb-c/)
A major advantage of optical is that there are no EMV problems and that you are electrically decoupled (e.g. no hum loops). This might be one of the reasons why toslink is optical. For displays ground loops are typically no problem compared to amplifiers/speakers.
Anyone know why they didn't end up going optical + copper for USB3?
Their press statements it's visually transparent, but the technical overview says this:
> All of the analyses showed that the DSC algorithm outperformed five other proprietary algorithms on these picture quality tests, and was either visually lossless or very nearly so for all tested images at 8 bits/pixel.
Here are two examples. First, the engine behind Folding@Home:
See also AlexNet (Deep Learning's most prominent benchmark):
versus GTX TitanX:
TLDR: AMD losing by a factor of 2 or more...
So unless this changes dramatically, NVIDIA IMO will continue to dominate.
But hey, they both crush Xeon Phi and FPGAs so it doesn't suck to be #2 when #3 is Intel.
600 images/s inference (highest ever publically reported), projected to be as high as 900 images/s. ~$5000 per FPGA.
Compare and contrast with TitanX/GTX 980 TI now topping 5000 images/s. GTX 1080 will only be faster than this and $600.
And so far, no FPGA training numbers. But here are some distributed training numbers for $10Kish Xeon servers:
64 of them train Alexnet in ~5 hours. A DGX-1 does it in 2. Don't have/Can't get a $129K DGX-1? Fine, buy a Big Sur from one of several vendors and throw in 8 GTX 1080s as soon as they're out, implement "One Weird Trick", and you'll do it in <3. That ought to run you about $30-$35K versus >$600K for those 64 Xeon servers.
FPGAs OTOH are fantastic at low memory bandwidth embarrassingly parallel computation like bitcoin mining. Deep Learning is not such a domain.
Even your previous post had links to two page that were not measuring the same thing... and thus could not support your claims?
I think you need to clarify.
Vega is also exiting because it should be released when AMD's Zen CPU's will be released, and there were rumors about a HPC Zen APU with Vega and HBM.
At first I thought it means that HBM2 would be available across their whole lineup, but I think if that was the case they would've just said so.
I've also read a bit about how Nvidia wanted to go with HMC (Hyper Memory Cube) instead of HBM, but it seems it's still at least twice as expensive/GB, and they needed to go with high-end GPUs that have at least 16GB. HMC is not even at 8GB yet. Intel also seems to have adopted HMC for some servers.
So is the "next-gen" memory HMC, or something else? AMD is supposed to come out with it in 2018, hopefully at 10nm.
Also, GDDR5 does not have higher latency than DDR3. The memory arrays themselves are the same and exactly as fast, and the bus doesn't add any extra latency. GDDR5 doesn't trade latency for bandwidth, it trades density for bandwidth.
Things built to use GDDDR5 often have much higher latencies than things built to use DDR3, but this has nothing to do with the memory and instead about how the memory controllers in GPUs delay accesses to merge ones close to each other to optimize for bandwidth.
nVidia released new features and capabilities on CUDA that later versions of OpenCL also supported, but were not available on nVidia cards due to lack of OpenCL driver support. Hence, lots of people developed libraries and tools for CUDA, with great encouragement and support from nVidia (which attracted developers). These tools were useful and powerful, encouraging others to also use nVidia hardware to take advantage of those tools. CUDA has now taken root.
If nVidia were to start keeping their drivers updated with the latest version of OpenCL, i suspect CUDA's dominance would start to wane. Though the CUDA roots have gone deep, so it may take several years to reverse its dominance. However, its highly unlikely that nVidia will update their drives any time soon. The only thing that would probably force them is AMD gaining market dominance for a few years.
If you read between the lines of how dismissive AMD were of NVidia's work on self driving cars you can see they don't care about compute.
They are chasing the VR market, and that is fine.
Witness also PyOpenCL versus PyCUDA - both are essentially equally convenient to use, while raw OpenCL C is a nightmare compared to CUDA C.
Python has some good support for this already. I haven't done a project like this yet, but if setting up something fresh I'd immediately jump on NumPy interacting with my own kernels and some Python glue for the node level parallelism.
I don't understand why AMD doesn't just jump on the CUDA bandwagon and support the API?
OpenCL (using an ATI card) was much harder to program, since the abstraction level was much higher. Writing two separate kernels and have them each be faster than a generic version that ends up compromising for compatibility.
The OpenCL one ended up being faster, but I suspect that's due to ATI hardware being superior at the time.
The current versions only work with a couple of latest AMD chips though. And CUDA is just one part of it, they have nothing equivalent to CuDNN
2. It's not just the API. It's the entire SDK that has NV specific hooks that AMD can't tap into
+ This is a "small" part of the cost, but doing 5m polygons at 60 fps can result in about 30 GFLOPS of compute for that single matrix operation (in reality, there are many vertex operations and often many more fragment operations).
edit: apparently Pascal still has a warp-size of 32.
Threads are executed an entire warp at a time (32 or 64 threads). All threads execute all paths though the code block - eg if ANY thread executes an if-statement, ALL threads execute an if-statement. The threads where the if-statement is false will execute NOP instructions until the combined control flow resumes. This formula is followed recursively if applicable (this is why "warp divergence/branch divergence" absolutely murders performance on GPUs).
When threads execute a memory load/store, they do so as a bank. The warp controller is designed to combine these requests if at all possible. If 32 threads all issue a request for 32 sequential blocks, it will combine them into 1 request for 32 blocks. However, it cannot do anything if the request isn't either contiguous (a sequential block of the warp-size) or strided (a block where thread N wants index X+0, N+1 wants thread X+0, X+2N, etc). In other words - it doesn't have to be contiguous, but it does have to be uniform. The resulting memory access will be broadcast to all units within the warp, and this is a huge factor in accelerating compute loads.
Having a warp-size of 64 hugely accelerates certain patterns of math, particularly wide linear algebra.
The NVIDIA memory model also goes through L1 cache - but that's obviously not very big on a GPU processor (also true on AMD IIRC). Like <128 bytes per thread. It's great if your threads hit it coalesced, otherwise it's pretty meaningless.
At glance there is a lot of legacy stuff so I'd look at anything related to GCN, Sea Islands and Southern Islands. Evergreen, R600-800 etc are legacy VLIW ISA as far as I know.
memory accesses in a warp do not necessarily have to be contiguous, but it does matter how many 32 byte global memory segments (and 128 byte l1 cache segments) they fall into. the memory controller can load 1, 2 or 4 of those 32 byte segments in a single transaction, but that's read through the cache in 128 byte cache lines.
thus, if every lane in a warp loads a random word in a 128 byte range, then there is no penalty; it's 1 transaction and the reading is at full efficiency. but, if every lane in a warp loads 4 bytes with a stride of 128 bytes, then this is very bad: 4096 bytes are loaded but only 128 are used, resulting in ~3% efficiency.
look up "dual-issue":
There seems to be Nvidia's pascal, gtx, titan, etc. Something called geforce. And I believe these are just from Nvidia.
If I'm interested in building out a desktop with a gpu for:
1. Learning computation on GPU (matrix math such as speeing up linear regression, deep learning, cuda) using c++11
2. Trying out oculus rift
Is this the right card? Note that I'm not building production models. I'm learning to use gpus. I'm also not a gamer, but am intrigued by oculus. Which GPU should I look at?
If not, the 980 Ti or Titan X offer excellent deep learning performance, albeit only at FP32. And their scheduling/preemption is not entirely there, they may not be as capable of Dynamic Parallelism as Kepler was. The 780 Ti is actually a more capable card in some respects.
The new consumer Pascal cards will almost certainly not support FP64, NVIDIA has considered that a Quadro/Tesla feature since the OG Titan. If DP performance is a critical feature for you and you need more performance than an OG Titan will deliver, you want the new Tesla P100 compute cards, and you'll have to convince NVIDIA you're worthwhile and pay a 10x premium for it if you want it within the next 6 months. But they probably will support compute better, although you should wait for confirmation before spending a bunch of money...
For VR stuff or deep learning, the consumer Pascal cards sound ideal. Get a 1070 or 1080, definitely. The (purportedly) improved preemption performance alone justifies the premium over Maxwell, and the FP16 capability will significantly accelerate deep learning (FP16 vs FP32 is not a significant difference in overall net output in deep learning).
I am purely after FP64 performance for scientific compute.
a) what does the "OG" stand for?
b) what about the titan black model? seems to offer yet a bit more FP64 performance than the normal Titan?
NVIDIA is totally dominant in the compute market. They have an enormous amount of software running on CUDA that you would be locking yourself out of with AMD, and since NVIDIA has such a dominant share of the compute hardware you would also be tuning for what amounts to niche hardware.
AMD has recently been working on getting a CUDA compatibility layer working, hopefully this will improve in the future.
- Pascal: codename for Nvidia's latest microarchitecture, used in the new GeForce 10-series announced today. (AMD counterpart is "Polaris".)
- GeForce: Nvidia's consumer graphics card brand. (AMD equivalent is "Radeon".)
- GTX: label for GeForce graphics cards meant for gaming and other demanding tasks (see AMD Radeon "R9").
- Titan: special name for super high-end GeForce graphics cards. (For AMD, see "Fury".)
You might also encounter Nvidia Quadro graphics cards (comparable to AMD FirePro), which are meant for professional workstations, and Nvidia Tesla graphics cards, which target high performance computing.
The 1080 will probably be the best card available until next year, when HBM2 cards (~3x memory bandwidth) reach general availability. I'm hoping for a 1080 Ti or a new Titan then.
What are the characteristics of NVIDIA GPUs that make them superior for deep learning applications?
Phrased another way, if you're designing a card specifically to be good for training deep neural nets, how does it come out differently from cards designed to be good at other typical applications of GPGPU?
On the hardware end, Nvidia has slightly superior floating point performance (which is the only thing that matters for neural nets). Pascal also contains 16 bit floating point instructions, which will also be a big boost to neural net training performance.
I would be interested to hear about the difference in floating point performance. I would have guessed that, at this point, pretty much every chip designer in the world knows equally well how to make a floating-point multiplier. So it must be that nvidia vs amd have made different trade-offs in when it comes to caching memory near the FP registers or some such thing?
GTX 970: It was revealed that the card was designed to access its memory as a 3.5 GB section, plus a 0.5 GB one, access to the latter being 7 times slower than the first one. -- https://en.wikipedia.org/wiki/GeForce_900_series#False_adver...
You may not notice it nowadays because several games received patches and newer nVidia drivers limit the memory consumption to just below 3.5 GB. Otherwise you would experience a major slowdown. Also a major problem for CUDA.
Just because not everyone writes CUDA applications that need access to all 4GB does not mean that people are using it wrong, or that they should mind that the last .5GB is slow. If it works for others' use then that's great for them.
No doubt that with future games the 970 will not perform as well as it should, but I will have a different card by then. What they did is a shame and shouldn't happen again, but I haven't noticed any real-world ramifications yet.
Even with 3.5gb it's still a great cost/performance card.
I've heard 6th gen i7 is not good for deep learning because its PCIe performance is crippled (16 PCIe lanes instead of 40 in previous generations, it should matter for dual GPU use case). Don't quote me on that ;) Used xeons 2670v1 are dirt cheap on ebay now, and they are modern enough. Single-core performance is worse than in modern desktop CPUs, but not too much, multi-core performance is great, and these xeons allow to install lots of RAM.
If you don't want that much RAM then for the same price a single desktop CPU (i7 5th gen?) could work better because of a higher clock rate.
Usually pcpartpicker builds have comments, did you not submit your build for review? (I don't actually know how the site works).
ASRock EP2C602-2T/D16 has EEB 12" x 13" form factor, it may be a bit harder to find a case for it (but that's possible, at least form factor is standard). Also, it has 2 PCIe slots which means you can insert only a single GPU - each GPU usually takes 2 nearby slots. Nice thing is that you can put more RAM sticks; it means you can either get more RAM or save money by using e.g. 8GB sticks instead of 16GB sticks which are cheaper per GB.
There are Supermicro motherboards similar to ASUS Z9PA-D8C, but AFAIK they have non-standard mounting points, and Supermicro branded cases are very pricely. There is also a few other ASUS motherboards (Z9PE-D8 WS, Z9PE-D16); they have nice features - D16 has more RAM slots and D8 WS has more PCIe slots, but D8 WS cost more and they both don't use ATX form factor (though their form factor is still standard).
The motherboard I've chosen is not without gotchas - CPU slots are very close to each other, so you can't use a cooler which exceeds CPU slot size; most popular coolers don't fit. To install Ubuntu I had to first use integrated video to install nvidia drivers, and then I had to disable the integrated video using an on-board jumper (there were some Ubuntu configuration problems with both internal and nvidia gpus active), but likely this is not specific to this MB.
There may be other motherboards; look for C602 chipset, check what's the form factor, how many RAM and PCIe 16x slots are there, how is PCIe speed adjusted when several slots are occupied (e.g. often 16x slot can become 8x slot when something is inserted in another slot), and check MB image to understand how are PCIe slots layed out: when two of them are next to each other a GPU covers them both. E.g. ASRock Rack EP2C602 looks fine (it is EEB, and no USB3, but CPU slots are layed out better than in ASUS Z9PA-D8C, you can use more cooler models, and it is a bit cheaper). No idea about which brands are better, I only built such computer once.
Going beyond that is not impossible but requires server grade hardware. For example, the Dell R920/R930 has 10 PCIe slots so does the Supermicro SuperServer 5086B. The barebone for the latter is above $8K. You need to buy Xeon E7 chips for these and those will cost you more than a pretty penny. I do not think $20K is unreasonable to target.
Not enough? A single SGI UV 300 chassis (5U) provides 4 Xeon E7 processors, up to 96 DIMMs, and 12 PCIe slots. You can stuff 8 Xeon E7 CPUs into the new HP Integrity MC990 X and have 20 (!) PCIe slots. How much these cost? An arm and two legs. I can't possibly imagine how such a single workstation would be worth it instead of a multitude of cheaper workstations with with just 4 GPUs (you'd likely end up with a magnitude more GPUs in this case -- E7 CPUs and their base systems are hideously expensive) but to each their own.
Setting aside this 1080, just on per CUDA core prices, cheaper workstations with pairs of 970s are much better deals, because everything gets cheaper, not just the GPUs.
Consumer cards are a good deal if you know your algo works well on GPUs in single (32-bit) floating point and you don't mind dicking around with hardware configurations to save money.
E.g. if you play around for a little bit, you could end up spending $50 on the cloud. By contrast, a card will set you back $350+.
If you're doing lots of machine learning, the cloud will quickly rack up.
There's other factors. E.g. the cloud's performance can vary depending on the time of day (and whose using the application).
Don't buy a graphics card unless you're certain you're going to do lots of ML. Or buy one if you think that'll motivate you to do justify the purchase by doing lots of ML.
Generally, it makes sense to have a local GPU to try out models, and move to the cloud for large scale computing, once you have a better idea of the model.
One thing not mentioned in this gaming-oriented press release is that the Pascal GPUs have additional support for really fast 32 bit (and do I recall 16 bit?) processing; this is almost certainly more appealing for machine learning folks.
On the VR side, the 1080 won't be a minimum required target for some time is my guess; the enthusiast market is still quite small. That said, it can't come too quickly; better rendering combined with butter-smooth rates has a visceral impact in VR which is not like on-screen improvements.
Idle power consumption is an order of magnitude less. For the 980 Ti, you're looking at about 10 W of power consumption while running outside of a gaming application. Maybe 40 W if you're doing something intensive. 
I'm sure some of those expensive cards have extra features, but the pricing differences are still crazy.
Meanwhile we have no problems with an office running on 950Ms and 780/980 GTXes, except for some weird driver bugs that only occur on the single Quadro machine that we keep for reference.
AFAICT, PassMark's GPU benchmarks are most often used simply because it's usually the first Google result for "video card benchmark".
I'm just realizing that GPUs could be a viable alternative to specialized (eg, expensive) encryption offload engines or very high end (eg, expensive) v4 Xeons with many, many cores. However, it is rather hard to find data on the bandwidth at which GPUs can process a given cipher. I'm just now starting to look into this, so I could very well be mis-understanding something,
With each step you not just cut down your bandwidth but also you increase your latency and that has a huge impact.
Also would data-dependencies be different on a GPU than with parallelization on a CPU? Because modes like GCM (i.e. CTR) both the encryption and decryption are parallelizable (with CBC it's just decryption).
I wonder if AMD GPU's would be better suited to the task than Nvidia's (at least their existing GPU's) provided you could actually saturate the GPU with enough work to make it worth it.
Issue rate is in the CUDA documentation for Maxwell (cc 5.2) at least, presumably they'll update with Pascal soon:
Where they would win is in memory bandwidth for intermediate results (e.g., in/out of registers or smem or even global memory), assuming that you aren't ultimately bound by I/O over the bus from the CPU.
Then again, rendering a demanding title at 8192x4320 on a single card and downsampling to 4k is probably wishful thinking anyways. However, it's definitely a legitimate concern for those with dual/triple/quad GPU setups rendering with non-pooled VRAM.
On the bright side, 8GB should hopefully be sufficient to pull off 8k/4k supersampling with less demanding titles (e.g. Cities: Skylines). Lackluster driver or title support for such stratospheric resolutions may prove to be an issue, though.
It's possible Nvidia is saving the 12GB configuration for a 1080 Ti model down the road. If they release a new Titan model, I'm guessing it'll probably weigh in at 16GB. Perhaps less if those cards end up using HBM2 instead of GDDR5X.
Accidentally multiplied DCI 4K, not UHD 4K.
Am I wrong to think this card could really shake bcrypt GPU cracking up?
But oh man am I excited!
I'm interested to know how quickly I can plug in a machine learning toolkit, it was bit finicky to get up and running on a box with a couple of 580GTs in it but that might just be because it was an older board.
It's also nice to have an all Intel machine for Linux. I'd use a lowend NV Pascal to breathe new life into an older desktop machine since NV seems to always have a bit better general desktop acceleration that really helps out old CPUs. If building a high end gaming rig I'd probably wait for AMD's next chip. I've liked them more on the high end for a few generations now. Async compute and fine-grained preemption, generally better hardware for Vulkan/DX12. AMD is also worth keeping an eye on for their newfound console dominance, subsequent impact if they push dual 'Crossfire' GPUs into XboxOne v2, the PS4K and Nintendo NX. That would be a master stroke to get games programmed at a low-level for their specific dual GPUs by default. Also, the removal of driver middleware mess with the return of low-level APIs to replace OGL/DX11 will remove the software monkey off AMD's back. That always plagued them and the former ATI a bit.
I'll probably buy the KabyLake 'Skull Canyon' NUC revision next year and if I end up missing the GPU power, hook up the highest end AMD Polaris over Thunderbolt. Combining the 256MB L4 cache that Kabylake-H will have with Polaris will truly be next-level. Kaby also has Intel Optane support in the SODIMM slots, it's possible we'll finally see the merge of RAM+SSDs into a single chip.
But more than anything, I want Kabylake because it's Intel's last 14nm product so here's to hoping they finally sort out their electromigration problems. Best to take a wait and see on these 16nm GPUs for the same reason. I'm moving to a 'last revision' purchase policy on these <=16nm processes.
Can you elaborate on this? Never heard of this.
A little bit of everything but note how common 1024x768 is. It looks a little sharper, enough that it doesn't need AA (which I always felt introduced some strange input lag).
Guarantees good performance, makes the models slightly bigger. Gives you a slight edge. You can also run a much slower GPU, that's how I'm getting away with using stuff like Intel Iris Pro. I'd run 1280x1024 with all options on lowest even if I had a Fury X. Other people I know have GF780s and do the same. I only turn up graphics options in single player games, at which point I don't care about FPS dips. When playing competitively I want every edge I can get.
Jesus christ, I can't even imagine the leaps in skill I'd have to make before the difference between 50fps and 150fps had any impact on my performance. Watching gaming at that level feels like being a civilian in Dragonball Z, just seeing a bunch of blurs zipping in the sky while wondering if I'm about to become obsolete.
Keep in mind fps are normally given in averages. You can go up to a wall and have nothing on screen and get an insane FPS or you can back out and take in all of the area with max view distance and see a lower FPS. Add in players/objects/explosions etc and your FPS dips and dives.
Having an average of 50FPS means you likely ARE going to notice it at certain dips while at 150 average would be much less likely to get a perceivable fps dip
Also there are plenty of full HD displays with 120Hz+ now ...
Back in the CS < 1.3 days everyone used 640x480 but I never really understood why.
> The GTX 1080 is 3x more power efficient than the Maxwell Architecture.
I think that someone get carried away by imagination.
I found that 980 has 4.6 TFLOPs (Single precision)
And assuming that 1080 performance (9 TFLOPs) is also for single precision and new card has the same TDP, this is 1.95x increase, so it is ~2x
EDIT: I found that 1080 will have 180W TDP, where 980 has 165W, so correction it will be 1.79x increase
built to give a 'desktop' cluster class performance based on GPUs. I wonder how it would fare against a 1080.
Maxwell had been planned as 20nm, but when TSMC couldn't solve their process issues the first Maxwell was taped out at 28nm (GM107 aka the GeForce 750), while they waited on the rest of lineup. 6 months later, with TSMC still having issues at 20nm, the GM204(GeForce 980/970) was launched at 28nm.
16nm FinFET wasn't subject to massive delays (it launched early, in fact), and seems to have better yields than 20nm early on, so Pascal skipped right down to it.
> The performance of NVIDIA’s GeForce GTX 1080 is just off the charts. NVIDIA’s CEO, Jen-Hsun, mentioned at the announcement that the GeForce GTX 1080 is not only faster than one GeForce GTX 980 but it crushes two 980s in SLI setup.
From the article, I got the impression that the difference was like night and day. I wonder when we will be getting some early real-world benchmarks.
"10 Gaming Perfected"
Seems like a missed marketing opportunity to make it a 10 GFLOPs card.
> There will be two versions of the card, the base/MSRP card at $599, and then a more expensive Founders Edition card at $699. At the base level this is a slight price increase over the GTX 980, which launched at $549. Information on the differences between these versions is limited, but based on NVIDIA’s press release it would appear that only the Founders Edition card will ship with NVIDIA’s full reference design, cooler and all. Meanwhile the base cards will feature custom designs from NVIDIA’s partners. NVIDIA’s press release was also very careful to only attach the May 27th launch date to the Founders Edition cards.