Wow. It has 15.3 billion transistors. It's amazing we can buy something with that many engineered parts. Even if the transistors are the result of duplication and lithography, it's an astonishing number. Creating the mask must have taken a while.
Does anyone know what the failure rate is for the transistors (or transistors of a similar production process)? Do they all have to function to spec for a GPU, or are malfunctioning transistors disabled or corrected? What does the QC process look like?
Exact failure and bin rates for most semiconductor companies are considered deep dark internal trade secret. Other than pure scale, yield rates are one of the biggest factors in semiconductor cost and profit margin.
And the answer is it depends. If you lose some transistors, you expect to lose the entire chip. But the vast majority of the transistors on each chip are part of cache or many many duplicate GPU cores, which if they fail to pass tests, can be disabled or downclocked and then the chip is binned into the appropriate product line.
With GPUs this is much easier than other types of chips, because the level of functional duplication that exists allows a lot of flexibility. If a core is bad, you use a different one, and GPU cores are small enough they'd be stupid not to put some spares on each chip. Same with memories.
Generally one can safely assume:
* Most chips that come off the line are binned into a lower category and do not function at max spec for everything, which is why the price jump is so high at the extreme upper end of a hardware series.
* With ASIC lithography most transistor malfunction isn't correctable, you mostly have to either downclock (some types of faults) or disable (the rest) that piece.
* Rates of transistor malfunction is still incredibly fucking amazingly phenomenally low. Like with 15B transistors on a chip, you have trouble affording a failure rate of even one in a billion.
So your line has to be, as the kids say: on fleek.
I used to work for an OEM, and the Intel and AMD engineers would quietly explain to us how this worked on a number of occasions.
The AMD X3 chips I think were the best example of this being done. These were quad-core parts that AMD was manufacturing at the time, but had defect that made one core faulty. So that core was disabled, and sold as triple-core part.
http://www.zdnet.com/article/why-amds-triple-core-phenom-is-...
An interesting note is processes improve over time, so most companies end up binning processors much more conservatively over time. Price discrimination means company's want to restrict their highest bin even if they could sell it for less. https://en.wikipedia.org/wiki/Price_discrimination
I don't have direct knowledge of current yield rates, so this is speculation. That said, I did give what I think is a reasonable order of magnitude in the above comment when I said a company would have trouble affording a rate of 1 in a billion transistors for unclustered defects. I meant that literally. Some chips just probably can't be profitable with a rate that high. Nvidia might be able to make yield on that rate since GPUs have enough functional duplication, but I'd expect Intel's rate to be under one in a billion and over one in a trillion.
Also note that clustered failures are different. Some whole wafers might be junk if alignment is off, or if there's a bubble somewhere, a series of adjacent chips would be destroyed. If you throw a whole wafer away, that thankfully doesn't mean you have to produce a billion perfect wafers to make up for it. So the yield rate above would only need to apply for the parts where the process is otherwise dialed in.
If you have a bubble or alignment issue, it really doesn't matter at that point whether you kill tens of transistors or a few billion, any chips where the bubble is are likely just gonna get marked and tossed. And it is sometimes routine on some types of lines that if the chip yield is low enough on a wafer the whole thing is tossed since it's not economical to cut and further test and package up any working ones.
Semiconductor economics are pretty nuts.
It's actually more common to specify error rate in terms of defects per mm^2, because the exact number of transistors involved in the defect is mostly irrelevant if it's a defect that is wide ranging.
I do not have the answers for your questions (and I don't think anyone can share actual failure rates), but I would direct you this video which goes over a lot of modern chip fabrication techniques, circa 2009:
There are wafer test machines which will interface with the wafer directly and do some testing (which are $$$$), JTAG type tests, which access parts of the chip out of band, and functional testing. Some products, like SD Cards actually have a microcontroller on board that will provide the test routines and error correction without the need of an expensive machine. Design for test is extremely important.
I'm by no means an expert however, I mostly deal with JTAG and functional tests.
Hasn't half precision (16-bit float) been in NVidia GPUs forever? I could swear it was available back in the very first shader-capable Geforce FX days already.
(Post author here.) Yes, FP16 has been supported in NVIDIA GPUs as a texture format "forever" -- since before it was incorporated into the IEEE 754 standard. Indeed what is new in GP100 is hardware ALU support (and note denormals are full speed, which is even more important for lower precision formats).
IBM's Power9 and its future Power 3.0 ISA CPUs, which should increasingly focus on deep-learning/big data optimization combined with Nvidia's GPUs which will increasingly optimize for the same, should make an interesting match over the next 5+ years.
On the gaming side, I do hope they continue to optimize for VR. I think AMD is even slightly ahead of them on that.
I'm not sure whether it was called Turbo Pascal, but we had this Pascal editor with built in 2D draw windows to learn programming in high school. I'm still looking for something equivalent (but maybe a bit more modern/portable) to give to my kid when he's showing some interest. Scratch is nice, but the visual programming becomes limiting very quickly. Is there anything like this today?
(Post author here.) Curious to hear more details about your workload, because a 5+-year-old Fermi would truly be hard pressed to outperform Maxwell or even a Kepler K40, let alone Pascal.
It's parameter sweeps of a delay differential equations, one simulation per thread. This requires a lot of complex array indexing and global memory access, so arithmetic density isn't near optimal. Still, it's a real world workload that benefits hugely from GPU acceleration.
Moving from a GTX 480 to a Kepler or Maxwell card, the numbers go up, but not the performance. I might have a corner case, but before investing in new hardware, I would want to benchmark first and not blindly follow the numbers.
People bought 400 series cards for their compute performance long after they were outdated. If your software wants an nvidia card it was either that or go up to quadro. People bought the first titan for the same reason.
* "Up to 2x" performance on FP16 (compared to FP32, so about 22 TFLOPs).
* FP16 is also aimed to neural nets training, because when the weights of the net are FP16, it's a more compact representation.
* 3840 general purpose processors.
* More/better texture units, memory units, etc. So it's not about raw power, but also about a better design.
Guess that's about the important stuff. I just skimmed the article over the top, reading a bit here and there, but that seemed to be the most remarkable stuff.
How does performance like this compare to the TFLOPs on a supercomputer? I.E. how does one draw parallels to the performance you can get versus what we've been historically able to do with http://www.top500.org/statistics/perfdevel/?
Pascal won't be cheap so comparing to a top-end Xeon E5 v4, it's about 7x the theoretical FP64 performance (assuming the Xeon is 2.2GHz * 22 cores * 8 avx pipes per * 2 for FMA per socket, given the price range). Similar story at FP32. However, the GPU wins on FP16.
For historical performance, just pick one of the machines that did x teraflops. E.g. the first teraflop computer used ~6000 200MHz pentium pro chips around 1996.
What I'm trying to figure out, is whether a teraflop directly comparable. That is, on the top 500 list, the first 4.9 Teraflop computer was in 2000, but does that mean that Pascal could provide performance similar to the Supercomputer on the LINPACK benchmark?
In describing the benchmark, they say,
In an attempt to obtain uniformity across all computers in performance reporting, the algorithm used in solving the system of equations in the benchmark procedure must conform to LU factorization with partial pivoting. In particular, the operation count for the algorithm must be 2/3 n^3 + O(n^2) double precision floating point operations. This excludes the use of a fast matrix multiply algorithm like "Strassen's Method" or algorithms which compute a solution in a precision lower than full precision (64 bit floating point arithmetic) and refine the solution using an iterative approach.
So, to summarize, if in 2000 the fastest supercomputer on the planet ran at about 4.9 TFLOPs, does that mean, apples-apples on the LINPACK (and only the LINPACK), that Pascal today would outperform that Supercomputer?
Technically, yes, a teraflop is a teraflop, and is directly comparable. It just means you can do an awful lot of floating point operations per second. But many systems are sensitive to memory size, memory bandwidth, and as a result communication costs (i.e. latency/bandwidth of the interconnect between machines).
The benchmark is essentially bottlenecked on FP64 matrix multiplies. If that's what you need to do, then sure, it's indicative.
Some machine learning workloads are also bottlenecked on matrix multiply, but don't need FP64 precision. They can use FP16. Fits a bigger model in a given memory size, makes better use of memory bandwidth, and given the right hardware support, you can get extremely high rates as on Pascal.
Personally, I find the memory system on Pascal more interesting than raw flops rate. Also, the use of nvlink to link multiple GPUs..
I agree on the memory model being the most interesting thing about this card. I sort of "under-sold" it on the "better design" part of my last bullet.
People/manufacturers tend to look at clock rates, fill rates (for GPUs), FLOPs, "crunching power" in general, forgetting completely the memory part. For example, today most CPUs end up being bound by cache sizes and performance tuning focuses on being nice on the cache rather than being optimal in your instructions (see for example Abrash's Pixomatic articles[0-2], which are about high performance assembly programming in "modern environments").
With GPU and "classic" HPC (don't know about the new systems with the "compute fabric interconnects"), memory usually becomes the bottleneck (except for embarrasingly paralell problems, of course). In fact, I'm pretty it was Cray who said that a supercomputer is a way to turn a CPU-bound problem into an IO-bound problem.
This. Anything with global interactions (i.e. low flops per byte transferred from memory to core) is poorly suited for GPUs.
There is a hierarchy of HPC-type workloads called "Colella's seven dwarves" that ranks different workloads in terms of being CPU bound or memory bandwidth bound. See also the "roofline model". Both of these heuristics are made to reason about CPUs, but are also effective for thinking about GPUs.
Comparison to CPU is also important IMHO, and for that you need to be aware that terminology is very different.
What nvidia calls a "core" is more like one entry in a SIMD unit on a cpu.
What nvidia calls a "SM" is closer to a CPU core.
There is more to it than that, i.e. gpu cores are more independent than entries in a cpu vector unit, but on the other hand, but gpu "SM"s are less independent than cpu cores.
It's also worth keeping in mind that mediocre cpu code will run circles around mediocre gpu code. To get the gpu magic you have to invest a lot of effort in tuning for the architecture.
Does anyone know what the failure rate is for the transistors (or transistors of a similar production process)? Do they all have to function to spec for a GPU, or are malfunctioning transistors disabled or corrected? What does the QC process look like?