Inside Pascal: Nvidia 's Newest Computing Platform

tomkinstinch · on April 21, 2016

Wow. It has 15.3 billion transistors. It's amazing we can buy something with that many engineered parts. Even if the transistors are the result of duplication and lithography, it's an astonishing number. Creating the mask must have taken a while.

Does anyone know what the failure rate is for the transistors (or transistors of a similar production process)? Do they all have to function to spec for a GPU, or are malfunctioning transistors disabled or corrected? What does the QC process look like?

djcapelis · on April 21, 2016

Exact failure and bin rates for most semiconductor companies are considered deep dark internal trade secret. Other than pure scale, yield rates are one of the biggest factors in semiconductor cost and profit margin.

And the answer is it depends. If you lose some transistors, you expect to lose the entire chip. But the vast majority of the transistors on each chip are part of cache or many many duplicate GPU cores, which if they fail to pass tests, can be disabled or downclocked and then the chip is binned into the appropriate product line.

With GPUs this is much easier than other types of chips, because the level of functional duplication that exists allows a lot of flexibility. If a core is bad, you use a different one, and GPU cores are small enough they'd be stupid not to put some spares on each chip. Same with memories.

Generally one can safely assume:

* Most chips that come off the line are binned into a lower category and do not function at max spec for everything, which is why the price jump is so high at the extreme upper end of a hardware series.

* With ASIC lithography most transistor malfunction isn't correctable, you mostly have to either downclock (some types of faults) or disable (the rest) that piece.

* Rates of transistor malfunction is still incredibly fucking amazingly phenomenally low. Like with 15B transistors on a chip, you have trouble affording a failure rate of even one in a billion.

So your line has to be, as the kids say: on fleek.

ckozlowski · on April 21, 2016

Spot on.

I used to work for an OEM, and the Intel and AMD engineers would quietly explain to us how this worked on a number of occasions.

The AMD X3 chips I think were the best example of this being done. These were quad-core parts that AMD was manufacturing at the time, but had defect that made one core faulty. So that core was disabled, and sold as triple-core part. http://www.zdnet.com/article/why-amds-triple-core-phenom-is-...

mud_dauber · on April 21, 2016

I would expect that partially defective chips are repaired during probe or (more likely) final test by blowing fuses.

The chip's yields, and therefore cost, will depend on the foundry's natural defect rate per area and the design quality.

Retric · on April 21, 2016

An interesting note is processes improve over time, so most companies end up binning processors much more conservatively over time. Price discrimination means company's want to restrict their highest bin even if they could sell it for less. https://en.wikipedia.org/wiki/Price_discrimination

Salgat · on April 21, 2016

Is there an approximation you can give, such as a magnitude? Is it around 1 transistor per million, per billion, per thousand?

djcapelis · on April 21, 2016

I don't have direct knowledge of current yield rates, so this is speculation. That said, I did give what I think is a reasonable order of magnitude in the above comment when I said a company would have trouble affording a rate of 1 in a billion transistors for unclustered defects. I meant that literally. Some chips just probably can't be profitable with a rate that high. Nvidia might be able to make yield on that rate since GPUs have enough functional duplication, but I'd expect Intel's rate to be under one in a billion and over one in a trillion.

Also note that clustered failures are different. Some whole wafers might be junk if alignment is off, or if there's a bubble somewhere, a series of adjacent chips would be destroyed. If you throw a whole wafer away, that thankfully doesn't mean you have to produce a billion perfect wafers to make up for it. So the yield rate above would only need to apply for the parts where the process is otherwise dialed in.

If you have a bubble or alignment issue, it really doesn't matter at that point whether you kill tens of transistors or a few billion, any chips where the bubble is are likely just gonna get marked and tossed. And it is sometimes routine on some types of lines that if the chip yield is low enough on a wafer the whole thing is tossed since it's not economical to cut and further test and package up any working ones.

Semiconductor economics are pretty nuts.

It's actually more common to specify error rate in terms of defects per mm^2, because the exact number of transistors involved in the defect is mostly irrelevant if it's a defect that is wide ranging.

tcas · on April 21, 2016

I do not have the answers for your questions (and I don't think anyone can share actual failure rates), but I would direct you this video which goes over a lot of modern chip fabrication techniques, circa 2009:

https://www.youtube.com/watch?v=NGFhc8R_uO4

It's crazy stuff.

There are wafer test machines which will interface with the wafer directly and do some testing (which are $$$$), JTAG type tests, which access parts of the chip out of band, and functional testing. Some products, like SD Cards actually have a microcontroller on board that will provide the test routines and error correction without the need of an expensive machine. Design for test is extremely important.

I'm by no means an expert however, I mostly deal with JTAG and functional tests.

cottonseed · on April 21, 2016

I just read that the Xilinx XCVU440 FPGA has >20B transistors, and that's one generation old (20nm, UltraScale+ is on 16nm finfet). Insane.

wyldfire · on April 21, 2016

Half precision ftw! ML is the use case they're designing for, but we all get to reap the benefits.

pavlov · on April 21, 2016

Hasn't half precision (16-bit float) been in NVidia GPUs forever? I could swear it was available back in the very first shader-capable Geforce FX days already.

bsprings · on April 21, 2016

(Post author here.) Yes, FP16 has been supported in NVIDIA GPUs as a texture format "forever" -- since before it was incorporated into the IEEE 754 standard. Indeed what is new in GP100 is hardware ALU support (and note denormals are full speed, which is even more important for lower precision formats).

wyldfire · on April 21, 2016

IIRC what's new here is native support in the ALUs, etc. I think the older support was probably in software.

rsp1984 · on April 21, 2016

FP has always been 16- or 20-bit precision in the Tegra (mobile) chips up to Tegra 4.

pavlov · on April 21, 2016

Makes sense, thanks.

mtgx · on April 21, 2016

IBM's Power9 and its future Power 3.0 ISA CPUs, which should increasingly focus on deep-learning/big data optimization combined with Nvidia's GPUs which will increasingly optimize for the same, should make an interesting match over the next 5+ years.

On the gaming side, I do hope they continue to optimize for VR. I think AMD is even slightly ahead of them on that.

gnuvince · on April 21, 2016

Please say it's programmable in Pascal.

gnoway · on April 21, 2016

Well, Delphi can call functions in external assemblies so...technically?

https://rosettacode.org/wiki/Call_a_function_in_a_shared_lib...

analognoise · on April 21, 2016

Calling out to FreePascal and Lazarus here - if they provide a DLL, yup!

I was just looking at controlling NGSpice from FreePascal - one of the examples of running a shared instance of NGSpice is done in a Pascal dialect:

http://ngspice.sourceforge.net/shared.html

I like Pascal much better than C++ and think the portable Lazarus GUI toolkit is pretty damn trick. Check it out: http://www.lazarus-ide.org/

sklogic · on April 22, 2016

Why not? Pascal to PTX should be trivial (e.g., there is a lot of LLVM Pascal frontends).

venomsnake · on April 21, 2016

That will be only for the Turbo models. It was awesome language.

m_mueller · on April 22, 2016

I'm not sure whether it was called Turbo Pascal, but we had this Pascal editor with built in 2D draw windows to learn programming in high school. I'm still looking for something equivalent (but maybe a bit more modern/portable) to give to my kid when he's showing some interest. Scratch is nice, but the visual programming becomes limiting very quickly. Is there anything like this today?

draven · on April 22, 2016

Openframeworks and Cinder are C++, so that's probably not fit for a first experience.

Check out LÖVE: https://love2d.org/

It's in Lua (so learning the language won't be a big part of the whole experience) and runs on many different platforms.

m_mueller · on April 22, 2016

Löve looks indeed awesome, thanks!

nitrogen · on April 22, 2016

There's Processing and Processing.js, or you might be able to find a modern LOGO. Some terms to search for: "live coding", "turtle graphics".

boznz · on April 21, 2016

+1 Nostalgia. :-)

overcast · on April 21, 2016

Hush you!

marmaduke · on April 21, 2016

Is there a description somewhere without all the bla bla hype? Comparison with past architectures would also be welcome, also without hype.

pklausler · on April 21, 2016

The article has several tables that juxtapose the specs of the previous, current, and new generations, and I think that you will enjoy reading it.

marmaduke · on April 21, 2016

I have a fermi board which, despite worse numbers, easily outpaces a kepler and maxwell board, for my workload.

So, yeah, those tables are hype too. I am asking about benchmarks on real workloads.

bsprings · on April 21, 2016

(Post author here.) Curious to hear more details about your workload, because a 5+-year-old Fermi would truly be hard pressed to outperform Maxwell or even a Kepler K40, let alone Pascal.

marmaduke · on April 22, 2016

It's parameter sweeps of a delay differential equations, one simulation per thread. This requires a lot of complex array indexing and global memory access, so arithmetic density isn't near optimal. Still, it's a real world workload that benefits hugely from GPU acceleration.

Moving from a GTX 480 to a Kepler or Maxwell card, the numbers go up, but not the performance. I might have a corner case, but before investing in new hardware, I would want to benchmark first and not blindly follow the numbers.

Caprinicus · on April 22, 2016

People bought 400 series cards for their compute performance long after they were outdated. If your software wants an nvidia card it was either that or go up to quadro. People bought the first titan for the same reason.

dr_zoidberg · on April 21, 2016

What I got from the article:

* ~5.5 TFLOPs on FP64

* "About 2x" performance on FP32 (so 11 TFLOPs)

* "Up to 2x" performance on FP16 (compared to FP32, so about 22 TFLOPs).

* FP16 is also aimed to neural nets training, because when the weights of the net are FP16, it's a more compact representation.

* 3840 general purpose processors.

* More/better texture units, memory units, etc. So it's not about raw power, but also about a better design.

Guess that's about the important stuff. I just skimmed the article over the top, reading a bit here and there, but that seemed to be the most remarkable stuff.

ghshephard · on April 21, 2016

How does performance like this compare to the TFLOPs on a supercomputer? I.E. how does one draw parallels to the performance you can get versus what we've been historically able to do with http://www.top500.org/statistics/perfdevel/?

stuntprogrammer · on April 21, 2016

Pascal won't be cheap so comparing to a top-end Xeon E5 v4, it's about 7x the theoretical FP64 performance (assuming the Xeon is 2.2GHz * 22 cores * 8 avx pipes per * 2 for FMA per socket, given the price range). Similar story at FP32. However, the GPU wins on FP16.

For historical performance, just pick one of the machines that did x teraflops. E.g. the first teraflop computer used ~6000 200MHz pentium pro chips around 1996.

ghshephard · on April 21, 2016

What I'm trying to figure out, is whether a teraflop directly comparable. That is, on the top 500 list, the first 4.9 Teraflop computer was in 2000, but does that mean that Pascal could provide performance similar to the Supercomputer on the LINPACK benchmark?

In describing the benchmark, they say,

In an attempt to obtain uniformity across all computers in performance reporting, the algorithm used in solving the system of equations in the benchmark procedure must conform to LU factorization with partial pivoting. In particular, the operation count for the algorithm must be 2/3 n^3 + O(n^2) double precision floating point operations. This excludes the use of a fast matrix multiply algorithm like "Strassen's Method" or algorithms which compute a solution in a precision lower than full precision (64 bit floating point arithmetic) and refine the solution using an iterative approach.

So, to summarize, if in 2000 the fastest supercomputer on the planet ran at about 4.9 TFLOPs, does that mean, apples-apples on the LINPACK (and only the LINPACK), that Pascal today would outperform that Supercomputer?

stuntprogrammer · on April 21, 2016

Technically, yes, a teraflop is a teraflop, and is directly comparable. It just means you can do an awful lot of floating point operations per second. But many systems are sensitive to memory size, memory bandwidth, and as a result communication costs (i.e. latency/bandwidth of the interconnect between machines).

The benchmark is essentially bottlenecked on FP64 matrix multiplies. If that's what you need to do, then sure, it's indicative.

Some machine learning workloads are also bottlenecked on matrix multiply, but don't need FP64 precision. They can use FP16. Fits a bigger model in a given memory size, makes better use of memory bandwidth, and given the right hardware support, you can get extremely high rates as on Pascal.

Personally, I find the memory system on Pascal more interesting than raw flops rate. Also, the use of nvlink to link multiple GPUs..

dr_zoidberg · on April 22, 2016

I agree on the memory model being the most interesting thing about this card. I sort of "under-sold" it on the "better design" part of my last bullet.

People/manufacturers tend to look at clock rates, fill rates (for GPUs), FLOPs, "crunching power" in general, forgetting completely the memory part. For example, today most CPUs end up being bound by cache sizes and performance tuning focuses on being nice on the cache rather than being optimal in your instructions (see for example Abrash's Pixomatic articles[0-2], which are about high performance assembly programming in "modern environments").

With GPU and "classic" HPC (don't know about the new systems with the "compute fabric interconnects"), memory usually becomes the bottleneck (except for embarrasingly paralell problems, of course). In fact, I'm pretty it was Cray who said that a supercomputer is a way to turn a CPU-bound problem into an IO-bound problem.

[0] http://www.drdobbs.com/architecture-and-design/optimizing-pi...

[1] http://www.drdobbs.com/optimizing-pixomatic-for-modern-x86-p...

[2] http://www.drdobbs.com/optimizing-pixomatic-for-modern-x86-p...

semi-extrinsic · on April 21, 2016

This. Anything with global interactions (i.e. low flops per byte transferred from memory to core) is poorly suited for GPUs.

There is a hierarchy of HPC-type workloads called "Colella's seven dwarves" that ranks different workloads in terms of being CPU bound or memory bandwidth bound. See also the "roofline model". Both of these heuristics are made to reason about CPUs, but are also effective for thinking about GPUs.

chongli · on April 21, 2016

What's the memory bandwidth of the Xeon E5 v4? NVidia made a big deal of their increased memory bandwidth due to chip stacking.

drewm1980 · on April 22, 2016

Comparison to CPU is also important IMHO, and for that you need to be aware that terminology is very different.

What nvidia calls a "core" is more like one entry in a SIMD unit on a cpu. What nvidia calls a "SM" is closer to a CPU core.

There is more to it than that, i.e. gpu cores are more independent than entries in a cpu vector unit, but on the other hand, but gpu "SM"s are less independent than cpu cores.

It's also worth keeping in mind that mediocre cpu code will run circles around mediocre gpu code. To get the gpu magic you have to invest a lot of effort in tuning for the architecture.

ansible · on April 21, 2016

I think the unified memory is one of the big deals here. This makes working on large data sets much easier.

JustSomeNobody · on April 21, 2016

300 Watts.

Toasty.

timeu · on April 21, 2016

Sorry but does it run Crysis ? ;-)

But seriously quite impressive piece of hardware.