I don't see any mention of this chip doing DGEMM at 1TF. Just that it's sustaini...

stuntprogrammer · on Nov 16, 2011

Sustaining 1TF on DGEMM was explicitly mentioned by Intel in the presentation/briefing.

It's also mentioned in the press release:

http://newsroom.intel.com/community/intel_newsroom/blog/2011...

"The first presentation of the first silicon of “Knights Corner” co-processor showed that Intel architecture is capable of delivering more than 1 TFLOPs of double precision floating point performance (as measured by the Double-precision, General Matrix-Matrix multiplication benchmark -- DGEMM). This was the first demonstration of a single processing chip capable of achieving such a performance level."

Does it mean much? It means something to me, and is a great first step for those of us running compute intensive codes. They really wouldn't get far if they designed the chip only around being able to do this.

As I mentioned elsewhere in the thread, the article text is incorrect. The chip we're discussing is Knights Corner not Knights Ferry. The latter has been in early user hands for quite some time now and I've spent plenty of time hacking on it. Knights Corner is the new chip that is working it's way to production via the usual process with ship for revenue in 2012.

The 2018 target is for an exascale machine, not shipment of initial MIC devices. TACC have already announced they'll be building out a 10 petaflop MIC based system next year to go operational by 2013.

Yes, I'm comparing a chip that has not shipped, but given the perf advantage, given the tools and productivity advantage, given the multiyear process advantage Intel is sustaining, this is not a chip to be ignored. Knights Corner is shipping on 22nm. Other vendors have notoriously had difficultly on previous processes, depend on fabs like TSMC who are doing 28nm for them, and will be later to 14nm etc.

onemoreact · on Nov 16, 2011

Thanks for clearing that up, my google foo is weak when they use the wrong names.

Still, it looks like they really do design for benchmarks: "Xeon E5 delivers up to 2.1* times more performance in raw FLOPS (Floating Point Operations Per Second as measured by Linpack) and up to 70 percent more performance using real-HPC workloads compared to the previous generation of Intel Xeon 5600 series processors." 110% on benchmark = 70% in real world apps.

Granted, if this works out great, I have seen Intel blow to many new 'high performance' chips to expect much still they might just pull this one off. Unlike say the http://en.wikipedia.org/wiki/Itanium etc

PS: I always look at what Intel get's x86 to do much like how Microsoft could develop software, it's not that the capability is awesome so much as watching a mountain of hacks dance. They have a huge process advantage and can throw piles of money and talent at the process but they are stuck with optimization's made when computers where less than 1% as powerful.

stuntprogrammer · on Nov 16, 2011

We should distinguish between designing for a benchmark and designing for a set of workloads. Everyone choices representative workloads they care about and evaluate design choices on a variety of metrics from simulating execution of parts of those workloads.

Linpack is a common go-to number because, for all the flaws, it's a widely quoted number. E.g. used in the top500 ranking. It tends to let the cpu crank away and not stress the interconnect, and is widely viewed as an upper bound on perf for the machine. In the E5 case it'll be particularly helped by the move to AVX enabled cores, and take more advantage of that than general workloads. Realistic hpc workloads stress a lot more of the machine beyond the cpu. Interconnect performance in particular.

People like to dump on x86 but it's not that bad. There are plenty of features no one really uses and we still have around, but those features will often end up being microcoded and not gunking up the rest of the core. The big issue is decoder power and performance. x86 decode is complex. On the flipside, the code density is pretty good and that is important. Secondly, Intel and others, have added various improvements that help avoid the downsides. E.g. caching of decode, post-decode loop buffers and uop caches etc. Plus the new ISA extensions are much kinder..

Retric · on Nov 17, 2011

The problem with x86 is when you scale the chips to N cores you have N copy's of all that dead weight. You might not save many transistors by say dropping support for 16 bit floats relative to how much people would hate you for doing so. However, there are plenty of things you can drop from a GPU or vector processor and when you start having 100's of them it's a real issue.

Still with enough of a process advantage and enough manpower you can end with something like the i7 2600 which has a near useless GPU and a ridiculous pin count and still dominates all competition in it's price range.

stuntprogrammer · on Nov 17, 2011

Is there a cost? Of course. But arguably it's in the noise on these chips. Knights Ferry and Corner are using a scalar x86 core derived from the P54C. How many transistors was that? About 3.3 million. By contrast, Nvidia's 16-core Fermi is a 3 billion transistor design. (No, Fermi doesn't have 512 cores, that's a marketing number based on declaring that a SIMD lane is a "cuda core", if we do the same trick with MIC we start doing 50+ cores * 16 wide and claiming 800 cores).

How can we resolve this dissonance? Easy -- ignoring the fixed function and graphics only parts of Fermi, most of the transistors are going to be in the caches, the floating point units and the interconnect. These are places MIC will also spend billions of transistors but they're not carrying legacy dead weight from x86 history -- the FPU is 16 wide by definition must have a new ISA. The cost of the scalar cores will not be remotely dominant.

I'm not sure why you are concerned about the pin count on the processor, except perhaps if you are complaining about changing socket designs which is a different argument. The i7 2600 would fit in a LGA 1155 (i.e. 1155 pins) whereas Fermi was using a 1981 pin design on the compute SKUs. The sandy bridge CPU design is a fine one. The GPU is rapidly improving (e.g. ivy bridge should be significantly better, and will be a 1.4 billion transistor design in the same 22nm as Knights Corner).