
No Free Lunch for Intel MIC (or GPU’s) - ajdecon
http://blogs.nvidia.com/2012/04/no-free-lunch-for-intel-mic-or-gpus/
======
noobface
Thanks for posting this. I love to see hardware stuff on the front page,
especially such a relevant topic.

I actually sat down with the director of a HPC shop a few months ago and
discussed the MIC question.

He was hopeful, but said nearly everything about MIC isn't as mature as it
needs to be to warrant adoption. His best case scenario was adopting MIC for
his 2016 build.

------
modeless
I think this article is a bit off target. Yes, it's true that a simple
recompile of an OpenMP app won't give you peak performance, but it's still
nice to be able to do it as a starting point for incremental porting.

Furthermore, while OpenMP apps probably will have scaling issues at 50+ cores,
I think the bigger issue that the article hardly mentions is the new wide
vector unit in the MIC cores. That's where all the FLOPS happen, and it's a
completely new instruction set. x86 is a sideshow. Apps will need to be
rewritten for the vector unit to get anywhere near peak performance on MIC.

~~~
pavlov
Since apps need to be rewritten anyway, what's the value of doing the simple
recompile using OpenMP? It doesn't offer sufficient performance and it doesn't
function as a template for the rewritten app.

The whole thing seems like a marketing-driven exercise in fake compatibility.
Imagine a company with an old PalmOS app which needs to be ported to iPad if
the company wants to stay relevant. They can't just wrap the Palm app inside
an emulator and sell that -- it's just not good enough for a port because it
doesn't leverage any of the new platform's strengths. That's pretty much what
Intel is proposing.

~~~
modeless
I disagree that all apps need to be entirely rewritten. I think it would be
perfectly reasonable to use an existing OpenMP app as a starting point, and
it's likely that some parts would work fine. Even if the MIC chip runs some
parts of the app slower than a Xeon would, it could still be worth running
those parts on the MIC chip to avoid sending intermediate results through the
PCIe bottleneck.

------
Andys
Nice try, Nvidia. This article is some fact and the rest FUD.

    
    
      "If you currently have access to MIC chips and have been testing
      real applications, I would love to hear from you"
    

The information that was used to judge Intel's solution was based only on
press releases and not any practical working knowledge.

    
    
      "We can no longer reduce voltage in proportion to transistor size"
    

The "we" here refers to Nvidia, and not Intel, because Nvidia uses TSMC to fab
their GPU chips.

    
    
      "there is no such thing as a “magic” compiler that will
      automatically parallelize your code"
    

That is funny because Nvidia's newest GPU design offloads scheduling decisions
on to the compiler instead of handling it dynamically on-core.

So perhaps Nvidia is warming up the FUD machines because of their newest range
of GPUs which focus on gaming at the expense of computing.

eg. the Double Precision FLOP/s is now less than 10% of the Single Precision
FLOP/s performance

~~~
codedivine
Your reply contains some weird non-sequiters. Nvidia is right on target when
they say there is no "magic" autoparallelizing compiler. The
autoparallelization they talk about, and the scheduling issues you talk about,
are completely unrelated except that they both contain the word "compiler" :/

As for double-precision performance, it is also well known that the GTX 680 is
only for consumers. The HPC version of Kepler is due later this year and I
will not be surprised to see it offer much improve double precision
performance compared to current Teslas. Intel MIC is not a consumer product,
so it is unfair to compare it to consumer products like GTX 680.

~~~
pavanky
The compiler comment is pretty strange for a company that is (or was)
aggressively Pushing OpenACC [1].

[1] <http://www.nvidia.com/object/openacc-gpu-directives.html>

------
tambourine_man
Not my area of expertise, so I'm sorry if this doesn't make much sense:

Wouldn't it possible to have a virtual machine that presents itself as a
single CPU and distributes the job efficiently across the many cores that it
is running on?

~~~
extension
How would you distribute the work of a virtual machine? It has to execute
instructions in sequence, just like a non-virtual machine. Any parallelism
needs to be extracted at a much higher level of algorithmic abstraction, i.e.
by a programmer.

Imagine trying to make a virtual machine that would rewrite any bubble sort as
a quicksort. Parallelizing algorithms is much more difficult than even that.

~~~
tambourine_man
But we are already able to extract parallelism deep down in the instruction
level with out of order execution.

Imagine some layer in between an abstract syntax tree and micro-ops. Again,
just thinking out loud.

~~~
extension
We can extract a _little bit_ of parallelism that way, but nothing even close
to the scale of what a GPU does.

~~~
tambourine_man
relevant: <http://news.ycombinator.com/item?id=3816771>

------
pheon
dont overlook the power of a brand. In this case its x86 which means brand
intel, which is about as strong as it gets in Tech.

Alot of decisions are made by non-technical or are technical but lack the
depth of understanding.

e.g. propose intel MIC vs Telira (similar thing but with MIPS cores)

