
Intel's 50-Core Xeon Phi: The New Era of Inexpensive Supercomputing - iso-8859-1
http://www.drdobbs.com/parallel/intels-50-core-xeon-phi-the-new-era-of-i/240105810?donkey
======
Osmium
I'm really excited about our massively parallel future, not least because I
have to run scientific code that would greatly benefit from it. But at the
moment it's so hard to program for this sort of thing: can someone explain
why, in simple terms, something like OpenCL or CUDA is so damn complicated? Is
there any way to avoid having to have a low-level understanding of how a GPU
or co-processor works, rather than expecting the vendor to implement an easier
to use solution? I'm thinking about, e.g., Matlab's "parfor" (parallel for)
command, which is super easy to use.

The article states that "All of these [CUDA/OpenCL] problems go away with the
Phi. It's a pure x86 programming model that everyone is used to. It's a
question of reusing, rather than rewriting, code" but I find it hard to
believe I can just drop existing code into it and expect decent performance.

~~~
mturmon
Have you used OpenMP (<http://en.wikipedia.org/wiki/OpenMP>)? It has the
flavor of parfor -- you identify the embarrassingly parallel loops in your C
or Fortran, put in something like

    
    
        #pragma openmp parallel for
    

in front of them, and your code transfers through pretty much intact -- it
handles the thread wrappers. You can add other pragmas for the times when you
need locking.

This is a much less intrusive setup than CUDA; you don't have to worry about
loading data, or double/float conflicts.

The OpenMP extensions could be a very good fit for scientific programming on
this coprocessor.

~~~
Osmium
Thanks; I'll take a look. But OpenMP is CPU-only right? Apple's got their
(currently less portable, admittedly) Grand Central Dispatch that does
something similar. But as far as I know, if you want portable GPU code your
only option is OpenCL, and even then it requires optimisation depending what
device you're using it on (or so I've heard).

~~~
foxhill
OpenMP 4.0 is likely to have support for accelerator devices (i.e, move the
necessary data on to the device, run the computation, and move back to the
host). in fact, that's one of the methods you can use the Phi right now (intel
have extensions to OpenMP)

or if you can't be bothered to wait for such a standard, you should have a
look at OpenACC[1], which does exactly this, and exists now. you end up adding
code like

    
    
        #pragma acc kernels for
    

on top of your for loops, it does the low level work for you.

[1] <http://www.openacc-standard.org/>

------
JoeAltmaier
Lots of cores means lots of threads - 4 hyperthreads per core. So 200+ threads
could be handy in high-bandwidth low-latency situations. E.g. it could make a
dandy server for delivering low-latency streams like stock quotes. You would
want a new kernel model, where you bound program threads to particular
hyperthreads and blocked in user-space on events - so your hyperthread cache
was always hot.

~~~
Symmetry
Not horribly relevant from a software perspective, but as a hardware geek I
think the way they're doing threading is really interesting. Big OoO
processors like a normal Xeon or a Power7 usually use simultaneous
multithreading (SMT) which means that you have instructions from two threads
being fed to the execution units every clock cycle, and since they often
aren't in contention for the same resources you get higher throughput. Some
in-order processors like a Niagra often use block multithreading (BMT) where
you run one process until you get a cache miss, then switch to another thread
with some delay as the pipeline is flushed.

What the Phi is doing is combining those approaches, running two threads
simultaneously and switching threads out on cache-misses. This way you only
double rather than quadrupaling your control structures, but you don't have
your cores entirely unutalized when you're swapping threads. A really nifty
compromise, I think.

------
api
I wonder if you could use this to run lots and lots of small virtualized
nodes? I know that's not the intended use case but I wonder if it's possible
and would perform well?

~~~
rbanffy
Memory bandwidth would limit the performance unless your VMs were running
something with a very small memory footprint, about 160 megabytes per
instance.

Having said that, I've used Unix workstations with less RAM attached than that
through much less than 7GBps worth of bus...

~~~
wmf
Phi's memory _bandwidth_ is very high but its memory _capacity_ is very low (a
normal Xeon can drive ~192 GB cheaply).

------
cefstat
I've read about Xeon Phi a few months ago and I really want to get my hands on
one. My problems are in the embarrassingly parallelizable class (or almost).
Having said that, does anybody know how each Xeon Phi core performs with
respect to a modern Intel processor (i7 or Xeon) for standard numerical code
(Linpack etc.)?

~~~
rys
They're Pentium-class x86 cores and barely any more than front end control
processors for the vector hardware. The fact it's x86 is almost incidental,
IMHO; the vector ISA is all programmers should really care about on the Phi.

~~~
berkut
I guess that means they've got primitive (Pentium Pro equivalent) branch
predictors and memory pre-fetchers then?

Are they even out-of-order? I.e. is it Pentium or Pentium Pro class?

~~~
stonemetal
[http://www.anandtech.com/show/6451/the-xeon-phi-at-work-
at-t...](http://www.anandtech.com/show/6451/the-xeon-phi-at-work-at-tacc)

 _Each core is a simple in order x86 CPU (derived from the original Pentium)
with a 512-bit SIMD unit._

~~~
berkut
So the branch predictors will be crap, but thanks to the hyperthreading, it
probably won't be noticeable on most workloads...

~~~
apendleton
Maybe I'm missing something, but do in-order architectures even have much use
for branch prediction? They can't speculatively execute based on the outcome
of a conditional, right?

~~~
wtallis
Sure they can. Branch prediction allows you to move an instruction along the
pipeline before the instruction determining its outcome has been retired.
Without branch prediction, every conditional jump will potentially stall the
pipeline. With branch prediction, a correctly predicted branch executes
quickly, and a mis-predicted branch results in a pipeline flush.

Instruction re-ordering is more about taking full advantage of multiple
execution units (ALUs, etc.), or not completely stalling the pipeline to wait
on a memory fetch.

------
joss82
I'm sure a lot of us here would love to have a cheap supercomputer to perform
some heavily parallelizable workloads on our servers. And it this going to
dramatically lower the cost of virtual private instances? I really can't wait
to see some benchmarks.

~~~
foxhill
cheap it is certainly not. additionally, all the tests i've seen indicate that
kepler/tahiti have got little to worry about.

~~~
Scene_Cast2
GPGPUs are notoriously hard to extract high performance. If you're an
enterprise customer with no readily-available GPGPU code, Xeon Phi makes much
more sense GPUs for a few reasons.

First, the talent pool for HPC x86 programmers is an order of magnitude larger
than for expert GPGPU programmers - Xeon Phi is just a virtual x86 server rack
with TCP/IP messaging.

Second, the amount of time and effort to extract useful performance from
GPGPUs is quite a lot; if it's for internal use and you're not selling the
code to the masses, you're likely to get the same amount of performance with
less time on the Phi, unless you're going for "the best, regardless of money &
time".

Last, most enterprise customers will want ECC + other compute features.
They're sold in the pro-level 3k+ Teslas, which happen to be more expensive
than the Phi.

Where GPGPU does make sense: consumer-level hardware using already-written
software (workstations and hobbyists in particular) and businesses where
performance/watt is crucial at any cost.

~~~
mich41
Phi architecture is closer to a GPU than a rack of x86 servers.

With 60 cores reading memory over a common ring bus latency will kill you
unless you tile your loops to maximize cache reuse [1], at which point you
might as well write a GPU code which preloads blocks of data to local memory
and works there.

Also, to beat performance of normal x86 CPU you must use vector instructions,
what gives you all the little problems GPU warps are known to cause.

[1] [http://software.intel.com/en-us/articles/cache-blocking-
tech...](http://software.intel.com/en-us/articles/cache-blocking-techniques)

------
stephengillie
Will ARM servers have PCIe slots?

~~~
Symmetry
A quick Google reveals there are already ARM servers with PCIe slots.
<http://www.globalscaletechnologies.com/t-openrdudetails.aspx>

------
praveenster
why does the url include donkey as the query string?

~~~
hellrich
Probably submitted before, thus a change to the url was necessary to
circumvent HNs filter.

~~~
sp332
Yup, just yesterday <http://news.ycombinator.com/item?id=4784834>

------
shasta
8 GB of RAM total?

~~~
fdej
Seems pretty good if you compare it to a GPU.

~~~
tsahyt
The impressive thing is the memory bandwidth though. That's the one thing I've
always loved GPUs for.

------
drudru11
OR... just spend ~$1000 and get __3070 __cores of what you really need (FLOPS)

How?

The latest and greatest Nvidia card at your favorite retailer.

~~~
foxhill
consumer kepler boards have no double precision hardware, in this case a Phi
would destroy it.

------
foxhill
don't know where the 50 core figure comes from, as a Phi has 60 cores (61 in
the "better" model).

~~~
wmf
Until a few days ago Intel was saying "over 50" cores; some people forgot to
flush their cache.

------
erichocean
That article is riddled with errors.

~~~
stcredzero
Examples?

------
rorrr
> _Suggested reatail pricing for the initial model is $2649, with subsequent
> models expected to cost less than $2000_

That's $44.15 per 1 GHz core.

AMD FX-6300 Six-Core 3.5GHz is $138 = $23 per core (and much faster cores).

Intel Xeon 5148 2.33ghz is $18.

~~~
jlgreco
How do differences in supporting-hardware/density change the effective
pricing? It seems to me like the Phi cores could end up cheaper once you
factor in everything else that you need to get that many cores of something
else.

~~~
rorrr
You need just three of AMDs CPUs to match the 60 1GHz cores.

You can get a quad CPU motherboard relatively cheaply:

[http://www.ebay.com/itm/Arima-Quad-CPU-16-Core-AMD-
Opteron-M...](http://www.ebay.com/itm/Arima-Quad-CPU-16-Core-AMD-Opteron-
Motherboard-w-PSU-/110585626008)

Also don't forget to add the same costs for the Phi solution.

~~~
astrodust
The 50-core Xeon sounded great up until the $2600 price tag. You can buy a lot
of CPU cores for that much money if you have quad-socket boards like that.

