Parallel Programming and Optimization with Intel Xeon Phi (2013) [pdf]

acqq · on Nov 22, 2014

Interestingly, Xeon Phi processors draw between 200 and 300 W alone, cost between under 2000 and up to 5000 USD.

http://en.wikipedia.org/wiki/Xeon_Phi

The cheapest one has 57 cores, 28.5 MB L2 cache, up to 6 GB RAM but 240 GB/s RAM bandwidth over 12 channels:

http://ark.intel.com/products/75797/Intel-Xeon-Phi-Coprocess...

Anybody knows some "prosumer" product (and its price when equipped with 6 GB) that uses Xeon Phi?

sspiff · on Nov 22, 2014

I recently ordered a Xeon Phi 31S1P. They're under $200 at the moment[1]. The biggest problem I have is that you can't just plug them into any computer like you would a graphics card. You need a compatible motherboard, and those tend to be in the high price range and 2011 sockets (which means expensive CPU). Most of the time, you won't find out if a board is compatible until you plug a card in and try it out.

I'm curious how well the card will work, I'd love to see something like Erlang on such a thing.

[1] https://software.intel.com/en-us/articles/special-promotion-...

nkurz · on Nov 22, 2014

I've stared at that promotion (and even submitted it here, I think) but I haven't really understood it. Who did you end up buying from? Are there limitations on who can participate? How are you going to deal with the passive cooling aspect?

sspiff · on Nov 25, 2014

Colfax International. No restrictions, although they can't ship to Europe unless you have a DHL/FedEx/UPS/... freight collect account.

They've been very helpful, agreed to ship the cards to a friend in the US, and were patient as my bank blocked my credit card twice because of the "suspicious" transaction. I ended up wiring the money instead.

I ended up paying 492 euros for 3 cards (2 of my friends also ordered one).

I'm just an individual enthousiast, and my friends are researchers at the local university. You don't need a company, you don't need to order a lot of them, you can just order a single card as a private individual.

iamsalman · on Nov 22, 2014

The whole premise behind introducing Phi to compete with discrete GPUs from NVIDIA/AMD was to have a plug-in accelerator which supports x86 which meant no code porting needed, hence enabling companies with millions of man hours invested in their code to simply take benefit of the accelerator. However, this is not the case -- The price/performance ratio for code which is not optimized to make use of massively parallel processors would be mediocre at best.

Besides, Xeon Phi's are reincarnation of project Larrabe which never took off.

If you have to end up optimizing your code for accelerators in any case, x86 or not -- you are better off optimizing it for GPUs instead.

acqq · on Nov 22, 2014

You are right, I was interested exactly for the possibility that even the code that isn't particularly customized for the coprocessor would benefit from using all the cores. I always believed the GPU code doesn't do well with a lot of branches, and I hoped that Phi would better run such code than GPUs. Now searching, here's Nvidia's take:

http://www.nvidia.com/object/justthefacts.html

rasz_pl · on Nov 22, 2014

From what I read a long time ago Phi has the same branch limitation - it can do branches by running same code twice, just like Cuda.

not to mention "x86 compatibility as a bouns" is a tired old BS line Intel uses on clueless decision makers at the golf club, it never amounts to anything, its not like you are going to just drop your binaries in there.

sspiff · on Nov 22, 2014

Not true. The cores run independant threads and processes on an embedded Linux system that's running on the card, meaning they're much easier to program, and they allow porting of existing software without completely going back to the drawing board.

daniel-cussen · on Nov 22, 2014

> From what I read a long time ago Phi has the same branch limitation - it can do branches by running same code twice, just like Cuda.

I'm fairly sure that's not the case now. Certainly the capability is there for it to do independent branches--just look at the GA144, which while limited in other ways, can have its 144 computers branching all over the place simultaneously. No, I'm pretty sure that's the whole point of this type of architecture: to allow more branching.

If it didn't, I'd be a little bit screwed, because I was counting on it for a compute-bound algorithm that really needs that branching.

sspiff · on Nov 22, 2014

It's not the case. These are 57 independant cores, much like you'd see in a quad core CPU, except that they're Pentium-vintage feature wise (with the addition of some modern vector instructions and SMT).

As far as I can tell, they're not binary compatible with existing software and software need recompilation, using Intels compilers.

PaulKeeble · on Nov 22, 2014

Many times I have had programming problems that would have usefully been able to use a lot of cores especially with reasonable IO wait on each of the threads of execution. But in order to really have just been able to use this it needs:

1) The cores need to run modern x86 so its a simple matter of just running the binary. 2) The offloading and concept of heterogeneous cores needs to be added into the operating system. All the cores should be exposed to programs and then putting threads onto accelerators ought to be something a hint can suggest or maybe a special type of parallel thread created. In essence the OS needs to be exposing them like any other core with shared memory and everything else this entails to make it native.

The current model that GPUs use is very good for doing matrices of data but it really doesn't lend itself to agents or other types of concurrency. Its a bit of a stretch to be writing in a restricted form of C which is pretty low level with openCL or DirectCompute combined with all the API and data passing overhead its a very specific type of program that benefits and it requires rewriting your code completely. The future can't possibly be this in the general case and its not really being adopted all that widely, some people are using it of course but most aren't.

In my opinion lots of low power cores that run the same instruction set as the primary CPU gives us a useful middle ground that is easier to use and optimise for and can be used alongside the primary fast cores of the main CPU. That is the future I am hoping for.

higherpurpose · on Nov 22, 2014

AMD/ARM's HSA doesn't need OpenCL (although it will support OpenCL 2.0 which is much more optimized for heterogeneous computing). I guess you're talking about the current state of GPU computing. The next-generation, that will be based on HSA, should be much better. You can even use Java or other languages to write for it.

ternaryoperator · on Nov 23, 2014

For what I thought was a good, short, clear intro to programming the Phi,see [1]

[1] http://www.drdobbs.com/240144160

jcr · on Nov 22, 2014

The linked pdf is of chapters 3 and 4, but chapters 1 and 2 are also available:

http://inside.mines.edu/~tkaiser/csci580fall13/one-day-1.pdf

All the other files from "Advanced High Performance Computing CSCI 580" class also look interesting:

http://inside.mines.edu/~tkaiser/csci580fall13/

rbanffy · on Nov 22, 2014

The Xeon Phi is a fascinating exercise in futurology. It may face strong competition from GPUs in HPC environments, but future end-user non-specialised processors will probably have more cores than current designs and any effort optimising code to run on more cores than currently available is an investment that'll bear fruit in the future.

seanmcdirmid · on Nov 22, 2014

The Phi is SIMD with a thin memory hierarchy. So it's not really that different from a GPU...it definitely isn't general purpose.

rbanffy · on Nov 23, 2014

It is different in that it looks general purpose enough (it looks like a lot of Atom-like cores with wider SIMD units hooked up to a pool of shared memory) and in that it can run off-the-shelf software (albeit poorly).

Learning to make it run effectively may give you some insight on how to persuade your personal computer of 2024 to use all its cores and make your browsing experience better.