I'm working on a low level library that - in theory - will allow you to write code once and compile it for OpenCL or CUDA backends [0]. It is still pre-alpha and completely unusable but maybe you want to have a look or keep an eye on it.

I am trying to see if I can put a) a portable interface together (for both writing kernels and the coordination of cards, contexts, memory, queues/streams etc.) and b) if the performance is portable. I can already see that performance for trivial kernels is portable from AMD to NVIDIA but as soon as I go to the Intel PHI things are suddenly very different.

[0] https://github.com/sschaetz/aura

