I guess "language support" was referring to the restrictions within the kernel l...

foxhill · on June 3, 2014

templates are not something that (outside of simple uses) you'd want to use in your kernels. regardless, nvidia should be pushing their improvements through to OpenCL by exposing extensions. they might get adopted into the core profile.

> What bugs me about OpenCL is the intentional vagueness of the specification that gives every implementer the freedom to do whatever they want with the result that performance portability is often difficult to achieve.

well, that flexibility is required for OpenCL to be meaningful. that's where the variation in the hardware platforms exists. it's what differentiates compute devices. if that vagueness wasn't there, then we couldn't have things like OpenCL on FPGAs (altera, xilinx)

as for your statement on performance portability, perhaps that is an issue (but that's entirely dependent on the type of problem you're trying to compute). but something i don't understand is this;

you could have picked a proprietary API to do your compute. but say you choose CL. you optimize for your hardware, then what do you know - it's not really that fast on other hardware. but you're entirely overlooking the biggest boon here - your code ran on the other hardware in the first place. getting performant code is now only a matter of optimizing for that piece of hardware.

you could argue that's entirely too complicated, but that's what we have been doing already with our regular C/C++ programs (SSE/AVX/SMP...)

oneofthose · on June 3, 2014

Templates are an essential tool to write type-independent algorithms. They enable meta-programming, an invaluable tool to provide flexible yet efficient active libraries to users. They allow automated kernel-space exploration. So templates are exactly what you want.

I understand the need for a standard that supports various different architectures, even architectures that might not exist yet. I guess I just dislike the way the did it. Compared to other standards (that also leave various things to the implementer), I think they did a poor job. They should have defined the semantics and the types better. The entire buffer mapping for example is a huge mess. Nvidia went ahead and fitted pinned memory in there somewhere. Others didn't, with the result that the meaning of the code changes completely depending on which library you link against.

I'm not arguing against OpenCL here, I'm saying they could do even better. It should not be too much effort too. And if companies like Apple and Google would have chimed in, we would have pretty awesome OpenCL standard and implementations today.

As for your argument about hand-optimization: C++ library implementers [0,1] (and compiler vendors probably too) found abstractions, tricks and tools that give performance portability today. They are of course domain-specific but it is possible.

[0] https://github.com/MetaScale/nt2 [1] http://eigen.tuxfamily.org/

foxhill · on June 3, 2014

> Templates are an essential tool to write type-independent algorithms. They enable meta-programming, an invaluable tool to provide flexible yet efficient active libraries to users. They allow automated kernel-space exploration. So templates are exactly what you want.

but OpenCL C only has primitive types. templates become more useful when you have classes, but bringing classes to the GPU is.. well, less than optimal.

> Compared to other standards (that also leave various things to the implementer), I think they did a poor job

i don't know what your complaints are exactly, but i don't share your opinions - i think OpenCL is almost as flexible as it needs to be.

> The entire buffer mapping for example is a huge mess

i disagree. clCreateBuffer creates a buffer, clEnqueue(Read|Write)Buffer reads or writes to it. you can do more advanced transfers with the *rect variants, but you kind of probably know what you're doing at that point.

you want pinned memory? call clCreateBuffer with CL_MEM_ALLOC_HOST_POINTER. and instead of Enqueue(Read|Write) use Enqueue(Map|Unmap). wether or not you get pinned memory is up to the runtime (and nvidia's runtime does not guarantee it - it's an impossible one to make).

> Others didn't, with the result that the meaning of the code changes completely depending on which library you link against

as mentioned, use map/unmap. it works on all the runtimes, and at least isn't any slower than read/write. as for what library you link to, that's also a moot point - we have ICDs now, you link to a shim layer that dynamically links the appropriate run time during context creation (you can have several OpenCL platforms on one machine).

> As for your argument about hand-optimization: C++ library implementers [0,1] (and compiler vendors probably too) found abstractions, tricks and tools that give performance portability today. They are of course domain-specific but it is possible.

i haven't looked into either of your links in detail, but with the various BLAS/LAPACK libraries that exist, which are also far more mature (and more widely used), would almost certainly be a better choice. lots of these already work on GPUs and are optimized to death by beings who think in assembly.. most of them are in fortran, as well (although they have front ends for several languages).