Hacker News new | comments | show | ask | jobs | submit login
DCompute: GPGPU with Native D for OpenCL and CUDA (dlang.org)
111 points by ingve 70 days ago | hide | past | web | 30 comments | favorite

D has always interested me a lot. Sure it's been around a while, but the community seems rather small - comparatively. That said, that small community produces some really nice stuff - they are up to what 3 'official' compilers now? I hope to see its adoption rise now that the reference compiler is open source(iirc) - the community seems second to none as far as signal to noise ratio goes.

Nicholas Wilson has done an awesome job with DCompute - especially ability to use D's lambda's and templates when writing compute kernels. Its going to be fun seeing this evolve.

I'd be excited for D to find its niche there. Good heterogeneous compute support with D's introspective design features has the potential to be a very powerful way to create number crunching applications

Thanks. Halide has recently caught my eye and I'd like to see how far I replicate how they do things. It'll have to wait a bit though.

Halide has some interesting ideas for image processing -especially regarding algorithm separation and scheduling - so great to hear its on your radar and be very interested to see what you come up with. My interest/focus is more on stencil codes and that is certainly an area I hope to test with DCompute. Congrats again (and thanks again) for an awesome project.

I can't wait until compilers can start auto generating GPU kernels. That will be when GPGPU really takes off for most people who's applications aren't critical enough to spend hours writing these by hand but would benefit from the significant speed up.

I'm not sure that will ever happen, at least without changing languages.

Autovectorization alone is difficult in C and C++ because the languages provide almost no useful information about aliasing. Precise aliasing info is just the tip of the iceberg regarding what you would need for GPU-based autovectorization.

D has array expressions, which look like:

   a[] = c * b[];
The idea of them is they are parallelizable, and do not require an auto-vectoring loop optimizer. Auto-vectorizing is fraught with problems, like the user not being aware if the auto-vectorizing succeeded or not.

Another aspect of D that enables parallization is the use of ranges + algorithms. Although currently unexploited, it has the potential to be able to express parallel algorithms, and then take advantage of that.

This can be vectorised for CPU execution (which is nice!), but it will not solve the problem for GPUs. Moving just that scalar-vector multiplication to a GPU would be detrimental to performance, because the run-time would be dominated by the cost of moving the 'b' array to the GPU, followed by moving the 'a' array back. You get into all kinds of issues with where to leave data that has no obvious right answer.

It also does not help with problems that are not as embarrassingly parallel, but could still benefit from vectorisation, such as summations (or reductions in general), or things that contain their own inner control flow (think a Mandelbrot set, where you have a larger outer parallel loop with an inner serial while-loop).

That actually depends upon the architecture. If you have shared memory between GPU and CPU the copy latency is reduced by orders of magnitude - you're only waiting on the values to be populated in GPU's caches.

It'll be interesting to see what sort of context-switch performance comes out of the Zen APU devices when they are released (next year I believe).

I'd be somewhat surprised if Intel and NVidia aren't working on similar things.

> If you have shared memory between GPU and CPU the copy latency is reduced by orders of magnitude - you're only waiting on the values to be populated in GPU's caches.

Shared memory by itself will not have a significant performance impact here (although it simplifies the data movement substantially). For a scalar-vector multiplication, you are strongly memory bound, and so you still have to wait for all the memory to be retrieved over the bus. You are correct if both CPU and GPU are behind the same bus (such as if the GPU is on the same die), but modern GPUs support shared virtual memory even for non-integrated GPUs.

> Another aspect of D that enables parallization is the use of ranges + algorithms.

Is that similar to the C++ std::algorithm execution policy features introduced in C++17?

No. What Walter is talking about in D is if you have

Foo[] foo = ...;

foreach(f; foo) { f.bar.baz.quux; }

and the computation is parallelisable then you can write

import std.parallelism; foreach(f; parallel(foo)) { f.bar.baz.quux; }

and it will parallelised across threads.

I guess std.parallelism's parallel might be more analogous to an "instance" of an executor.

So that's similar to OpenMP parallel for looping?

I'm not familiar with that aspect of C++17.

Its the tag based dispatching for the `par` in


to call a parallel implementation of sort on `a`.

(On the Mill CPU the retire stations snoop/snarf writes and spot aliasing as it happens at runtime. This is a big part of our auto-vectorisation)

Cool, I'm looking forward to seeing more news about the mill.


probably won't get on the front page of HN :|

Interesting, are there benchmarks I can look at?

Is there any discussion about doing this in Rust? (maybe surrounding the SIMD RFCs)

There are several efforts including VexCL [0] expression templates, OpenACC [1] preprocessor, Kokkos [2] template metaprogramming, SYCL [3], and many other similar projects. Parallel programming for non-trivial parallelism is still...non-trivial.

[0] https://github.com/ddemidov/vexcl

[1] https://www.openacc.org/

[2] https://github.com/kokkos/kokkos

[3] https://www.khronos.org/sycl

If OpenACC worked as well as OpenMP (or OpenMP offloads worked as simply) it'd be huge, I would think

It is really difficult to extract parallelism from sequentially written programs. It is even worse when it comes to restricted fine-grained parallelism, as on a GPU. I'm not aware of any really robust auto-parallelising compiler, and even compilers for specialised parallel languages struggle to reach performance close to hand-written code for more complex problems.

One of the main goals for DCompute is to lower the barrier of entry, so that people how have embarrassingly parallel problems can take advantage of their hardware without being an expert.

Apparently Numba can (kind of) do this. It JITs Python/Numpy/Scipy code through the LLVM-based toolchain for CUDA.

[1] http://numba.pydata.org/numba-doc/latest/cuda/kernels.html [2] http://numba.pydata.org/numba-doc/latest/cuda/reduction.html...

You still need to write what's basically CUDA code for it to run.

It's not at all like you write python and it translates it to CUDA or anything.

Using the 'cuda.jit' method as linked does require you to do things like manually setting threads and blocks, though one could argue it makes it easier than doing it in CUDA C.

However numba's 'vectorize' and 'guvectorize' decorators can also run code on the GPU. The current documentation doesn't show good GPU examples, but here's examples from the documentation for the deprecated numbapro (the CUDA things from numbapro were later added into numba): https://docs.continuum.io/numbapro/CUDAufunc

  @vectorize(['float32(float32, float32, float32)',
            'float64(float64, float64, float64)'],
  def cu_discriminant(a, b, c):
    return math.sqrt(b ** 2 - 4 * a * c)
The 'float32/64' type signatures are not strictly necessary, unless you want to define the output type (so if the inputs are 32-bit floats and you don't want it to return 64-bit floats); if given no signature numba will automatically compile a new kernel each time the function is called with a new type signature. So that function would become (but in current numba 'gpu' should be replaced with 'cuda'):

  def cu_discriminant(a, b, c):
    return math.sqrt(b ** 2 - 4 * a * c)
Vectorize is a little limited in that it only operates on scalars and broadcasts those scalar operations over arrays.

guvectorize is more powerful and can operate on arrays directly so something like convolution or a moving average are possible, but is slightly more complicated to use than vectorize.

Update: fixed code formatting

not sure if this is what you are looking for but I had success with this Java library http://aparapi.com/, it looks at function's JVM byte code and runs it on GPU if it can, otherwise falls back to CPU. I suspect same thing can be done in other languages.


"Aparapi allows developers to write native Java code capable of being executed directly on a graphics card GPU by converting Java byte code to an OpenCL kernel dynamically at runtime. Because it is backed by OpenCL Aparapi is compatible with all OpenCL compatible Graphics Cards."

Interesting. CUDA kernels are plagued by an explosion of entry points and OpenCL C kernels by the lack of meta-programming.

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact