
A High-Level Lua API for GPU Parallelism - devy
http://willcrichton.github.io/terracuda/
======
dragandj
While the project is interesting, the main benchmark is very misleading. They
claim that terracuda matrix multiplication is faster than the one written in
cuda. This can be only true if it is compared to NAIVE matrix multiplication
in cuda, which can be confirmed by the time that is reported in seconds, which
is orders of magnitude slower than the "real" cuda matrix multiplication.

Why is that benchmark completely meaningles? Because cuda matrix
multiplication is so optimized that no one with a bit of sense would roll out
his own, and even if he does, that implementation would have to use very
specialized and ugly code - can this be done in terracuda, and is it simpler
than in C?

Addidionally, the main issue in gpu kernels is not the ugliness of c itself,
but the need to optimize algorithms to the hardware of the gpu, thus making
the code much more verbose and ugly than the naive c code. That's why
terracuda and the likes look fantastic when you compare them with naive C
implementations, but they try to solve the wrong problem IMO.

~~~
wcrichton
Project author here. Your first claim is true, and you can verify it here:
[https://github.com/willcrichton/terracuda/blob/master/matrix...](https://github.com/willcrichton/terracuda/blob/master/matrix/matrix.cu)

However, I would dispute that it makes the benchmark meaningless. I agree that
no one will write the hand-optimized matrix multiply routine in Terra, but I
think that misunderstands the use case of the language. Terracuda is intended
to make everyday GPGPU computations super easy, so you can speed up simple
scripting code that would otherwise be order of magnitudes slower. Look at the
other examples in the repo: if you wanted to write a simple hash function,
Mandelbrot generator, or renderer, Terracuda makes it simple to accelerate Lua
code.

There will always be room for super optimized kernels for common functions
like matrix multiplication, but I believe there's a strong need for a language
that handles everything else.

~~~
dragandj
My point is that while it can sometimes accelerate lua code, it will also (i'd
say) _de_ celerate lua code, when the algorithm is not trivially
parallelizable (and that is the majority of cases). So, to get the gpu boost,
or evento understand how to implement the algorithm properly, I'd have to drop
to the GPU specifics. The main question is then: do terracuda helps with that
or will it be the obstacle? (it is not a rethorical question; while I suspect
it will be the obstacle, I am interested to hear what is your experience with
such cases)

~~~
wcrichto
I strongly doubt writing Terracuda GPU code would ever decelerate your
original application, although it might not be as fast as a hand-written CUDA
program (and that's an important distinction!). Terracuda doesn't actually
lack that many GPU specifics--really it just doesn't provide access to shared
memory constructs which are important for hand-tuned matrix mult kernels and
blocked algorithms. And that's in the NVPTX API so it's not a true obstacle,
just something I haven't gotten around to writing yet.

~~~
jakub_h
> although it might not be as fast as a hand-written CUDA program

If you're dealing with larger pieces of code at once (that is, if you have a
bigger view of what the programmer wants to run, which you should have), one
interesting thing you could do is "exploratory compiling". Try different
compilation step sequences and transformations, if you have enough invariants
in the code. Machine time is cheap and this could yield some interesting
results. Of course, the more constrained the language model is, the more
opportunities for this.

(There's been some interesting results in this area, like superoptimizers for
local optimization, but larger scale compilation would probably use something
like genetic algorithms, as you're unlikely to have the opportunity for
exhaustive searches.)

------
6d65
This is great. Playing around with Terra for a few weeks now in my spare time.
And it's a great language.

As a side note. I'm not sure this is usable in software other than Scientific
Computing kernels, as relying solely on CUDA rules out AMD and Intel GPUs, if
i'm not mistaken.

Though I guess one could use it as an inspiration for implementing something
similar for OpenCL or Vulkan.

------
pavlov
Awesome project. Terra seems so powerful, it's great to see more applications
for it.

Sidenote -- I hate to be the guy complaining about web design, but the Raleway
font is very hard to read (Retina MacBook). Here's a screenshot:
[http://i.imgur.com/9DkVEaQ.png](http://i.imgur.com/9DkVEaQ.png)

The screenshot is at 2x resolution so the readability problem is not quite
obvious, but look at the vertical lines on the "m" letters -- the line width
is all over the place even within a single letter. Raleway's miserable kerning
doesn't help things either.

~~~
wcrichton
Totally valid complaint. I've updated it to something that's hopefully more
readable.

~~~
pavlov
Thank you! Much better now.

------
w0utert
I love stuff like this, but I'm wondering whether it's actually useful for
solving any real-world GPGPU problems.

My understanding is that the hardest part of moving stuff to the GPU is moving
data around to and from the GPU without killing the computational performance
gains by data dependencies, and to transform your problem to something that
doesn't need a lot of control flow or rich/complex data structures. Does
Terracuda address these things?

~~~
wcrichto
Good question. For your first point, Terracuda doesn't do anything
particularly intelligent with movement between host/device memories--it's
mostly there just to hide that away from you. However, empirically, if your
baseline is Lua or Python, almost anything you write in CUDA is going to be
more efficient so long as you have enough data to parallelize over.

As for the second point, that is still mostly left to the programmer. GPU code
doesn't necessarily preclude complex data structures, however. For one of my
benchmarks, we used a software renderer that generates a quadtree on the fly
and accesses that during the shading phase
([https://github.com/willcrichton/terracuda/blob/master/render...](https://github.com/willcrichton/terracuda/blob/master/renderer/cuda_renderer.t)).

~~~
w0utert
On typical consumer hardware you really want to batch up your processing in a
way that gives you the best computation/data ratio to prevent performance
being killed by data latencies, I don't think any kind of solution exists that
can do that for you automatically/transparently. For the usual suspects like
signal processing stuff and such, these problems are relatively easy to
overcome because you typically have large amounts of data that can be
processed at once or in in blocks.

What would make your project even more interesting is to see how it would
perform on an heterogeneous system architecture (HSA), ie: what AMD is working
towards. In an HSA, the CPU and GPU share the same memory and address space
and a program can basically move processing between the two at close to no
cost (no memory ever needs to be copied or mapped in/out GPU address space).
For problems that allow some kind of pipelining to absorb start-up latencies
caused by data dependencies, a combination of Lua + Terracuda would be pretty
awesome, allowing you to move even small/short-running computational tasks to
the GPU.

------
nacs
Newbie question: Would this Terracuda system be the easiest way to get started
with GPU/CUDA programming for someone who hasn't done it before or are there
easier approaches out there for running code in parallel on the GPU?

(I have worked with Lua for personal gamedev projects before)

~~~
wcrichto
Ideally this would be your easiest starting point for GPU programming, but I
haven't this particular course project for a year and a half so it would need
some polishing before it's ready for prime time. Honestly, I would just dive
into the CUDA docs and mess around with their toolchain. Some resources:

Lecture on "GPU Architecture and CUDA Programming":
[http://15418.courses.cs.cmu.edu/spring2016/lecture/gpuarch](http://15418.courses.cs.cmu.edu/spring2016/lecture/gpuarch)

CUDA C Programming Guide:
[http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf](http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf)

