
Using LLVM to Accelerate Application Performance on GPUs - jtsymonds
http://www.mapd.com/blog/2016/04/27/massive-throughput-database-queries-with-llvm-on-gpus/
======
joe_the_user
"The most powerful GPU currently available is the NVIDIA Tesla K80
Accelerator, with up to 8.74 teraflops of compute performance and nearly 500
GB/sec of memory bandwidth."

Wow, I've been reading, on Nvidia's site and elsewhere, about cuda programming
on gpus and a lot of the advice involves avoiding a lot of transfers between
main memory and the gpu. However, at the transfer rate quoted above, it seems
like you have a device that can read all of main memory in a few seconds.

This is great ... however, does this mean all the advice about avoiding memory
transfers and the programming style went with that, goes out the window?

Edit: as a beginner, I'm probably butchering the standard advice for c++/cuda
programming. However, could someone summarize how things change as you get to
more capable processors. A lot of the Nvidia blog posts are geared to stuff
valid for all cuda version but since I'm starting fresh with a compute-intense
program targeting cuda, I'd like to target the best things cuda can do.

Edit2: While the article is fascinating for the possibilities it's talking
about, it's mostly a puff-piece for some (apparently) closed-source software.
Anyone know an open source equivalent or some more detailed sample code for
doing stuff similar to what MapD claims to do?

~~~
robbies
That 500 GB/s is the bandwidth between the NV shader cores and the GPU RAM,
not between the shader cores and CPU accessible system RAM. You don't want to
be transferring between the two pools that often, as those buses are much
slower.

~~~
pmalynin
Yeah, RAM AFAIK is about 22 GB/s

~~~
trsohmers
And PCIe Gen 3 is limited to 16GB/s, which is the real bottleneck between the
CPU and GPU... if you can use all 16 lanes. Most GPGPU set ups with multiple
GPUs only are able to use 4 or 8 lanes each, so you then are stuck with
multiple 4 or 8GB/s bottlenecks.

------
Joky
"LLVM IR is quite portable over the various architectures we run on (GPU,
x86-64, ARM)."

I wonder what they mean by "portable" in this context.

~~~
tmostak
Meaning LLVM has backends that target these various architectures, with very
few (if any) changes required to the IR.

------
brotchie
What's the current state-of-the-art for network interface -> GPGPU DMA?

I recall reading a while ago that NVIDIA was working on the ability to DMA
directly off certain network cards into GPU RAM. Has this come to fruition?

For this kind of application, is there a meaningful speed-up by fetching data
from some central data source directly into GPU memory rather than doing the
network interface -> RAM -> GPGPU RAM?

------
xcombelle
They mention that it's super fast without index. Is there a win with index ?

~~~
tmostak
The main focus of MapD is scan queries where you might need to look at
billions of rows to do the group bys, joins and aggregates to answer a query.
Indexes don't tend to do well for such use cases.

In the future we may add indexes such that looking up a single or small number
of rows is as fast as possible (such operations fast now since the GPU scans
are so fast, but not as fast as they would be if we had indexes)

~~~
joe_the_user
So does the database essentially feed the tables referenced in the query
through the gpu?

Does it accumulate values as it goes? Work on all the values together in
memory?

Anyway, it seems like this sort of speed should also allow one to work with
larger indices and do the sort of queries that allows.

~~~
tmostak
MapD tries to cache compressed versions of the "hot" columns of a table in GPU
RAM, which could be up to 256GB per node across 8 GPUs. If necessary though it
can stream the data from CPU RAM - of course such queries won't be as fast.

Yes you could imagine accelerating index lookups with GPUs (I think there's
some research papers on this subject already) - maybe a future project for us.

