
BlazingSQL – A GPU SQL Engine for RAPIDS Open-Source Software from Nvidia - felipe_aramburu
https://blog.blazingdb.com/announcing-blazingsql-a-gpu-sql-engine-for-rapids-open-source-software-from-nvidia-11e115ba7dd7
======
sgt101
Does anyone have a clue how this could glue together with YARN[1] and/or TonY
[2] from LinkedIn?

[1] [https://hortonworks.com/blog/gpus-support-in-apache-
hadoop-3...](https://hortonworks.com/blog/gpus-support-in-apache-
hadoop-3-1-yarn-hdp-3/)

[2] [https://engineering.linkedin.com/blog/2018/09/open-
sourcing-...](https://engineering.linkedin.com/blog/2018/09/open-sourcing-tony
--native-support-of-tensorflow-on-hadoop)

~~~
felipe_aramburu
I am going to look into this. Its a fantastic question but we are too ignorant
on the matter to be able to comment properly at the time. I will respond here,
eventually :).

------
polskibus
I wonder how will this impact MapD and Dremio's path on the market. It seems
to me like RAPIDS is going to include visualisation for free so that would
undermine MapD's commercial offering.

~~~
randyzwitch
(OmniSci/MapD employee here)

In my opinion, I don't see this as undermining the OmniSci commercial
offering, but rather opening up GPU accelerated analytics and visualization to
a wider audience. A smaller piece of a much larger pie benefits all of the
contributors to the RAPIDS project (of which OmniSci is a contributor as well)

~~~
felipe_aramburu
Yah we are all solving different parts of this ENORMOUS puzzle. Our focus has
always been on making reducing the pains of ETLing data into the GPUs from all
of the crazy sources they can come from.

------
Steltek
What workloads is this good for?

I don't do GPGPU stuff but my layman's assumption was that shipping data to
the GPU was an expensive operation. Sending a ton of IO from "data lakes"[+]
to the GPU to do trivial comparisons seems like the worst case scenario?

+: Ddoes that term annoy anyone else? It's just a hoarder's impression of a
database...

~~~
arnon
It is true though. A data lake is what companies end up with by dumping
everything into HDFS, regardless of it's importance and relevance. So, it is
just a hoarding database.

I'll let the BlazingSQL guys answer the workloads question, but yeah, copying
data to the GPU (host -> device) is an expensive operation, and it has to be
justified.

With SQream DB (disclaimer: am a developer on SQream DB), we make decisions in
the query compiler and optimizer if we want to copy data over to the GPU, or
keep it in CPU for processing. The optimizer will know if it makes sense.

For example, when you're doing heavy transforms (eg. `SELECT
(SQRT(POWER(x,2)+44)/w+RADIANS(COS(y)*z)) FROM t`), it may make sense to have
it in the GPU anyway.

~~~
twtw
> copying data to the GPU (host -> device)

Copying data to the CPU (peripheral -> host) is expensive too, and it's
perfectly possible to just replace peripheral -> host with peripheral ->
device. I don't know if that is what blazingsql does, but it shouldn't be
discounted.

~~~
felipe_aramburu
If you mean something like network card ==> device via RDMA we are not doing
that yet but are currently working on this implementation. These past few
months we have focused more on adding functionality to libgdf. We are now
optimizing some of the i/os like this one you mention here. As of now if it
goes over the wire it is being copied into pre allocated pinned memory buffers
then sent to GPU. This will go away in about a month or so.

------
bufferoverflow
Anybody tried PG-Strom in production?

~~~
usgroup
Personally I'm most excited about PGStrom, because it means faster Postgres
all the time, not just when it fits into memory, and not just for SQL.

~~~
arnon
It appears to be, more than anything else, some CUDA code wrapped in UDFs. Not
sure I'd want that anywhere near my production stuff in that format.

~~~
usgroup
This is false. PG has a well defined and modular query engine. pgstrom works
by replacing and adding modules which allow for GPU support.

~~~
arnon
You're right, I got confused with another Postgres-GPU plugin type product.

However, Postgres isn't columnar. How does PG-Strom (now HeteroDB) arrange
data for the GPU properly?

------
aogl
I'd be quite interested to benchmark this in production, possibly as a master
replica to see how it compares in real-time.

What is the best way to pull real-time metrics from it into some sort of
dashboard?

~~~
felipe_aramburu
You can reach out to me (my first name at blazingdb) if you want to talk about
this.

In a nutshell users interact with blazing through the python API for the most
part. If you have small result sets like those that normally go into a
dashboard (very large data sets, small result sets normally), then you could
write queries in python that get distributed to a cluster. The result sets are
then available to be retrieved either via CUDA IPC locally or via TCP if you
want to pull the result sets back to the user in Python. We will be
incorporating faster interconnects using UCX in the following months for multi
node clusters.

------
jmakov
Will this also support rendering plots on the GPU? I think this is really the
killer feature.

------
meh2frdf
Is this in anyway related to blazegraph the gpu accelerated database?

~~~
ddorian43
This is from blazingdb.com, so sql on gpu (blazegraph is different).

~~~
meh2frdf
Blazegraph is now an Amazon owned trademark, good luck defending your name!

------
aocsa
Excellent work!

------
glenrivard
Interesting stuff. I thought one of the most interesting papers at NIPS this
year was using GPU or TPU for intelligent indexes.

[https://arxiv.org/abs/1712.01208](https://arxiv.org/abs/1712.01208)

Paper is from Jeff Dean and a worth while read, IMO.

