
BlazingSQL is Now Open Source - roaramburu
https://blog.blazingdb.com/blazingsql-is-now-open-source-b859d342ec20
======
tombert
This seems pretty cool, and I'll probably play with this at some point, but
sadly literally all of my GPUs are AMD or Intel at this point.

I'm sure you had a good reason, so I'm genuinely curious to why CUDA was
chosen instead of something like OpenCL?

(I'll add my typical disclaimer that I'm not saying this as some passive-
aggressive way to criticize; I'm genuinely curious to the reasoning behind the
choice.)

~~~
felipe_aramburu
Thats a great question. The answer is two-fold.

Early on when we first started playing around with General Processing on GPU's
we had Nvidia cards to begin with and I started looking at the apis that were
available to me.

The CUDA ones were easier for me to get started, had tons of learning content
that Nvidia provided, and were more performant on the cards that I had at the
time compared to other options. So we built up lots of expertise in this
specific way of coding for GPUS. We also found time and time again that it was
faster than opencl for what we were trying to do and the hardware available to
us on cloud providers was Nvidia GPUs.

The second answer to this question is that blazingsql is part of a greater
ecosystem. rapids.ai and the largest contributor by far is Nvidia. We are
really happy to be working with their developers to grow this eco system and
that means that the technology will probably be CUDA only unless we somehow
program "backends" like they did with thrust but that would be eons away from
now.

~~~
tombert
Thanks for answering my question so quickly!

That seems like a pretty good reason...I have been looking to learn some GPU
programming to optimize some matrix math that I've been doing for a pet
project, and while my first instinct was telling me OpenCL since it's
portable, if people who actually know what they're talking about are saying
that CUDA is simpler to start with, it might be worth it to me to pick up a
cheap Nvidia GPU/Jetson Nano and do some processing that way.

~~~
felipe_aramburu
The collab link below let's you use a gpu for free on Google cloud

------
dang
Discussed twice last year:

[https://news.ycombinator.com/item?id=18186392](https://news.ycombinator.com/item?id=18186392)

[https://news.ycombinator.com/item?id=19192625](https://news.ycombinator.com/item?id=19192625)

Also 2017:
[https://news.ycombinator.com/item?id=15819489](https://news.ycombinator.com/item?id=15819489)

2016:
[https://news.ycombinator.com/item?id=12484568](https://news.ycombinator.com/item?id=12484568)

------
rburhum
This is great. The BlazingDB guys are awesome and now that the project is open
source this is another good reason for my teams to experiment with different
workloads and compare it against a SparkSQL approach

~~~
huac
+1, this is very cool, but would love for the BlazingDB team to show
benchmarks here

~~~
roaramburu
Tons of benchmarks at blog.blazingdb.com

Check it out, it's fast.

------
samstave
Can someone give me some use case examples?

I read that site, and the RAPIDS site - but would like to hear from some ppl
using this in prod/test and what they are using it for...

~~~
lmeyerov
We worked with the team early on it. In turn, that means it's inside one of
the powertools at gov, bank, etc. teams, even if most of the users don't quite
know what a GPU DB is :) We do GPU visual graph analytics over event data
(security, fraud, customer 360, ...). We use for a bunch: interactive
sub-100ms timebars, histograms, etc. Any full-table compute stuff you'd do in
pandas, sql, spark, etc. Any UI interaction like a filter can trigger tons of
queries, and w/ GPUs, that means they can quickly compute all sorts of things.

The reason Graphistry picked BlazingSQL is it fit in as part of our approach
of end-to-end GPU services that compose by sharing in-memory Apache Arrow
format columnar data. When the Blazing team aligned on Nvidia RAPIDS more
deeply than the other 2nd-wave GPU analytics engines, it made the most sense
as an embedded compute dependency. Going forward, that means Blazing can focus
on making a great SQL engine, and we know the rate of their GPU progress won't
be pegged to their team but to RAPIDS. A surprise win over just cudf (python)
was eliminating most of the constant overheads (10ms->1ms / call), and looking
forward, seems like an easier path to multi/many-GPU vs. cudf (dask).

We should share a tech report at some point - bravo to the team!

~~~
felipe_aramburu
Thanks Leo! We love having you all as early adopters of our tech!

------
pradn
Looks like an good way to do analytics on the GPU. The Python API is clean and
simple.

The premise is that GPUs will accelerate columnar data analytics. And, with
"Dask" [1], you can run those worldloads on a cluster.

I wonder if careful indexing on initial write would outperform this system.
This system looks like it's best when you have totally raw, unindexed data.
Perhaps a future thing to do is to generate a side index during initial column
scans to speed up future queries?

Also, GPU memory is pretty expensive. How does the total-cost-of-ownership
compare to just running on RAM with powerful multi-core CPUs? There's like
512-bit vector operations these days.

[1]: [https://rapids.ai/dask.html](https://rapids.ai/dask.html)

~~~
felipe_aramburu
GPU memory is expensive but a big as #@$% computer is even more expensive.
When we show comparisons to things like spark we are doing so use cost basis.
So if we say something like we are x times faster than this technology on this
workload what we did was launch clusters that have similar costs. Total cost
of ownership is also reduced by the fact that the engine itself is totally
ephemeral. You can turn it off and on within seconds.

------
orliesaurus
I had never heard of this before, anyone in HN has used this before? If yes,
where and more specifically what was your use-case? Thank you!

------
bsamuels
What kind of benefits does CUDA bring to databases? I've never heard of
running a database on a GPU before. Couldn't find anything on their homepage
other than comparison with a few other db options

~~~
reilly3000
Check out [https://www.omnisci.com/learn/resources/gpu-
database](https://www.omnisci.com/learn/resources/gpu-database)

In summary, you get snappy, interactive query speeds on large data sets. I've
ran that locally and the results are pretty amazing compared to Postgres or
even Tableau in-memory.

I'm personally more excited about GPUs in stream processing; its just quite a
natural fit:
[https://github.com/rapidsai/cudf](https://github.com/rapidsai/cudf)

~~~
kichik
If you're interested in stream processing, check out FASTDATA.io PlasmaENGINE.
We do both stream and batch processing with Apache Spark on the GPU.

[https://fastdata.io/plasma-engine/](https://fastdata.io/plasma-engine/)

* It's not open-source and I work there.

~~~
arnon
Hi Kichik :)

------
manojlds
Is this due to PartiQL?

~~~
manigandham
PartiQL is a query language, based in SQL and extended to be more natural with
unstructured and nested data. It can be used with various database and
querying engines.

BlazingDB/SQL is a querying engine, more similar to Presto or Apache Drill,
and specializes in using GPUs for processing power.

------
llampx
Does this only run on NVIDIA/CUDA systems?

~~~
bernaferrari
> BlazingSQL is built entirely on top of cuDF and cuIO.

Yes.

