
BlazingSQL – GPU SQL Engine Now Runs Over 20X Faster Than Apache Spark - felipe_aramburu
https://blog.blazingdb.com/blazingsql-the-gpu-sql-engine-now-runs-over-20x-faster-than-apache-spark-1b0bffc990a9
======
budde
Yeah, this is really, really far from an apples-to-apples comparison. First
of, the test dataset size is trivially small for usecases where big data
systems are typically applied. I don't know why you'd introduce all the
complexity and overhead of a distributed mapreduce framework to ETL a dataset
that would fit in memory on consumer-grade hardware. It's not exactly fair to
compare a framework running on a single node to one where you've artificially
introduced multiple nodes and network overhead for a dataset that would easily
fit on one. You'll also notice a pretty stark difference between the level of
detail provided for the BlazingSQL test set up and the Spark one, which
(unless I'm missing something) is lacking any code or configuration details.
I've dipped my toes in the big data space long enough and seen enough "${FANCY
NEW FRAMEWORK} beats ${INDUSTRY-STANDARD FRAMEWORK} by 123x!!" posts to
recognize this as a gigantic red flag. How you manage partition sizes, order
and choice of operations, and tuning parameters can make orders-of-magnitude
level differences to your performance.

Maybe the future of frameworks like this will be on the GPU. I'm just not
seeing any evidence of it yet. Right now, Spark fills the space where you can
throw globs of memory at TB- to PB-scale problems. I could very well be wrong,
but I don't see how this is going to be cost-effective on GPUs given the
current cost of memory there.

~~~
felipe_aramburu
1\. You don't have to fit your whole workload on the GPU you can process it in
batches like you would for a workload that doesn't fit into memory on a non
gpu solution. You don't need PB of GPU memory to run PB workloads.

2\. The dataset is trivially small because this is a new engine built for the
rapids eco system and it is limited for the time being to a single node. We
are releasing our distributed version for GTC (mid March) and will be able to
give you more reasonable comparisons. This is a similar path of development to
our pre Rapids engine which went from single node to distributed in about a
month because we have built this engine to be distributed. Right now we are
finishing up UCX integration which is the layer we will be using to
communicate between all the nodes.

3\. You can always try it out. Its own dockerhub (see links in this post) and
if you want to run distributed workloads right now you can manage that process
using dask by handling the splitting up of the job yourself. In a few weeks
you will be able to have the job split up for you automatically without need
for the user to be aware of the size of the cluster or how to distribute data
across it.

~~~
lmeyerov
We're pretty excited near-term for getting to sub-second / sub-100ms
interactive time on real GB workloads. That's pretty normal in GPU land. More
so, where this is pretty clearly going, is using multiGPU boxes like DGX2s
that already have 2 TB/s memory bandwidth. Unlike multinode cpu systems, I'd
expect better scaling b/c no need to leave the node.

With GPUs, the software progression is single gpu -> multi-gpu -> multinode
multigpu. By far, the hardest step is single gpu. They're showing that.

------
paulsutter
Better headline, "Apache Spark Is Very Slow". Only 1 megabyte per second per
node? (3.8GB/(8 nodes*500 seconds)) Even with single core nodes, that's
utterly feeble (pro tip, please state the kind of node you're using)

Seems fishy that the benchmark is small enough to fit into GPU RAM as well

~~~
pete23
I’d like a Postgres bake off. Playing against Spark on a 24Gb dataset seems
silly, too.

~~~
felipe_aramburu
1\. It really isn't the kind of problem we are trying to solve. 2\. Last time
we did this we were over 100x faster and so we said. Ok its slow as &#@%.

We are more interested in trying to compare to tools that can provide
workflows which are:

A. I have a ton of data in S3, HDFS, cloud storage solutions, centralized NFS
B. I want to use the GPU to perform learning, graph analytics, classification,
etc. I want to get from A to B as fast as possible utilizing the same
resources for A that I will need to complete steb B.

------
michelpp
There is also GPU support for PostgreSQL using an extension called pg-strom

[https://github.com/heterodb/pg-strom](https://github.com/heterodb/pg-strom)

Not sure how it compares to this tool, but it uses a GPU to do both high
density analytics, by loading data straight from disk to the gpu with dma
avoiding the cpu entirely to do the work, or by using the GPU as a prefilter
to remove lots of candidate rows and then letting the CPU do the core query on
much fewer rows.

~~~
felipe_aramburu
PG-Strom is a great way to make postgres faster. Nice. But if the purpose is
here to build very fast and responsive system which is interacting with other
gpu tools well then snap I don't see how it can compete.

The old school way of, run this query, get my results, read those results
using something like jdbc, is too slow. You can literally take out the "run
this query" portion and it will still be slower than what we believe needs to
exist to power GPU accelerated workloads.

We are trying to keep up with the pace of hardware improvements that NVIDIA
has been churning out. Moving a result set from Postgres ==> "wherever I need
to be" by performing copies and making allocations for that data is a cost we
want to save people from.

------
hbogert
Now an experiment with 4 gpu nodes.

Horizontal scaling matters as well, which is not shown here. Furthermore,
using less than 16gb of data.. why would you use these frameworks in the first
place?

It is common knowledge that a grep and sort on 100gb is faster than Hadoop,
however 100tb is a different story

~~~
felipe_aramburu
Uncoordinated distributed is coming soon. We have been building this tool with
distribution in mind and right now are finishing up our first distributed
version of BlazingSQL which we will be releasing just before GTC ~March 15th.
We have already distributed using dask which you can do RIGHT now with
blazingsql but we felt it would not be fair to make a comparison between
distributed blazingsql which is user managed and something like spark which
happens under the hood. We will be releasing distributed numbers early March
where we can show differences between BlazingSQL and larger spark clusters
where the distribution is handled by the engine instead of the user.

------
PeterZaitsev
Hi,

Apache Spark is easy target to beat. I would be much more interested looking
at comparison to ClickHouse :)

~~~
felipe_aramburu
Do you have an example of workloads that are easy to reproduce using
clickhouse so we can see how much effort it would be to make a comparison? We
are always happy to test against other tools!

~~~
bicubic
I mean it has pretty full-featured sql support, so you can probably reproduce
your current scenario with it?

~~~
felipe_aramburu
And you think [https://tech.marksblogg.com/billion-nyc-taxi-rides-
clickhous...](https://tech.marksblogg.com/billion-nyc-taxi-rides-clickhouse-
cluster.html) for example is something that can be considered fast? It takes
the user 55 minutes just to load its data into a state so that it can be
"queryable".

After importing then they spend 34 more minutes making the data into a
columnar representation. Alright so 89 minutes in and we still haven't run
queries.

Oh but its not distribute yet. Darn I have to run some non standard sql
commands like

CREATE TABLE trips_mergetree_x3 AS trips_mergetree_third ENGINE =
Distributed(perftest_3shards, default, trips_mergetree_third, rand());

Ok can I query my data yet? No you have to move it into this distributed
representation and that takes 15 more minutes. Oh ok...

And now? Yes you can run your queries but they aren't really very fast.

SELECT cab_type, count(*) FROM trips_mergetree_x3 GROUP BY cab_type;

Can take 2.5 seconds on a 108 cpu core cluster for only 1.1BN rows? Thats not
fast. That's particularly slow given that requires you to ingest and optimize
your data.

Maybe you want to show us an example of some simple tests you have run with
blazing and clickhouse. As I read it now its not worth our time to look into
becuase its so very different from what we are trying to offer which is:

Connect to your files wherever you have them ETL quickly Train / Classify Move
on!

~~~
bicubic
The ingest time is due to updating the merge tree. You don't need a merge tree
for etl... It's like the worst backing store you could possibly choose. You're
also comparing an intentionally horizontally distributed query to a purely
vertical one on a single node. You can see just slightly below the same query
takes 0.2 seconds on a single node.

I was hoping to see some serious consideration given to these kinds of
benchmarks, considering Clickhouse is one of the most cost effective tools
I've used in the real world and occasionally outperforms things like mapd.

I was expecting your solution to outperform Clickhouse at least in some
aspects, and a benchmark showing where it wins. Instead you reveal ignorance
of Clickhouse and even the benchmarks you linked.

Your comment comes off as incredibly arrogant and at the same time incredibly
misinformed. Disappointing to see this attitude from the team.

~~~
felipe_aramburu
I am ignorant of clickhouse. It doesn't really compete in the workloads we are
interested in. Sorry you feel this way but we are a small team and need to
consider tools that integrate with Apache Arrow and CUDF natively.

If it doesn't take input from Arrow and CUDF and it doesn't produce output
that is Arrow CUDF or one of the file formats we are decompressing on the GPU.
Then we don't care unless one of our users asks us for this.

We are 16 people and a year ago were 5. We can't test everything out just the
tools our users need to replace in their stacks. I apologize if I came off as
arrogant. I have tourette's syndrome and a few other things that make it
difficult for me to communicate, particularly when discussing technical
matters. If I have offended you I do apologize but not a single one of our
users has said to me I am using clickhouse and want to speed up my GPU
workloads. Maybe its so fast they don't mind paying a serialization cost going
from clickhouse to GPU workload and if so thats great for them!

~~~
bicubic
Understood.

I do suggest you seriously benchmark against clickhouse, because where single
node performance is concerned, it is _the_ tool to beat outside arcane
proprietary stuff like kdb+ and brytlytdb. I have used single-node clickhouse
and seen interactive query times where an >10 node spark cluster was
recommended by supposed experts.

Clickhouse is not a mainstream tool (and I have discussed its limitations in
other threads) but it is certainly rising in popularity, and in my view it
comes pretty close to 1st place for general purpose perf short of Google scale
datasets.

~~~
felipe_aramburu
Ok. Right now we are in tunnel vision mode to get our distributed version out
by GTC in mid march. We will benchmark against clickhouse sometime in March.
Do you know of any benchmark tests that are a bit more involved in terms of
query complexity? We are most interested in queries where you can't be clever
and use things like indexing and precomputed materializations.

The more complex the query the less you can rely on being clever and the more
the guts need to be performant and that is more important to us right now.

~~~
hodgesrm
I work for Altinity, which offers commercial support for ClickHouse. We like
benchmarks. :)

We use the DTC airline on time performance dataset
([https://www.transtats.bts.gov/tables.asp?DB_ID=120](https://www.transtats.bts.gov/tables.asp?DB_ID=120))
and Yellow Taxi trip data from NYC Open Data
([https://data.cityofnewyork.us/browse?q=yellow%20taxi%20data&...](https://data.cityofnewyork.us/browse?q=yellow%20taxi%20data&sortBy=relevance))
for benchmarking real-time query performance on ClickHouse. I'm working on
publishing both datasets in a form that makes it easy to load them quickly.
Queries are an exercise for the reader but see Mark Litwintschik's blog for
good examples of queries: [https://tech.marksblogg.com/billion-nyc-taxi-
clickhouse.html](https://tech.marksblogg.com/billion-nyc-taxi-
clickhouse.html).

We've also done head-to-head comparisons on time series using the TSBS
benchmark developed by the Timescale team. See
[https://www.altinity.com/blog/clickhouse-timeseries-
scalabil...](https://www.altinity.com/blog/clickhouse-timeseries-scalability-
benchmarks) for a description of our tests as well as a link to the TSBS
Github project.

~~~
einpoklum
On an unrelated note: Oh, if you guys are using the OnTime data, have a look
at this: [https://github.com/eyalroz/usdt-ontime-
tools](https://github.com/eyalroz/usdt-ontime-tools)

------
georgewfraser
Spark is not a very fast SQL engine. A more appropriate comparison would be
something like Redshift or Snowflake. This GitHub repo [1] contains the source
code to run TPC-DS on 5 data warehouses; it would be great to add BlazingDB.

[1]
[https://github.com/fivetran/benchmark](https://github.com/fivetran/benchmark)

------
dominotw
[https://www.youtube.com/watch?v=tUIrR_mj9fQ&index=4&list=PLS...](https://www.youtube.com/watch?v=tUIrR_mj9fQ&index=4&list=PLSE8ODhjZXjbjOyrcqgE6_lCV6xvzffSN)
cmu talk series

------
cavisne
Uber and others seem to be having success with GPU accelerated big data
workloads but it’s hard to see how this would work in a cloud environment.
Nvidia price descrimination works for ML/rendering tasks that almost mandate
GPUs but it would significantly reduce the value for generic workloads

~~~
felipe_aramburu
What makes it hard to work in a cloud environment? The on demand nature of it?
The fact that you only pay for the time you are using it? The ease of scaling
or deployment?

It would be helpful if you could qualify some of these statements.

------
dswalter
I couldn't find any license information for BlazingSQL. Where is that
available?

~~~
roaramburu
Sorry, didn't have it available easily. You can see it here:
[https://www.blazingdb.com/#/eula](https://www.blazingdb.com/#/eula)

Our open source code (the compute kernels, GPU DataFrame, etc.) is part of
RAPIDS AI. You can see source and license here:
[https://github.com/rapidsai/cudf](https://github.com/rapidsai/cudf)

------
fulafel
To save time from anyone else curious, BlazingSQL requires a Nvidia GPU and
Linux.
([https://blazingdb.com/#/documentation](https://blazingdb.com/#/documentation))

------
continuations
So I saw this on their website: "SQL on RAPIDS, the open-source GPU data
science framework"

So is BlazingSQL open source? If it is where is the source code & what license
is it?

Also where is the source code for RAPIDS? And what license is it?

~~~
twtw
Bit of a lmgtfy question, but here you go:

[https://github.com/rapidsai](https://github.com/rapidsai)

[https://github.com/rapidsai/cudf/blob/branch-0.6/LICENSE](https://github.com/rapidsai/cudf/blob/branch-0.6/LICENSE)

------
techie128
I am generally not critical of people posting benchmarks. Neither am I trying
to rain down on your parade but seriously you do _not_ use Apache Spark to
process 3.8G or 15.6G of data. It is too trivial.

~~~
felipe_aramburu
I do not disagree with you. We are going to be publishing much larger
benchmarks soon. Before we could start building BlazingSQL we had to work on
CUDF in the rapids eco system. There was alot of work that went into building
all of the primitives that were needed for both our sql engine and all of the
other wonderful tools that are now leveraging CUDF.

We have to start somewhere and if you think we should compare against
something else please tell us what you think we should compare ourselves to!
We are already looking into some of the other tools that have been mentioned
here in the comments.

------
thanatos_dem
What’s going on with that first graph? The scale changes from 1500” to 3000”,
but Spark’s bar sizes stay the same.

Did Spark have a 2x slow down between the tests? It seems pretty misleading.

~~~
thestephen
Further down in the article:

 _In our original demo, we ran a V100 GPU against 8 CPU nodes. The new T4 GPUs
cut our cost in half, which meant we reduced the Apache Spark cluster to 4 CPU
nodes in order to maintain price parity. Even with the reduced GPU memory, the
whole workload ran significantly faster._

~~~
einpoklum
Actually, what you should normalize is operating costs, so probably power
consumption is the better figure to normalize by.

... on the other hand, that's more difficult to measure, and TDP figures are
not what really happens.

------
bithavoc
How different is BlazingSQL to Uber's AresDB[0]?

[0][https://eng.uber.com/aresdb/](https://eng.uber.com/aresdb/)

~~~
felipe_aramburu
In lots of ways. AresDB has ingestion. Users take their data and say here make
a table from this, persist that data the way I want you to great now I can
query it and go along my merry business. I see lots of nice fancy
optimizations in there. The kinds of things a small company like Blazing can
only marvel at!

I also see a completely different use case. We don't want to ingest data into
Blazing. That is NOT our cup of tea. We will make tools to help you persist
data quickly. Working on it. But we want to read data the way YOU store it
without asking you to duplicate it into our system and requiring significant
effort from the user.

You have files. Great we can read them, query them, do all kinds of acrobatics
on them and don't need you to store redundant copies of what you already had.

The files are stored in S3 or HDFS and other file systems. Great, we have
built connections to interface with those file systems and you can query
directly from them.

A user should be able to get blazing's docker container. Register a filesystem
with the engine, and start querying their files within minutes of having
launched the container. When we don't do this we fail. It does not seem to me
that AresDB is trying to do this but I could be completely wrong and if you
think so please tell me why!

BlazingDB is meant to be connected to other tools in the rapids eco system.
You interact with it through python. You can 0 copy IPC the data to any other
tool in the rapids eco system. Want to a power a dashboard, sure, but this
isn't the use case that we are optimizing for. We don't have a JDBC
connection. We don't connect to tableau.

Blazing Focuses on Interpreting rather than JITing. Instead of focusing on
trying to make the most performant compiled code possible by JITing exactly
what we want we pay a process cost that reduces I/O. We convert Logical Plans
into physical execution plans that are suitable for a SIMD architecture whose
optimization centers around minimizing materializations and global memory
accesses. We tried JIT but couldn't find a way to compile plans in less than a
few hundred nano seconds at which point the overhead of compiling superseded
that of interpreting operations in a kernel.

------
mastrsushi
Don't SQL engines typically run on servers which lack GPU's? That pretty much
kills the entire hardware demographic of enterprise SQL users.

~~~
felipe_aramburu
The purpose of blazing is to accelerate GPU workloads. SQL is just a language
we provide users as an interface for performing complex transformations in the
rapids eco system. rapids.ai

We aren't targeting enterprise sql users specifically. We are targeting people
using GPUs that want to accelerate the process of making information available
to those gpus by leveraging the GPUs for the process of parsing (CSV),
decompressing (Parquet) and then transforming data to make it suitable for
machine learning, classifying etc.

------
m0zg
Molasses is fast compared to Apache Spark. Compare to something like Redshift
or Google BigQuery, and on a much larger dataset.

------
ris
The source being where?

~~~
felipe_aramburu
Most of the execution kernels that we use are in
[https://github.com/rapidsai/cudf](https://github.com/rapidsai/cudf) . The
engine itself is not open source though it is free to use and we have a
dockerhub link which is pretty convenient.
[https://hub.docker.com/r/blazingdb/blazingsql/](https://hub.docker.com/r/blazingdb/blazingsql/)

Conda works too but its still a bit painful and we are working on making that
a bit more user friendly.

------
wazokazi
Is there support for loading data from Parquet files?

~~~
roaramburu
There is, this demo is done with CSV's b/c we can't read Apache Parquet files
with strings yet (changing very very very soon). You can check out how here:
[https://docs.blazingdb.com/docs/database-
administration#sect...](https://docs.blazingdb.com/docs/database-
administration#section-apache-parquet)

------
jarjar12
Keep up the great work. Don’t listen to others.

------
heavenlyblue
So how's the string support on BlazingSQL?

~~~
lmeyerov
The reason our team (graphistry) is working w/ the BlazingDB team is q's like
this. BlazingSQL subsumes RAPIDS / cudf... which now includes the GPU-
optimized nvstrings lib. And instead of a hacky grad student's GPU string lib,
this is Nvidia's.

We initially bet on OpenCL... but had to write almost everything ourselves. We
then switched to GOAI, despite having to write much of Apache Arrow[JS], among
other things. I love q's like that now that more and more pieces are falling
into place with best-of-class tech. That's a big deal for our users, and helps
our small punch above our weight for helping them. 2019 will be fun!

------
hyperpallium

      SELECT colA + colB * 10, sin(colA) — cos(colD) FROM tableA
    

GPUs excel at parallel trigonometry computation.

~~~
einpoklum
Only if your data is already on the GPU. Otherwise, this little computational
work is swamped by the cost of I/O back and forth to main memory.

------
moocowtruck
link to the source ? i couldn't find it on the page

~~~
felipe_aramburu
[https://github.com/rapidsai](https://github.com/rapidsai) is the source for
the rapids tools. BlazingSQL itself is not open source though it is free to
use.

------
aocsa
excellent work!

