Hacker News new | past | comments | ask | show | jobs | submit login
BlazingSQL – GPU SQL Engine Now Runs Over 20X Faster Than Apache Spark (blazingdb.com)
192 points by felipe_aramburu 31 days ago | hide | past | web | favorite | 76 comments

Yeah, this is really, really far from an apples-to-apples comparison. First of, the test dataset size is trivially small for usecases where big data systems are typically applied. I don't know why you'd introduce all the complexity and overhead of a distributed mapreduce framework to ETL a dataset that would fit in memory on consumer-grade hardware. It's not exactly fair to compare a framework running on a single node to one where you've artificially introduced multiple nodes and network overhead for a dataset that would easily fit on one. You'll also notice a pretty stark difference between the level of detail provided for the BlazingSQL test set up and the Spark one, which (unless I'm missing something) is lacking any code or configuration details. I've dipped my toes in the big data space long enough and seen enough "${FANCY NEW FRAMEWORK} beats ${INDUSTRY-STANDARD FRAMEWORK} by 123x!!" posts to recognize this as a gigantic red flag. How you manage partition sizes, order and choice of operations, and tuning parameters can make orders-of-magnitude level differences to your performance.

Maybe the future of frameworks like this will be on the GPU. I'm just not seeing any evidence of it yet. Right now, Spark fills the space where you can throw globs of memory at TB- to PB-scale problems. I could very well be wrong, but I don't see how this is going to be cost-effective on GPUs given the current cost of memory there.

1. You don't have to fit your whole workload on the GPU you can process it in batches like you would for a workload that doesn't fit into memory on a non gpu solution. You don't need PB of GPU memory to run PB workloads.

2. The dataset is trivially small because this is a new engine built for the rapids eco system and it is limited for the time being to a single node. We are releasing our distributed version for GTC (mid March) and will be able to give you more reasonable comparisons. This is a similar path of development to our pre Rapids engine which went from single node to distributed in about a month because we have built this engine to be distributed. Right now we are finishing up UCX integration which is the layer we will be using to communicate between all the nodes.

3. You can always try it out. Its own dockerhub (see links in this post) and if you want to run distributed workloads right now you can manage that process using dask by handling the splitting up of the job yourself. In a few weeks you will be able to have the job split up for you automatically without need for the user to be aware of the size of the cluster or how to distribute data across it.

We're pretty excited near-term for getting to sub-second / sub-100ms interactive time on real GB workloads. That's pretty normal in GPU land. More so, where this is pretty clearly going, is using multiGPU boxes like DGX2s that already have 2 TB/s memory bandwidth. Unlike multinode cpu systems, I'd expect better scaling b/c no need to leave the node.

With GPUs, the software progression is single gpu -> multi-gpu -> multinode multigpu. By far, the hardest step is single gpu. They're showing that.

1. If you process it in batches then you have to count the time it takes to send the data of each batch to and from the GPU. 2. It's fair to start out with small data sets, but then you don't compare against distributed frameworks like Spark, but rather against single-node solutions.

Also - Spark is very slow compared to analytic distributed DBMSes.

I used to think that experience meant the difference between believing benchmarks and being skeptical of them. Now I know it's the difference between being skeptical and ignoring them.

So it’s not a revolutionary game changer? Aw shucks, back to work I guess.

Better headline, "Apache Spark Is Very Slow". Only 1 megabyte per second per node? (3.8GB/(8 nodes*500 seconds)) Even with single core nodes, that's utterly feeble (pro tip, please state the kind of node you're using)

Seems fishy that the benchmark is small enough to fit into GPU RAM as well

I’d like a Postgres bake off. Playing against Spark on a 24Gb dataset seems silly, too.

1. It really isn't the kind of problem we are trying to solve. 2. Last time we did this we were over 100x faster and so we said. Ok its slow as &#@%.

We are more interested in trying to compare to tools that can provide workflows which are:

A. I have a ton of data in S3, HDFS, cloud storage solutions, centralized NFS B. I want to use the GPU to perform learning, graph analytics, classification, etc. I want to get from A to B as fast as possible utilizing the same resources for A that I will need to complete steb B.

As for size we will be releasing much larger workloads for GTC and are working hard on finishing up our work to make the workloads public and reproducible as well as releasing a new version of our engine where you can run much much larger workloads through distribution.

I've done some fairly intense Spark turning.

We were able to get a query that took 16 hours to 4 seconds. This took a long time, and included changing how the data was stored etc.

Anyway, fast Spark performance is possible but it is very sensitive to tuning (this was in ~2016, so maybe things have improved).

Interesting. This is worth writing a blog post on.

I did a talk on it at Hadoop World or some HortonWorks conference one year. Don't know where the slides when though.

Do you remember the title? I can try finding it on the Internet if you happen to be able to find the title (maybe from your submission email?). Thanks for doing the talk.

This was the conference: https://dataworkssummit.com/sydney-2017/, but they don't seem to have uploaded any of the sessions from Sydney to the SlideShare.

The one I did the previous year is nicely linked from http://hadoopsummit.org/melbourne/agenda/

Pro tip: justify why dividing data set size by time to solution and saying number is small is a good analysis.

Let's try imagenet training. Intel's best time is 3h25m on 128 nodes. Imagenet is ~150 gb.

(150 GB)/(3 hr * 3600 sec/hr * 128 nodes) = less than one megabyte per second per node! Caffe is slow! CPUs are slow!

Or bs metrics are bad?


This is ETL processing, not training a neural network. If their (undisclosed) actual task is not parallelizable in Spark, the comparison is even sillier so let's be generous and assume it is.

The GPU workload is used in feeding XGBOOST to perform classification. We are omitting this timing on purpose because the training time on purpose becuase SPARK took 1000's of times longer for this part of the workflow.

You can see the workflow for yourself and see how the task is quite parallelizable.

So just the ETL portion is that slow? That's really odd. I think I see faster performance than that ETLing on Hadoop. Unless there's something complex going on here.

Did you see the steps being performed in the workflow? How would you perform these steps on hadoop and what kind of times would you expect for the different steps in the workflow?

I went back to see and only just realized you had a “TLDR here’s the full details” bit. Missed that and saw only the 5x faster ETL which I figured was what you were crowing about. I’ll have to look at the rest later. I still don’t get why the ETL phase has to be so slow but maybe it’ll be obvious when I look, like you’re doing extensive transformations or something. But no one would just call that “just ETL”. Anyway, I’ll see for sure later.

Cacaw! I am glad the crowing reached you :). You probably did not get far if you are still on the old numbers that was in the first paragraph.

ETL can be very extensive. For example, we first built this when we had to take data from 15 different database systems that represented individuals and their pension contributions and join across these systems. It was a largish join about ~4 tables from each system so around a 60 table join.

That was "JUST ETL". The job was preparing the data for training. ETL is often times a large part of people's workflow. Looking for needle situations in a haystack. That can be JUST ETL. If you believe there is a more apt word for extracting data from a system, performing unspeakable transformations on it, then making that information available to another process then please tell me. Being of Peruvian stock myself I take great license with my language and grammar.

Intel's run wasn't 3:25 per epoch, which is the difference. A simple ETL should be one linear scan over the data, with maybe some merges, nothing I saw here seemed more complex than that?

You will notice that the second size does NOT fit into GPU ram on this benchmark since we ran it on a GPU that has less memory available than the DATA it is running on. Spark is feeble, perhaps but it is one of the tools being used to feed GPU workloads. Here we have replaced a workload where spark was previously used and now it is faster.

> Apache Spark Is Very Slow

I’m betting that the bottleneck is running ML training. Otherwise I can’t see why an ETL job on that amount of data would take several minutes.

An etl job could take an infinite amount of time. I am not trying to be cute by saying this but the point is that some ETL jobs might be as simple as loading data from a file others could involve millions of transformations. There is no limit to what you can do in ETL. This is not just loading the data from a file and providing it to xgboost.

And as we state in the post. This does not include ML times since we wanted to show how much faster we were making ETL. The ML timing differences are even starker than ETL.

The workload size is goofy to say the least.

Distributed workloads are not yet supported, so they showcase what they can fit on a single node right now.

There is also GPU support for PostgreSQL using an extension called pg-strom


Not sure how it compares to this tool, but it uses a GPU to do both high density analytics, by loading data straight from disk to the gpu with dma avoiding the cpu entirely to do the work, or by using the GPU as a prefilter to remove lots of candidate rows and then letting the CPU do the core query on much fewer rows.

PG-Strom is a great way to make postgres faster. Nice. But if the purpose is here to build very fast and responsive system which is interacting with other gpu tools well then snap I don't see how it can compete.

The old school way of, run this query, get my results, read those results using something like jdbc, is too slow. You can literally take out the "run this query" portion and it will still be slower than what we believe needs to exist to power GPU accelerated workloads.

We are trying to keep up with the pace of hardware improvements that NVIDIA has been churning out. Moving a result set from Postgres ==> "wherever I need to be" by performing copies and making allocations for that data is a cost we want to save people from.

Now an experiment with 4 gpu nodes.

Horizontal scaling matters as well, which is not shown here. Furthermore, using less than 16gb of data.. why would you use these frameworks in the first place?

It is common knowledge that a grep and sort on 100gb is faster than Hadoop, however 100tb is a different story

Uncoordinated distributed is coming soon. We have been building this tool with distribution in mind and right now are finishing up our first distributed version of BlazingSQL which we will be releasing just before GTC ~March 15th. We have already distributed using dask which you can do RIGHT now with blazingsql but we felt it would not be fair to make a comparison between distributed blazingsql which is user managed and something like spark which happens under the hood. We will be releasing distributed numbers early March where we can show differences between BlazingSQL and larger spark clusters where the distribution is handled by the engine instead of the user.


Apache Spark is easy target to beat. I would be much more interested looking at comparison to ClickHouse :)

Do you have an example of workloads that are easy to reproduce using clickhouse so we can see how much effort it would be to make a comparison? We are always happy to test against other tools!

I mean it has pretty full-featured sql support, so you can probably reproduce your current scenario with it?

And you think https://tech.marksblogg.com/billion-nyc-taxi-rides-clickhous... for example is something that can be considered fast? It takes the user 55 minutes just to load its data into a state so that it can be "queryable".

After importing then they spend 34 more minutes making the data into a columnar representation. Alright so 89 minutes in and we still haven't run queries.

Oh but its not distribute yet. Darn I have to run some non standard sql commands like

CREATE TABLE trips_mergetree_x3 AS trips_mergetree_third ENGINE = Distributed(perftest_3shards, default, trips_mergetree_third, rand());

Ok can I query my data yet? No you have to move it into this distributed representation and that takes 15 more minutes. Oh ok...

And now? Yes you can run your queries but they aren't really very fast.

SELECT cab_type, count(*) FROM trips_mergetree_x3 GROUP BY cab_type;

Can take 2.5 seconds on a 108 cpu core cluster for only 1.1BN rows? Thats not fast. That's particularly slow given that requires you to ingest and optimize your data.

Maybe you want to show us an example of some simple tests you have run with blazing and clickhouse. As I read it now its not worth our time to look into becuase its so very different from what we are trying to offer which is:

Connect to your files wherever you have them ETL quickly Train / Classify Move on!

The ingest time is due to updating the merge tree. You don't need a merge tree for etl... It's like the worst backing store you could possibly choose. You're also comparing an intentionally horizontally distributed query to a purely vertical one on a single node. You can see just slightly below the same query takes 0.2 seconds on a single node.

I was hoping to see some serious consideration given to these kinds of benchmarks, considering Clickhouse is one of the most cost effective tools I've used in the real world and occasionally outperforms things like mapd.

I was expecting your solution to outperform Clickhouse at least in some aspects, and a benchmark showing where it wins. Instead you reveal ignorance of Clickhouse and even the benchmarks you linked.

Your comment comes off as incredibly arrogant and at the same time incredibly misinformed. Disappointing to see this attitude from the team.

I am ignorant of clickhouse. It doesn't really compete in the workloads we are interested in. Sorry you feel this way but we are a small team and need to consider tools that integrate with Apache Arrow and CUDF natively.

If it doesn't take input from Arrow and CUDF and it doesn't produce output that is Arrow CUDF or one of the file formats we are decompressing on the GPU. Then we don't care unless one of our users asks us for this.

We are 16 people and a year ago were 5. We can't test everything out just the tools our users need to replace in their stacks. I apologize if I came off as arrogant. I have tourette's syndrome and a few other things that make it difficult for me to communicate, particularly when discussing technical matters. If I have offended you I do apologize but not a single one of our users has said to me I am using clickhouse and want to speed up my GPU workloads. Maybe its so fast they don't mind paying a serialization cost going from clickhouse to GPU workload and if so thats great for them!


I do suggest you seriously benchmark against clickhouse, because where single node performance is concerned, it is the tool to beat outside arcane proprietary stuff like kdb+ and brytlytdb. I have used single-node clickhouse and seen interactive query times where an >10 node spark cluster was recommended by supposed experts.

Clickhouse is not a mainstream tool (and I have discussed its limitations in other threads) but it is certainly rising in popularity, and in my view it comes pretty close to 1st place for general purpose perf short of Google scale datasets.

Ok. Right now we are in tunnel vision mode to get our distributed version out by GTC in mid march. We will benchmark against clickhouse sometime in March. Do you know of any benchmark tests that are a bit more involved in terms of query complexity? We are most interested in queries where you can't be clever and use things like indexing and precomputed materializations.

The more complex the query the less you can rely on being clever and the more the guts need to be performant and that is more important to us right now.

I work for Altinity, which offers commercial support for ClickHouse. We like benchmarks. :)

We use the DTC airline on time performance dataset (https://www.transtats.bts.gov/tables.asp?DB_ID=120) and Yellow Taxi trip data from NYC Open Data (https://data.cityofnewyork.us/browse?q=yellow%20taxi%20data&...) for benchmarking real-time query performance on ClickHouse. I'm working on publishing both datasets in a form that makes it easy to load them quickly. Queries are an exercise for the reader but see Mark Litwintschik's blog for good examples of queries: https://tech.marksblogg.com/billion-nyc-taxi-clickhouse.html.

We've also done head-to-head comparisons on time series using the TSBS benchmark developed by the Timescale team. See https://www.altinity.com/blog/clickhouse-timeseries-scalabil... for a description of our tests as well as a link to the TSBS Github project.

On an unrelated note: Oh, if you guys are using the OnTime data, have a look at this: https://github.com/eyalroz/usdt-ontime-tools

BTW, I think you do need to consider materialized views. ClickHouse materialized views function like projections in Vertica. They can apply different indexing and sorting to data. Unless your query patterns are very rigid it's hard to get high performance in any DBMS without some ability to implement different clustering patterns in storage.

Spark is not a very fast SQL engine. A more appropriate comparison would be something like Redshift or Snowflake. This GitHub repo [1] contains the source code to run TPC-DS on 5 data warehouses; it would be great to add BlazingDB.

[1] https://github.com/fivetran/benchmark

Uber and others seem to be having success with GPU accelerated big data workloads but it’s hard to see how this would work in a cloud environment. Nvidia price descrimination works for ML/rendering tasks that almost mandate GPUs but it would significantly reduce the value for generic workloads

What makes it hard to work in a cloud environment? The on demand nature of it? The fact that you only pay for the time you are using it? The ease of scaling or deployment?

It would be helpful if you could qualify some of these statements.

I couldn't find any license information for BlazingSQL. Where is that available?

Sorry, didn't have it available easily. You can see it here: https://www.blazingdb.com/#/eula

Our open source code (the compute kernels, GPU DataFrame, etc.) is part of RAPIDS AI. You can see source and license here: https://github.com/rapidsai/cudf

To save time from anyone else curious, BlazingSQL requires a Nvidia GPU and Linux. (https://blazingdb.com/#/documentation)

So I saw this on their website: "SQL on RAPIDS, the open-source GPU data science framework"

So is BlazingSQL open source? If it is where is the source code & what license is it?

Also where is the source code for RAPIDS? And what license is it?

I am generally not critical of people posting benchmarks. Neither am I trying to rain down on your parade but seriously you do not use Apache Spark to process 3.8G or 15.6G of data. It is too trivial.

I do not disagree with you. We are going to be publishing much larger benchmarks soon. Before we could start building BlazingSQL we had to work on CUDF in the rapids eco system. There was alot of work that went into building all of the primitives that were needed for both our sql engine and all of the other wonderful tools that are now leveraging CUDF.

We have to start somewhere and if you think we should compare against something else please tell us what you think we should compare ourselves to! We are already looking into some of the other tools that have been mentioned here in the comments.

What’s going on with that first graph? The scale changes from 1500” to 3000”, but Spark’s bar sizes stay the same.

Did Spark have a 2x slow down between the tests? It seems pretty misleading.

Further down in the article:

In our original demo, we ran a V100 GPU against 8 CPU nodes. The new T4 GPUs cut our cost in half, which meant we reduced the Apache Spark cluster to 4 CPU nodes in order to maintain price parity. Even with the reduced GPU memory, the whole workload ran significantly faster.

Actually, what you should normalize is operating costs, so probably power consumption is the better figure to normalize by.

... on the other hand, that's more difficult to measure, and TDP figures are not what really happens.

How different is BlazingSQL to Uber's AresDB[0]?


In lots of ways. AresDB has ingestion. Users take their data and say here make a table from this, persist that data the way I want you to great now I can query it and go along my merry business. I see lots of nice fancy optimizations in there. The kinds of things a small company like Blazing can only marvel at!

I also see a completely different use case. We don't want to ingest data into Blazing. That is NOT our cup of tea. We will make tools to help you persist data quickly. Working on it. But we want to read data the way YOU store it without asking you to duplicate it into our system and requiring significant effort from the user.

You have files. Great we can read them, query them, do all kinds of acrobatics on them and don't need you to store redundant copies of what you already had.

The files are stored in S3 or HDFS and other file systems. Great, we have built connections to interface with those file systems and you can query directly from them.

A user should be able to get blazing's docker container. Register a filesystem with the engine, and start querying their files within minutes of having launched the container. When we don't do this we fail. It does not seem to me that AresDB is trying to do this but I could be completely wrong and if you think so please tell me why!

BlazingDB is meant to be connected to other tools in the rapids eco system. You interact with it through python. You can 0 copy IPC the data to any other tool in the rapids eco system. Want to a power a dashboard, sure, but this isn't the use case that we are optimizing for. We don't have a JDBC connection. We don't connect to tableau.

Blazing Focuses on Interpreting rather than JITing. Instead of focusing on trying to make the most performant compiled code possible by JITing exactly what we want we pay a process cost that reduces I/O. We convert Logical Plans into physical execution plans that are suitable for a SIMD architecture whose optimization centers around minimizing materializations and global memory accesses. We tried JIT but couldn't find a way to compile plans in less than a few hundred nano seconds at which point the overhead of compiling superseded that of interpreting operations in a kernel.

Don't SQL engines typically run on servers which lack GPU's? That pretty much kills the entire hardware demographic of enterprise SQL users.

The purpose of blazing is to accelerate GPU workloads. SQL is just a language we provide users as an interface for performing complex transformations in the rapids eco system. rapids.ai

We aren't targeting enterprise sql users specifically. We are targeting people using GPUs that want to accelerate the process of making information available to those gpus by leveraging the GPUs for the process of parsing (CSV), decompressing (Parquet) and then transforming data to make it suitable for machine learning, classifying etc.

Molasses is fast compared to Apache Spark. Compare to something like Redshift or Google BigQuery, and on a much larger dataset.

The source being where?

Most of the execution kernels that we use are in https://github.com/rapidsai/cudf . The engine itself is not open source though it is free to use and we have a dockerhub link which is pretty convenient. https://hub.docker.com/r/blazingdb/blazingsql/

Conda works too but its still a bit painful and we are working on making that a bit more user friendly.

Is there support for loading data from Parquet files?

There is, this demo is done with CSV's b/c we can't read Apache Parquet files with strings yet (changing very very very soon). You can check out how here: https://docs.blazingdb.com/docs/database-administration#sect...

Keep up the great work. Don’t listen to others.

So how's the string support on BlazingSQL?

The reason our team (graphistry) is working w/ the BlazingDB team is q's like this. BlazingSQL subsumes RAPIDS / cudf... which now includes the GPU-optimized nvstrings lib. And instead of a hacky grad student's GPU string lib, this is Nvidia's.

We initially bet on OpenCL... but had to write almost everything ourselves. We then switched to GOAI, despite having to write much of Apache Arrow[JS], among other things. I love q's like that now that more and more pieces are falling into place with best-of-class tech. That's a big deal for our users, and helps our small punch above our weight for helping them. 2019 will be fun!

Coming in the current sprint. We have a PR open agaisnt CUDF that brings string support to many of the CUDF algorithms. We will be using this branch in the meantime in our engine while we wait for the PR to be accepted.

  SELECT colA + colB * 10, sin(colA) — cos(colD) FROM tableA
GPUs excel at parallel trigonometry computation.

Only if your data is already on the GPU. Otherwise, this little computational work is swamped by the cost of I/O back and forth to main memory.

link to the source ? i couldn't find it on the page

https://github.com/rapidsai is the source for the rapids tools. BlazingSQL itself is not open source though it is free to use.

excellent work!

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact