Hacker News new | comments | show | ask | jobs | submit login
BlazingSQL – A GPU SQL Engine for RAPIDS Open-Source Software from Nvidia (blazingdb.com)
125 points by felipe_aramburu 7 days ago | hide | past | web | favorite | 34 comments

Does anyone have a clue how this could glue together with YARN[1] and/or TonY [2] from LinkedIn?

[1] https://hortonworks.com/blog/gpus-support-in-apache-hadoop-3...

[2] https://engineering.linkedin.com/blog/2018/09/open-sourcing-...

I am going to look into this. Its a fantastic question but we are too ignorant on the matter to be able to comment properly at the time. I will respond here, eventually :).

What workloads is this good for?

I don't do GPGPU stuff but my layman's assumption was that shipping data to the GPU was an expensive operation. Sending a ton of IO from "data lakes"[+] to the GPU to do trivial comparisons seems like the worst case scenario?

+: Ddoes that term annoy anyone else? It's just a hoarder's impression of a database...

The data is in the data lake. You need to get it. Either onto the CPU or the GPU and guess what the bottlekneck there will rarely be your interconnect between the CPU and the GPU but rather your interconnect with the data lake.

You say trivial comparisons but that is a pretty reductionist view of what a database can do. I assure you there are more than trivial comparisons going on. Distributed joins, aggregations, and even sorting is not a trivial comparison.

So how can the GPU help?

1. Data in datalakes is often compressed. These compression schemes are often amenable to being decompressed directly on the GPU. It is faster for most columnar compression schemes we see in files like Parquet and ORC to be decompressed on the GPU rather than on the CPU.

2. There are many distributed caching strategies which would let you make fewer and fewer requests directly to your data lake. If you are really clever, you might even decide to store a more compressed representation in your cache than the actual files you are reading from. This was not so difficult to do for data formats that already come in chunks like Parquet and ORC.

What workloads is it good for?

Ones where the consumer will be a distributed solution that runs on GPUs, a non distributed GPU solution coordinated by a tool like DASK, or even a single node solution where the user is going to be using other tools from the rapidsAI ecosystem. You use this if you are already leveraging the GPU for your workloads and want to reduce the timeline of getting data from where it lies, to the GPU.

It is true though. A data lake is what companies end up with by dumping everything into HDFS, regardless of it's importance and relevance. So, it is just a hoarding database.

I'll let the BlazingSQL guys answer the workloads question, but yeah, copying data to the GPU (host -> device) is an expensive operation, and it has to be justified.

With SQream DB (disclaimer: am a developer on SQream DB), we make decisions in the query compiler and optimizer if we want to copy data over to the GPU, or keep it in CPU for processing. The optimizer will know if it makes sense.

For example, when you're doing heavy transforms (eg. `SELECT (SQRT(POWER(x,2)+44)/w+RADIANS(COS(y)*z)) FROM t`), it may make sense to have it in the GPU anyway.

> copying data to the GPU (host -> device)

Copying data to the CPU (peripheral -> host) is expensive too, and it's perfectly possible to just replace peripheral -> host with peripheral -> device. I don't know if that is what blazingsql does, but it shouldn't be discounted.

If you mean something like network card ==> device via RDMA we are not doing that yet but are currently working on this implementation. These past few months we have focused more on adding functionality to libgdf. We are now optimizing some of the i/os like this one you mention here. As of now if it goes over the wire it is being copied into pre allocated pinned memory buffers then sent to GPU. This will go away in about a month or so.

Will you consider offering community editions for SQream in the future? It‘s a considerable differentiator for BlazingDB and PG Strom I guess

There is no such plan that I'm aware of

High end GPU memories are quite big (multi-GB), so the whole database fits in GPU memory for most databases. And this seems to support distributing the database over many servers and GPUs, so you can get a couple of order of magnitudes bigger via that route.

32GB is big?

The p2.16xlarge EC2 GPU instance has 16 GPUs, 732GB of GPU RAM.

You're assuming the data is split among the K80 GPUs (12GB per GPU instance), which may not be the case here. Who's doing the split? How is the data partitioned along the GPUs?

Hopefull the library that the user is employing! In our case we have a few different components that actually make up BlazingSQL. We have Relational Algebra nodes that are stateless and can do nothing but receive inputs and interpret relational algebra. They are coordinated by an orchestration node whose purpose it is to divide up queries.

There are three cases to consider here for dividing up the data.

1. Data coming from the user in Python this can be large or small, if it is large you can partition it amongst the nodes, if small you can just let every node have a copy, what is large or small depends on the size of your nodes, the interconnect, etc. 2. Data that resides in the datalake You can partition the dataset by dividing up the files and having each node perform the i/o necessary to retrieve that data and start processing it 3. Data that resides in previous distributed result sets this is great because well its already partitioned for you. If you have some nodes with large percentages of the result set you might make those partitions

So thats just for getting the query started. After that there are loads of operations that are not trivial to distribute. ( distributiong a + b is a heck of a lot easier than doing a distributed join). To reduce the amount of coordination we need between nodes something we do is sample before execution and generate partitioning strategies that will allow each node to PUSH its information to another node whenever this is required. This is much simpler than trying to coordinate distribution during the execution phase and allows every node to keep moving its process forward.

Can you point to a commercial GPU database offering that doesn't distribute calculations over multiple GPUs out-of-the-box? That's table stakes, not a point of differentiation.

Yes, in a database context. It's bigger than most databases. You can fit billions of items there.

I've had larger excel sheets :P

I wonder how will this impact MapD and Dremio's path on the market. It seems to me like RAPIDS is going to include visualisation for free so that would undermine MapD's commercial offering.

(OmniSci/MapD employee here)

In my opinion, I don't see this as undermining the OmniSci commercial offering, but rather opening up GPU accelerated analytics and visualization to a wider audience. A smaller piece of a much larger pie benefits all of the contributors to the RAPIDS project (of which OmniSci is a contributor as well)

Yah we are all solving different parts of this ENORMOUS puzzle. Our focus has always been on making reducing the pains of ETLing data into the GPUs from all of the crazy sources they can come from.

NVIDIA invests in both OmniSci and BlazingDB so don't worry about it. They have it covered.

Anybody tried PG-Strom in production?

Personally I'm most excited about PGStrom, because it means faster Postgres all the time, not just when it fits into memory, and not just for SQL.

Also to be clear as of right now the only thing that has to fit in memory are the inputs and the outputs to this system since these are being passed via CUDA IPC.

It appears to be, more than anything else, some CUDA code wrapped in UDFs. Not sure I'd want that anywhere near my production stuff in that format.

This is false. PG has a well defined and modular query engine. pgstrom works by replacing and adding modules which allow for GPU support.

You're right, I got confused with another Postgres-GPU plugin type product.

However, Postgres isn't columnar. How does PG-Strom (now HeteroDB) arrange data for the GPU properly?

I'd be quite interested to benchmark this in production, possibly as a master replica to see how it compares in real-time.

What is the best way to pull real-time metrics from it into some sort of dashboard?

You can reach out to me (my first name at blazingdb) if you want to talk about this.

In a nutshell users interact with blazing through the python API for the most part. If you have small result sets like those that normally go into a dashboard (very large data sets, small result sets normally), then you could write queries in python that get distributed to a cluster. The result sets are then available to be retrieved either via CUDA IPC locally or via TCP if you want to pull the result sets back to the user in Python. We will be incorporating faster interconnects using UCX in the following months for multi node clusters.

Is this in anyway related to blazegraph the gpu accelerated database?

This is from blazingdb.com, so sql on gpu (blazegraph is different).

Blazegraph is now an Amazon owned trademark, good luck defending your name!

Excellent work!

Interesting stuff. I thought one of the most interesting papers at NIPS this year was using GPU or TPU for intelligent indexes.


Paper is from Jeff Dean and a worth while read, IMO.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact