Hacker News new | past | comments | ask | show | jobs | submit login

What workloads is this good for?

I don't do GPGPU stuff but my layman's assumption was that shipping data to the GPU was an expensive operation. Sending a ton of IO from "data lakes"[+] to the GPU to do trivial comparisons seems like the worst case scenario?

+: Ddoes that term annoy anyone else? It's just a hoarder's impression of a database...




The data is in the data lake. You need to get it. Either onto the CPU or the GPU and guess what the bottlekneck there will rarely be your interconnect between the CPU and the GPU but rather your interconnect with the data lake.

You say trivial comparisons but that is a pretty reductionist view of what a database can do. I assure you there are more than trivial comparisons going on. Distributed joins, aggregations, and even sorting is not a trivial comparison.

So how can the GPU help?

1. Data in datalakes is often compressed. These compression schemes are often amenable to being decompressed directly on the GPU. It is faster for most columnar compression schemes we see in files like Parquet and ORC to be decompressed on the GPU rather than on the CPU.

2. There are many distributed caching strategies which would let you make fewer and fewer requests directly to your data lake. If you are really clever, you might even decide to store a more compressed representation in your cache than the actual files you are reading from. This was not so difficult to do for data formats that already come in chunks like Parquet and ORC.

What workloads is it good for?

Ones where the consumer will be a distributed solution that runs on GPUs, a non distributed GPU solution coordinated by a tool like DASK, or even a single node solution where the user is going to be using other tools from the rapidsAI ecosystem. You use this if you are already leveraging the GPU for your workloads and want to reduce the timeline of getting data from where it lies, to the GPU.


It is true though. A data lake is what companies end up with by dumping everything into HDFS, regardless of it's importance and relevance. So, it is just a hoarding database.

I'll let the BlazingSQL guys answer the workloads question, but yeah, copying data to the GPU (host -> device) is an expensive operation, and it has to be justified.

With SQream DB (disclaimer: am a developer on SQream DB), we make decisions in the query compiler and optimizer if we want to copy data over to the GPU, or keep it in CPU for processing. The optimizer will know if it makes sense.

For example, when you're doing heavy transforms (eg. `SELECT (SQRT(POWER(x,2)+44)/w+RADIANS(COS(y)*z)) FROM t`), it may make sense to have it in the GPU anyway.


> copying data to the GPU (host -> device)

Copying data to the CPU (peripheral -> host) is expensive too, and it's perfectly possible to just replace peripheral -> host with peripheral -> device. I don't know if that is what blazingsql does, but it shouldn't be discounted.


If you mean something like network card ==> device via RDMA we are not doing that yet but are currently working on this implementation. These past few months we have focused more on adding functionality to libgdf. We are now optimizing some of the i/os like this one you mention here. As of now if it goes over the wire it is being copied into pre allocated pinned memory buffers then sent to GPU. This will go away in about a month or so.


Will you consider offering community editions for SQream in the future? It‘s a considerable differentiator for BlazingDB and PG Strom I guess


There is no such plan that I'm aware of


High end GPU memories are quite big (multi-GB), so the whole database fits in GPU memory for most databases. And this seems to support distributing the database over many servers and GPUs, so you can get a couple of order of magnitudes bigger via that route.


32GB is big?


The p2.16xlarge EC2 GPU instance has 16 GPUs, 732GB of GPU RAM.


You're assuming the data is split among the K80 GPUs (12GB per GPU instance), which may not be the case here. Who's doing the split? How is the data partitioned along the GPUs?


Hopefull the library that the user is employing! In our case we have a few different components that actually make up BlazingSQL. We have Relational Algebra nodes that are stateless and can do nothing but receive inputs and interpret relational algebra. They are coordinated by an orchestration node whose purpose it is to divide up queries.

There are three cases to consider here for dividing up the data.

1. Data coming from the user in Python this can be large or small, if it is large you can partition it amongst the nodes, if small you can just let every node have a copy, what is large or small depends on the size of your nodes, the interconnect, etc. 2. Data that resides in the datalake You can partition the dataset by dividing up the files and having each node perform the i/o necessary to retrieve that data and start processing it 3. Data that resides in previous distributed result sets this is great because well its already partitioned for you. If you have some nodes with large percentages of the result set you might make those partitions

So thats just for getting the query started. After that there are loads of operations that are not trivial to distribute. ( distributiong a + b is a heck of a lot easier than doing a distributed join). To reduce the amount of coordination we need between nodes something we do is sample before execution and generate partitioning strategies that will allow each node to PUSH its information to another node whenever this is required. This is much simpler than trying to coordinate distribution during the execution phase and allows every node to keep moving its process forward.


Can you point to a commercial GPU database offering that doesn't distribute calculations over multiple GPUs out-of-the-box? That's table stakes, not a point of differentiation.


Yes, in a database context. It's bigger than most databases. You can fit billions of items there.


I've had larger excel sheets :P




Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: