I don't do GPGPU stuff but my layman's assumption was that shipping data to the GPU was an expensive operation. Sending a ton of IO from "data lakes"[+] to the GPU to do trivial comparisons seems like the worst case scenario?
+: Ddoes that term annoy anyone else? It's just a hoarder's impression of a database...
You say trivial comparisons but that is a pretty reductionist view of what a database can do. I assure you there are more than trivial comparisons going on. Distributed joins, aggregations, and even sorting is not a trivial comparison.
So how can the GPU help?
1. Data in datalakes is often compressed. These compression schemes are often amenable to being decompressed directly on the GPU. It is faster for most columnar compression schemes we see in files like Parquet and ORC to be decompressed on the GPU rather than on the CPU.
2. There are many distributed caching strategies which would let you make fewer and fewer requests directly to your data lake. If you are really clever, you might even decide to store a more compressed representation in your cache than the actual files you are reading from. This was not so difficult to do for data formats that already come in chunks like Parquet and ORC.
What workloads is it good for?
Ones where the consumer will be a distributed solution that runs on GPUs, a non distributed GPU solution coordinated by a tool like DASK, or even a single node solution where the user is going to be using other tools from the rapidsAI ecosystem. You use this if you are already leveraging the GPU for your workloads and want to reduce the timeline of getting data from where it lies, to the GPU.
I'll let the BlazingSQL guys answer the workloads question, but yeah, copying data to the GPU (host -> device) is an expensive operation, and it has to be justified.
With SQream DB (disclaimer: am a developer on SQream DB), we make decisions in the query compiler and optimizer if we want to copy data over to the GPU, or keep it in CPU for processing. The optimizer will know if it makes sense.
For example, when you're doing heavy transforms (eg. `SELECT (SQRT(POWER(x,2)+44)/w+RADIANS(COS(y)*z)) FROM t`), it may make sense to have it in the GPU anyway.
Copying data to the CPU (peripheral -> host) is expensive too, and it's perfectly possible to just replace peripheral -> host with peripheral -> device. I don't know if that is what blazingsql does, but it shouldn't be discounted.
There are three cases to consider here for dividing up the data.
1. Data coming from the user in Python
this can be large or small, if it is large you can partition it amongst the nodes, if small you can just let every node have a copy, what is large or small depends on the size of your nodes, the interconnect, etc.
2. Data that resides in the datalake
You can partition the dataset by dividing up the files and having each node perform the i/o necessary to retrieve that data and start processing it
3. Data that resides in previous distributed result sets
this is great because well its already partitioned for you. If you have some nodes with large percentages of the result set you might make those partitions
So thats just for getting the query started. After that there are loads of operations that are not trivial to distribute. ( distributiong a + b is a heck of a lot easier than doing a distributed join). To reduce the amount of coordination we need between nodes something we do is sample before execution and generate partitioning strategies that will allow each node to PUSH its information to another node whenever this is required. This is much simpler than trying to coordinate distribution during the execution phase and allows every node to keep moving its process forward.