32GB is big?

randyzwitch · on Oct 12, 2018

The p2.16xlarge EC2 GPU instance has 16 GPUs, 732GB of GPU RAM.

arnon · on Oct 12, 2018

You're assuming the data is split among the K80 GPUs (12GB per GPU instance), which may not be the case here. Who's doing the split? How is the data partitioned along the GPUs?

felipe_aramburu · on Oct 12, 2018

Hopefull the library that the user is employing! In our case we have a few different components that actually make up BlazingSQL. We have Relational Algebra nodes that are stateless and can do nothing but receive inputs and interpret relational algebra. They are coordinated by an orchestration node whose purpose it is to divide up queries.

There are three cases to consider here for dividing up the data.

1. Data coming from the user in Python this can be large or small, if it is large you can partition it amongst the nodes, if small you can just let every node have a copy, what is large or small depends on the size of your nodes, the interconnect, etc. 2. Data that resides in the datalake You can partition the dataset by dividing up the files and having each node perform the i/o necessary to retrieve that data and start processing it 3. Data that resides in previous distributed result sets this is great because well its already partitioned for you. If you have some nodes with large percentages of the result set you might make those partitions

So thats just for getting the query started. After that there are loads of operations that are not trivial to distribute. ( distributiong a + b is a heck of a lot easier than doing a distributed join). To reduce the amount of coordination we need between nodes something we do is sample before execution and generate partitioning strategies that will allow each node to PUSH its information to another node whenever this is required. This is much simpler than trying to coordinate distribution during the execution phase and allows every node to keep moving its process forward.

randyzwitch · on Oct 12, 2018

Can you point to a commercial GPU database offering that doesn't distribute calculations over multiple GPUs out-of-the-box? That's table stakes, not a point of differentiation.

fulafel · on Oct 12, 2018

Yes, in a database context. It's bigger than most databases. You can fit billions of items there.

arnon · on Oct 12, 2018

I've had larger excel sheets :P