Hacker News new | past | comments | ask | show | jobs | submit login

Hopefull the library that the user is employing! In our case we have a few different components that actually make up BlazingSQL. We have Relational Algebra nodes that are stateless and can do nothing but receive inputs and interpret relational algebra. They are coordinated by an orchestration node whose purpose it is to divide up queries.

There are three cases to consider here for dividing up the data.

1. Data coming from the user in Python this can be large or small, if it is large you can partition it amongst the nodes, if small you can just let every node have a copy, what is large or small depends on the size of your nodes, the interconnect, etc. 2. Data that resides in the datalake You can partition the dataset by dividing up the files and having each node perform the i/o necessary to retrieve that data and start processing it 3. Data that resides in previous distributed result sets this is great because well its already partitioned for you. If you have some nodes with large percentages of the result set you might make those partitions

So thats just for getting the query started. After that there are loads of operations that are not trivial to distribute. ( distributiong a + b is a heck of a lot easier than doing a distributed join). To reduce the amount of coordination we need between nodes something we do is sample before execution and generate partitioning strategies that will allow each node to PUSH its information to another node whenever this is required. This is much simpler than trying to coordinate distribution during the execution phase and allows every node to keep moving its process forward.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: