There are three cases to consider here for dividing up the data.
1. Data coming from the user in Python
this can be large or small, if it is large you can partition it amongst the nodes, if small you can just let every node have a copy, what is large or small depends on the size of your nodes, the interconnect, etc.
2. Data that resides in the datalake
You can partition the dataset by dividing up the files and having each node perform the i/o necessary to retrieve that data and start processing it
3. Data that resides in previous distributed result sets
this is great because well its already partitioned for you. If you have some nodes with large percentages of the result set you might make those partitions
So thats just for getting the query started. After that there are loads of operations that are not trivial to distribute. ( distributiong a + b is a heck of a lot easier than doing a distributed join). To reduce the amount of coordination we need between nodes something we do is sample before execution and generate partitioning strategies that will allow each node to PUSH its information to another node whenever this is required. This is much simpler than trying to coordinate distribution during the execution phase and allows every node to keep moving its process forward.