It's worse than that. Shuffle for Spark on Kubernetes is fundamentally broken an...

epdlxjmonad · on May 12, 2020

I agree that Spark on Kubernetes will have a hard time fixing the problem of shuffling. If they choose to use local disks for per-node shuffle service, a performance issue arises because disk-caching is container-local. If they choose to use NFS to store shuffle files, a different kind of performance issue arises because of not using local disks for storing intermediate files. All these issues will arise without properly implementing fault tolerance in Spark.

We are currently trying to fix the first problem in a different context (not Spark), where worker containers store intermediate shuffle files in local disks mounted as hostPath volumes. The performance penalty is about 50% compared with running everything natively. Besides occasionally some containers almost get stuck for a long time. I believe that the Spark community will encounter the same problem in the future if they choose to use local disks for storing intermediate files.

jstephan · on May 14, 2020

Glad our post sparked some pretty deep discussions on the future of spark-on-k8s ! The OS community is working on several projects to help this problem. You've mentioned NFS (by Google) but there's also the possibility to use object storage. Mappers would first write to local disks, and then the shuffle data would be async moved to the cloud.

Sources: - end of presentation https://www.slideshare.net/databricks/reliable-performance-a... - https://issues.apache.org/jira/browse/SPARK-25299