It's worse than that. Shuffle for Spark on Kubernetes is fundamentally broken and hasn't yet been fixed. The problem is that Docker containers cannot (for security reasons) share the same host-level disks. There is no external shuffle service, and disk-caching is container-local (not using kernel-level disk I/O buffering) which kills performance. Google's proposed soln below is to use NFS to store shuffle files, which is not going to be performant. Stick with YARN for Spark and only switch when shuffle is fixed for k8s. Databricks are in no rush to get shuffle fixed for k8s.
I agree that Spark on Kubernetes will have a hard time fixing the problem of shuffling. If they choose to use local disks for per-node shuffle service, a performance issue arises because disk-caching is container-local. If they choose to use NFS to store shuffle files, a different kind of performance issue arises because of not using local disks for storing intermediate files. All these issues will arise without properly implementing fault tolerance in Spark.
We are currently trying to fix the first problem in a different context (not Spark), where worker containers store intermediate shuffle files in local disks mounted as hostPath volumes. The performance penalty is about 50% compared with running everything natively. Besides occasionally some containers almost get stuck for a long time. I believe that the Spark community will encounter the same problem in the future if they choose to use local disks for storing intermediate files.
Glad our post sparked some pretty deep discussions on the future of spark-on-k8s ! The OS community is working on several projects to help this problem. You've mentioned NFS (by Google) but there's also the possibility to use object storage. Mappers would first write to local disks, and then the shuffle data would be async moved to the cloud.
References: https://youtu.be/GbpMOaSlMJ4?t=1617 https://t.co/KWDNHjudfY?amp=1 https://issues.apache.org/jira/browse/SPARK-25299