Abstractions almost always end up leaky. Spark SQL, for example, does whole-stag...

paulluuk · on Sept 8, 2021

What's wrong with relying on data engineers for data engineering?

disgruntledphd2 · on Sept 8, 2021

Spark is very very odd to tune. Like, it seems (from my limited experience) to have the problems common to distributed data processing (skew, it's almost always skew) but because it's lazy, people end up really confused as to what actually drives the performance problems.

That being said, Spark is literally the only (relatively) easy way to run distributed ML that's open source. The competitors are GPU's (if you have a GPU friendly problem) and running multiple Python processes across the network.

(I'm really hoping that people will now school me, and I'll discover a much better way in the comments).

b9a2cab5 · on Sept 10, 2021

Data engineers should be building pipelines and delivering business value, not fidgeting with some JVM or Spark parameter that saves them runtime on a join (or for that matter, from what I've seen at a certain bigco, building their own custom join algorithms). That's why I said it's only economical for big companies to run efficient abstractions and everyone else just throws more compute at the issue.