

Tuning Java Garbage Collection for Spark Applications - datascientist
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

======
estefan
The more help the better on this one. I've spent the last week trying to debug
why my job fails with a large dataset after 2 hours. I could see that the heap
was being exhausted but it took me ages to realise that coalescing to a single
output partition meant the entire dataset needed to fit into RAM :-(

I've got to say I enjoy using spark & scala far more than crunch & java.

By the way - does anyone know how to connect to the spark UI when running
under YARN on EMR?

~~~
threeseed
The YARN UI has the same information as the Spark UI.

But have to agree that Spark is amazing.

~~~
estefan
The only thing it's missing that I really need is secondary sorting... I wish
they'd add that soon. I've read it can be done but it doesn't look
straightforward.

