Hacker News new | past | comments | ask | show | jobs | submit login

A lot of the differences between the systems arise from the implementation choice of how to do aggregation in Hadoop 2.4.0 and Spark 1.3. There's nothing inherent in the RDD model, for example, that says the aggregation has to be done eagerly at the mapper; nor in the MapReduce model that says it has to be done at the reducer. Either system could support the other aggregation mechanism, and the only challenge would be in choosing which one to use.

Some former colleagues wrote a nice paper about the performance trade-offs for different styles of distributed aggregation in DryadLINQ (a MapReduce-style system), and evaluated it at scale:

http://sigops.org/sosp/sosp09/papers/yu-sosp09.pdf




> Either system could support the other aggregation mechanism, and the only challenge would be in choosing which one to use.

Hive implements something similar to the paper mentioned. Partial aggregation on mappers & the reducer does a sorted final aggregation.

You'll find Hive beating MapReduce[1], even though it is implemented using MR.

[1] - https://www.cl.cam.ac.uk/research/srg/netos/musketeer/eurosy...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: