Hacker News new | comments | show | ask | jobs | submit login

I find it very odd that the author completely glossed over the recent Spark developments in the Hadoop ecosystem. In many ways, Spark is meant to replace the Map-Reduce paradigm to enable easier access to the data.



Hi, co-founder

This is something we probably should have spoken directly to in the article because of how popular spark is. The main reason we didn't is that we like Spark and don't feel it needs to be replaced. It doesn't seem to have most of the ecosystem problems we discuss because it's got a company behind it. My understanding of Spark is that it's designed to be used with different storage backends and I'm very curious to see what would happen if we got it talking to pfs. I think it could work very well because Spark's notion of immutable RDDs seems very similar to the way pfs handles snapshots.

Hope this clears a few things up and apologizes if I've mischaracterized Spark here I've only used it a little bit.


Its not that odd, spark is very similar to the hadoop way of doing things and there are a few other projects like it that are attempting to be a better hadoop. So I dont think that much was lost by not mentioning it.


How is Spark similar to the Hadoop way of doing things ?

It operates very differently from Hadoop for us. SparkSQL allows third parties apps e.g. analytics to use JDBC/ODBC rather than going HDFS. And the in memory model and ease of caching data from HDFS allows for different use cases. We do most work now via SQL.

Combining Spark with Storm, ElasticSearch etc also permits a true real time ingestion and searching architecture.


Spark is a more general data processing framework than Hadoop. It can do map-reduce, can run on top of Hadoop clusters, and can use Hadoop data. It can also do streaming, interactive queries, machine learning, and graph processing.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: