Hacker News new | past | comments | ask | show | jobs | submit login

I'd love to, but it would take about an hour to run through everything.

Here's a short version: There's a collective ecosystem problem of fragmented applications, not-quite-right command line utilities, web interfaces that look like they were designed in 1995, noisy log files people actually have to read constantly, and cross coupling of dependencies that make keeping a cluster live for production use a full time job.

There's the programming problem of nobody actually writing hadoop mapreduce code because it's impossibly complicated. Everybody uses hive and pig and half a dozen other tools to compile to pre-templated java classes (this knocks off 5% to 30% of your performance if you could do it by hand).

It hasn't grown because it's so amazing, performant, and company saving. It grows because people jumped on a fad wagon then got stuck with having a few hundred TB in HDFS. The lack of a competing project with equal mindshare and battle-testedness doesn't foster any competition. It's the mysql of distributed processing systems. It works (mostly), but it breaks (in a few dozen known ways), so people keep adding features and building on top of it.




seiji pretty much nails it. Hadoop seems to have come out of a weird culture. It is a distributed system with a single point of failure (name node) because its designers insisted on avoiding Paxos (distributed systems are too hard so we'll just make a broken-by-design protocol instead). Another example is that a lot of the database code built on top of Hadoop is designed around one Java hashmap per row which really limits performance.

There are all sorts of oddities and you can mostly work around them but it is...exhausting, and I spend a lot of time thinking "surely there must be a better way".


> surely there must be a better way

http://www.spark-project.org/


Wait, so Zookeeper (= distributed consensus thingie that I think implements the Paxos algorithm) is a Hadoop project but not actually used in Hadoop mapreduce?


That's correct. I believe they are using it in some new "high availability" stuff coming down the road


Thanks for your insightful comments! I appreciate that you took the time to back up your opinion by distilling your thoughts into something quickly digestible.

Have you heard of any other projects outside of disco that are more performant than hadoop when used for similar applications?


I'd also just like to say: NameNode = single point of failure.

I worked on a contract for a large, very well-known social networking company a while back who refused to consider Hadoop because of this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: