
Why the days are numbered for Hadoop as we know it - iProject
http://gigaom.com/cloud/why-the-days-are-numbered-for-hadoop-as-we-know-it/
======
monstrado
Just because Google is no longer using something doesn't mean it's no longer
useful for other companies. Most companies will never need to do the same
level of computation that Google needs to do.

The fact is, Hadoop is becoming more and more easy to setup, with
distributions like Cloudera & Hortonworks allowing users to set up clusters
with minimal upfront knowledge of Hadoop and how it works. The article uses
HBase and Hive together in the same sentence which is kind of weird, Hive and
HBase are completely different. Hive was made by Facebook to allow users to
map schemas to a large set of files so their employees could run MapReduce
jobs without Java knowledge, but instead SQL (HSQL)...basically a way to
bridge the gap between Hadoop and non-java / programmers. HBase is a different
beast all together and has no relation to MapReduce...the only thing it has in
common with Hadoop is that it uses HDFS as its underlying FS (but doesn't have
to). It's a NoSQL database that is modeled after BigTable (K -> V) and is
pretty complicated...more so than just Hadoop/MR.

I think Hadoop will be around for quite a while, especially since it's
becoming almost trivial to deploy. I see a lot of companies repurposing older
servers into hadoop clusters and they are VERY happy with its performance. I
rarely see where Hadoop is not providing enough to the user, but rather the
user is not providing enough for Hadoop.

------
strlen
Few things:

* Existence of tools beyond Map/Reduce in use at Google, does not imply that Map/Reduce's "days are numbered."

Map/Reduce is still enormously useful for many tasks even when other
approaches (BSP, traditional distributed RDBMS techniques like Dremel) are
available.

* Hadoop is not restricted to Map/Reduce. HDFS, cluster management, and more can be used and are used by other applications.

I am not too heavily involved with the query-processing side of Hadoop, but as
far as as I understand the long term idea is that Map/Reduce will become just
another application (of many) running on top of Hadoop's cluster management
and storage infrastructure. See
[http://hadoop.apache.org/common/docs/r0.23.0/hadoop-
yarn/had...](http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-
yarn-site/YARN.html)

(Disclosure: I contribute to HDFS and HBase)

------
yaroslavvb
Other technologies to watch:

1\. GraphLab2. Unlike Pregel's Bulk Synchronous Parallel Model, GraphLab2 it
allows non-synchronous updates, which is more efficient for approximate
quantities. For instance, for AltaVista's web graph, most nodes only need to
be updated couple of times, while some nodes need more than 60 updates.

2\. Flume: it's an abstraction on top of MapReduce, you program as if your
data is contained in Java-like containers, and it turns your program into
series of regular MapReduces

3\. ScalOps (<http://cs.markusweimer.com/pub/2012-DataEng.pdf>): that's a
higher level abstraction prototyped in Yahoo Research, might get resurrected
in Microsoft.

4\. AllReduce

~~~
AaronBBrown
Flume is only related to MapReduce in that it is able to write to HDFS. All
Flume is is a transport mechanism to send log-like data from one place to
another. It can be made fancy by adding decorators to manipulate the data
along the way, but at its core, it just moves bytes around. Unfortunately, at
this stage, it's highly unreliable with its fault tolerant design causing more
problems than it solves. Hopefully FlumeNG will improve upon this. I speak as
someone who runs Flume in production and has to deal with it constantly
failing on me, usually silently.

It's a great technology, but it just isn't there yet.

~~~
yaroslavvb
Oops, looks-like Flume is Google-only name, open-source implementation is
called Crunch -- <https://issues.apache.org/jira/browse/MAPREDUCE-1849>

------
joshu
Google talks very selectively about the technologies they use an how they
work. Pretending you understand what google is doing based on this output is
an enormous mistake. This article is vague speculation wrapped in the clothing
of something more. It isn't.

------
anxrn
Hadoop is popular not (only?) because of some PR-driven mania around big data.
It is popular because it is an organic, evolving open source project that
solves hard but common problems faced by a lot of companies in a variety of
industries rather cheaply; despite being somewhat susceptible to the common
pitfalls of design-by-committee projects.

I fail to see how the existence of other tools that solve different and/or
narrower problems takes anything away from the success of Hadoop, let alone
spells its demise.

------
benbjohnson
It seems like Hadoop will move into areas where it shines -- namely custom
bulk processing of medium to large files. Big data is an umbrella for a lot of
different areas (click stream processing, financial analysis, image & video
processing). When it comes to processing large datasets you really need tools
designed for each particular type of data to obtain the performance you need.

Hadoop was a nice general purpose tool but niche-specific tools will
eventually supersede it for many types of processing.

------
dumm
Pregel: map-reduce in 15 lines of code = Google tries to teach its Java/C++
loving staff functional programming.

Once again, very old ideas being published as some amazing new discovery from
Google, mesmerizing geeks, tech companies. and wannabe tech companies
everywhere. The fact they are trying to sell it as a service shows they are
behind the curve. Using buzz phrases like "time to value" and "time to
insight".

How about the time it takes to get programmers to stop using iterative, loop
bases programming and braindead IDE's?

The author of the Pregel article talks about graphs and vertexes. "Everything
is a graph." "Think in terms of vertexes." No, everything is a list. That's a
very old idea. You must think in terms multi-dimensional lists and vectors.
The old new thing.

Processing trillions of rows in minutes. This is old hat for many folks in the
financial world.

Iterative programming is ingrained. And Google is a victim of this as much as
anyone else.

You can show a CS grad how to generate highly efficient C replete with goto's
using a high level language like Scheme, they will see the performance
benefit, and yet they will still go back to using some crippled "expressive"
language, because that's what they are used to. They want to write algorithms
that no one needs and programs that no one will ever use. Users want stuff
that is FAST. But a lot of programming is not for users, it's to entertain
programmers who are doing it. Sadly, they are not entertained by functional
programming and short programs of a few lines. They want to write 1000's of
lines of code. FAIL.

Give me someone who's mind has not been poisoned with the idea of loops and
the du jour scripting languages, preferably someone who has not majored in
Computer Science, and I can make them 100x as productive as today's average
and even above average programmers.

People will be stuck on Hadoop for a long time. Just as people are stuck on
C++, Perl, Java, Python and other verbose iterative languages.

~~~
wumpushunter
Okay, I'll bite. I'm ready to be 100x more productive—where do I begin?

~~~
rjurney
Try Spark: <http://www.spark-project.org/>

------
AaronBBrown
Can HN just ban gigaom articles? Have they ever written a single technically
interesting or accurate article in their entire existence? It's all just spin.

------
big_data
The title sounds a bit ominous for an article that really only attempts to
show how Hadoop was a stepping stone to a more refined set of tools at Google.

------
rjurney
Here's the reality: Every sliver of data is going to land on HDFS as the most
trusted and authoritative resource or 'record of truth.' It is the most cost-
effective highly available storage mechanism available. Batch computing is
here to stay, and Hadoop MapReduce will be a big part of it.

One might bet against Hadoop MapReduce, but betting against the Hadoop
Filesystem as cheap storage built on commodity hardware that can serve large
data in a highly available fashion is... misguided. Nothing else scales to
10,000 nodes per cluster and provides data locality (processor near disk
spindles) so that data is accessible, or is even close.

Many systems will sit in front of Hadoop to do things other than batch
computing, and many new types of distributed compute systems will sit on top
of Hadoop and Zookeeper. Hadoop is here to stay.

MapReduce is too low level, and systems like Pig and Hive will continue to
grow, improve and be the standard interface to working with Hadoop.

------
grammr
tl;dr: MapReduce is bad at a lot of things that it was never designed for.

------
wmf
I have a feeling any new open source analytics tools will be rolled into
Hadoop to preserve the value of the brand. I don't know what the future will
look like, but it will probably be called Hadoop.

~~~
benbjohnson
Not necessarily. Not everyone wants to build on top of a large pre-existing
code base or be locked into limitations because of design choices within
Hadoop or HDFS. And some people just don't want to write Java.

I'm writing a behavioral database in C and I didn't add it to the Hadoop
umbrella specifically for those reasons. For example, my database's query
language uses LLVM for compilation and optimization. That's something I
couldn't do if I was locked into the JVM.

------
CurtMonash
If we read this as an argument "And therefore MRv2/YARN will be important",
it's not crazy. The Hadoop project itself is opening up to break its
dependency on pure MapReduce. First out of the gate -- the Apache Hama folks,
who I believe have gotten their own Hacker News attention by (somewhat
ironically) attacking Hadoop.

------
zitterbewegung
This is silly. Just because google offers these things doesn't mean having
your own isn't an advantage. The other thing is there are mapreduce like
systems where you can start to break the rules.

The days aren't numbered. More and more companies are going to keep on
contributing to Hadoop and Hadoop like projects. One can imagine that the days
are numbered for Google. The moat will start to become less and less. Google
is running from Hadoop not the other way around.

I mean look at this <http://opencloudconsortium.org/> Eventually everyone is
going to figure out that you can not just copy google but you can out engineer
google. Google can't hire every good engineer.

~~~
vineet
This point is not really relevant to the article. The article is calling for
new non-MapReduce-based architectures that leverage the Hadoop core (as
opposed to the entire Hadoop stack).

------
danielhlockard
There are certain kinds of data -- (financial, etc) -- that you are going to
always want to run on your own locked down cluster, not uploading to google.

~~~
sanxiyn
Good point, but not relevant for the article.

