
Your command-line tools may be 235x faster but they don’t have the same features - Samus_
http://blog2samus.tumblr.com/post/108605064468/your-command-line-tools-may-be-235x-faster-but
======
dalke
Adam Drake's essay concludes:

> If you have a huge amount of data or really need distributed processing,
> then tools like Hadoop may be required, but more often than not these days I
> see Hadoop used where a traditional relational database or other solutions
> would be far better in terms of performance, cost of implementation, and
> ongoing maintenance.

This new essay says "I agree with the points made but in the end I feel like
it’s missing the point." It then argues that scalability, reliability, and
performance are the missing parts. (Though only in the context of batch jobs
and large data sets that don't fit on a single machine. While it mentions
scaling down, the overhead of going through Hadoop is too much if you want to
scale down to search 1K of data.)

In the end, I think Adam Drake made exactly the point that this new essay says
is missing.

Several people in the HN thread (and earlier threads on the same topic) have
commented about people using Hadoop for "large" data which wasn't large and
wasn't ever going scale past a single machine[1]. Complaining against that
misapplication of tools was the thrust of Drake essay. It wasn't a statement
that it's better than Hadoop for all cases.

[1] For the record, data in my field takes about 10 years to double. The same
algorithms that took a small cluster 5 minutes to search in 1990 now take a
second on a laptop.

~~~
Samus_
Hey there! I agree with the misuses but I don't think the reasons stated are
the right ones.

For instance, the fault-tolerance abilities of those frameworks is really
important and it isn't mentioned, in some cases that may even offset the speed
differences.

The data growth you mention is also a relevant factor which is missing, in
your case you have an argument against a tool that helps you scale, that's the
right mindset because you're evaluating the use of such tool over time not on
a single situation.

As for the scaling if you go as low as 1k sure, Hadoop becomes useless but
what if you had to scale from 500 machines to 100? if you bought a
supercomputer that costs millions and three months later you're no longer
using it then "scaling down" would be a great advantage.

So in short I like the "cons" of the original and I agree with them but there
are "pros" that should also be considered.

~~~
dalke
The original essay's point was to say that people over-weigh the pros, that
there are cons, and that some of those cons have considerable real-world
significance.

It wasn't to say that there are no pros. It wasn't to provide a balanced
analysis of all the pros and cons. It had a specific point, it made it, and it
acknowledged that the point wasn't universal.

This new essay misinterprets the point in order to set itself up as a more
realistic viewpoint, when in fact it's arguing against a straw interpretation
of the original essay.

Yes, fault-tolerance can be really important. And sometimes not. Such a
discussion is orthogonal the point of the original. A single node is quite
reliable. There are many ways to get redundancy on top of that. For the same
price as a Hadoop cluster, you could buy a machine with more reliable
hardware, plus a hot backup, plus a load balancer. This may be all that's
needed.

If you need 100 machines then you are still in the realm where you "really
need distributed processing", which was one of the times where the original
essay said that Hadoop could be the right solution.

If you have 100 machines, and the single node can go 200x faster, then you
really only need 1 machine.

~~~
Samus_
> The original essay's point was to say that people over-weigh the pros, that
> there are cons, and that some of those cons have considerable real-world
> significance. > > It wasn't to say that there are no pros. It wasn't to
> provide a balanced analysis of all the pros and cons. It had a specific
> point, it made it, and it acknowledged that the point wasn't universal. > >
> This new essay misinterprets the point in order to set itself up as a more
> realistic viewpoint, when in fact it's arguing against a straw
> interpretation of the original essay.

I don't see the pros mentioned, at all.

What I see is another "Hadoop is fast, Hadoop is slow" comparison based
entirely on speed which on its own isn't either a "pro" or a "con" is just a
factor among many.

Without the complete picture we'll keep misinterpreting and arguing over this
technology for the wrong reasons.

> Yes, fault-tolerance can be really important. And sometimes not. Such a
> discussion is orthogonal the point of the original. A single node is quite
> reliable. There are many ways to get redundancy on top of that. For the same
> price as a Hadoop cluster, you could buy a machine with more reliable
> hardware, plus a hot backup, plus a load balancer. This may be all that's
> needed.

Redundancy and fault-tolerance are not the same thing, you're also thinking in
the context of web development and this is a different scenario.

In web you can use redundancy to improve reliability because you're either
serving the same content to many clients or performing _independent_
computations to each of them; data processing frameworks are used to perform a
__single __computation with several different inputs over many machines,
replicating it may give you some safety regarding hardware failures but none
for software issues and you 'll be also wasting a lot of power thus limiting
the range of applications.

Big Data frameworks partition the input, they do not replicate it; they also
keep track of the results and allow you to retry failed chunks, it's a
different approach than those used on web.

> If you need 100 machines then you are still in the realm where you "really
> need distributed processing", which was one of the times where the original
> essay said that Hadoop could be the right solution. > > If you have 100
> machines, and the single node can go 200x faster, then you really only need
> 1 machine.

Like I said the speed is just one factor, you can't determine the right
solution based exclusively on it and that's what the original article missed,
you're disregarding all the other advantages that a cluster and a Big Data
framework provide.

Also the very concept of "machine" can vary wildly, there's an infinite range
of combinations of disk, ram and processor so what is "a machine" and how can
you say when you need one, ten or 100?

The real choice here is a single "powerful enough" machine vs. a set of
"commodity" hardware, then you can compare prices, speed, scalability and
reliability of each scenario in all the possible situations you expect to
encounter and hopefully some unforeseen ones as well.

~~~
dalke
The four pros I spotted were "cool", "learning and having fun", handle "a huge
amount of data" and "distributed processing."

There's no claim that it's trying to give the complete picture. It's trying to
solve a specific task for chess board analysis. Nor did it reject that Hadoop
was always the wrong solution, only that a single node _can_ be faster than
Hadoop. An existence proof is a low barrier to cross.

You say "this is a different scenario" and I have to shrug. I used to do
molecular dynamics distributed across a compute cluster or using a
supercomputer. That's yet another different scenario, and also one where
Hadoop doesn't work. Last I heard, simulation software I helped write now runs
across 8,400 cores. (See
[http://www.ks.uiuc.edu/Research/gpu/files/gtc2013_phillips.p...](http://www.ks.uiuc.edu/Research/gpu/files/gtc2013_phillips.pdf)
.) The visualization project I used to lead now analyzes 100+ TB of simulated
trajectory data. (See [http://on-
demand.gputechconf.com/gtc/2014/presentations/S441...](http://on-
demand.gputechconf.com/gtc/2014/presentations/S4410-visualization-analysis-
petascale-moledular-sims.pdf) .)

MD definitely falls into "Big Data", yet the examples I gave don't use Hadoop
or data partitioning like you suggest, so your statement about Big Data
frameworks is really restricted to only a subset of what people do in Big
Data.

So yes, you're right. The original essay isn't complete. Neither is this new
one, which didn't include all the problems with Hadoop, like poor inter-node
latency, and lack of support for MPI. The 'real choice' isn't as dichotomous
as you make it out to be. A set of powerful non-commodity hardware, like Blue
Waters, may be the right approach for some scenarios.

"Without the complete picture"

I disagree. The new essay gives a different picture. A lot of pictures fit
well in the frame of Hadoop, for the reasons given in the new essay.

In this case the pictured scenario is chess data analysis. In this case it
looks like a single node is faster. However, it's not _better_ for the
original original task, which was to experiment with learning Hadoop. ("I was
skeptical of using Hadoop for the task, but I can understand his goal of
learning and having fun with mrjob and EMR.") As the original original spec no
longer exists, I cannot tell if fault-tolerance was one of them. As it was a
personal learning project, no doubt it wasn't.

So this new essay is adding details that weren't relevant to the original
task, then complaining that the details weren't addressed. This is poor form.

You ask "how can you say when you need one, ten or 100" and the easy answer is
to go ahead and implement it on one node. If it takes an hour to write a
single node solution would take, and it's fast enough for your foreseeable
needs, then you're a lot better off than spending a day to try out a Hadoop
solution. Plus, without the single node solution it's hard to know if you have
a good baseline.

