

Gearman: Bringing the Power of Map/Reduce to Everyday Applications - yarapavan
http://www.gearman.org/

======
jimmybot
Here's an interesting alternative to Hadoop using two HN favorites, Erlang and
Python: <http://discoproject.org/>

Seems simple enough, though I haven't used it myself.

------
joshu
Gearman is a batch queueing system. They aren't really the same thing.

------
prodigal_erik
I wonder how they plan to distinguish themselves from hadoop.

~~~
sho
It's not Java? It's under rapid, active development? It's embeddable?

I think it looks very interesting. I've learned to always pay attention to
what Danga is doing - they have a knack for taking a task, cutting out all the
bloat and crap and yak-shaving, and delivering just what you need; see
memcached, perlbal, mogileFS.

And Hadoop is an utter pig. It might work great, fine, but it is way, WAY too
complex IMO. It doesn't have that elegance and efficiency that is the hallmark
of excellent software. Plus did I mention it's in Java? Hope you like XML.

Disclaimer: This is just my opinion and I have not used Hadoop extensively,
although I do work with the M/R pattern elsewhere. It's just my impression of
the software from not-very-thorough investigations.

~~~
russss
Gearman is a framework for quickly offloading computation of small jobs. It's
especially useful for running background jobs from a web site, like e-mailing
or image resizing.

Hadoop is a clone of Google's distributed computing framework and is
necessarily more complex than Gearman. Hadoop will let you reliably distribute
single computing tasks over thousands of nodes. It's under rapid development.
If you think it's too complex, you're probably not operating at the right
scale.

It's apples and oranges.

~~~
sho
Agreed. But the thing is - how many people _are_ running at Hadoop-appropriate
scale? You say thousands of nodes - so, top 50 websites plus some mining
companies/research institutions? That is probably an exaggeration - perhaps
Hadoop is useful even past 5 or so computers - but the point remains, it's a
big, heavy pig of a thing with high barriers to entry. It is Big Serious
Enterprise Software for Serious Business Only.

Hadoop core is 41M of compiled code, for christ's sake. Zipped!

This is hacker news, not large scale enterprise news. Hadoop has way too much
overhead just to play around with. A smaller, leaner, cut down alternative is
most welcome.

~~~
sunkencity
For some interesting thoughts on mapreduce frameworks check out this interview
with Jonathan Ellis on the Cassandra project which is facebooks version of a
distributed data store.

<http://techzinglive.com/?p=75>

From what I remember: Google built it's parallelizable stuff around it's
already excellent distributed filesystem. In hadoop they are essentially
cloning the filesystem and the rest of the stuff from scratch, which is a
pretty hard thing to do, and a more leaner approach would be better.

~~~
sho
Thanks, will check it out.

Yeah, I appreciate what Hadoop is trying to do. I am just not sure they are
going about it in the right way. I don't like their technology choices and do
not get a good feeling from the few times I've looked at their issue tracking
system (bureaucracy).

I mean look, MogileFS is also a distributed FS. It works. It's very good,
actually. It's in production in a lot of places.

 _MogileFS is 268KB of Perl._

Gearman is basically a distributed jobs broker, just like Hadoop. There are
some complexities but it is not rocket science. They basically do the same
thing.

 _Gearman is 560KB of C, the original was 12KB of Perl._

One of these solutions is literally 100 times larger than the other. I don't
know about you, but for me .. that is not a good sign. I do not see what
Hadoop can do that the Danga tools cannot. Hell, I have a hard time seeing
what Hadoop can do that a couple hundred lines of Python can't. Sure, might
not be enough to operate like Google does, but 90% of the time, quick and
dirty gets it done ..

~~~
prodigal_erik
As far as I can tell, right now Gearman is only a load balancer for small
requests using a nonstandard RPC protocol. It's easy to be smaller when you
don't yet have most of hadoop's major features:

* use the whole cluster for large jobs, consuming all the data right where it's stored and combining output from every machine you own (what's the point of remote work on just one box?)

* retry partial failures on different replicas of the same data (without losing successful work)

* ship intermediate values between workers (without this you don't even have map/reduce)

* roll new code and data beforehand (and clean it up after)

* avoid job starvation

