
Hadoop: When grownups do open source - sanj
http://www.theregister.co.uk/2008/08/11/hadoop_dziuba/
======
orib
The Register: When children do journalism.

He may have some points in there, but any real content is drowned by the noise
of the baseless insults, ad-hominem attacks, and poor writing. The angry rant
style can be made to work, but it needs to be backed up by actual skill at
writing, and it _definitely_ doesn't substitute for content.

This article makes me sad.

~~~
tdavis
Read it for entertainment purposes only and it becomes... better.

 _Along with the data processing framework, Doug Cutting also included a fault
tolerant, replicated, distributed file system with Hadoop just because fuck
you._ \-- I laughed!

Also, it makes fun of Ruby so another point for it.

It's unlikely this article is accurate or really serves any purpose whatsoever
other than immature linkbaity sort of content, but that's fine with me because
I never put it in the category of "real journalism."

~~~
cstejerean
The article was definitely entertaining. I definitely laughed at

"Starling is the Ruby-based messaging system that runs Twitter's backend. Yes,
Twitter, the nonprofit web service known widely for its downtime, dropped its
disaster-producing shitpile on the world."

------
paul
This guy is so annoying. You might think that the non-success of his own
startup ([http://www.techcrunch.com/2008/07/19/pressflip-is-a-belly-
fl...](http://www.techcrunch.com/2008/07/19/pressflip-is-a-belly-flop/)) would
make him slightly more humble, but apparently not.

~~~
brlewis
Maybe he decided to focus on his most sellable skill.

~~~
paul
Yes, I apparently those that can, do, and those that can not, talk trash.

------
gcv
And people complain about Zed Shaw's expletive-filled writing style? At least
Zed has a sense of humor and blasts just about everything in sight. As for
this article, let's see. Hero worship? Check. Ad hominem attacks? Check.
Uninformed comments ("RubyForge? What... is that?")? Check.

~~~
ajross
Yeah, this is disappointing. Mostly because Hadoop is, in fact, a pretty nifty
piece of software. But it's a cluster thing, and individual hackers don't do
clusters for some pretty fundamental economic reasons. So to Dziuba, that
means that the Hadoop people at Yahoo and elsewhere are "grownups". Sigh.

~~~
scott_s
I thought that was an odd distinction. For some reason, he equates open-source
with the Web 2.0, Ruby crowd. The first things that come to my mind are Linux
and gcc, which are two of the most important software projects on the planet.

~~~
pojo
No, he doesn't. He's clearly drawn a distinction between Web 2.0 open source
and the "rest" of open source. His entire point is to belittle and laugh at
Web 2.0 open source attempts.

------
bporterfield
Does anyone on HN have any first-hand experience configuring and using Hadoop?
If so, how was the process - was it straightfoward or complex, did it require
lots of peeking into the source, any major pitfalls, etc - and, are you
satisfied with how it's currently working and with your ability to add new
boxes easily?

~~~
strlen
Hi,

I've setup a hadoop cluster for a start-up I am working in. Initial reason?
The amount of incoming log data was getting too slow to process on a single
machine and having had a single graduate course in distributed systems I knew
a) what is required to go from a single script to a system distributed across
nodes b) that I am not in the capacity (alone) to create this system in a time
window that will meet business needs.

As a result, I saw hadoop as offering a solution to: implement a distribute
process where by records are processes in parallel by multiple machines
without having to worry about implementing locking and message passing;
distribute disk seek times across a cluster of nodes, pool together disk space
with ability to add disk space by simply adding new commodity machines to a
cluster.

If you have these specific needs: a) need to rapidly and robustly process more
data than a single machine can handle (think of it this way -- if you do data
processing with a Perl script on a single machine, will the data be simply too
old by the time the Perl script is done?) b) ability to add storage for new
data without powering machines down c) the data is non-relational (i.e. you
aren't exporting data from MySQL and then re-importing it again - if your data
can be stored effectively inside a MySQL cluster and queried in a relational
manner, you don't need to use hadoop).

Now, if you still think you need hadoop (in most of the cases, answer is "you
don't"):

To get hadoop going minimally to be useful right away is actually almost
trivial.

However, as your dataset grows, the amount of jobs and the size of the cluster
grow as well as the expectation of hadoop to also be a reliable data store
grow, you will run into plateaus that you'll need to overcome.

To answer your questions directly: \- Getting it going is easy, trivial. Do it
first on your desktop (if you run Linux directly, if not under VMWare).
Process is straight forward. \- Peeking into the source is not required to get
it going \- Pitfalls: * Use enough memory, JVM is a memory hog. Always make
sure you have sufficient swap (4x your memory) * Find the optimal size of
input data for your task (if you are using compressed files for input).

(Yes, I will write these learnings up eventually).

~~~
bporterfield
Great reply. My current thinking for the project I'm working on is that I
don't need Hadoop...yet. That said, if it's trivial to implement on a single
machine it might be wise to just begin the project in Hadoop so that I don't
have to make the switch later. Always balance between 'premature optimization
is the root of all evil' and 'choosing the right tools for the right job'.

I'll give it a shot on my desktop to start - thanks for the tips!

------
gfunk911
Wow. That might be the most unprofessional thing I've ever read.

~~~
william42
I, on the other hand, have read other articles from The Register. This is par
for the course.

------
jon_dahl
Heh - I'm working on a MapReduce blog post as we speak (see
<http://tinyurl.com/6oylnr>). So I suppose I'm prone to his attack.

I actually think he makes one good point, that virtually no one actually needs
to use MapReduce, let alone implement it, let alone write a blog tutorial on
how to implement it. That said, virtually no one needs to write a compiler
either, and yet understanding compilers (or even writing one) is not a waste.

Also, his comparison of a message queue (Starling) to MapReduce almost makes
you wonder if he really understands MapReduce. Starling has nothing to do with
MapReduce, and MapReduce is completely unrelated to Twitter's scaling
problems.

~~~
nickb
And, Starling is not the reason for Twitter's scaling problems. It's actually
a very very simple and very useful and it works as advertised.

------
sanj
The question I had is whether Dziuba built something interesting. Zed has.

~~~
william42
He built a search engine called Pressflip which has a neat idea(say which
searches are good or bad) but fails on the execution.

------
geuis
Parts of the article made me laugh, especially the Twitter as drum major quip
(me being an old band/drumcorps member). Not knowing much about hadoop or map
reduce, I gather he doesn't really have a clue based on his dismissal of
Twitter as a failure. Last time I checked, they're going strong.

~~~
sah
I hope someday I manage to build something that people laugh at as much as
they're laughing at Twitter.

------
icey
This is a troll of epic proportions.

Well, not really epic. Mini-series proportion is probably more apt.

~~~
dfranke
I gathered that much just from the title. If you trained a Bayesian filter to
recognize trolling, "grownups" would probably be one of the strongest troll
tokens. adequacy.org even subtitled itself "news for grownups".

------
thomasmallen
I'm pretty much done with that e-rag.

------
tlrobinson
The only thing I got from this article is reaffirmation that Ted Dziuba is a
huge Hadoop fanboy and still hates Twitter.

------
danielrhodes
Amazing article, only because it called out so many types of people
simultaneously.

~~~
william42
It's like Zed Shaw without the factual accuracy.

------
auston
"...but rather as an act of mercy for his keyboard"

-Classic.

------
th0ma5
heh... "Along with the data processing framework, Doug Cutting also included a
fault tolerant, replicated, distributed file system with Hadoop just because
fuck you"

