

Map Reduce for the People - rjurney
http://blog.weatherby.net/2009/04/map-reduce-for-the-people.html

======
kurtosis
In my experience the biggest barrier to dealing with huge data volumes isn't
the cost of supercomputer time. - it's the time it takes to

(a) curate and prepare the data

(b) develop a model to analyze it, adapt the model to whatever platform I'm
working with e.g. hadoop, MPI

(c) get the results into a presentable form.

Compared to the hours of labor involved in these three activities
supercomputer time isn't really that expensive. It can be leased if you know
where to look.

~~~
dmix
> if you know where to look.

Wasn't that the point of the article? Supercomputing is now accessible to "the
people" so part of the problem is close to being solved.

~~~
kurtosis
Yeah, no doubt cheaper cycles are a good thing. So help me out here - do you
know of supercomputing (MPI or hadoop) apps where the bottleneck was finding
time on a computer. I think development cost is still the bottleneck

In bio and geo anyone with a grant can get access with little difficulty. I
guess people with grants covers most of the people here.

~~~
rjurney
Could you help me out here - which is easier to get started developing with -
MPI or hadoop?

Its not just about cheap processor time, its about ease of use. Hadoop is a
platform that does much much of the hard part for you, giving you an enormous
head start to solving problems in parallel because all you have to really
think about is the algorithm you're executing. Thinking in mapreduce is still
hard, but its a whole lot easier than thinking in MPI, and it takes a lot less
time to get real computations happening on your data.

The cloud computing time could be MORE expensive than supercomputer time. That
doesn't matter. The point is that this stuff is now accessible to a much wider
group than MPI was, and they're learning the skills to crunch gobs of data.

Its pretty clear that you're a 'smart kid' and that both MPI and mapreduce are
pretty easy for you. Thats not the case for most people. You're talking about
hard scientists. I'm talking about normal developers.

~~~
kurtosis
you have a good point, but for what it's worth here's the practical example
that is the basis of my opinion on this. I had a very large corpus of text and
I wanted to construct a word association graph. This involves multiplying
together lots of sparse 10^5 x 10^9 matrices that are too large to fit in
single memory. This is out of core. Getting together 6 computers and setting
up hadoop and HDFS took me about 2 days starting from nothing.

Figuring out how to do all of these out of core sparse matrix manipulations
honestly took me about a week of tinkering before I had anything worth even
trying. I'm certainly not an expert at this stuff so maybe a better coder
could have done it a lot faster but that's my problem.

What would be a real advance in bringing massive scale data analysis to "the
people" would be something like hadoop R or MATLAB that makes doing things
like this completely automagic.

~~~
rjurney
Regarding the real advance: That stuff is the very 'revolution' thats coming
that I talk about :) Visicalc kindly sucked, but Excel is pretty good. Its
still very early, but the way forward has become clear.

------
bcl
I think this is the leading edge of the next revolution. MapReduce, Hadoop,
Cloud CPU cycles are the building blocks but we still need the tools that make
it easier for scientists to operate on their datasets without having to be a
hard-core coder. I wouldn't be surprised to see Matlab or some other offering
from Wolfram take on that role.

~~~
rjurney
The examples were from science because they struck me as the most powerful,
but I think this will go well beyond science to many problems. It started with
search, after all.

