

The map-reduce cargo cult - sllrpr
http://blog.locut.us/2012/04/16/the-map-reduce-cargo-cult/

======
gfodor
Uh, no. The reason you copy your data to EC2 is so you can perform arbitrary
transformations on it in a scalable way. I can dump 10GB of data to EC2 and
then use 1000 machines to perform some processing, during which there are many
terabytes of intermediate data, and an output which is less than a few
kilobytes. (For example, computing the 10 most popular search queries and
their click through rates over the last year.)

A tip for the author: if you notice a number of your peers doing something,
perhaps you should give them the benefit of the doubt and inquire further.

~~~
sllrpr
If the best way you can think of to determine the 10 most popular search
queries and their click-through rates involves terabytes of intermediate data
for only 10GB of input data, then you're doing something very wrong.

~~~
gfodor
The point is that small inputs can have large intermediate scratch datasets
and even smaller outputs.

Edit: Even more to the point is that being able to do scalable transformations
like computing top N CTR on a lot of data with little regard to available
computing/network/disk resources is the reason why you would copy your input
data to EC2 for processing. If the author has a point to make he failed to do
so beyond making himself look like someone who enjoys labeling things he
doesn't understand as a "cargo cult."

~~~
sllrpr
> The point is that small inputs can have large intermediate scratch datasets
> and even smaller outputs.

Perhaps in some rare circumstances (although not the one you cited), however
most people use map reduce for aggregation of one form or another, which
doesn't require vast amounts of intermediate data unless you are being
deliberately inefficient.

> Even more to the point is that being able to do scalable transformations
> like computing top N CTR on a lot of data with little regard to available
> computing/network/disk resources is the reason why you would copy your input
> data to EC2 for processing.

Actually you'd copy it to S3 for processing, and then it would need to be
downloaded into EC2 (unless you want to leave your EC2 instances running,
which you won't unless you have a large number of shares in Amazon). It's hard
to imagine situations where it is faster to move the data across Amazon's LAN,
than to simply process it on the machine it's already on.

> If the author has a point to make he failed to do so beyond making himself
> look like someone who enjoys labeling things he doesn't understand as a
> "cargo cult."

The author looks like someone pointing out that the original purpose of map-
reduce is that you do your computations where your data is, and that moving
your data so that you can do map-reduce on it misses the point. The author is
correct.

You might have a stronger argument if you could show some common non-contrived
situations where there would be a relatively small amount of input data but
vast amounts of intermediate data. You haven't yet.

~~~
gfodor
Collaborative filtering.

I use S3 and EC2 interchangably when it comes to EMR, which is what I presume
is what the author is referring to. Most EMR jobs consume and write their data
to S3 and use a temporary HDFS cluster for scratch. By and large scratch data
ends up being much, much larger than the original inputs, if for no other
reason than that needed during the shuffle/sort stage. (I am assuming we are
talking about non trivial map reduce jobs here, not word counters, where you
have many reduce steps.) it goes without saying there are many applications
where user-created functions will generate more data than they consume
(combinatorics, etc)

Data locality is but one reason to use map reduce. In practice EMR allows you
to draw upon elastic computing resources to allow you to process data however
you like. It provides developer and cluster isolation and linearly scalable
I/O from S3 as well. The author sounds like someone who may have read the
academic papers and a few books but hasn't used these tools in practice.

~~~
sllrpr
> Collaborative filtering.

What collaborative filtering algorithm are you using that requires terabytes
of intermediate storage for gigabytes of input data?

I'm familiar with most approaches to CF (SVD, gradient descent, etc) and I
can't think of any that require large amounts of intermediate storage.

> By and large scratch data ends up being much, much larger than the original
> inputs, if for no other reason than that needed during the shuffle/sort
> stage

I can't think of a single practical situation where you couldn't do your
sorting online as you progress through the data. Again, the overhead of moving
the data to-and-from S3 would be greater than processing the data locally
(unless Amazon's LAN is faster than a SATA bus, which is unlikely).

> The author sounds like someone who may have read the academic papers and a
> few books but hasn't used these tools in practice.

You keep attacking the author in various ad hominem ways, yet you haven't yet
provided a single uncontrived example of the small input data, large
intermediate data scenario that your argument relies upon.

~~~
gfodor
My argument does _not_ rely upon it, it was an example of one of _several_
reasons running map reduce jobs on the AWS cloud have nothing to do with the
amount input data you are moving around. I am not going to go off into even
more detail about specific jobs I run daily that generate a large amount of
itermediate data because unless I paste the source code in this thread and
write a paper on it I assume you won't believe me that there is in fact in the
space of "all map reduce jobs" jobs that can generate more data than they
input.

If you write a trivial map reduce job using cascading that has 10 reducers and
each reduce step shuffles the data on a different grouping key you will find
that Hadoop alone is generating more data than you input. But again, this
isn't the point. The point is the author is calling anyone using AWS for map
reduce a "cargo cult" based upon an academic argument that the sole purpose of
map reduce is to move computation to your data, hence if you copy your data
you are missing the point. In practice, the cost of uploading your data to s3
is a footnote compared to the computational flexibility and use cases that
become possible once you are able to run arbitrary tranformations on that data
via EMR. You keep ignoring my main point and are focused on my simplistic
examples, reading way more into them than was intended.

~~~
sllrpr
> it was an example of one of several reasons running map reduce jobs on the
> AWS cloud have nothing to do with the amount input data you are moving
> around

It would be an example if you had backed up your assertion that collaborative
filtering required large amounts of intermediate data, but apparently you are
unwilling or, more likely, unable to do this.

> I am not going to go off into even more detail about specific jobs I run
> daily that generate a large amount of itermediate data because unless I
> paste the source code in this thread and write a paper on it I assume you
> won't believe me that there is in fact in the space of "all map reduce jobs"
> jobs that can generate more data than they input.

Even more detail? You haven't given me any detail! You've yet to give me a
single example of a practical situation where a task involves much larger
amounts of intermediate data than it's input data. I'm asking you to back up
your argument, I'm not asking for access to your source code.

> If you write a trivial map reduce job using cascading that has 10 reducers
> and each reduce step shuffles the data on a different grouping key you will
> find that Hadoop alone is generating more data than you input

If it's so trivial, why can't you give me a single practical use-case?

> In practice, the cost of uploading your data to s3 is a footnote compared to
> the computational flexibility and use cases that become possible once you
> are able to run arbitrary tranformations on that data via EMR

Yes, apparently so many use-cases that you can't provide a single example of
one!

~~~
gfodor
Apparently I am either horrible at explaining myself or you are being
deliberately obtuse. The argument about intermediate data size is a sufficient
but unnecessary argument to prove the author has no point to make.

First, let's show that I can write a job that can produce more data that it
inputs. I have a map of user to score, and want to compute pair wise summed
scores for every user pair. This is clearly O(n^2) outputs. Computing pair
wise scores is a common algorithm for recommender systems (I realize in
practice you generally do not compute the entire space because it will be too
slow. However your output will be closer to n^2 than n, ie, it will be much
larger than your input.)

Now, this is a complicated example. A simpler example is "I want to compute
aggregations on all the fields my log file." If you have N fields, hadoop is
going to sort the data N times. Ie, you will be producing lots of intermediate
data, almost certainly more than the input size, just by using map reduce (the
code doing this merge sort is not your code, it's hadoop.)

But again, the point you keep missing and conveniently ignore when you quote
my posts is that the point of map reduce on Aws is not about data locality
(obviously) but about downstream flexibility. I can run 100 jobs, in parallel,
on 10k machines, and output much more data than I input, without running a
cluster of my own and I get to pay by the hour. I am isolated from other devs
and spin the machines down when finished. If you buy the argument that this is
a useful feature (as any EMR customer would attest too) than this too is a
separate more pragmatic reason why the author has no idea what he is taking
about when he says all EMR customers are a cargo cult.

~~~
sllrpr
> First, let's show that I can write a job that can produce more data that it
> inputs. I have a map of user to score, and want to compute pair wise summed
> scores for every user pair. This is clearly O(n^2) outputs. Computing pair
> wise scores is a common algorithm for recommender systems (I realize in
> practice you generally do not compute the entire space because it will be
> too slow. However your output will be closer to n^2 than n, ie, it will be
> much larger than your input.)

But that's exactly it - you admit yourself that this is a contrived example!
For all but trivial numbers of users N^2 storage requirements would be
impractical for any system, no programmer in their right mind would take such
an approach.

And this is the point that you seem to be missing: the question is not whether
you can come up with some contrived scenario where there is vastly more
intermediate data than input data, but when you can come up with a realistic
practical example - and apparently you cannot.

> A simpler example is "I want to compute aggregations on all the fields my
> log file."

In this case the amount of intermediate data is no greater than the amount of
output data.

> But again, the point you keep missing and conveniently ignore when you quote
> my posts is that the point of map reduce on Aws is not about data locality
> (obviously) but about downstream flexibility. I can run 100 jobs, in
> parallel, on 10k machines, and output much more data than I input, without
> running a cluster of my own and I get to pay by the hour. I am isolated from
> other devs and spin the machines down when finished.

These are benefits of EC2, not of map-reduce. The critique was specific to
map-reduce, not of the entire principle of on-demand computing.

> If you buy the argument that this is a useful feature (as any EMR customer
> would attest too)

Strawman. You're attempting to conflate the general benefits of on-demand
computing with the specific benefits of EMR. The author was criticizing EMR in
particular, not the entire principle of on-demand computing.

> than this too is a separate more pragmatic reason why the author has no idea
> what he is taking about when he says all EMR customers are a cargo cult.

And yet you still haven't been able to cite any realistic, non-contrived
examples of where EMR is the appropriate tool for the job.

~~~
gfodor
Ok, it's clear now you have horrible reading comprehension and/or have never
used Hadoop in practice. If you perform an aggregation on 10 fields in a log
file, a common pattern, you are going to be generating more intermediate data
than your log files during the shuffle. (Hadoop is, in this case.) If you
generate a dataset that has pairwise scores for users, even if you are only
sampling 1% of them than you are generating _way_ more data than your input.
(You are generating this data in this example, not Hadoop, which is why I
included it.) Say N=1M. You want to compute the top 1% pairwise scores (pruned
via some locality sensitive hashing.) Guess what, you are computing N^2 * 0.01
= 10B records. Perfectly tractable for a big data system. You seem too busy
yelling at me to even read what I am writing, since I stated this clearly: "(I
realize in practice you generally do not compute the entire space because it
will be too slow. However your output will be closer to n^2 than n, ie, it
will be much larger than your input.)"

You say: "These are benefits of EC2, not of map-reduce. The critique was
specific to map-reduce, not of the entire principle of on-demand computing."
This is patently false. The criticism of the author is that people using EMR
are a cargo cult and do not understand the purpose of map reduce. Ie, they
should not be using EMR, and he is so clever for realizing the folly of their
ways. My claim is the author does not understand the purpose of EMR. By
definition, EMR brings with it the benefits of EC2. I don't see how this is a
strawman.

At least I am sure now that it's not my communication skills, I illustrated
two perfectly fine examples and you glossed right over them without
understanding. I've already spent enough time on this and argued in good faith
and at this point I can at least be confident that you have no interest in
understanding what I am saying and instead want to childishly prove me wrong
on the internet.

