

Cloud MapReduce - a fast and lean alternative to Hadoop on AWS - helwr
http://code.google.com/p/cloudmapreduce/

======
petewarden
Summary - implements MapReduce using Amazon's native cloud offerings like S3,
SQS and SimpleDB instead of Hadoop's reliance on traditional OS filesystems
and services.

Interesting idea. I'm wary of the reliability issues though, I'm doing a lot
with SimpleDB and there's plenty of landmines.

Here's their technical paper:

<http://sites.google.com/site/huanliu/cloudmapreduce.pdf>

------
vicaya
The performance evaluation vs Hadoop looks bogus to me. S3, SQS and SimpleDB
run on real (non-virtualized) hardware. You need a lot more Hadoop nodes to
make it comparable. At least throughput per AWS dollar should be reported.

The ~100k files inverted index test is completely unfair to Hadoop: launching
100k map task per small file is ridiculous as it's basically measuring JVM
startup time. People/crawlers typically put all these pages in large
map/sequence files. Hadoop would automatically launch a map task per chunk
(default 128MB).

Fetching data to EC2 for computation is also a step backward. It cannot scale
on large data as the cluster will become switch bound much ealier than the
"kosher" map-reduce, where data locality is observed.

In any case, it's neither fast or lean (it's basically Java glue code to use
S3/SQS/SimpleDB, and seems to have larger overall carbon footprint than
Hadoop)

