

Social Graph Analysis using Elastic MapReduce and PyPy - jordanmessina
http://postneo.com/2011/05/04/social-graph-analysis-using-elastic-mapreduce-and-pypy

======
yummyfajitas
It's usually more efficient to use distcp to copy the data to ephemeral
storage:

    
    
        hadoop distcp s3n://<id>:<secret>@bucket/input_data /input_data
        hadoop jar myjar.jar org.myorganization.module.jobname /input_data /output_data
        ....other steps....
        hadoop distcp /output_data/ s3n://<id>:<secret>@bucket/output_data
    

Doing it this way, all spills/intermediate steps live on (very fast) ephemeral
storage. Additionally, the distcp command is pretty fast since it is a
distributed copy - all nodes will saturate their connections on the copy.

Also, the authors advice to "make processing faster, decompress it, split it
in to lots of smaller files" is probably not the optimal way to do it. Hadoop
isn't great at storing lots of small files.

[http://www.cloudera.com/blog/2009/02/the-small-files-
problem...](http://www.cloudera.com/blog/2009/02/the-small-files-problem/)

I'm guessing that he gets a speedup because he is using raw s3:// as a
filesystem rather than s3n://. If I understand the hadoop s3 connection, this
means that hadoop will use a single connection to pull a single file - so a
single 10GB file will max out a single node's network connection and the rest
will go unused. Splitting the file up into 1 million line blocks lets multiple
nodes pull simultaneously. But storing files using the s3n:// allows this
anyway, and results in large files being split into 64mb blocks. It also
allows you to store files larger than 5GB, or whatever the s3 file size limit
is.

Now, a question I'd really like to know the answer to. What advantage is there
to using EMR vs Whirr?

[edit: adjusted language - upon rereading it seemed harsh,which was not my
intent. ]

~~~
mcroydon
Thanks for the tip, I didn't think to copy the data to ephemeral storage like
that. That'll probably speed things up a lot.

I ended up splitting the data in to a relatively small number (~200) of ~30MB
gzipped files in order to initially saturate the mappers and speed things up.
If that's not necessary after moving to ephemeral storage that's fine by me!

~~~
yummyfajitas
It's not necessary once your files live on ephemeral storage, but it would be
necessary if you want the distcp operation to be fast. But again, the s3 block
filesystem will not have this problem.

------
aksbhat
I have tried using both Hadoop (55 node cluster at Cornell) and a single AWS
High Memory double extra large instance with 32GB memory.

I have found that since the Twitters social graph is small enough to fit in
the memory, a single instance with huge amount of RAM is much more efficient,
especially when your algorithm iterates over nodes in the network.

You can read about it here:

Hadoop based results: www.akshaybhat.com/LPMR/

Results using a single High Memory instance AWS instance
www.akshaybhat.com/LPMR/GRAPHLAB

Even the startup hunch has taken a similar approach and use a single machine
with large amount of memory rather than a hadoop cluster.

~~~
mcroydon
Good call. The dataset was large enough that it didn't feel silly to use
something like MapReduce but the same thought has been in the back of my head
the whole time.

------
mcroydon
With 8 instances and the number of files the input was split in to there were
definitely both map and reduce tasks waiting for a runner. I don't know
exactly how much but I'm pretty sure I'm paying a pretty heavy IO tax by using
S3.

------
kingkilr
Only 1/3 faster? Unacceptable! I wonder just how IO bound it is since it's not
using HDFS, it'd be nice to know what speedup we're doing on the CPU bound
portion of the task.

