

Parallel upload to Amazon S3 with python, boto and multiprocessing - rch
http://bcbio.wordpress.com/2011/04/10/parallel-upload-to-amazon-s3-with-python-boto-and-multiprocessing/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+bcbio+%28Blue+Collar+Bioinformatics%29

======
dlsspy
What a terrible way to achieve network concurrency. You can't possibly think
that each upload is using a "core" (as described in the article). That would
imply that a two core web server could only serve two files at the same time.
Sending a block of data from one file descriptor to another is not a CPU
intensive task.

Just use twisted or similar and run as many as you want concurrently with a
single thread.

~~~
rch
I see what you're saying, but the process he describes does work. Don't focus
on the word 'cores', just think clients.

E.g. 'If you have a 10GB text file, it can be faster to split the file across
multiple connections when pushing it to S3.' That sounds OK, yeah?

~~~
dlsspy
Yes, but it's a very expensive way to do it -- and the article is about how to
do it this way.

This way takes 10GB read and 10GB write of IO in subprocesses and then running
a separate process for every HTTP upload of the individual files.

I'd be surprised if it took more (or even as much) code to do it with a single
thread on the original file with as many connections as you wanted using
txAWS.

It's the difference between "it works" and "this is a good solution you should
use as a model for your own apps."

~~~
chapmanb
Thanks for the feedback. I'd be very happy if this inspired better solutions.
My goal was to solve a problem that was slowing down my work, and I reached
for multiprocessing from my toolbox since that's my first choice to run work
in parallel. Since I didn't find an existing implementation, I decided to
share it. It would be great to learn from some alternative approaches.

~~~
dlsspy
I hope that didn't sound like an attack. I'm glad you solved your problem and
communicated it. I did see that while twisted does have multipart file upload
in a branch on lp, it's not actually merged in yet.

The problem is just one of language and toolkit abstractions. Too many python
APIs are blocking unnecessarily with no way out. A non-blocking version of
that API would be obvious how to run in parallel (as the twisted one is).

This is part of the reason for node.js' rise in popularity -- or at least
reason for existence. He took the other extreme where nothing blocks. The
community is coming up with ways to simulate blocking APIs in efforts similar
to what you've done to simulate non-blocking APIs in a world full of blocking
APIs.

------
rch
from the author's comments: "It’s way faster. Like, way faster. Seriously, I
didn’t try to benchmark because it will depend on your network, but it helped
me with uploads that were going painfully slow (~12 hours -> ~3 hours, using 8
cores)."

:)

