
Pushing the Limits of Amazon S3 Upload Performance - sorenbs
http://improve.dk/archive/2011/11/07/pushing-the-limits-of-amazon-s3-upload-performance.aspx
======
aaronjg
I have seen some people get performance gains from using UDP. TCP tends to
back off rather aggressively after even a single dropped packet, and the
connection is slow to get back to speed. With UDP, the uploader can just
stream packets with sequential ids, and the receiver can respond whith a
negative ack if it misses a packet in the series.

UDP Data Transfer (Open Source): <http://udt.sourceforge.net>

Bandwidth Challenge:
[http://www.hpcwire.com/hpcwire/2009-12-08/open_cloud_testbed...](http://www.hpcwire.com/hpcwire/2009-12-08/open_cloud_testbed_wins_bandwidth_challenge_at_sc09.html)

Aspera: <http://www.asperasoft.com/>

~~~
SpikeGronim
TCP backs off aggressively to avoid congestive collapse of the network. It is
backing off in order to preserve shared infrastructure. If everybody used your
UDP trick then the Internet would literally collapse. This is detailed at
[http://en.wikipedia.org/wiki/Network_congestion#Congestive_c...](http://en.wikipedia.org/wiki/Network_congestion#Congestive_collapse)
. You should respect the rules of the road.

~~~
jon_dahl
Isn't this ultimately up to the congestion control algorithm that is used
alongside of UDP? Aspera and UDT, for example, have tunable congestion control
that can be more or less greedy.

I'm not a protocol hacker, so hopefully one will weigh in here. But I do
remember some Sky Is Falling discussion over uTorrent's use of UDP for
transfer, and the uTorrent line was always that they were implementing UDP
transfer in a way that played nice with TCP.

~~~
sern
uTorrent's UDP congestion control algorithm (LEDBAT) goes further than playing
nice with TCP. Unlike TCP, which only responds to packet loss, LEDBAT also
responds to delay. This makes it yield remarkably quickly to anything else
that might use the link.

------
twp
Timely! I'm just in the process of uploading 32 million objects to S3 :-)

The parallel upload code I'm using is written in Python, using the
multiprocessing and boto libraries and is here:

<http://github.com/twpayne/s3-parallel-put>

It has some nice features, like reading values directly from uncompressed tar
files - this means your disk heads will scan linearly rather than seeking
around. It can also gzip and set Content-Encoding, restart interrupted
transfers from its own log files, and do a MD5 sum check to avoid putting keys
that are already set.

Comments and feedback welcome.

------
adpowers
Sadly no mention of multipart upload. If you use S3's multipart upload feature
you can get similar performance for one file by uploading multiple parts of
the file at once. I saw a substantial throughput improvement when I
implemented some multipart upload code, even when uploading from EC2.

~~~
orcadk
Totally true. However, I had to limit the scope of this post somewhere, and in
this case, I focused on multithreaded upload of smaller objects, as mentioned
in the first paragraph. We do have some larger object uploads, though they are
a minority. I may do a separate followup post on high performance single file
uploads through multipart uploading.

------
ww520
Wow that's amazing. I didn't realize you can utilize beyond 10Mbs in EC2.

OTOH do people ever delete objects from S3? It seems people keep uploading to
S3.

