

Amazon S3: Multipart Upload - abraham
http://aws.typepad.com/aws/2010/11/amazon-s3-multipart-upload.html

======
jluxenberg
Not to be confused with the "multipart/form-data" encoding type for form-based
file uploads. This is not an implementation of that protocol for S3.

~~~
chunkbot
Because we _love_ it when corporations come up with their own proprietary
implementations of existing open protocols with similar if not superior
functionality.

~~~
JeremyBanks
Apples and oranges. multipart/form-data is used for sending a set of form
information, possibly including files, all together. This announcement is that
S3 will now allow you to upload a file in pieces so that you don't lose
everything when an upload is interrupted.

------
bgentry
Based on the docs, it appears that this also allows you to upload segments of
files without knowing the final number of segments or the final file size.

This will be pretty damn useful for piping the output of some process
generating a large file (i.e. video transcoding) and beginning the upload
before the file has been fully created.

~~~
cperciva
_This will be pretty damn useful for piping the output of some process
generating a large file (i.e. video transcoding) and beginning the upload
before the file has been fully created._

Even better: You can split a video into pieces, transcode each part on a
different EC2 node, and upload the parts directly from those respective nodes.

~~~
bgentry
won't work with all codecs, but that same concept can be applied to a lot of
areas!

------
neilc
_Limitations of the TCP/IP protocol make it very difficult for a single
application to saturate a network connection._

I'm just curious, but exactly which "limitations" are those? I can believe
that parallel connections help in practice (especially when fetching small
objects), but for large objects, I find it surprising you can't get reasonably
close to saturating a single network connection with a modern TCP stack (e.g.,
using TCP window scaling).

~~~
tav
It's pretty much impossible to saturate even a LAN connection with a single
TCP connection. The are a number of issues at play here — RTT (Round Trip
Time, i.e. ping/latency), window sizes, packet loss and initcwnd (TCP's
initial window).

The combination of the limitations imposed by the speed of light and TCP's
windowing system means that you are buggered transferring large files over
high-latency TCP connections. I haven't checked their figures, but here's a
TCP rate calculator I just found which lets you tune the different parameters:
<http://osn.fx.net.nz/LFN/>

_The greater the delay, the bigger the impact. For example if we take a
standard Windows XP machine and plug in the values for a standard Gigabit LAN
(typically .2ms latency between hosts) we get a maximum speed of 700Mbit/sec,
but if we try if between two hosts, one of them in the USA (typically around
120ms) the maximum transfer rate falls to 1.17 Mbit/sec._

There are a number of attempts trying to fix TCP's failings in this regard.
For starters, see this presentation by the Google/SPDY guys making a case for
changing TCP Slow Start:
[http://www.chromium.org/spdy/An_Argument_For_Changing_TCP_Sl...](http://www.chromium.org/spdy/An_Argument_For_Changing_TCP_Slow_Start.pdf)
— here's the IETF Draft for increasing TCP's Initial Window
<http://tools.ietf.org/html/draft-hkchu-tcpm-initcwnd>

And, more radically, see the uTP work by the Bittorrent folk who are trying to
create a better alternative to TCP instead of simply fixing it —
<http://en.wikipedia.org/wiki/Micro_Transport_Protocol> \+
<https://github.com/bittorrent/libutp> (source code).

Anyways, sorry for not going into too much detail (it's 4am), but hope I was
able to clear things up a little.

~~~
neilc
_The are a number of issues at play here — RTT (Round Trip Time, i.e.
ping/latency), window sizes, packet loss and initcwnd (TCP's initial window)._

Initial window size: not relevant AFAICS, I'm not talking about connection
startup behavior.

RTT, Window size: if the bandwidth-delay product is large, obviously you need
a large window size (>>65K). Thankfully, recent TCP stacks support TCP window
scaling.

Packet loss: you need relatively large buffers (by the standards of
traditional TCP) and a sane scheme for recovering from packet loss (e.g.,
SACK), but I don't see why this is a show stopper on modern TCP stacks.

I'm not super familiar with the SPDY work, but from what I recall, it
primarily addresses connection startup behavior, rather than steady-state
behavior.

~~~
nkurz
In theory you should be right, and yet in practice it seems to be a problem.
Here's a recent exchange that offers some real life numbers:
[http://serverfault.com/questions/111813/what-is-the-best-
way...](http://serverfault.com/questions/111813/what-is-the-best-way-to-
transfer-a-single-large-file-over-a-high-speed-high-late)

------
shib71
This is very good - uploading large files is a PITA. Now all we need is a
Flash client we can add to a website, and we'll have a reliable way for
website users to upload huge files.

~~~
cperciva
Did you really just use the words "Flash client" and "reliable" in the same
sentence?

~~~
shib71
As fashionable as it is to deride Flash, the vast majority find it quite
stable. I believe a good Flash developer could produce a solid upload client
for this service. I also believe that every developer on HN would use such a
client without batting an eye.

~~~
cperciva
_I also believe that every developer on HN would use such a client without
batting an eye._

Well, except maybe people on ipads....

~~~
shib71
Granted. Mobile devices are handicapped in terms of embeddable functionality,
but that isn't limited to Flash. Java applets, the only obvious alternative
for this use case, are in the same boat. I think that if mobile users want to
upload >100Mb files over the air it's fair to make them use an app.

------
dholowiski
Am I the only one who thinks it's strange that the AWS blog is hosted on
typepad?

~~~
jeffbarr
Perhaps, but I started the blog in November of 2004, long before EC2, S3, or
any of the other services had been released. It was a clean and simple way to
get a blog up and running and I've never had a compelling reason to go through
the trouble to move it to another host.

Here's the first post that I wrote for the AWS blog:
<http://aws.typepad.com/aws/2004/11/welcome.html>

------
bshep
I did not find this in the description for the service, but I'm wondering what
happens if you have a crash or power failure while doing a multi-part upload
and dont have the 'upload-id' stored anywhere.

First of all the storage for the data already uploaded is reserved and there
is no way to release since you cannot abort the upload without the 'id'.

Second of all there doesnt seem to be a way to list active multi-part uploads,
you can only list the status of an upload for which you have the 'id'.

Any ideas?

~~~
throw_away
this doc:
[http://docs.amazonwebservices.com/AmazonS3/latest/dev/iearch...](http://docs.amazonwebservices.com/AmazonS3/latest/dev/iearchitecture.html)
says that there is an operation to list all in-progress multi-part uploads as
well as list the parts uploaded for a given multi-part upload.

------
blantonl
What is an acceptable use case for this few feature?

~~~
jasonkester
Here's how I'm going to use it:

I run a service that process S3 and Cloudfront logs for people. Each S3 Bucket
generates between 200 and 1000 logfiles every day that need to be combined
together to for a full day's weblog.

Part of what my service does is re-upload that combined logfile back to the
bucket in question, and since for large sites it can be upwards of 200mb
_zipped_ , it'd be nice to be able to upload it in little 5mb chunks that can
be resent if anything goes wrong.

~~~
andraz
exactly the same use case here... :) Maybe Amazon should go a step further and
enable people to get logs aggregated together by the chosen time unit (hour,
day, week), so we won't need to do round trip just to join files.

