
Tus - an Open Source File Upload Protocol - pow-tac
http://www.tus.io/
======
lucaspiller
This is a pretty cool idea. I absolutely detest how much apps don't properly
handle network degradation correctly, so anything resumable is great in my
opinion.

I have a question though... regarding the resuming you do the HEAD request to
see what has been uploaded:

> HTTP/1.1 200 Ok

> Content-Length: 100

> Content-Type: image/jpg

> Content-Disposition: attachment; filename="cat.jpg"'

> Range: bytes=0-69

Is it possible that the data that is already there could be corrupt?

I'm also wondering how things like proxies deal with this. A lot of mobile
networks have nasty transparent caching proxies in their network. Also when
uploading a file through Nginx (when the upload works correctly) it won't send
anything to the backend until it has the complete data, is this the same if
the connection cut half way through?

~~~
felixge
> Is it possible that the data that is already there could be corrupt?

TCP error correction should take care of this for most parts, but we're
actively discussing adding support for stronger guarantees:
[https://github.com/tus/tus-resumable-upload-
protocol/issues/...](https://github.com/tus/tus-resumable-upload-
protocol/issues/7)

> I'm also wondering how things like proxies deal with this. A lot of mobile
> networks have nasty transparent caching proxies in their network.

That's a good question. A http proxy could always cause issues, but most
proxies should leave POST/PUT/HEAD requests untouched. That being said, we
won't freeze the protocol until we had a chance to try it against a variety of
mobile networks, which is why we're already starting to implement an initial
iOS client over here: <https://github.com/tus/tus-ios-client> (not ready yet,
but keep an eye on it).

> Also when uploading a file through Nginx (when the upload works correctly)
> it won't send anything to the backend until it has the complete data, is
> this the same if the connection cut half way through?

NGINX is somewhat unsuitable as a proxy for file uploads due to the buffering
you mentioned. Ideally people will implement the tus protocol as an NGINX
Addon like this one: <http://www.grid.net.ru/nginx/resumable_uploads.en.html>

Meanwhile clients are free to choose a small chunk size for individual PUT
requests (e.g. 1 MB), which will allow them to still have resumability (in 1
MB intervals) without changing their architecture.

Last but not least, we'll implement the tus protocol for our commercial
uploading service, transloadit.com.

So I'm reasonably optimistic that NGINX won't be a major hurdle for the
adoption of the protocol.

~~~
rogerbinns
> TCP error correction should take care of this for most parts ...

Oh no it doesn't! We have an analytics service receiving HTTP posts from
browsers all over the world as JSON. There is an astonishing amount of single
bit errors going on. Usually the initial 20 bytes are okay, but after that we
see all sorts of patterns including a bit flip every 8 bytes or so. Note that
these will have been received at Google's appengine servers with the correct
checksum. I believe that much of the cause is intermediary devices (eg
performing NAT or routers) that are responsible for the corruption and
recalculate the checksum putting a good checksum on what is now corrupted
data.

For that service we have to use HTTP (grumble grumble IE grumble). For our
regular stuff we use HTTPS where we do still see the problem but it is
considerably rarer. In that case the cause is most likely the client device
having problems (eg RAM bit flips, cosmic rays, overclocked/overheated CPUs
etc)

All else being equal I'd recommend you add a layer of checksums as a helpful
sanity check. Using SSL also does that for you, but it sees the data late.

------
nanoman
I've been a user of their product <https://transloadit.com/> for about a year
now for video encoding and am very happy with it. The API is elegant, the
whole service is reliable and fast.

These guys know what they're doing.

~~~
felixge
Thanks for the kind words! Tus.io the beginning of a larger commitment to open
source here at Transloadit, which will also translate into many new features
(resumable uploads being the next one).

~~~
karterk
Felix, given that the rest of your stack is on Node, what made you choose Go?

~~~
felixge
Node.js was a good choice for Transloadit when we started (2009). By good I
mean better than PHP which we tried using for transloadit before.

I started playing with Go late last year, so far I'm under the impression that
it's easier to write reliable software with it than node.js. Callbacks and
exception handling are a huge PITA in node.js, and the community has chosen to
refuse improvements that would help with some of the issues (promises).

Go is also an incredible joy to work with given the modern nature of the
standard library, static typing, gofmt, built-in testing, and many other small
things that the Go team has done right.

That being said, tus.io is not a Go project. Our first server implementation
(tusd) is written in Go, but we're working on support for other plattforms
like node.js as well.

Generally speaking node.js will continue to be part of our toolkit at
Transloadit (for the quick & dirty), but I suspect that we'll use Go for the
more criticial parts we work on going forward.

------
gabipurcaru
Shameless plug, but HTML5 + CORS + S3 can enable resumable file uploads. I've
written a library that uploads to S3, and can resume uploads (think internet
going down for a while, force-closing the tab, etc.):
<https://github.com/cinely/mule-uploader> . There's a demo available, I
suggest you test it with bigger files (>100MB)

~~~
felixge
Your library looks great - thanks for releasing it!

S3 is an incredible offering, but since I'm working on tus.io, I'll focus on
what's wrong with it : )

\- Multipart chunks need to be 5 MB at least. An interrupted part cannot be
resumed. This kills the mobile use case.

\- Throughput to S3 is bad from outside of EC2, uploads often start at very
slow speeds and won't reach the capacity of the connection in many cases.

\- S3 does not let you stream/access an upload in progress easily, so you
can't start to transcode a video while it's still uploading.

\- The S3 API is the opposite of RESTful.

\- S3 is a proprietary service, their protocols are not intendent/documented
for adoption, and IMO they don't deserve great people like you making free
contributions to their ecosystem.

edit: I'm not trying to say S3 isn't a good choice for many people. But our
goal is to bring resumable file uploads to every iOS, jQuery, Wordpress,
Drupal, Rails, etc. application in the world - S3 is not the right starting
point for that.

~~~
gabipurcaru
Thanks for your insight, you make good points; it seems to me you don't really
like S3, can you elaborate why?

~~~
felixge
I love S3, and we use it all over the place at transloadit.

I realized my comment sounded overly negative, so I added a clarification to
my comment: Our goal is to bring resumable file uploads to the entire planet,
S3 or any other proprietary protocol should not be the base for that.

~~~
derefr
What's your opinion on Swift (<http://www.openstack.org/software/openstack-
storage/>)? It's basically an open standard for "an object store with
compatibility to, and the same guarantees of, S3." Used in, for example,
Rackspace Cloud Files.

------
j4_james
I think some of your responses aren't quite right. For example, in response to
the first PUT, you have:

    
    
      HTTP/1.1 200 Ok
      Range: bytes=0-99
      Content-Length: 0
    

But the Range header surely can't be used here, since it's a request header
and this is a response. A Content-Range header wouldn't be any more
appropriate, since you're not actually returning any content (of any amount).
Do you really need this info in the response anyway? The sender knows what
they sent, and either it was entirely successful (a 2xx response) or it
wasn't.

Also, if you're going to return a zero-length 200 response, you might as well
use 204 No Content instead.

Then, when resuming an upload, you send a HEAD that returns the following:

    
    
      HTTP/1.1 200 Ok
      Content-Length: 100
      Content-Type: image/jpg
      Content-Disposition: attachment; filename="cat.jpg"'
      Range: bytes=0-69
    

Again, you can't use the Range request header in a response. And the Content-
Length should surely be 70, since that's how much content would be returned if
this was a GET request. You could possibly include a Content-Range of 0-69/100
if the server wanted to communicate the expected file size, but I'm not
convinced that's necessary and seems something of an abuse of that header.

Finally, the response to the resumed PUT has the same problems as the first
PUT response. It should probably just be a 204 No Content response - no
Content-Length or Range headers required.

~~~
felixge
> But the Range header surely can't be used here, since it's a request header
> and this is a response.

We're currently discussing how to interpret RFC 2616 (http 1.1) for this here:
[https://github.com/tus/tus-resumable-upload-
protocol/issues/...](https://github.com/tus/tus-resumable-upload-
protocol/issues/2)

If you have a better suggestions than using the Range header that will still
allow clients to send multiple file chunks in parallel, I'd be very interested
in it!

> Do you really need this info in the response anyway? The sender knows what
> they sent, and either it was entirely successful (a 2xx response) or it
> wasn't.

We don't need it for the PUT request, but we do need it for HEAD. Adding it to
PUT is redundant, but simplifies the logic for clients who choose to upload
multiple chunks in parallel.

> Also, if you're going to return a zero-length 200 response, you might as
> well use 204 No Content instead.

Good point, I'll investigate on that: [https://github.com/tus/tus-resumable-
upload-protocol/issues/...](https://github.com/tus/tus-resumable-upload-
protocol/issues/12)

> And the Content-Length should surely be 70, since that's how much content
> would be returned if this was a GET request.

It's 100. We haven't specified GET requests yet, but a server could stream an
upload in this case until all bytes have been received.

Anyway, this is awesome feedback - thank you so much!

~~~
j4_james
> We're currently discussing how to interpret RFC 2616 (http 1.1) for this
> here: [https://github.com/tus/tus-resumable-upload-
> protocol/issues/...](https://github.com/tus/tus-resumable-upload-
> protocol/issues/2)

I've followed up with further comments there.

> If you have a better suggestions than using the Range header that will still
> allow clients to send multiple file chunks in parallel, I'd be very
> interested in it!

I don't see a way to support parallel transfers using only existing HTTP
headers (without violating the HTTP spec). I would suggest maybe proposing a
new header in the HTTPbis WG. For example, something like Available-Ranges
that returns a ranges-specifier indicating the set of ranges that are
avaiable.

This could possibly be returned as part of a 416 response when attempting to
GET a file that isn't entirely available yet. A HEAD request would thus return
the same thing.

> It's 100. We haven't specified GET requests yet, but a server could stream
> an upload in this case until all bytes have been received.

The reason I brought up a GET request is because "the metainformation
contained in the HTTP headers in response to a HEAD request SHOULD be
identical to the information sent in response to a GET request." (section 9.4
of RFC2616)

If you haven't got all 100 bytes yet, your GET request can't return a Content-
Length of 100, thus your HEAD request shouldn't be returning 100 either.

I would have thought you would return whatever content you had available
(hence the 70 bytes), but if you want to support parallel transfers, then a
416 error response indicating the available ranges might make more sense.

------
andyking
Call me a prude, but I saw the F-word and hit the back button.

~~~
felixge
Point taken. We'll try to hit a better balance between emotionalizing the
subject without being too exclusive when iterating the text for this again.

<https://github.com/tus/tus.io/issues/14>

~~~
PommeDeTerre
I don't think it's an issue with "emotionalizing the subject", nor one of
people getting offended, but rather one of professionalism.

That sort of language, in a context like that, reminds me of a Zed-style rant.
It makes it harder for me to take it seriously, you know? The whole project
ends up coming off as an amateur effort, even if that may not be the case.

~~~
felixge
I don't consider professionalism and the word "fuck" to be mutually exclusive,
but at the end of the day we'll focus on what attracts people. Our current
choice of words clearly fails at this goal, so we'll consider replacing it.

~~~
lttlrck
It clearly fails but you'll only consider changing it? That's more worrying
than choosing the word in the first place...

------
icebraining
Seems fine as a best practice for using HTTP for file uploads. I feel the
requirement for the server to have fixed URLs for uploading to be limiting,
but then again, I'm one of those HATEOAS freaks.

~~~
felixge
The urls don't have to be fixed. I'll clarify this in the docs:
[https://github.com/tus/tus-resumable-upload-
protocol/issues/...](https://github.com/tus/tus-resumable-upload-
protocol/issues/11)

Also: The protocol is still under heavy development, so please post any
additional ideas, issues, patches or feedback you may have!

------
tsuraan
I'm confused on what information the HEAD request gives after a chunk has
failed. Suppose a client concurrently uploads chunks 1, 2, 3, 4 and 5; chunks
2 and 4 fail, and the rest work. What information does the HEAD give to tell
the client that it needs to re-send the data that was in chunks 2 and 4?
Wouldn't it make more sense for the client to store the success of each of its
chunk uploads?

I'm also not seeing how the client indicates that the upload is complete. It
could be done server-side, by just detecting when a file has no more holes in
it, but that seems hacky. Holes can also be useful; suppose I make a 32GB
.vmdk file (non-sparse) and put 2GB of data on it. If the server can support
holes, then I can upload (and the server only has to store) about 2GB of data;
if the server can't support holes, then I'll have to upload a bit more data
(assuming compression), and the server will have to store a lot more data. If
there were some final message the client could submit to the resource saying
"I'm done, commit it!", I think the protocol would be a bit more complete.

~~~
j4_james
Lets say the whole file is 1000 bytes and each of the five chunks is 200
bytes. If 2 and 4 have failed, then the HEAD would return with a Range header
like this:

    
    
      Range: bytes=0-199,400-599,800-999
    

The client would then know it had to resend bytes 200-399 and 600-799 (namely
parts 2 and 4). If the chunks only partially failed (say 100 bytes of each
chunk was received), the HEAD might even return something like this:

    
    
      Range: bytes=0-299,400-699,800-999
    

So now the client knows it only has to resend bytes 300-399 and 700-799 (only
the last 100 bytes of chunks 2 and 4).

Technically the Range header isn't valid in an HTTP response (something they
are aware of), but conceptually I think the idea works fairly well.

------
nnnnni
As usual, relevant xkcd:

<http://xkcd.com/927/>

~~~
felixge
: ) - hopefully that's not the case here, afaik our protocol is the first
proposal in this space meant for general adoption. All prior art is tied to
specific libraries and services.

~~~
nnnnni
I hope so too =-)

It looked nice from my quick skim, it just brought that classic xkcd to mind!

If this can be put into any page/service, it'd be a huge contribution.

------
raimue
It seems like the handling of concurrent access has been neglected in this
protocol. What if multiple clients try to resume uploading the same file?

~~~
shabble
Some of hte ideas from <http://www.w3.org/1999/04/Editing/> might come in
handy there. What realistic usage scenarios are there for multiple clients
concurrently uploading (parts of) the same file?

