Hacker News new | comments | show | ask | jobs | submit login
Tus - an Open Source File Upload Protocol (tus.io)
110 points by pow-tac on Apr 17, 2013 | hide | past | web | favorite | 46 comments



This is a pretty cool idea. I absolutely detest how much apps don't properly handle network degradation correctly, so anything resumable is great in my opinion.

I have a question though... regarding the resuming you do the HEAD request to see what has been uploaded:

> HTTP/1.1 200 Ok

> Content-Length: 100

> Content-Type: image/jpg

> Content-Disposition: attachment; filename="cat.jpg"'

> Range: bytes=0-69

Is it possible that the data that is already there could be corrupt?

I'm also wondering how things like proxies deal with this. A lot of mobile networks have nasty transparent caching proxies in their network. Also when uploading a file through Nginx (when the upload works correctly) it won't send anything to the backend until it has the complete data, is this the same if the connection cut half way through?


> Is it possible that the data that is already there could be corrupt?

TCP error correction should take care of this for most parts, but we're actively discussing adding support for stronger guarantees: https://github.com/tus/tus-resumable-upload-protocol/issues/...

> I'm also wondering how things like proxies deal with this. A lot of mobile networks have nasty transparent caching proxies in their network.

That's a good question. A http proxy could always cause issues, but most proxies should leave POST/PUT/HEAD requests untouched. That being said, we won't freeze the protocol until we had a chance to try it against a variety of mobile networks, which is why we're already starting to implement an initial iOS client over here: https://github.com/tus/tus-ios-client (not ready yet, but keep an eye on it).

> Also when uploading a file through Nginx (when the upload works correctly) it won't send anything to the backend until it has the complete data, is this the same if the connection cut half way through?

NGINX is somewhat unsuitable as a proxy for file uploads due to the buffering you mentioned. Ideally people will implement the tus protocol as an NGINX Addon like this one: http://www.grid.net.ru/nginx/resumable_uploads.en.html

Meanwhile clients are free to choose a small chunk size for individual PUT requests (e.g. 1 MB), which will allow them to still have resumability (in 1 MB intervals) without changing their architecture.

Last but not least, we'll implement the tus protocol for our commercial uploading service, transloadit.com.

So I'm reasonably optimistic that NGINX won't be a major hurdle for the adoption of the protocol.


> TCP error correction should take care of this for most parts ...

Oh no it doesn't! We have an analytics service receiving HTTP posts from browsers all over the world as JSON. There is an astonishing amount of single bit errors going on. Usually the initial 20 bytes are okay, but after that we see all sorts of patterns including a bit flip every 8 bytes or so. Note that these will have been received at Google's appengine servers with the correct checksum. I believe that much of the cause is intermediary devices (eg performing NAT or routers) that are responsible for the corruption and recalculate the checksum putting a good checksum on what is now corrupted data.

For that service we have to use HTTP (grumble grumble IE grumble). For our regular stuff we use HTTPS where we do still see the problem but it is considerably rarer. In that case the cause is most likely the client device having problems (eg RAM bit flips, cosmic rays, overclocked/overheated CPUs etc)

All else being equal I'd recommend you add a layer of checksums as a helpful sanity check. Using SSL also does that for you, but it sees the data late.


Thanks for the quick response. Everything you say makes sense. Best of luck with this, I look forward to seeing where it goes!


TCP means that it shouldn't be corrupt. I would like the feature of having MD5 hash of the data sent so far though, with this the clients would be able to validate that the data up to N bytes is most likely not corrupt.


The server could send a Content-MD5 header, which the client could compare to the md5sum of its own bytes 0 to 69.


Why use hashing when you only need checksumming?


Depending on the use case, man in the middle attacks could be a good reason for hashing.


A man in the middle could just as easily change the hash header, and MD5 isn't cryptographically secure anyway. It is still a very good checksum.


I've been a user of their product https://transloadit.com/ for about a year now for video encoding and am very happy with it. The API is elegant, the whole service is reliable and fast.

These guys know what they're doing.


Thanks for the kind words! Tus.io the beginning of a larger commitment to open source here at Transloadit, which will also translate into many new features (resumable uploads being the next one).


Felix, given that the rest of your stack is on Node, what made you choose Go?


Node.js was a good choice for Transloadit when we started (2009). By good I mean better than PHP which we tried using for transloadit before.

I started playing with Go late last year, so far I'm under the impression that it's easier to write reliable software with it than node.js. Callbacks and exception handling are a huge PITA in node.js, and the community has chosen to refuse improvements that would help with some of the issues (promises).

Go is also an incredible joy to work with given the modern nature of the standard library, static typing, gofmt, built-in testing, and many other small things that the Go team has done right.

That being said, tus.io is not a Go project. Our first server implementation (tusd) is written in Go, but we're working on support for other plattforms like node.js as well.

Generally speaking node.js will continue to be part of our toolkit at Transloadit (for the quick & dirty), but I suspect that we'll use Go for the more criticial parts we work on going forward.


Shameless plug, but HTML5 + CORS + S3 can enable resumable file uploads. I've written a library that uploads to S3, and can resume uploads (think internet going down for a while, force-closing the tab, etc.): https://github.com/cinely/mule-uploader . There's a demo available, I suggest you test it with bigger files (>100MB)


Your library looks great - thanks for releasing it!

S3 is an incredible offering, but since I'm working on tus.io, I'll focus on what's wrong with it : )

- Multipart chunks need to be 5 MB at least. An interrupted part cannot be resumed. This kills the mobile use case.

- Throughput to S3 is bad from outside of EC2, uploads often start at very slow speeds and won't reach the capacity of the connection in many cases.

- S3 does not let you stream/access an upload in progress easily, so you can't start to transcode a video while it's still uploading.

- The S3 API is the opposite of RESTful.

- S3 is a proprietary service, their protocols are not intendent/documented for adoption, and IMO they don't deserve great people like you making free contributions to their ecosystem.

edit: I'm not trying to say S3 isn't a good choice for many people. But our goal is to bring resumable file uploads to every iOS, jQuery, Wordpress, Drupal, Rails, etc. application in the world - S3 is not the right starting point for that.


Thanks for your insight, you make good points; it seems to me you don't really like S3, can you elaborate why?


I love S3, and we use it all over the place at transloadit.

I realized my comment sounded overly negative, so I added a clarification to my comment: Our goal is to bring resumable file uploads to the entire planet, S3 or any other proprietary protocol should not be the base for that.


What's your opinion on Swift (http://www.openstack.org/software/openstack-storage/)? It's basically an open standard for "an object store with compatibility to, and the same guarantees of, S3." Used in, for example, Rackspace Cloud Files.


Throughput to S3 is ok if you initiate a multipart upload and then do several parallel chunk uploads.


I think some of your responses aren't quite right. For example, in response to the first PUT, you have:

  HTTP/1.1 200 Ok
  Range: bytes=0-99
  Content-Length: 0
But the Range header surely can't be used here, since it's a request header and this is a response. A Content-Range header wouldn't be any more appropriate, since you're not actually returning any content (of any amount). Do you really need this info in the response anyway? The sender knows what they sent, and either it was entirely successful (a 2xx response) or it wasn't.

Also, if you're going to return a zero-length 200 response, you might as well use 204 No Content instead.

Then, when resuming an upload, you send a HEAD that returns the following:

  HTTP/1.1 200 Ok
  Content-Length: 100
  Content-Type: image/jpg
  Content-Disposition: attachment; filename="cat.jpg"'
  Range: bytes=0-69
Again, you can't use the Range request header in a response. And the Content-Length should surely be 70, since that's how much content would be returned if this was a GET request. You could possibly include a Content-Range of 0-69/100 if the server wanted to communicate the expected file size, but I'm not convinced that's necessary and seems something of an abuse of that header.

Finally, the response to the resumed PUT has the same problems as the first PUT response. It should probably just be a 204 No Content response - no Content-Length or Range headers required.


> But the Range header surely can't be used here, since it's a request header and this is a response.

We're currently discussing how to interpret RFC 2616 (http 1.1) for this here: https://github.com/tus/tus-resumable-upload-protocol/issues/...

If you have a better suggestions than using the Range header that will still allow clients to send multiple file chunks in parallel, I'd be very interested in it!

> Do you really need this info in the response anyway? The sender knows what they sent, and either it was entirely successful (a 2xx response) or it wasn't.

We don't need it for the PUT request, but we do need it for HEAD. Adding it to PUT is redundant, but simplifies the logic for clients who choose to upload multiple chunks in parallel.

> Also, if you're going to return a zero-length 200 response, you might as well use 204 No Content instead.

Good point, I'll investigate on that: https://github.com/tus/tus-resumable-upload-protocol/issues/...

> And the Content-Length should surely be 70, since that's how much content would be returned if this was a GET request.

It's 100. We haven't specified GET requests yet, but a server could stream an upload in this case until all bytes have been received.

Anyway, this is awesome feedback - thank you so much!


> We're currently discussing how to interpret RFC 2616 (http 1.1) for this here: https://github.com/tus/tus-resumable-upload-protocol/issues/...

I've followed up with further comments there.

> If you have a better suggestions than using the Range header that will still allow clients to send multiple file chunks in parallel, I'd be very interested in it!

I don't see a way to support parallel transfers using only existing HTTP headers (without violating the HTTP spec). I would suggest maybe proposing a new header in the HTTPbis WG. For example, something like Available-Ranges that returns a ranges-specifier indicating the set of ranges that are avaiable.

This could possibly be returned as part of a 416 response when attempting to GET a file that isn't entirely available yet. A HEAD request would thus return the same thing.

> It's 100. We haven't specified GET requests yet, but a server could stream an upload in this case until all bytes have been received.

The reason I brought up a GET request is because "the metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request." (section 9.4 of RFC2616)

If you haven't got all 100 bytes yet, your GET request can't return a Content-Length of 100, thus your HEAD request shouldn't be returning 100 either.

I would have thought you would return whatever content you had available (hence the 70 bytes), but if you want to support parallel transfers, then a 416 error response indicating the available ranges might make more sense.


Call me a prude, but I saw the F-word and hit the back button.


Is there any reason why you don't use a profanity filter, like this one - https://chrome.google.com/webstore/detail/simple-profanity-f... .

It is an issue where you object to a certain subset of language, language that is used to express strong or passionate feelings about things. I have never given swearing much thought and I don't understand what makes you object to it.

I guess I have always felt that the less people have to sensor themselves them better. The wikipedia article on the subject [http://en.wikipedia.org/wiki/Profanity] is not as informative as I'd hoped.


Point taken. We'll try to hit a better balance between emotionalizing the subject without being too exclusive when iterating the text for this again.

https://github.com/tus/tus.io/issues/14


I don't think it's an issue with "emotionalizing the subject", nor one of people getting offended, but rather one of professionalism.

That sort of language, in a context like that, reminds me of a Zed-style rant. It makes it harder for me to take it seriously, you know? The whole project ends up coming off as an amateur effort, even if that may not be the case.


I don't consider professionalism and the word "fuck" to be mutually exclusive, but at the end of the day we'll focus on what attracts people. Our current choice of words clearly fails at this goal, so we'll consider replacing it.


It clearly fails but you'll only consider changing it? That's more worrying than choosing the word in the first place...


I didn't notice it until I read your comment.


Same here. I think in that context it's fine. I didn't even notice it until I saw this comment here and went back to look.


You mean "fork"?


Prude.


I don't get it...


The little "about" text on the landing page has the word "fucking" in it.


I got exactly the same reaction.


Seems fine as a best practice for using HTTP for file uploads. I feel the requirement for the server to have fixed URLs for uploading to be limiting, but then again, I'm one of those HATEOAS freaks.


The urls don't have to be fixed. I'll clarify this in the docs: https://github.com/tus/tus-resumable-upload-protocol/issues/...

Also: The protocol is still under heavy development, so please post any additional ideas, issues, patches or feedback you may have!


I'm confused on what information the HEAD request gives after a chunk has failed. Suppose a client concurrently uploads chunks 1, 2, 3, 4 and 5; chunks 2 and 4 fail, and the rest work. What information does the HEAD give to tell the client that it needs to re-send the data that was in chunks 2 and 4? Wouldn't it make more sense for the client to store the success of each of its chunk uploads?

I'm also not seeing how the client indicates that the upload is complete. It could be done server-side, by just detecting when a file has no more holes in it, but that seems hacky. Holes can also be useful; suppose I make a 32GB .vmdk file (non-sparse) and put 2GB of data on it. If the server can support holes, then I can upload (and the server only has to store) about 2GB of data; if the server can't support holes, then I'll have to upload a bit more data (assuming compression), and the server will have to store a lot more data. If there were some final message the client could submit to the resource saying "I'm done, commit it!", I think the protocol would be a bit more complete.


Lets say the whole file is 1000 bytes and each of the five chunks is 200 bytes. If 2 and 4 have failed, then the HEAD would return with a Range header like this:

  Range: bytes=0-199,400-599,800-999
The client would then know it had to resend bytes 200-399 and 600-799 (namely parts 2 and 4). If the chunks only partially failed (say 100 bytes of each chunk was received), the HEAD might even return something like this:

  Range: bytes=0-299,400-699,800-999
So now the client knows it only has to resend bytes 300-399 and 700-799 (only the last 100 bytes of chunks 2 and 4).

Technically the Range header isn't valid in an HTTP response (something they are aware of), but conceptually I think the idea works fairly well.


Disclaimer: I'm a friend of the main author and have been peripherally involved (mostly watching from the sidelines) in this project.

Valid points, IMO. Those sound like use cases that might not have been contemplated originally (the idea for the spec grew out of the author's work here: https://transloadit.com).

That said, the spec, and the code around it are both still very much evolving, and are welcoming of input. You can join us on GitHub or in #tusio on Freenode.


As usual, relevant xkcd:

http://xkcd.com/927/


: ) - hopefully that's not the case here, afaik our protocol is the first proposal in this space meant for general adoption. All prior art is tied to specific libraries and services.


I hope so too =-)

It looked nice from my quick skim, it just brought that classic xkcd to mind!

If this can be put into any page/service, it'd be a huge contribution.


It seems like the handling of concurrent access has been neglected in this protocol. What if multiple clients try to resume uploading the same file?


Some of hte ideas from http://www.w3.org/1999/04/Editing/ might come in handy there. What realistic usage scenarios are there for multiple clients concurrently uploading (parts of) the same file?


So far we have this:

> Servers MUST handle overlapping PUT requests in an idempotent fashion given that the overlapping data is identical. Otherwise the behavior is undefined.

Does that adress your concern?




Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: