Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Why does S3 still not support append?
14 points by whatnotests on Dec 17, 2015 | hide | past | favorite | 27 comments
Is there some technical reason why it's simply infeasible? Is there some fundamental architectural decision laid down years ago that prevents "append" operations?

If the Google version of S3 supports append, why not?

Google Cloud Storage does not support append.

One reason these things don't support append is at some point they need to choose the "version" of the object. Usually this is done when an object upload has completed.

If they allow arbitrary appends to objects, then they would have a hard time assigning any type of ordering to them, as the concept of an object being "complete" would be thrown out the window.

(EDIT: and what does it mean to have a GET on an object, if you don't know the latest version to return?)

I think something like this could be implemented, but it would probably be an entirely different product that supported some specific traditional file operations (rename, ftruncate, link, etc) but had different scaling properties.

The ability to append to a block already existed with block blobs, which are a bag of data blocks and an ordered list of block identifiers: you could just create a new block, then commit a new list with its identifier at the end.

The real benefit of the new append blob is that you have a one-request append (instead of read list, upload block, commit list).

Also, append blobs (like block blobs) are limited to 50000 append operations.

S3 is a key/value store. Appends don't make sense in that context. If you think of it as a key/value store, then a lot of their constraints start to make more sense.

Google Cloud Storage does not support append. Their docs: "... you cannot make incremental changes to objects, such as append operations or truncate operations."

Oops my mistake. I coulda swore.

Still wishing.

Azure recently added it for logging scenarios. https://azure.microsoft.com/en-us/blog/azure-storage-release...

Azure already had append support for block blobs (i.e. you could upload new data to an existing blob without having to upload the entire blob). But unlike S3, Azure Storage tends to favor features (and consistency) over performance.

Well answered by other comments here, the whole "eventually consistent blob" vs "directed graph of mutation operations" problem. FWIW, this is a good distributed systems interview question :-)

You can really do either with S3- it's just an eventually consistent, immutable kv.

A) Keep a consistent manifest of chunk range to keys. B) Keep a ordered list of keys that represent the DAG.

In case A, you'll be able to assemble your blob in parallel even.

S3 is eventually consistent, appending an eventually consistent file is going to get very messy, very fast - what happens when an append reaches a replica node before an earlier one does?

If you're happy with out of order appends, just use a container file format like Parquet where appends are actually additional file creations

After a decently large RAID failure, I needed to gzip and send as many large files and send it over to S3 as quickly as possible on the risk of another failure. The script would gzip the file and then sync it up to s3, all in its own backgrounded processes. If two large files would get sent at the same time, both would die, then /leave incomplete files/.

After leaving that running over night, all of the files appeared to be uploaded... until the owner of the company needed to use them.

I'm still not sure if that's an exceptional use case, but it left a pretty bad taste in my mouth about S3 ever since.

It sounds like you were missing the "Content-MD5" header on your put requests. As i recall S3 will return an HTTP error response if the complete object does not match the Content-MD5 the client sends. The other issue with the HTTP protocol is that the request body doesnt have a mandatory delimiter. The client/server cant really distinguish between a terminated TCP connection and a complete HTTP body without the optional Content-Length/Content-MD5 headers. It really sounds like one or more of your latge files were timing out somewhere and the checksum was t sent.

Because reconciling 2 separate appends to 2 separate nodes which have different copies of the data would be a huge mess.

S3 is more of a simple key-value store than a full filesystem (and for good reason). I suspect the reason their docs push the filesystem metaphor so much is because filesystems are more familiar to many people, and most filesystem semantics can be implemented using a key-value store. In that sense, there is no update() or append() in S3, just a simple set().

Also, because AWS doesn't provide a decent networked file system where multiple instances can simultaneously mount the same volume at the same time in read/write mode. S3 is as close as one can get, in many cases.

You can use our ObjectiveFS[1] if you want a networked file system where multiple instances can simultaneously mount it read/write. It is backed by S3 and gives you a standard POSIX interface.

[1] https://objectivefs.com

Thanks, interesting. Can you comment on how it compares to s3fs? I've used s3fs and it can sometimes be buggy (as in, to the point where files get clobbered and corrupted) and slow (especially in listing directories with a large number of files). Does ObjectiveFS solve these issues, and do you have any reliability statistics?

What about EFS? (still in beta, essentially NFS) https://aws.amazon.com/efs/

I mostly miss a MoveObject operation to rename files myself, but I guess they are keeping things simple and scalable etc. on their end and requiring us to work around it with the existing lower level operations.

You can use a server side copy then a delete to rename an object reasonably efficiently. I guess that is what you mean by using existing lower level operations, but if you don't you might find that helpful!

I don't think it has a traditional filesystem. It probably just writes all puts sequentially as fast as possible and stores the location and then replicates. The easiest way to append would be to read the object, append, and then write to a new object. If they did that internally there would be no transfer out and no revenue although they could probably charge for the internal expense. Another reason is that people would probably think that appends are no big deal and try to append continuously to multi-gigabyte files. If this is the case then it is best to let the client handle appends where costs are out in the open.

All excellent points, especially about cost.

I've considered "faking" the append functionality by making a new file per append action, then performing a periodic compaction.

Even compaction-via-combine-and-delete-old is clunky.

    aws s3 combine --target s3://bucket-name/output-file.txt \
      s3://bucket-name/input-file-1.txt \
      ... \
I, for one, would pay extra for that.

The lack of read-after-delete consistency makes this tricky.


I've seen "eventually" consistent mean up to 24hrs in the face of problems. Several minutes seems common when versioning/bucket replication is enabled.

I can second that. Personally I'd like to see them support symbolic links so version controlling and rolling deployments of static websites becomes a little easier.

This is actually pretty kewl. You could potentially branch the current CLI from the GitHub repo and add that functionality in. Ideally the flow would work something like the following:

  1. Start a multipart object upload
  2. Issue "Upload Part - Copy" requests for each part of an object ( http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPartCopy.html )
  3. Complete the multipart object upload
Alternative flow:

  1. Enable bucket versioning
  2. Download the parsed S3 objects
  3. Start a multipart object upload to S3 with the specified target object as the object name
  4. Reupload the parsed S3 objects as parts of a single multipart object upload
  5. Delete the previous parsed objects once the multipart object upload is complete (a delete marker should be added to the top of the version stack, but the previously stored version should still be available if you specify its handle/version id).
Edit: changed formatting

You mean appending things on the end of files or what? If so, probably because its trivial to work around by storing the data in new files - and large data where this would be valuable should be broken down into pieces for n different reasons anyway.

Why do people who ask questions fall slightly short of providing enough information to meaningfully answer them?

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact