Good. The first S in S3 is simple, despite nobody in software appreciating simplicity anymore.
Adding features makes documentation more complicated, makes the tech harder to learn, makes libraries bigger, likely harms performance a bit, increases bug surface area, etc.
When it gets too out of hand, people will paper it over with a new, simpler abstraction layer, and the process starts again, only with a layer of garbage spaghetti underneath.
Show your age and be proud, Simple Storage Service.
Multi-region is pretty essential IMO. You can’t get the same cost effectiveness by building your own multi-region S3, stapling multiple buckets together. The basic premise of S3 is that you get a keystore interface, and S3 handles error correction and distributing the chunks among multiple physical machines on the cheap. Multi-region is the same product, at a larger scale. The complicated part is that somebody has to pay for the network bandwidth (whereas in a single region, it’s so cheap it’s unmetered).
The CAS thing is also pretty essential. Everybody wants to build some simple storage system on top of S3 and without CAS it’s pretty damn hard to have any kind of consistency guarantees, even for very simple systems… you end up having to build something outside S3 to manage consistency. An easy to understand use case is backups. Suppose you are using S3 as a backend for a backup system which deduplicates backups. You want to make backups from multiple locations, you want to deduplicate because there’s a lot of duplicate data floating around (maybe you have terabytes of video files getting copied, ML model weights, something else big that you copy around), and you want to expire old backups. You can almost build this on top of plain S3, and the only reason you can’t is because it’s unsafe to expire old data in such a system if any backup is writing (because the other backup may add a new reference to data, racing against the expiration / garbage collection process). A simple CAS gives you a lot of tools to solve this. The alternative to CAS is doing something kinda silly, like running a DynamoDB table as a layer of indirection.
Neither of these things add much complexity to S3.
(I think append is less useful and potentially a lot more complicated, both in terms of its API implications and in terms of the underlying complexity. If I want “append”, then I can use multipart uploads, or just upload multiple objects and reassemble them on the client side.)
Essential for who? Lots of other storage solutions if S3 is too simple. Distributed databases. Or fire up an EC2 cluster and install anything you want on it.
The multi-region thing should be pretty apparent. It’s part of the core S3 design to provide distributed storage. Multi-region is distributing it over a larger area. If you want to implement multi-region storage yourself, you can do it on S3 and pay a high cost for duplicated data, or you can try to implement your own S3 alternative.
For CAS, one example is backup jobs. You can run backup jobs to S3, but there are some safety issues if you want deduplication and you want to expire old data.
> if S3 is too simple
CAS isn’t some kind of super complicated, technical thing.
It would be nice if S3 had this small, incremental additional feature. That’s all. It would mean that some people don’t need to fire up DynamoDB just to do something you can already do in, say, GCS.
Sure… there’s always people out there who have a shorter list of requirements than you do. Someone else out there doesn’t need it to be cost-efficient, so maybe that’s not “essential”?
A currently running backup process can create a new reference to an object which is more than a week old. Meanwhile, the garbage collection process can be deleting that object, but the deletion operation hasn’t finished yet. CAS gives you a lot of options to do this safely.
What if you also have a week between marking/moving a file to start deletion, and the final removal?
If a backup and a GC race then a file can get both referenced and marked at the same time, but then a future GC will see the references and put the file back into a normal state. Assume other operations can still find the file while it's marked.
Are there benefits to CAS for this situation other than resolving faster?
Sure, you could probably use that kind of delay. I have personally seen GC systems that take more than a week to mark which files need to be deleted but this is admittedly unlikely (the system in question was massive).
> Are there benefits to CAS for this situation other than resolving faster?
I think this kind of thing comes up a lot, where you’d find it convenient to have a CAS update for your file. Like, maybe you should be using a database, but you’re already using S3 and having one or two CAS operation would mean that you can stick with S3.
Sometimes, the alternative is a little ugly. Like, “I’m going to create a DynamoDB table, and it’s only going to contain one row.”
What I’d really love, even more, is to have some kind of distributed lock service on AWS. Something like Zookeeper or Etcd as a SaaS product, where it’s cheap just to get a couple distributed locks. Feels like a gap in cloud offerings to me, but I can understand why it’s missing.
You can use S3 for this, no? (Admittedly it's clunkier than a service with SDKs.)
LIST, GET, and PUT are strongly consistent, the file name is the lock name, write the owner id and expiry timestamp in the file, and periodically extend the lock expiry (heartbeat). If an other process finds an expired lock delete the file.
Oh I'm sure there's lots of systems where CAS is very useful.
It's just that a backup tends to have mostly immutable files sitting around, so it becomes more niche. It's awkward to do a lock but you don't need a lot of locking.
> When it gets too out of hand, people will paper it over with a new, simpler abstraction layer, and the process starts again, only with a layer of garbage spaghetti underneath.
I'm pretty happy that there are S3 compatible stores that you can host yourself, that aren't insanely complex.
There was also Zenko, but I don't think they gained a lot of traction for the most part: https://www.zenko.io/
Of course, many will prefer hosted/managed solutions and that's perfectly fine, but at least when you run software yourself, you are more in control over it and for the most part can also make the judgement on how hard it is to operate and keep operational (e.g. similar to what you'd experience when running PostgreSQL/MariaDB/MySQL or trying to run Oracle).
That said, my needs (both in regards to features and scaling) are pretty basic, so it's okay to pay any of the vendors for something a bit more advanced and scalable.
seaweedfs has worked really nice for our small use for 2 years. some docs polish wouldn't hurt but reading the source isn't hard and I don't know golang.
This. There's always a small group of people pushing feature requests (ahem scope creep) into services that were never designed for those things. Unfortunately, those people win a lot of the time. Ie, see all the simple JS frameworks that were initially meant to solve relatively simple problems, only for them to become bloated and be replaced by something else that promised simplicity.
The simple things are simple. Other are possible to do. I don't think that image is representative. "I want to go for a walk, but there's whole world to choose the path from" - you can still go around the block, the world is not in the way. (There are simple, generic read-only and read write policies available)
If you don't care about IAM permissions / authorization, just give them all permissions. That's simple, but probably not secure. If you follow "best practice" you'll spend half your time dealing with granular IAM roles, permissions, security groups, and a ton of other stuff.
"S3 is that simple, here's an example using an otherwise generic HTTP library specifically altered to deal with AWS's tiresome boilerplate complexity".
I don't like putting words in other people's mouths, but that really does seem like a fair paraphrasing of your comment.
S3 was/is optimised for a specific bunch of use cases.
Because EC2's hypervisor was (when it was launched) lacking features (no hot swap, not shared block storage, no host movements, no online backup/clone, no live recovery, no Highly available host failover ) S3 had to step in to pick up some of the slack that proper block or file storage would have taken.
For better or for worse, people adopted the ephemeral style of computing, and used s3 as the state store.
S3 got away with it because it was the only practical object store in town.
The biggest drawback that is still has (and will likely always have) is that you can't write parts to a file. its either replace or nothing.
> I super doubt these limitations are there because of any EC2 requirements.
I think you are misreading me, but to be fair, I was being vague about timelines.
my main thrust is that S3 is dominant because EC2 was/is lacking. S3 is optimised for uptime and consistency, which means that its brilliant at mostly static file hosting. It will work for more dynamic state type stuff, its just not really designed for it (see https://xeiaso.net/blog/anything-message-queue/)
making your own object store that is fast, durable and available and has another feature is really really hard to do at scale. Its far easier to put up with s3 than make your own.
The design of Azure's blob-store abstraction is great progress toward the ideal form of an object-store: a serverless managed store for all types of data buffers, where handles can be efficiently transformed/exchanged between between different "states of matter" — immutable objects, vs append-only streamable logs, vs multi-reader multi-writer random-access disks; and where all three of these states of data-buffer can be consumed interchangeably by a consumer through the same set of APIs — where these APIs might work most efficiently for data buffers that are in the right state, but they still work for data buffers that aren't.
Sadly, Azure's implementation of its blob-store, is kind of underwhelming — especially for any kind of infrastructure-level use-cases.
For example, while there is a change feed for blob events, akin to S3 lifecycle event notifications, it stops exactly where S3's API stops; so there is no event generated by an append to an appendable blob, nor a write to a page in a page blob or a block in a block blob. (And even if there were, they make no guarantees of the change feed being linearized — saying that changes to some resources might arrive out-of-order or not at all; and that if you want a linearized change feed, you need to read it out of a log-multiplexer, which puts a several-minute delay and multi-minute step-granularity on reads from it.)
As such, you can't use Append Blobs as the storage layer for a Kafka-alike; and nor can you use Page Blobs as the transport to enable an embedded LMDB-alike to be network-replicated. (Or rather, you can, but in both cases you won't receive timely notifications that new data has been added / that pages have been invalidated at the origin, so unless you're operating with zero caching, your cache will end up stale and your state from successive reads will end up incoherent.)
In my 1 year of experience with Azure 3 years ago I saw this same story play out across every product line. Layers 1-3 look amazing then layers 4-6 are total head scratcher deal breakers. Very frustrating.
Not if you want speed. All of those objects and working out the history takes time. The API is not speedy when it comes to key:value lookup. At that point you're better off with a local object mapping cache. But then now you've created a half arsed file system
Multipart upload allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object. In general, when your object size reaches 100 MB, you should consider using multipart uploads instead of uploading the object in a single operation.
Multipart uploads are not the same as append; an object being multipart-uploaded doesn't exist in the bucket namespace until the upload is finalized, and once that is done it cannot be modified without overwriting.
I mean kinda, but you're still doing a copy on write, its just (from my understanding so I could easily be wrong) you are telling S3 copy the parts of file you don't upload.
If you are spooling through a file and changing a few chars at a time, you'll be generating a boat load of writes (and or reading from the wrong file too. )
This is a really smart comment, and I generally agree with your conclusions (disc: I was early at AWS).
Adding to the thread: EBS was kind of created to make up for EC2's ephemeral storage issue, but in 2024 I'm pretty sure AWS would design S3 + EBS + EC2 in a very different way.
yes. i don't really understand why they don't call it an ETag
maybe because you're more likely to have the same generation for different paths, because it's not a hash? an ETag also isn't guaranteed to be a hash, though, so idk
Partial updates could be useful for certain kinds of (mostly binary) files, but block storage is going to handle that much better in general than object storage. The concurrency and consistency guarantees are quite different between object and block storage. Making partial updates atomic would be quite difficult in general in S3, though simple preconditions like compare-and-swap (which is sorely needed anyway) might be sufficient to make it possible for certain use cases.
The paradigm can be flipped now for these distributed storage systems.
The blocks of a filesystem can now be objects, replicated or erasure-coded - like Ceph running filesystems on top of its low-level object storage protocol, which is done on raw disks, not a filesystem.
This can't be done for something like Minio just running on your filesystem, but if you're building the storage system from the ground up it can.
We will see more and more of these products appearing.
Vast Data is another interesting one. Global deduplication and compression for your data, with an S3, block, or NFS interface.
Storing differential backups for thousands or millions of VMs?
You'll only store the data from base Ubuntu image once.
I was referring more to the list of hypervisor features not implemented:
>> Because EC2's hypervisor was (when it was launched) lacking features (no hot swap, not shared block storage, no host movements, no online backup/clone, no live recovery, no Highly available host failover ) S3 had to step in to pick up some of the slack that proper block or file storage would have taken.
I don't want any of that nonsense in my compute layer or an application (at scale) that relies on shared block storage or host movements or live recovery.
One of the important premises of object storage is that if your PutObject or multipart upload succeed, the entire object is atomically replaced. It is eventually consistent, so you may not immediately retrieve the just-uploaded object with GetObject, but you should see the new version eventually, and never see part of one version mixed with part of another. This should natively support compare-and-swap: "hey if the existing etag is what I expect, apply my change, otherwise ignore my change and tell me so". This has nothing to do with DynamoDB and is not reimplementing its feature set. It is just a natural extension of how the service already works (from an API consumer perspective, not necessarily an implementation perspective).
transparent HA means that I can fail over services to other regions without having to get the programmers to think about it. Most of the busy work at scale is managing state or, more correctly recovering state from broken machines.
If I can make something else do that reliably, rather than engineer it myself, thats a win.
So much of the work of standing up a cluster (be it k8s or something else) is getting to the point where you can arbitrarily kill a datastore and it self heal.
If you're talking about s3 partial updates, its about cost/and or performance. If you dealing with megabyte chunks, and you want to flip a few bytes over hundreds of thousands, thats going to eat into transfer costs.
Sure you could chunk up the files even smaller, but then you hit into access latency (s3 aint that fast. )
I was referring to the notion that "failings" in the hypervisor layer like "hot swap, shared block storage, host movements, online backup/clone, live recovery, Highly available host failover" are a problem. At scale, I don't want my application to rely on any of that magic.
Reliability is always your problem not something to be punted to another layer of the stack that lets you pretend stuff doesn't go wrong.
yup, which is why relying on devs to engineer it is a pain in the arse. Having online migration is such a useful tool to avoid accidental overloads when doing maintenance, its also a grat tool to have when testing config changes.
Currently I work at a place that has its own container spec and scheduler. This makes sense because we have literal millions of machines to manage. but thats an edge case.
For something like a global newspaper (when I used to work) it would be a massive overkill, we spent far too long making K8s act like a mainframe, when we could have bought one 20 times over, and still have change for a good party every week. or, just used hosted databases and liberal caches.
Oh sure -- for piddly enterprise nonsense, having some VM yeeting magic to HA a thing that's not HA is .... yeah, I guess. Ideally in combination with tested backups for when the HA magic corrupts instead of protects, but such is life.
But that's not "at scale" that's just some great plains accounting app that's been dragged from one pickle jar to another.
In 2016 we had a 36k cluster. There was something like 2 PB of fast online storage, 48pb of nearline, and two massive tape libraries for backup/interchange.
The cluster was ephemeral, and could be reprovisioned automatically by netboot. However the DNS/DHCP + auth servers were on the critical pathway. So we dumped them on a VMware cluster to make sure that we could run them in as close to 100% as possible. Yes, they were replicated, but they were also running on separate HA clusters, with mirrored storage. This meant that if we lost both of them, we could within a few minutes run them directly from a snapshot, or if it was a catasrafuck reload the config from git.
Now we could have made our own DNS+dhcp server, and or kerberos/ldap/active directory. but that cost money and wasn't worth the time. Plus the risk of running your own with a small crew (less than 10 infra people) was way to high.
VMware was almost mainframe level of uptime, if you did it right.
I want somewhere reasonably priced to keep my files where I can confidently know they'll be there a decade from now.
For that purpose, I think S3's age is a killer feature. I won't be surprised when we see the HN post "google cloud storage has been sent to the graveyard"
I put all my "decade from now" files on three separate 500GB USB drives, each storing the same data, and each drive from different manufacturers/dates. Two drives stored in a safe at my house, and one in a safe at my brother's house. I used to store in S3, and it was more convenient, but I just felt weird about storing really important files on someone else's cloud -- what if I don't login to the AWS account for 9 years, and then when I need my files, I find out AWS kindly deleted/disabled my account and I didn't know about it? That situation is my biggest concern.
> what if I don't login to the AWS account for 9 years, and then when I need my files, I find out AWS kindly deleted/disabled my account and I didn't know about it? That situation is my biggest concern.
Understandable concern, but AWS being targeted at enterprise gives me confidence they don't do any funny stuff like that. On and of course, they bill you monthly, so ideally you'd have a tell if something goes wrong.
I also store my files locally on a USB drive in alternative locations, but I conceptually trust S3, but maybe actions speak louder.
Local vs cloud backups each have their own tradeoffs. I use both.
For cloud backup I use Arq backing up daily to AWS (no affiliation with Arq other than being a happy customer). You get client side encryption and the daily backup directly mitigates your concern, if there’s an AWS account issue you will know immediately and can fix it. For my storage amount and use it only costs about $2 a month.
I don’t know. I’ve been using this setup for 7+ years, daily backups, and never had this problem.
I have about 80 GB or so of data being backed up. The daily backups upload only new files plus files that are changed. The largest monthly AWS bill I ever got was $6. The next month it went back to the usual ~$2 range.
Family photos/videos and genealogy documents of multiple generations of family makes up the bulk of data I hold dear.
While a scanned copy of something like a birth certificate or deed might not often work in place of a physical copy they're nice to have on hand. Plus they're a part of a family history.
I see. None of those things (in my experiences, no house title ever) have sufficed as a copy. So if you need it and your house burnt down, it's a wait until they're reissued.
you have to pay to store stuff on S3 (or glacier), monthly. So i don't think they care if you log in - if your credit card on file allows billing for 9 years.
Their killing of domains, selling that to Squarespace, then Squarespace selling to PE was the most incredible degradation of service I've seen in a while!
It was such a bizarre choice, I don't understand. Why wouldn't they want new businesses to come to them first? Naming a company often involves getting a domain and then paying for hosting, email, and/or other services. This all makes me think gsuite will be next on the chopping block.
> This all makes me think gsuite will be next on the chopping block.
GSuite is the most profitable part of GCP, and it's totally propping that division up. It'll be the last to go - it's a lot easier to recognize bad management.
Domains probably was expensive to maintain or understaffed for some reason, and some exec knew a guy a Squarespace.
I don't see how domains going away would suggest Google suite is near the chopping block. I doubt many major customers of GSuite even bothered with Domains. It was such a a small and bare bones business, pretty unrelated to the rest of GSuite.
Several years ago a startup owner I was friends with was doing IoT stuff. If they'd ended up choosing the Google option for their device comms they'd have been in a bad place from this. Pretty sure they went with the AWS IoT stuff though.
At this point, I wonder if there's anyone left on Earth who Google haven't screwed over in some significant way?
The mere fact that abstractions like S3 even exist still boggles my mind. Infinitely scalable, indefinitely persistent, inexpensive, super-high reliability, software-addressable storage accessible from (almost) anywhere on the planet. I'm sure tfa's critique is valid, but also... we have miraculous tools.
Until recently, S3 had an eventual-consistency model (a given file/object would be internally consistent, but different readers would see creations/deletions/etc in different orders). It's favoring availability over consistency.
Using file operations for mutexes makes sense in Unix because of the filesystem semantics there but it makes less sense in a distributed object store.
S3 has had read-after-write strong consistency since 2020, so... yeah, I guess that's still 'recently', given that S3 existed for 14 years prior to that with eventual consistency.
CAS (or If-Match, If-None-Match) is something that is atomic but localized to a single key. You don’t have to provide any additional consistency guarantees beyond what S3 already provides. You are providing a couple new atomic operation with the same radius as existing operations—everything is already atomic on a single key, you’re just adding conditions to those operations.
You could argue that the reticence to add features to your service due to the the sheer complexity, time commitment, or flat out unwillingness to not upset the apple cart is a sign of age.
I would argue it doesn't have much to do with age, but rather with market dominance. If you are the top dog you don't need to fight for users by implementing various features.
I really wish S3 had the ability to rename / move files without having to copy the data all over again. It seems like something that should not be necessary, given that that information is just metadata.
Even if there is some technical reason why the data needs copying, S3 could at least pretend that the file is in the new place until it’s actually there.
> S3 doesn’t have dual-region or multi-region buckets
This is true, but S3 does support replication (including deletion markers), and even 2-way replication, between two regions. Definitely not the same thing as a dual-region bucket, but it can satisfy many use cases where dual-region bucket would be used otherwise.
One two-region bucket would be cheaper than two one-region buckets. I’ve done analyses of similar systems and calculated the costs necessary.
You pay for durability and availability in the form of disk overhead, CPU, and network. Each encoding scheme has some expected cost. If your overhead for one-region is $X per gigabyte, then generally speaking, the overhead for two-region is going to be less than $2X—each region is more durable + available because of the copy stored in the other region.
Append would allow to build a lot of other systems. I mean, the only functional difference between S3 and GFS is append operation. Google build BigTable, Megastore and who knows what more over GFS. You can't do the same with S3 (without having to implement the append somewhere else yourself).
GFS? Google hasn’t used GFS for, like, fifteen years.
You can totally build stuff like Megastore or Bigtable (or Spanner) on top of S3. You use a log-structured merge tree. That’s how these systems work in the first place. In the log-structured merge tree, you have a set of files containing your data but you don’t modify them. Instead, you write new files containing the changes (the log). Eventually you compact them by writing a complete copy and deleting the old versions.
This works just fine on S3, and there are even some key-value stores built on top of S3 that work this way. Colossus is cheaper for short-lived data.
> you write new files containing the changes (the log)
People are asking for create-if-not-exists specifically to be able to add objects to an ordered log, without needing a separate service for coordination.
S3 cannot be used for this. GCS, for example, can.
Yeah—I know I didn’t mention that. You can build this type of system on top of S3 + CAS.
The “append” feature isn’t necessary for functionality, it just improves the cost / performance.
The idea of running your own database on top of S3 is, well, it’s gonna be janky. It’s not ideal. You do end up seeing databases running on top of S3 (I’ve seen some), and sometimes it even makes sense.
I think you missed the point. GFS supported append operations and was created around 24 years ago. S3 still hasn't caught up with this particular feature. Although they clearly implemented it, as you can do a long multipart upload and S3 will join the file for you.
At a high level, yes, you can implement systems like Megastore or Bigtable over S3. However, there are many details you must take into account. You cannot simply wave away the complexity and potential failure scenarios.
For starters, how are you going to create the newest SST?
If you keep it in memory or on disk, it must be replicated to prevent data loss if a machine fails. This approach could lead to losing the most recent changes. Additionally, you end up with a hybrid system that needs to read data from multiple sources, which adds complexity. If you essentially reimplement the system, why use S3 at all?
What if the data volume gets too low and you end up writing many small, expensive files?
Using something like Kinesis for batching might work, but the data won't be visible for N minutes.
Merging partial tables also requires maintaining an external index to track availability. Transactions would be helpful, but how do you handle failures?
And we haven't even mentioned managing garbage collection. It would require an external lock or reference count system.
Maybe I am thinking more broadly when imagine what it means to implement something like Spanner on top of S3.
We know that Spanner on top of S3 is not going to give you the same price/performance as building Spanner on top of Colossus while giving you the same semantics. You either relax the semantics a little bit, you pay out the nose for a lot of little files, or you find a durable place outside S3 to store the newest data.
> At a high level, yes, you can implement systems like Megastore or Bigtable over S3. However, there are many details you must take into account. You cannot simply wave away the complexity and potential failure scenarios.
Most of the complexity is the same whether you implement Megastore on top of S3 or on top of GFS. You can’t handwave it in either scenario.
> If you essentially reimplement the system, why use S3 at all?
It’s highly durable, highly available, and cheap (under certain usage scenarios).
GFS is not available for anyone to use, inside or outside Google. Its successor, Colossus, is not available outside Google. They’re just not available.
I haven't used S3: does compacting work within their system, like with an API call, or do you have to download all the chunks, upload the concatenated result, and then delete the chunks?
S3 does simple, reliable object storage well. If anything its draconian pricing model for charging exuberant pricing for bandwidth is showing its age and why we've moved to Cloudflare R2 for their zero egress fees.
Being able to reuse the s3 command-line and existing S3 libraries has made the migration painless, so I'm thankful they've created a defacto standard that other S3 compatible object storage providers can implement.
Interestingly, DynamoDB is cheaper than S3 though, compared by number of requests. DynamoDB costs $1.25 per million write request units and $0.25 per million read request units. While S3 is $5 per million PUT requests and $0.4 per million GET requests.
That's a good point, only very small data is cheaper in DynamoDB than S3. Also adding global secondary indexes tends to add cost, since writes are charged for each of them.
This is perfectly fine. S3 is simple and stable, and that's their selling point. There's no competition to the history and the proven stability rather than chasing shiny features.
Does it? The x-amz-copy-source-if-match etc headers seem to be talking about the source, not the target. I see nothing there to express "fail if target object exists already".
Azure Storage Accounts have major problems with key partitioning that you have to be aware of at the application level that S3 has no problem with. Additionally, Azure Storage accounts have bandwidth throttle limits that can force you to shard across multiple accounts which is pretty painful. Azure SDKs outside of C# and Java are also not well supported in my experience.
EDIT: This was based on my experience from ~2 years ago. The Blob storage team has reached out to me via email and let me know that both the key partitioning and throttling issues have been fixed since then.
I would consider the documented performance targets [0] for a standard Azure Blob account to be very good. We're talking 60 Gbps in/120 Gbps out, with 20,000 requests per second as the default request rate.
From what I can tell, the S3 request rate is about 9,000 requests per second [1] split between reads and writes for a single partition. From my perspective it really just depends on what you're trying to build but I don't see the performance of Azure Storage as being an issue in any way for a typical application.
Partitioning will also depend heavily on what kind of application you're building, but the documentation does point out that load balancing will kick in once it starts to see a lot of traffic on a partition [2]. Since you have to use partitioning for S3 in order to get better performance, I don't really see how that's a point against Azure.
As for SDKs [3] I have no idea how good support is, but they all have commits within the last day.
That’s 3,500 PUT and 5,500 GET requests per second per PREFIX.
S3 does a bad job of describing what a prefix is, but for most[1] practical intents you can consider the entire key of an object the prefix.
[1] The actual behaviour treats prefixes like partitions, but it’s completely automated and as long as you don’t expect an instant scale up to very large request rates, S3’s performance is basically unlimited. There are no per account hard or soft limits that need increasing or limit scalability.
One feature that I definitely miss is the lack of a strongly consistent putIfAbsent API. A lot of big data table formats like Delta.io would benefit so much from it, right now you need to work around it by connecting to DynamoDB :/
Assuming you want to write to dest_path with put-if-absent semantics, here's a sketch:
- Write to a unique temporary path (call this temp_path)
- Commit a record to DynamoDB with put-if-absent semantics on the key (dest_path) and the value (temp_path, "incomplete")
- Three possible outcomes:
1. Your write succeeded; proceed to issue a CopyObject call from temp_path to dest_path and if successful mark as complete in DynamoDB.
2. The row already existed in DynamoDB as "complete" -> do nothing
3. The row already existed in DynamoDB as "incomplete" -> another write has been committed, but is not complete; attempt to repair it by issuing a CopyObject for the path in DynamoDB. On success, mark as complete in DynamoDB. ("repair" step)
Reads could also hit DynamoDB before S3 in order to perform the "repair" step if applicable.
When I have a problem figuring something out with S3, I slide into my shared slack channel with my AWS account rep and solutions engineers and ask them for help, and they help. Is this a feature other cloud storage providers have?
(But this rarely happens with S3, because it's so simple.)
It feels like the S3 Team has managed to avoid chasing the marginal user with features. A great example of restraint. It does beg the question of whether the system could be improved in other areas like optimizations to reduce price.
> By embracing DynamoDB as your metadata layer, systems stand to gain a lot.
I just implemented a "posix-like" filesystem on top of it. Which means large object offload to s3 is not a problem or even an "ugly abstraction." In fact it looks quite natural once you get down to the layer that does this.
You also get something like regular file locks which extend to s3 objects, and if you're using cognito, you can simplify your permissions management and run it all through the file system as well.
Amazon's profit margins on data transfer are out of this world, like probably > 80%. Data transfer costs are hard and confusing to predict (especially with different regions, etc), so it makes sense they'd gouge on it more than storage.
I like Cloudflare, but this is also what makes me skeptical. There's no such thing as a free lunch, and I've had the rug pulled out from underneath me many times before. I wish they at least charged a sustainable amount for egress, because I feel like it's coming eventually.
Supposedly it’s not applicable to R2, but their Enterprise egress pricing is around the $0.05/GB rate. They won’t contact you though until you start to get close to the $5,000/month usage level, which itself can be a problem as your fee goes from $0/month to $5,000/month.
R2 still has per-request fees which, given PUTs are still $4.50/million and GETs $0.36/million (only about 10% off S3’s), they still have significant margin there to cover egress for objects of reasonable size.
One other benefit of paying up front is that I think you are less restricted on S3/Cloudfront in what you can host. I think Cloudflare has some restrictions (video files?) to keep costs more sane. Otherwise, you could spin up a YouTube clone extremely cheap.
Adding features makes documentation more complicated, makes the tech harder to learn, makes libraries bigger, likely harms performance a bit, increases bug surface area, etc.
When it gets too out of hand, people will paper it over with a new, simpler abstraction layer, and the process starts again, only with a layer of garbage spaghetti underneath.
Show your age and be proud, Simple Storage Service.