AWS in 2025: Stuff you think you know that's now wrong

simonw · 2025-08-20T16:18:18 1755706698

S3: "Block Public Access is now enabled by default on new buckets."

On the one hand, this is obviously the right decision. The number of giant data breeches caused by incorrectly configured S3 buckets is enormous.

But... every year or so I find myself wanting to create an S3 bucket with public read access to I can serve files out of it. And every time I need to do that I find something has changed and my old recipe doesn't work any more and I have to figure it out again from scratch!

sylens · 2025-08-20T18:10:54 1755713454

The thing to keep in mind with the "Block Public Access" setting is that is a redundancy built in to save people from making really big mistakes.

Even if you have a terrible and permissive bucket policy or ACLs (legacy but still around) configured for the S3 bucket, if you have Block Public Access turned on - it won't matter. It still won't allow public access to the objects within.

If you turn it off but you have a well scoped and ironclad bucket policy - you're still good! The bucket policy will dictate who, if anyone, has access. Of course, you have to make sure nobody inadvertantly modifies that bucket policy over time, or adds an IAM role with access, or modifies the trust policy for an existing IAM role that has access, and so on.

simonw · 2025-08-20T19:20:26 1755717626

I think this is the key of why I find it confusing: I need a very clear diagram showing which rules override which other rules.

saghm · 2025-08-20T23:41:24 1755733284

My understanding is that there isn't actually any "overriding" in the sense of two rules conflicting and one of them having to "win" and take effect. I think it's more that an enabled rule always is in effect, but it might overlap with another rule, in which case removing one of them still won't remove the restrictions on the area of overlap. It's possible I'm reading too much into your choice of words, but it does sound like there's a chance that the confusion is stemming from an incorrect assumption of how various permissions interact.

That being said, there's certain a lot more that could into making a system like that easier for developers. One thing that springs to mind is tooling that can describe what rules are currently in effect that limit (or grant, depending on the model) permissions for something. That would make it more clear when there are overlapping rules that affect the permissions of something, which in turn would make it much more clear why something is still not accessible from a given context despite one of the rules being removed.

jagged-chisel · 2025-08-21T11:40:37 1755776437

If one rule explicitly restricts access and another explicitly grants access, which one is in effect? Do restrictions override grants? Does a grant to GroupOne override a restriction to GroupAlpha when the authenticated use in is both groups? Do rules set by GodAdmin override rules set by AngelAdmin?

saghm · 2025-08-21T15:21:10 1755789670

It's possible I'm making the exact mistake that the article describes and relying on outdated information, but my understanding is that pretty much all of the rules are actually permissions rather than restrictions. "Block public access" is an unfortunate exception to this, and I suspect that it's probably just a poorly named inversion of an "allow public access" permission. You're 100% right that modeling permissions like this requires having everything in the same "direction", i.e. either all permissions or all restrictions.

After thinking about this sort of thing a lot when designing a system for something sort of similar to this (at a much smaller scale, but with the intent to define it in a way that could be extended to define new types of rules for a given set of resources), I feel pretty strongly that the best way for a system like this to work from the protectives of security, ease of implementation, and intuitiveness for users are all aligned in requiring every rule to explicitly be defined as a permission rather than representing any of them as restrictions (both in how they're presented to the user and how they're modeled under the hood). With this model, veryifing whether an action is allowed can be implemented by mapping an action to the set of accesses (or mutations, as the case may be) it would perform, and then checking that each of them has a rule present that allows it. This makes it much easier to figure out whether something is allowed or not, and there's plenty of room for quality of life things to help users understand the system (e.g. being able to easily show a user what rules pertain to a given resource with essentially the same lookup that you'd need to do when verifying an action in it). My sense is that this is actually not far from how AWS permissions are implemented under the hood, but they completely fail at the user-facing side of this by making it much harder than it needs to be to discover where to define the rules for something (and by extension, where to find the rules currently in effect for it).

luluthefirst · 2025-08-21T13:51:08 1755784268

They don't really override each other but they act like stacked barriers, like a garage door blocking access to an open or closed car. Access is granted if every relevant layer allows it.

cedws · 2025-08-21T03:37:20 1755747440

I’m getting deja vu, didn’t they already do this like 10 years ago because people kept leaving their buckets wide open?

awongh · 2025-08-20T23:22:32 1755732152

This is exactly what I use LLMs for. To just read the docs for me and pull out the base level demo code that's buried in all the AWS documentation.

Once I have that I can also ask it for the custom tweaks I need.

dcminter · 2025-08-21T06:46:33 1755758793

This could not possibly go wrong...

You're braver than me if you're willing to trust the LLM here - fine if you're ready to properly review all the relevant docs once you have code in hand, but there are some very expensive risks otherwise.

awongh · 2025-08-21T10:29:41 1755772181

This is LLM as semantic search- so it's way way easier to start from the basic example code and google to confirm that it's correct than it is to read the docs from scratch and piece together the basic example code. Especially for things like configurations and permissions.

dcminter · 2025-08-21T11:27:36 1755775656

Sure, if you do that second part of verifying it. If you just get the LLM to spit it out then yolo it into production it is going to make you sad at some point.

simianwords · 2025-08-21T07:01:03 1755759663

There’s nothing brave in this. It generally works the way it should and even if it doesn’t - you just go back to see what went wrong.

I take code from stack overflow all the time and there’s like a 90% chance it can work. What’s the difference here?

jcattle · 2025-08-21T07:52:08 1755762728

However on AWS the difference between "generally working the way it should and not working the way it should" can be a 30,000$ cloud bill racked up in a few hours with EC2 going full speed ahead mining bitcoin.

simianwords · 2025-08-21T07:56:56 1755763016

For those high stakes cases maybe you can be more careful. You can still use an LLM to search and get references to the appropriate place and do your own verification.

But for low stakes LLM works just fine - not everything is going to blow up to a 30,000 bill.

In fact I'll take the complete opposite stance - verifying your design with an LLM will help you _save_ money more often than not. It knows things you don't and has awareness of concepts that you might have not even read about.

dcminter · 2025-08-21T07:41:54 1755762114

Well, the "accidentally making the S3 bucket public" scenario would be a good one. If you review carefully with full understanding of what e.g. all your policies are doing then great, no problem.

If you don't do that will you necessarily notice that you accidentally leaked customer data to the world?

The problem isn't the LLM it's assuming its output is correct just the same as assuming Stack Overflow answers are correct without verifying/understanding them.

simianwords · 2025-08-21T07:52:17 1755762737

I agree but its about the extent. I'm willing to accept the risk of ocassionally making S3 public but getting things done much faster, much like I don't meticulously read documentation when I can get the answer from stackoverflow.

If you are comparing with stackoverflow then I guess we are on the same page - most people are fine with taking stuff from stackoverflow and it doesn't count as "brave".

dcminter · 2025-08-21T08:09:45 1755763785

I think anyone who just copies and pastes from SO is indeed "brave" for pretty much exactly the same reason.

> I'm willing to accept the risk of ocassionally making S3 public

This is definitely where we diverge. I'm generally working with stuff that legally cannot be exposed - with hefty compliance fines on the horizon if we fuck up.

simianwords · 2025-08-21T08:39:18 1755765558

That's fair - I would definitely use stackoverflow liberally and dive into documentation when situation demands it.

awongh · 2025-08-21T10:31:55 1755772315

The thing is that you can now ask the LLM for links and you can ask it to break down why it thinks a piece of code, for example, protects the bucket from being public. Things that are easy to verify against the actual docs.

I feel like this workflow is still less time, easier and less error prone than digging out the exact right syntax from the AWS docs.

jiggawatts · 2025-08-21T10:49:50 1755773390

Back when GPT4 was the new hotness, I dumped the markdown text from the Azure documentation GitHub repo into a vector index and wrapped a chatbot around it. That way, I got answers based on the latest documentation instead of a year-old LLM model's fuzzy memory.

I now have the daunting challenge of deploying an Azure Kubernetes cluster with... shudder... Windows Server containers on top. There's a mile-long list of deprecations and missing features that were fixed just "last week" (or whatever). That is just too much work to keep up with for mere humans.

I'm thinking of doing the same kind of customised chatbot but with a scheduled daily script that pulls the latest doco commits, and the Azure blogs, and the open GitHub issue tickets in the relevant projects and dumps all of that directly into the chat context.

I'm going to roll up my sleeves next week and actually do that.

Then, then, I'm going to ask the wizard in the machine how to make this madness work.

Pray for me.

andrewmcwatters · 2025-08-20T17:37:35 1755711455

This sort of thing drives me nuts in interviews, when people are like, are you familiar with such-and-such technology?

Yeah, what month?

tester756 · 2025-08-20T20:08:09 1755720489

If you're aware of changes, then explain that there were changes over time, that's it

reactordev · 2025-08-21T02:32:13 1755743533

You say this, someone challenges you, now you're on the defensive during an interview and everyone has a bad taste in their mouth. Yeah, that's how it goes.

pas · 2025-08-21T07:38:55 1755761935

That's just the taste of iron from the blood after the duel. But this is completely normal after a formal challenge! Companies want real cyberwarriors, and the old (lame) rockstar ninjas that they hired 10 years ago are very prone to issuing these.

reactordev · 2025-08-21T14:58:38 1755788318

I don’t want to go to war, I just want a quiet house in the mountains and a career that allows me to think about things.

andrewmcwatters · 2025-08-21T16:56:27 1755795387

Amen.

andrewmcwatters · 2025-08-20T21:31:08 1755725468

You seem to be lacking the experience of what actually happens in interviews.

reactordev · 2025-08-21T02:30:59 1755743459

They'll teach you how for $250 and a certification test...

crinkly · 2025-08-20T17:23:31 1755710611

I just stick CloudFront in front of those buckets. You don't need to expose the bucket at all then and can point it at a canonical hostname in your DNS.

hnlmorg · 2025-08-20T18:14:30 1755713670

That’s definitely the “correct” way of doing things if you’re writing infra professionally. But I do also get that more casual users might prefer not to incur the additional costs nor complexity of having CloudFront in front. Though at that point, one could reasonably ask if S3 is the right choice for causal users.

gchamonlive · 2025-08-20T18:52:08 1755715928

S3 + cloudfront is also incredibly popular so you can just find recipes for automating that in any technology you want, Terraform, ansible, plain bash scripts, Cloudformation (god forbid)

gigatexal · 2025-08-20T18:59:37 1755716377

Yeah holy crap why is cloud formation so terrible?

gchamonlive · 2025-08-20T19:34:56 1755718496

It's designed to be a declarative DSL, but then you have to do all sorts of filters and maps in any group of resources and suddenly you are programming in yaml with both hands tied behind your back

gigatexal · 2025-08-20T21:48:31 1755726511

Yeah it’s just terrible. If Amazon knew what was good they’d just replace it with almost anything else. Heck just got all in on terraform and call it a day.

mdaniel · 2025-08-21T04:24:11 1755750251

This may be heresy in an AWS thread, but as a concept Bicep actually isn't terrible: https://github.com/Azure/bicep/blob/v0.37.4/src/Bicep.Cli.E2...

It does compile down to Azure Resource Manager's json DSL, so in that way close to Troposphere I guess, only both sides are official and not just some rando project that happens to emit yaml/json

The implementation, of course, is ... very Azure, so I don't mean to praise using it, merely that it's a better idea than rawdogging json

hnlmorg · 2025-08-21T07:49:08 1755762548

I’ve heard so many bad things about bicep on Azure that I’m not convinced it’s an upgrade over TF.

The syntax does look nicer but sadly that’s just a superficial improvement.

hnlmorg · 2025-08-21T07:44:38 1755762278

They do contribute to the AWS provider for Terraform.

Also that have CDK which is a framework for writing IaC in Java/TypeScript, Go, Python, etc.

gigatexal · 2025-08-21T21:58:12 1755813492

Meh. The CDK doesn’t look terrible. It’s still not ideal. But even if this compiles to a mess of CF it’s still better than writing CF by hand and that’s only because CF is so bad to begin with.

https://dev.to/kelvinskell/getting-started-with-aws-cdk-in-p...

mdaniel · 2025-08-21T04:28:46 1755750526

As for "go all in on terraform," I pray to all that is holy every night that terraform rots in the hell that spawned it. And that's not even getting into the rug pull parts, I mean the very idea of

1. I need a goddamn CLI to run it (versus giving someone a URL they can load in their tenant and have running resources afterward)

1. the goddamn CLI mandates live cloud credentials, but then stright-up never uses them to check a goddamn thing it intends to do to my cloud control plane

You may say "running 'plan' does" and I can offer 50+ examples clearly demonstrating that it does not catch the most facepalm of bugs

1. related to that, having a state file that believes it knows what exists in the world is just ludicrous and pain made manifest

1. a tool that thinks nuking things is an appropriate fix ... whew. Although I guess in our new LLM world, saying such things makes me the old person who should get onboard the "nothing matters" train

and the language is a dumpster, imho

gchamonlive · 2025-08-21T13:56:45 1755784605

All Terraform does is build a DAG, compare it with the current state file and pass the changes down to the provider so it can translate to the correct sequence of interactions with the upstream API. Most of your criticism boils down to limitations of the cloud provider API and/or Terraform provider quality. It won't check for naming collision for instance, it assumes you know what you are doing.

Regarding HCL, I respect their decision to keep the language minimal, and for all it's worth you can go very, very far with the language expressions and using modules to abstract some logic, but I think it's a fair criticism for the language not to support custom functions and higher level abstractions.

hnlmorg · 2025-08-21T08:04:34 1755763474

There's a lot wrong with Terraform but I don't think you're being at all fair with your specific critisims here:

> 1. I need a goddamn CLI to run it (versus giving someone a URL they can load in their tenant and have running resources afterward)

CloudFormation is the only IaC that supports "running as a URL" and that's only because it's an AWS native solution. And CloudFormation is a hell of a lot more painful to write and slower to iterate on. So you're not any better off for using CF.

What usually happens with TF is you'd build a deploy pipeline. Thus you can test via the CLI then deploy via CI/CD. So you're not limited to just the CLI. But personally, I don't see the CLI as a limitation.

> the goddamn CLI mandates live cloud credentials, but then stright-up never uses them to check a goddamn thing it intends to do to my cloud control plane

All IaC requires live cloud credentials. It would be impossible for them to work without live credentials ;)

Terraform does do a lot of checking. I do agree there is a lot that the plan misses though. That's definitely frustrating. But it's a side effect of cloud vendors having arbitrary conditions that are hard to define and forever changing. You run into the same problem with any tool you'd use to provision. Heck, even manually deploying stuff from the web console sometimes takes a couple of tweaks to get right.

> 1. related to that, having a state file that believes it knows what exists in the world is just ludicrous and pain made manifest

This is a very strange complaint. Having a state file is the bare minimum any IaC NEEDS for it to be considered a viable option. If you don't like IaC tracking state then you're really little better off than managing resources manually.

> a tool that thinks nuking things is an appropriate fix ... whew.

This is grossly unfair. Terraform only destroys resources when:

1. you remove those resources from the source. Which is sensible because you're telling Terraform you no longer want those resources

2. when you make a change that AWS doesn't support doing on live resources. Thus the limitation isn't Terraform, it is AWS

In either scenario, the destroy is explicit in the plan and expected behaviour.

mdaniel · 2025-08-21T14:40:42 1755787242

> CloudFormation is the only IaC that supports "running as a URL"

Incorrect, ARM does too, they even have a much nicer icon for one click "Deploy to Azure" <https://learn.microsoft.com/en-us/azure/azure-resource-manag...> and as a concrete example (or whole repo of them): <https://github.com/Azure/azure-quickstart-templates/tree/2db...>

> All IaC requires live cloud credentials. It would be impossible for them to work without live credentials ;)

Did you read the rest of the sentence? I said it's the worst of both worlds: I can't run "plan" without live creds, but then it doesn't use them to check jack shit. Also, to circle back to our CF and Bicep discussion, no, I don't need cloud creds to write code for those stacks - I need only creds to apply them

I don't need a state file for CF nor Bicep. Mysterious about that, huh?

hnlmorg · 2025-08-21T18:57:36 1755802656

> Incorrect, ARM does too, they even have a much nicer icon for one click "Deploy to Azure"

That’s Azure, not AWS. My point was to have “one click” HTTP installs you need native integration with the cloud vendor. For Azure it’s the clusterfuck that is Bicep. For AWS it’s the clusterfuck that is CF

> I don't need a state file for CF nor Bicep.

CF does have a state file, it’s just hidden from view.

And bicep is shit precisely because it doesn’t track state. In fact the lack of a state file is the main complain against bicep and thus the biggest thing holding it back from wider adoption — despite being endorsed by Microsoft Azure.

SvenL · 2025-08-21T04:50:50 1755751850

Amen, and I would add to that list “no, just because you use terraform doesn’t mean you can simply switch between cloud providers”.

hnlmorg · 2025-08-21T08:29:24 1755764964

Is there any IaC solutions where you can “simply switch between cloud providers”?

This isn’t a limitation of TF, it’s an intended consequence of cloud vendor lock in

mdaniel · 2025-08-21T14:55:03 1755788103

I believe the usual uninformed thinking is "terraform exists outside of AWS, so I can move off of AWS" versus "we have used CF or Bicep, now we're stuck" kind of deal

Which is to say both of you are correct, but OP was highlighting the improper expectations of "if we write in TF, sure it sucks balls but we can then just pivot to $other_cloud" not realizing it's untrue and now you've used a rusty paintbrush as a screwdriver

hnlmorg · 2025-08-21T19:53:36 1755806016

I don’t think that expectation exists with anyone with even the slightest understanding of IaC and systems.

But maybe I’ve just been blessed to work with people who aren’t complete idiots?

stogot · 2025-08-21T02:22:00 1755742920

Isn’t that what CDK was for?

SteveNuts · 2025-08-20T19:12:01 1755717121

Last time I tried to use CF, the third party IAC tools were faster to release new features than the functionality of CF itself. (Like Terraform would support some S3 bucket feature when creating a bucket, but CF did not).

I'm not sure if that's changed recently, I've stopped using it.

tkjef · 2025-08-21T14:09:30 1755785370

I have been on the terraform side for 7 years-ish.

eksctl just really impressed me with its eks management, specifically managed node groups & cluster add-ons, over terraform.

that uses cloudformation under the hood. so i gave it a try, and it’s awesome. combine with github actions and you have your IAC automation.

nice web interface for others to check stacks status, events for debugging and associated resources that were created.

oh, ever destroy some legacy complex (or not that complex) aws shit in terraform? it’s not going to be smooth. site to site connections, network interfaces, subnets, peering connections, associated resources… oh, my.

so far cloudformation has been good at destroying, but i haven’t tested that with massive legacy infra yet.

but i am happily converted tf>cf.

and will happily use both alongside each other as needed.

dragonwriter · 2025-08-21T00:53:44 1755737624

Because its an old early IaC language, but it works and lots depends on it, so instead of dumping or retooling it, AWS keeps it around as a compilation target, while pushing other solutions (years ago, the SAM transform on top of it, more recently CDK) as the main thing for people to actually use directly.

baby_souffle · 2025-08-21T00:48:14 1755737294

> Yeah holy crap why is cloud formation so terrible?

I can't confirm it, but I suspect that it was always meant to be a sales tool.

Every AWS announcement blog has a "just copy this JSON blob, and paste it $here to get your own copy of the toy demo we used to demonstrate in this announcement blog" vibe to it.

damieng · 2025-08-20T21:36:32 1755725792

I'd argue putting CloudFront on top of S3 is less complex than getting the permissions and static sharing setup right on S3 itself.

hnlmorg · 2025-08-21T08:12:36 1755763956

I do get where you're coming from, but I don't agree. With the CF+S3 combo you now need to choose which sharing mode to work with S3 (there are several different ways you can link CF to S3). Then you have the wider configuration of CF to manage too. And that's before you account for any caching issues you might run into when debugging your site.

If you know what you're doing, as it sounds like you and I do, then all of this is very easy to get set up (but then aren't most things easy when you already know how? hehe). However we are talking about people who aren't comfortable with vanilla S3, so throwing another service into the mix isn't going to make things easier for them.

tayo42 · 2025-08-20T20:14:07 1755720847

>S3 is the right choice for causal users.

It's so simple for storing and serving a static website.

Are there good and cheap alternatives?

MaKey · 2025-08-20T20:27:21 1755721641

Yeah, your classic web hoster. Just today I uploaded a static website to one via FTP.

fodkodrasz · 2025-08-20T20:49:06 1755722946

Really? If I remember correctly: My Static website served from S3 + CF + R53 by about 0.67$ / mo, 0.5 being R53 from that, 0.16 being CF, S3 being 0.01 for my page.

BTW: Is GitHub Page still free for custom domains? (I don't know the EULA)

daydream · 2025-08-20T22:22:14 1755728534

GitHub Pages are still free but commercial websites are forbidden.

crinkly · 2025-08-20T18:49:03 1755715743

It's actually incredibly cheap. I think our software distribution costs, in the account I run, are around $2.00 a month. That's pushing out several thousand MSI packages a day.

hnlmorg · 2025-08-21T08:19:50 1755764390

S3 is actually quite expensive compared to the competition for both storage costs and egress costs. At a previous start-up, we had terrabytes of data on S3 and it was our second largest cost (after GPUs) and by some margin.

For small scale stuff, S3s storage and egress charges are unlikely to be impactful. But it doesn’t mean they’re cheap relative to the competition.

There are also ways you can reduce S3 costs, but then you're trading the costs received from AWS with the costs of hiring competent DevOps. Either way, you pay.

oblio · 2025-08-20T21:56:16 1755726976

With CloudFront?

dcminter · 2025-08-21T11:50:07 1755777007

Not always that simple - for example if you want to automatically load /foo/index.html when the browser requests /foo/ you'll need to either use the web serving feature of S3 (bucket can't be private) or set up some lambda at edge or similar fiddly shenanigans.

herpderperator · 2025-08-20T19:24:58 1755717898

For the sake of understanding, can you explain why putting CloudFront in front of the buckets helps?

bhattisatish · 2025-08-20T20:32:59 1755721979

Cloudfront allows you to map your S3 with both

- signed url's in case you want a session base files download

- default public files, for e.g. a static site.

You can also map a domain (sub-domain) to Cloudfront with a CNAME record and serve the files via your own domain.

Cloudfront distributions are also CDN based. This way you serve files local to the users location, thus increasing the speed of your site.

For lower to mid range traffic, cloudfront with s3 is cheaper as the network cost of cloudfront is cheaper. But for large network traffic, cloudfront cost can balloon very fast. But in those scenarios S3 costs are prohibitive too!

SOLAR_FIELDS · 2025-08-20T16:19:08 1755706748

I honestly don't mind that you have to jump through hurdles to make your bucket publically available and that it's annoying. That to me seems like a feature, not a bug

dghlsakjg · 2025-08-20T16:39:43 1755707983

I think the OPs objection is not that hurdles exist but that they move them every time you try and run the track.

simonw · 2025-08-20T16:36:50 1755707810

Sure... but last time I needed to jump through those hurdles I lost nearly an hour to them!

I'm still not sure I know how to do it if I need to again.

viccis · 2025-08-20T20:29:21 1755721761

>In EC2, you can now change security groups and IAM roles without shutting the instance down to do it.

Hasn't it been this way for many years?

>Spot instances used to be much more of a bidding war / marketplace.

Yeah because there's no bidding any more at all, which is great because you don't get those super high spikes as availability drops and only the ones who bid super high to ensure they wouldn't be priced out are able to get them.

>You don’t have to randomize the first part of your object keys to ensure they get spread around and avoid hotspots.

This one was a nightmare and it took ages to convince some of my more pig headed coworkers in the past that they didn't need to do it any more. The funniest part is that they were storing their data as millions and millions of 10-100kb files, so the S3 backend scaling wasn't the thing bottlenecking performance anyway!

>Originally Lambda had a 5 minute timeout and didn’t support container images. Now you can run them for up to 15 minutes, use Docker images, use shared storage with EFS, give them up to 10GB of RAM (for which CPU scales accordingly and invisibly), and give /tmp up to 10GB of storage instead of just half a gig.

This was/is killer. It used to be such a pain to have to manage pyarrow's package size if I wanted a Python Lambda function that used it. One thing I'll add that took me an embarrassingly long time to realize is that your Python global scope is actually persisted, not just the /tmp directory.

indigodaddy · 2025-08-21T15:57:25 1755791845

Re: SG, yeah I wasnt doing any cloud stuff when that was the case. Never had to restart anything for an SG change and this must be at least 5-6 years..

jp57 · 2025-08-20T21:09:49 1755724189

> Glacier restores are also no longer painfully slow.

I had a theory (based on no evidence I'm aware of except knowing how Amazon operates) that the original Glacier service operated out of an Amazon fulfillment center somewhere. When you put it a request for your data, a picker would go to a shelf, pick up some removable media, take it back, and slot it into a drive in a rack.

This, BTW, is how tape backups on timesharing machines used to work once upon a time. You'd put in a request for a tape and the operator in the machine room would have to go get it from a shelf and mount it on the tape drive.

danudey · 2025-08-20T22:47:06 1755730026

The most likely explanation is that they used a tape robot, such as the one seen here:

https://www.reddit.com/r/DataHoarder/comments/12um0ga/the_ro...

Which is basically exactly what you described but the picker is a robot.

Data requests go into a queue; when your request comes up, the robot looks up the data you requested, finds the tape and the offset, fetches the tape and inserts it into the drive, fast-forwards it to the offset, reads the file to temporary storage, rewinds the tape, ejects it, and puts it back. The latency of offline storage is in fetching/replacing the casette and in forwarding/rewinding the tape, plus waiting for an available drive.

Realistically, the systems probably fetch the next request from the queue, look up the tape it's on, and then process every request from that tape so they're not swapping the same tape in and out twenty times for twenty requests.

philistine · 2025-08-21T00:21:44 1755735704

I've read very definitive discussions on here that Glacier never used tape. It has always been powered off hard disks.

UltraSane · 2025-08-21T01:05:15 1755738315

For truly write once read never data tape is the optimal storage method. It is exactly what the LTO standard was designed to do and it does it very well. You can be confident that you will be able to read every bit of data from a 30 year old tape, probably even 50 years old. It has the lowest bit error rate of any technology I am aware of. LTO-9 is better than 1 uncorrectable bit error in 10^20 user bits, which is 1 bit error in 12.5 exabytes. There is also the substantial advantage that tapes on a shelf are completely immune to ransomware. As a sysadmin I get that warm fuzzy feeling when critical data is backed up on a good LTO tape library.

dabiged · 2025-08-21T01:15:45 1755738945

As someone who does tape recovery on very very old tape I largely concur with this with a couple of caveats.

1. Do not encrypt your tapes if you want the data back in 30/50 years. We have had so many companies lose encryption keys and turn their tapes into paperweights because the company they bought out 17 years ago had poor key management.

2. The typical failure case on tape is physical damage not bit errors. This can be via blunt force trauma (i.e. dropping, or sometimes crushing) or via poor storage (i.e. mould/mildew).

3. Not all tape formats are created equal. I have seen far higher failure rates on tape formats that are repeatedly accessed, updated, ejected, than your old style write once, read none pattern.

count · 2025-08-21T01:11:59 1755738719

Call it bad luck, but I’ve never had a fully successful restore. Drives eat tapes, drives are damaged and write bad data, robot arms die or malfunction. Tapes have NEVER worked for me. SANs and remote disk though, rock solid.

That said, I don’t miss any of that stuff, gimme S3 any day :)

UltraSane · 2025-08-21T06:17:08 1755757028

You do realized that that isn't normal at all? LTO tape is still used by thousands of companies to backup many exabytes of data. I know it once saved Google from permanent loss of gmail data from a bug. You should really get a refund for your tape drives.

meepmorp · 2025-08-21T04:09:44 1755749384

Aren't LTO formats only backward compatible with the immediate prior version?

UltraSane · 2025-08-21T06:20:18 1755757218

They can write to one version back and read two version back. for really long term data storage you have to also store the read/write hardware.

Dylan16807 · 2025-08-21T06:54:57 1755759297

By the time it's hard to get a compatible LTO drive, I'd be very suspicious of a mothballed drive working either. If you want reliable long term storage you're going to have to update it every couple decades.

Twirrim · 2025-08-21T00:22:32 1755735752

I can't talk about it, but I've yet to see an accurate guess at how Glacier was originally designed. I think I'm in safe territory to say Glacier operated out of the same data centers as every other AWS service.

It's been a long time, and features launched since I left make clear some changes have happened, but I'll still tread a little carefully (though no one probably cares there anymore):

One of the most crucial things to do in all walks of engineering and product management is to learn how to manage the customer expectations. If you say customers can only upload 10 images, and then allow them to upload 12, they will come to expect that you will always let them upload 12. Sometimes it's really valuable to manage expectations so that you give yourself space for future changes that you may want to make. It's a lot easier to go from supporting 10 images to 20, than the reverse.

donavanm · 2025-08-21T07:05:05 1755759905

Im like 90% sure ive seen folks (unofficially) disclose the original storage and API decisions over the years, in roughly accurate terms. Personally I think the multi dimensional striping/erasure code ideas are way more interesting than the “its just a tape library” speculation/arguments. That and the real lessons learned around product differentiation as supporting technologies converge.

kelnos · 2025-08-21T03:54:42 1755748482

> I can't talk about it, but I've yet to see an accurate guess at how Glacier was originally designed.

It feels odd that this is some sort of secret. Why can't you talk about it?

Twirrim · 2025-08-21T04:01:18 1755748878

I signed NDAs. I wish Glacier was more open about their history, because it's honestly interesting, and they have a number of notable innovations in how they approach things.

Dylan16807 · 2025-08-21T07:02:11 1755759731

Well assuming your NDA is a reasonable length I hope you talk about it later.

(And if Amazon is making unreasonable length NDAs I hope they lose a lot of money over it.)

mh- · 2025-08-21T02:47:45 1755744465

..oh. That's clever. Thanks for posting this.

jp57 · 2025-08-21T21:50:37 1755813037

I think folks have missed what I think would have been clever about the implentation I (apparently) dreamt up. It's not that "it's just a tape library", it's that it would have used the existing FC and picker infrastructure that Amazon had already built, with some racks containing drives for removable media. I was thinking that it would not have been some special facility purely for Glacier, but rather one or more regular FCs would just have had some shelves with Glacier media (not necessarily tapes).

Then the existing pickers would get special instructions on their handhelds: Go get item number NNNN from Row/shelf/bin X/Y/Z and take it to [machine-M] and slot it in, etc.

browningstreet · 2025-08-20T21:27:51 1755725271

Yeah, but they've been robotic for decades since.

christina97 · 2025-08-20T23:14:07 1755731647

They would definitely be using rubies robots given how uniform hard drives are. The only reason warehouses still have humans is that heterogeneity (different sizes, different textures, different squishiness, etc).

general1726 · 2025-08-20T19:17:15 1755717435

I think there is more of us who kind of degenerated from doing it the AWS way - API Gateway, serverless lambdas mess around with IAM roles until it works, ... - to - Give me EC2 / LightSail VPS instance maybe an S3 bucket let's set domain through Route53 and go away with the rest of your orchestrion AWS.

no_wizard · 2025-08-20T22:17:21 1755728241

At what point is AWS worth using over other compute competitors when you’re using them as a storage bucket + VPS. They’re wholly more expensive at that point. Why not go with a more traditional but rock solid VPS provider?

I have the opposite philosophy for what it’s worth: if we are going to pay for AWS I want to use it correctly, but maximally. So for instance if I can offload N thing to Amazon and it’s appropriate to do so, it’s preferable. Step Functions, lambda, DynamoDB etc, over time, have come to supplant their alternatives and its overall more efficient and cost effective.

That said, I strongly believe developers don’t do enough consideration as to how to maximize vendor usage in an optimal way

benterix · 2025-08-21T10:09:12 1755770952

> That said, I strongly believe developers don’t do enough consideration as to how to maximize vendor usage in an optimal way

Because it's not straightforward. 1) You need to have general knowledge of AWS services and their strong and weak points to be able to choose the optimal one for the task, 2) you need to have good knowledge of the chosen service (like DynamoDB or Step Functions) to be able to use it optimally; being mediocre at it is often not enough, 3) local testing is often a challenge or plain impossible, you often have to do all testing on a dev account on AWS infra.

nitwit005 · 2025-08-20T22:43:04 1755729784

Your management will frequently be strangely happier to waste money on AWS, unfortunately.

Truly a marketing success.

wiether · 2025-08-21T06:31:35 1755757895

I agree that using them as a VPS provider is a mistake.

If you don't use the E(lasticity) of EC2, you're burning cash.

For prod workloads, if you can go from 1 to 10 instances during an average day, that's interesting. If you have 3 instances running 24/7/365, go somewhere else.

For dev workloads, being able to spin instances in a matter of seconds is a bliss. I installed the wrong version of a package on my instance? I just terminate it, wait for the auto-scaling group to pop a fresh new one a start again. No need to waste my time trying to clean my mess on the previous instance.

You speak about Step Functions as an efficient and cost effective service from AWS, and I must admit that it's one that I avoid as much as I can... Given the absolute mess that it is to setup/maintain, and that you completely lock yourself in AWS with this, I never pick it to do anything. I'd rather have a containerized workflow engine running on ECS, even though I miss on the few nice features that SF offers within AWS.

The approach I try to have is:

- business logic should be cloud agnostic

- infra should swallow all the provider's pills it needs to be as efficient as possible

no_wizard · 2025-08-21T13:22:39 1755782559

>business logic should be cloud agnostic

In practice I found this to be more burden than it’s worth. I have yet to work somewhere that is on Azure, GCP or AWS and actually switch between clouds. I am sure it happens, but is it really that common?

I instead think of these platforms as a marriage, you’re going to settle in one and do your best to never divorce

wiether · 2025-08-21T18:37:48 1755801468

Part of my job is to do migrations for customers, so, to me at least, it's not uncommon.

Using all the bells and whistles of a provider and being locked-in is one thing. But the other big issue is that, as service providers, they can (and some of them did more often than not) stop providing some services or changing them in a way that forces you to make big changes in your app to keep it running on this service.

Whereas, if you build your app in a agnostic way, they can stop or change what they want, you either don't rely on those services heavily enough for the changes required to be huge, or you can just deploy elsewhere, with another flavor of the same service.

Let's say you have a legacy Java app that works only with a library that is not maintained. If you don't want to bear the cost of rewriting with a new and maintained library, you can keep the app running, knowing the risks and taking the necessary steps to protect you against it.

Whereas if your app relies heavily on DynamoDB's API and they decide to drop the service completely, the only way to keep the app running is to rewrite everything for a similar service, or to find a service relying on the same API elsewhere.

austinshea · 2025-08-21T02:34:06 1755743646

Most work isn’t greenfield.

AWS can be used in a different, cost effective, way.

It can be used as a middle-ground capable of serving the existing business, while building towards a cloud agnostic future.

The good AWS services (s3, ec2, acm, ssm, r53, RDS, metadata, IAM, and E/A/NLBs) are actually good, even if they are a concern in terms of tracking their billing changes.

If you architect with these primitives, you are not beholden to any cloud provider, and can cut over traffic to a non AWS provider as soon as you’re done with your work.

mdaniel · 2025-08-21T15:05:39 1755788739

Of that list, watch out since IAM != IAM != IAM, so "cloud agnostic" is that famous 80/20 split

Spivak · 2025-08-21T00:00:18 1755734418

Because the compartmentalization of business duties means that devs are fighting uphill against the wind to sign a deal with a new vendor for something. It's business bikeshedding, as soon as you open the door to a new vendor everyone, especially finance, has opinions and you might end up stuck with a vendor you didn't want. Or you can use the pre-approved money furnace and just ship.

calmbonsai · 2025-08-20T19:22:49 1755717769

There are entire industries that have largely de-volved their clouds primarily for footprint flexibility (not all AWS services are in all regions) and billing consistency.

regularfry · 2025-08-20T20:19:27 1755721167

Honestly just having to manage IAM is such a time-suck that the way I've explained it to people is that we've traded the time we used to spend administering systems for time spent just managing permissions, and IAM is so obtuse that it comes out as a net loss.

There's a sweet spot somewhere in between raw VPSes and insanely detailed least-privilege serverless setups that I'm trying to revert to. Fargate isn't unmanageable as a candidate, not sure it's The One yet but I'm going to try moving more workloads to it to find out.

PaulDavisThe1st · 2025-08-20T20:38:30 1755722310

Can no longer login to my AWS account, because I never set up MFA.

Want to set up MFA ... login required to request device.

Yes, I know, they warned us far ahead of time. But not being able to request one of their MFA devices without a login is ... sucky.

austinshea · 2025-08-21T02:25:37 1755743137

Talk to support: https://support.aws.amazon.com/#/contacts/one-support?formId...

PaulDavisThe1st · 2025-08-21T04:51:11 1755751871

I did. That's just an AI, which says this:

> I understand your situation is a bit unique, where you are unable to log in to your AWS account without an MFA device, but you also can't order an MFA device without being able to log in. This is a scenario that is not directly covered in our standard operating procedures.

The best course of action would be for you to contact AWS Support directly. They will be able to review your specific case and provide guidance on how to obtain an MFA device to regain access to your account. The support team may have alternative options or processes they can walk you through to resolve this issue.

Please submit a support request, and one of our agents will be happy to assist you further. You can access the support request form here: https://console.aws.amazon.com/support/home

That last URL? You need to login to use it ...

berlesi · 2025-08-20T21:23:12 1755724992

Looks like something that you could solve easily through their support, no?

raffraffraff · 2025-08-20T21:37:34 1755725854

Support don't talk to you unless you pay for support

danudey · 2025-08-20T22:49:54 1755730194

Easy, just log into your account and pay them for support.

austinshea · 2025-08-21T02:25:18 1755743118

Yes, they do.

https://support.aws.amazon.com/#/contacts/one-support?formId...

SOLAR_FIELDS · 2025-08-20T16:15:03 1755706503

You know what's still stupid? That if you have an S3 bucket in the same region as your VPC that you will get billed on your NAT Gateway to send data out to the public internet and right back in to the same datacenter. There is simply no reason to not default that behavior to opt out vs opt in (via a VPC endpoint) beyond AWS profiting off of people's lack of knowledge in this realm. The amount of people who would want the current opt-in behavior is... if not zero, infinitesimally small.

solatic · 2025-08-20T17:23:14 1755710594

It's a design that is secure by default. If you have no NAT gateway and no VPC Gateway Endpoint for S3 (and no other means of Internet egress) then workloads cannot access S3. Networking should be closed by default, and it is. If the user sets up things they don't understand (like NAT gateways), that's on them. Managed NAT gateways are not the only option for Internet egress and users are responsible for the networks they build on top of AWS's primitives (and yes, it is indeed important to remember that they are primitives, this is an IaaS, not a PaaS).

bilalq · 2025-08-20T17:32:30 1755711150

Fine for when you have no NAT gateway and have a subnet with truly no egress allowed. But if you're adding a NAT gateway, it's crazy that you need to setup the gateway endpoint for S3/DDB separately. And even crazier that you have to pay for private links per AWS service endpoint.

solatic · 2025-08-20T19:33:52 1755718432

There's very real differences between NAT gateways and VPC Gateway Endpoints.

NAT gateways are not purely hands-off, you can attach additional IP addresses to NAT gateways to help them scale to supporting more instances behind the NAT gateway, which is a fundamental part of how NAT gateways work in network architectures, because of the limit on the number of ports that can be opened through a single IP address. When you use a VPC Gateway Endpoint then it doesn't use up ports or IP addresses attached to a NAT gateway at all. And what about metering? If you pay per GB for traffic passing through the NAT gateway, but I guess not for traffic to an implicit built-in S3 gateway, so do you expect AWS to show you different meters for billed and not-billed traffic, but performance still depends on the sum total of the traffic (S3 and Internet egress) passing through it? How is that not confusing?

It's also besides the point that not all NAT gateways are used for Internet egress, indeed there are many enterprise networks where there are nested layers of private networks where NAT gateways help deal with overlapping private IP CIDR ranges. In such cases, having some kind of implicit built-in S3 gateway violates assumptions about how network traffic is controlled and routed, since the assumption is for the traffic to be completely private. So even if it was supported, it would need to be disabled by default (for secure defaults), and you're right back at the equivalent situation you have today, where the VPC Gateway Endpoint is a separate resource to be configured.

Not to mention that VPC Gateway Endpoints allow you to define policy on the gateway describing what may pass through, e.g. permitting read-only traffic through the endpoint but not writes. Not sure how you expect that to work with NAT gateways. This is something that AWS and Azure have very similar implementatoons for that work really well, whereas GCP only permits configuring such controls at the Organization level (!)

They are just completely different networking tools for completely different purposes. I expect closed-by-default secure defaults. I expect AWS to expose the power of different networking implements to me because these are low-level building blocks. Because they are low-level building blocks, I expect for there to be footguns and for the user to be held responsible for correct configuration.

bilalq · 2025-08-20T21:14:17 1755724457

My objections here are in terms of how this manifests in billing. Especially when you consider the highway robbery rates for internet egress.

solatic · 2025-08-21T06:34:08 1755758048

Again, you are dealing with low-level primitives. You can provision an EC2 VM with multiple GPUs at high cost and use it to host nginx. That is not a correct configuration. There are much cheaper ways available to you. It's ridiculous to imply that AWS shouldn't send you a higher bill because you didn't use the GPUs or that AWS shouldn't offer instances with GPUs because they are more expensive. You, the user, are responsible for building a correct configuration with the low-level primitives that have been made available to you! If it's too much then feel free to move up the stack and host your workloads on a PaaS instead.

Dylan16807 · 2025-08-21T07:12:50 1755760370

It being low level is not an excuse for systems that lead people down the wrong path.

And the traffic never even reaches the public internet. There's a mismatch between what the billing is supposedly for and what it's actually applied to.

> do you expect AWS to show you different meters for billed and not-billed traffic, but performance still depends on the sum total of the traffic (S3 and Internet egress) passing through it?

Yes.

> How is that not confusing?

That's how network ports work. They only go so fast, and you can be charged based on destination. I don't see the issue.

> It's also besides the point that not all NAT gateways are used for Internet egress

Okay, if two NAT gateways talk to each other it also should not have egress fees.

> some kind of implicit built-in S3 gateway violates assumptions

So don't do that. Checking if the traffic will leave the datacenter doesn't need such a thing.

otterley · 2025-08-20T17:20:24 1755710424

This is the intended use case for S3 VPC Gateway Endpoints, which are free of charge.

https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpo...

(Disclaimer: I work for AWS, opinions are my own.)

darkwater · 2025-08-20T17:36:18 1755711378

I think they know it. They are complaining it's not enabled by default (and so do I).

otterley · 2025-08-20T18:14:27 1755713667

AWS VPCs are secure by default, which means no traffic traverses their boundaries unless you intentionally enable it.

There are many IaC libraries, including the standard CloudFormation VPC template and CDK VPC class, that can create them automatically if you so choose. I suspect the same is also true of commonly-used Terraform templates.

conradludgate · 2025-08-21T05:06:43 1755752803

I've been testing our PrivateLink connectivity at work in the past few weeks. This means I've been creating and destroying a bunch of VPCs to test the functionality. The flow in the AWS console when you select the "VPC and more" wizard does have an S3 Gateway enabled by default

hylaride · 2025-08-20T19:52:32 1755719552

As others have pointed out, this is by design. If VPCs have access to AWS resources (such as S3, DynamoDB, etc), an otherwise locked down VPC can still have data leaks to those services, including to other AWS accounts.

It's a convenience VS security argument, though the documentation could be better (including via AWS recommended settings if it sees you using S3).

SOLAR_FIELDS · 2025-08-20T17:36:14 1755711374

The problem is that the default behavior for this is opt-in, rather than opt-out. No one prefers opt-in. So why is it opt-in?

icedchai · 2025-08-20T19:54:07 1755719647

If it were opt-out someone would accidentally leave it on and eventually realize that entire systems had been accidentally "backed up" and exfiltrated to S3.

SOLAR_FIELDS · 2025-08-20T20:00:52 1755720052

What? The same is possible whether it's opt-in or opt-out. It's just that if you have the gateway as opt-out you wouldn't also have this problem AND a massive AWS bill. You would just have this problem.

icedchai · 2025-08-20T22:32:07 1755729127

No, with opt-in the VPC subnet is secure by default. Someone has to explicitly allow access to S3 (or anything else.)

Spivak · 2025-08-20T20:29:03 1755721743

The bad situation is if you created a VPC with no internet access but the hypothetical automatic VPC endpoint still let instances access S3. Then a compromised instance has a vector for data exfiltration.

otterley · 2025-08-20T18:27:12 1755714432

AWS VPCs are secure by default, which means no traffic traverses their boundaries unless you intentionally enable it.

SOLAR_FIELDS · 2025-08-20T19:51:48 1755719508

"The door is locked, so instead of suggesting to the end user that they should unlock the door with this key that we know how to give the end user deterministically, we instead tell them to drive across town and back on our toll roads and collect money from it"

This has been a common gotcha for over a decade now: https://www.lastweekinaws.com/blog/the-aws-managed-nat-gatew...

otterley · 2025-08-20T21:19:26 1755724766

Speaking solely on my own behalf: I don't know a single person at AWS (and I know a lot of them) who wants to mislead customers into spending more money than they need to. I remember a time before Gateway Endpoints existed, and customers (including me at the time) were spending tons of money passing traffic through pricey NAT Gateways to S3. S3 Gateway Endpoints saved them money.

SOLAR_FIELDS · 2025-08-20T21:53:31 1755726811

Clearly you guys are aware of the problem though. I mean, every time this thing happens there's probably a ticket. I've personally filed one myself years ago when it happened to me. So why has the behavior not changed? You don't have to give up security to remove this footgun, it's possible to remove it and still make it an opt-in action for security purposes.

hinkley · 2025-08-20T18:16:36 1755713796

Your job depends upon you misunderstanding the problem.

afandian · 2025-08-20T16:24:42 1755707082

Having experienced the joy of setting up VPC, subnets and PrivateLink endpoints the whole thing just seems absurd.

They spent the effort of branding private VPC endpoints "PrivateLink". Maybe it took some engineering effort on their part, but it should be the default out of the box, and an entirely unremarkable feature.

In fact, I think if you have private subnets, the only way to use S3 etc is Private Link (correct me if I'm wrong).

It's just baffling.

time0ut · 2025-08-20T17:22:46 1755710566

You can provision gateway endpoints for S3 and DynamoDB. They are free and considered best practice. They are opt-in though, but easy to enable.

mdaniel · 2025-08-20T21:30:55 1755725455

And ECR, which I would guess impacts more folks than DynamoDB https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-e...

And, as as added benefit, they distinguish between "just pull" and "pull and push" which is nice

afandian · 2025-08-21T08:20:51 1755764451

True, I forgot that. But depending on services you still have to have some Gateway and some Interface endpoints.

dmart · 2025-08-20T16:28:37 1755707317

VPC endpoints in general should be free and enabled by default. That you need to pay extra to reach AWS' own API endpoints from your VPC feels egregious.

otterley · 2025-08-20T17:22:19 1755710539

Gateway endpoints are free. Network endpoints (which are basically AWS-managed ENIs that can tunnel through VPC boundaries) are not free.

S3 can use either, and we recommend establishing VPC Gateway endpoints by default whenever you need S3 access.

(Disclaimer: I work for AWS, opinions are my own.)

Hikikomori · 2025-08-20T18:00:30 1755712830

Why don't you have gateway endpoints for all your APIs?

donavanm · 2025-08-21T07:27:03 1755761223

The original private endpoints implementation required meaningful work from the service teams (ec2 networking, s3, & ddb). It also changed how the "front end" API servers handled requests and how their infrastructure was deployed (at the time?). The newer LB/ENI style privatelink abstracts away _most_ of that "per service" implementation effort at the cost of more per-request/connection work fromthe virtual network. Hence why theres more support from other services, and it includes a cost.

count · 2025-08-21T01:20:37 1755739237

The service teams don’t talk to each other…

mdaniel · 2025-08-21T04:36:03 1755750963

I think that is by design https://konghq.com/blog/enterprise/api-mandate

tux3 · 2025-08-20T16:25:32 1755707132

That is price segmentation. People who are price insensitive will not invest the time to fix it

People who are probably shouldn't be on aws - but they usually have to for unrelated reasons, and they will work to reduce their bill.

nbngeorcjhe · 2025-08-20T16:55:56 1755708956

> People who are price insensitive will not invest the time to fix it

This just sounds like a polite way of saying "we're taking peoples' money in exchange for nothing of value, and we can get away with it because they don't know any better".

robertlagrant · 2025-08-20T17:31:40 1755711100

It's more like: we made loads of stuff super cheap but here's where we make some money because it scales with use.

amenghra · 2025-08-20T17:14:22 1755710062

Price segmentation happens all the time in pretty much every industry.

hinkley · 2025-08-20T18:21:14 1755714074

There’s an entire Pandora’s box of shitty things that happen in pretty much every industry. I don’t think you want to use that defense.

happytoexplain · 2025-08-21T18:06:13 1755799573

Must be un-criticizable then?

happytoexplain · 2025-08-20T17:10:43 1755709843

>People who are price insensitive will not invest the time to fix it

Hideous.

kbolino · 2025-08-20T17:15:17 1755710117

The problem is that VPC endpoints aren't free.

They should be, of course, at least when the destination is an AWS service in the same region.

[edit: I'm speaking about interface endpoints, but S3 and DynamoDB can use gateway endpoints, which are free to the same region]

otterley · 2025-08-20T17:21:19 1755710479

Gateway endpoints are free. Network endpoints (which are basically AWS-managed ENIs that can tunnel through VPC boundaries) are not free.

S3 can use either, and we recommend establishing VPC Gateway endpoints by default whenever you need S3 access.

(Disclaimer: I work for AWS, opinions are my own.)

JoshTriplett · 2025-08-20T19:29:54 1755718194

That's fascinating! I hadn't found that in the documentation; everything seems to steer people towards PrivateLink, not gateway endpoints.

Would you recommend using VPC Gateway even on a public VPC that has an Internet gateway (note: not a NAT gateway)? Or only on a private VPC or one with a NAT gateway?

otterley · 2025-08-20T19:36:31 1755718591

I recommend S3 Gateways for all VPCs that need to access S3, even those that already have routes to the Internet. Plus they eliminate the need for NAT Gateway traversal for requests that originate from private subnets.

JoshTriplett · 2025-08-21T01:02:45 1755738165

> I recommend S3 Gateways for all VPCs that need to access S3, even those that already have routes to the Internet.

Fascinating. What's the advantage of doing that?

donavanm · 2025-08-21T07:22:28 1755760948

It's a much more direct/efficient connection from the EC2 instance to the S3 storage servers through the virtual network layer. It reduces the network path/length through the AWS network _and_ removes the number of virtual network functions/servers (ala "LB") that your connections will traverse.

JoshTriplett · 2025-08-21T17:51:22 1755798682

That's helpful to know, thank you! I'll take a look at that and see if it improves S3 performance.

paulddraper · 2025-08-20T21:14:16 1755724456

> everything seems to steer people towards PrivateLink, not gateway endpoints

Gateway endpoints only work for some things.

Hikikomori · 2025-08-20T21:38:59 1755725939

Privatelink endpoints can be of type gateway or interface. Only gateway is free and only S3 and dynamodb supports it.

kbolino · 2025-08-20T17:34:47 1755711287

Fair point, and valid for S3 (the topic at hand) and DynamoDB.

Other AWS services, though, don't support gateway endpoints.

mdaniel · 2025-08-20T21:37:38 1755725858

https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-e...

~~I get the impression there are several others, too, but that one is of especial interest to me~~ Wowzers, they really are much better now:

  aws --region us-east-1 ec2 describe-vpc-endpoint-services | jq '.ServiceNames|length'
  459

If you're saying "other services should offer VPC Endpoints," I am 100% on-board. One should never have to traverse the Internet to contact any AWS control plane

watermelon0 · 2025-08-21T05:08:38 1755752918

Those are VPC endpoints, not gateway endpoints.

kbolino · 2025-08-21T12:52:31 1755780751

Both interface endpoints and gateway endpoints are also called VPC endpoints. The former get distinct IP addresses in your VPC subnets while the latter get distinct entries in your VPC routing tables. They are even created with the same API call: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_C...

paulddraper · 2025-08-20T17:17:28 1755710248

Well yeah that's the point....why route through the public internet.

kbolino · 2025-08-20T17:22:47 1755710567

I doubt the traffic ever actually leaves AWS. Assuming it does make it all the way out to their edge routers, the destination ASN will still be one of their own. Not that the pricing will reflect this, of course.

The other problem with (interface) VPC endpoints is that they eat up IP addresses. Every service/region permutation needs a separate IP address drawn from your subnets. Immaterial if you're using IPv6, but can be quite limiting if you're using IPv4.

immibis · 2025-08-20T22:12:09 1755727929

Sounds like a good reason to use IPv6.

kbolino · 2025-08-21T00:13:56 1755735236

There were still a couple of services/features that choked on IPv6 last time I looked (1.5-2 years ago) but it works with most things and they do seem to be making progress on the others.

paulddraper · 2025-08-21T04:09:52 1755749392

I don’t know any software that doesn’t work with IPv6 as of 2021.

Just some internet services that haven’t upgraded. (But fixed by NAT.)

kbolino · 2025-08-21T13:37:41 1755783461

There's a matrix provided by Amazon themselves listing the level of IPv6 support in their various services: https://docs.aws.amazon.com/vpc/latest/userguide/aws-ipv6-su...

torginus · 2025-08-20T19:44:45 1755719085

If you had an ALB inside the VPC that routed the requests to something that lives inside the VPC, which called the AWS PutObject api on the bucket, would that still be the case?

immibis · 2025-08-20T17:59:16 1755712756

A company making revenue is not stupid.

JCM9 · 2025-08-20T19:35:43 1755718543

Some good stuff here. I wish AWS would just focus on these boring, but ultimately important, things that they’re good at instead of all the current distractions trying to play catch up on “AI.” AWS leadership missed the boat there big time, but that’s OK.

Ultimately AWS doesn’t have the right leadership or talent to be good at GenAI, but they do (or at least used to) have decent core engineers. I’d like to see them get back to basics and focus there. Right now leadership seems panicked about GenAI and is just throwing random stuff at the wall desperately trying to get something to stick. Thats really annoying to customers.

anon7000 · 2025-08-21T08:53:16 1755766396

They continue to have large teams working on core stuff. It’s just that they’re working at such a low level (like high perf virtualized networking on their custom network cards) that most people don’t hear about it or care that much.

PartiallyTyped · 2025-08-21T05:18:28 1755753508

Leadership is looking to provide infrastructure for anyone to just pick a model and get on doing things without the hassle of setting things up.

raffraffraff · 2025-08-20T21:37:58 1755725878

> Availability Zones used to be randomized between accounts (my us-east-1a was your us-east-1c)

WTH?

zbentley · 2025-08-20T22:09:06 1755727746

It was for spreading load out. If someone was managing resources in a bunch of accounts and always defaulted to, say, 1b, AWS randomized what AZs corresponded to what datacenter segments to avoid hot spots.

The canonical AZ naming was provided because, I bet, they realized that the users who needed canonical AZ identifiers were rarely the same users that were causing hot spots via always picking the same AZ.

Twirrim · 2025-08-21T00:38:27 1755736707

Almost everyone went with 1a, every time. It causes significant issues for all sorts of reasons, especially considering the latency target for network connections between data centres in an AD

mpyne · 2025-08-20T22:03:59 1755727439

Presumably it would help ensure that everyone selecting us-east-1a in their base configs didn't actually all land in the same AZ.

skywhopper · 2025-08-20T21:47:18 1755726438

They did this to stop people from overloading us-east-1a.

It was fine, until there started to be ways of wiring up networks between accounts (eg PrivateLink endpoint services) and you had to figure out which AZ was which so you could be sure you were mapping to the the same AZs in each account.

I built a whole methodology for mapping this out across dozens of AWS accounts, and built lookup tables for our internal infrastructure… and then AWS added the zone ID to AZ metadata so that we could just look it up directly instead.

slashdev · 2025-08-20T21:43:53 1755726233

Yeah this one drove me crazy

aaronblohowiak · 2025-08-20T16:20:54 1755706854

>VPC peering used to be annoying; now there are better options like Transit Gateway, VPC sharing between accounts, resource sharing between accounts, and Cloud WAN.

TGW is... twice as expensive as vpc peering?

klysm · 2025-08-20T18:32:24 1755714744

VPC sharing is the sleeper here. You can do cross account networking all in the same VPC and skip all the expensive stuff.

aaronblohowiak · 2025-08-20T18:43:06 1755715386

as long as your VPCs aren't too big, yea.

klysm · 2025-08-21T02:56:14 1755744974

If you are exhausting an entire VPC I’d be pretty impressed!

aaronblohowiak · 2025-08-21T03:41:42 1755747702

Used to work at nflx, biiiiig headaches

Hikikomori · 2025-08-20T20:34:43 1755722083

Shared vpcs can get pretty big. Even if you approach the NAU limit you can use privatelink or TGW to have more large shared vpcs.

alFReD-NSH · 2025-08-20T16:47:03 1755708423

And vpc sharing is free. Cost and architecture are tied.

Hikikomori · 2025-08-20T16:49:15 1755708555

More than twice as same AZ is free with peering. But if you're big enough you can get better deals on cost.

But unlike peering TGW traffic flows through an additional compute layer so it has additional cost.

gurjeet · 2025-08-20T16:55:48 1755708948

It would've been nice if each of those claims in the article also linked to either the relevant announcement or to the documentation. If I'm interested in any of these headline items, I'd like to learn more.

mdaniel · 2025-08-20T21:42:57 1755726177

I don't believe AWS offers permalinks, so it would only help until they rolled over the next documentation release :-(

They actually used to have the upstream docs in GitHub, and that was super nice for giving permalinks but also building the docs locally in a non-pdf-single-file setup. Pour one out, I guess

chisleu · 2025-08-20T17:03:57 1755709437

> You don’t have to randomize the first part of your object keys to ensure they get spread around and avoid hotspots.

As of when? According to internal support, this is still required as of 1.5 years ago.

arpinum · 2025-08-21T01:42:35 1755740555

I think there is some nuance needed here. If you ask support to partition your bucket then they will be a bit annoying if you ask for specific partition points and the first part of the prefix is not randomised. They tried to push me to refactor the bucket first to randomise the beginning of the prefix, but eventually they did it.

The auto partitioning is different. It can isolate hot prefixes on its own and can intelligently pick the partition points. Problem is the process is slow and you can be throttled for more than a day before it kicks in.

laurent_du · 2025-08-20T19:19:41 1755717581

He's not talking about the prefix, just the beginning of the object key.

viccis · 2025-08-20T20:30:54 1755721854

The prefix is not separate from the object key. It's part of it. There's no randomization that needs to be done on either anymore.

csours · 2025-08-20T16:51:11 1755708671

Strictly off topic:

Everything you know is wrong.

Weird Al. https://www.youtube.com/watch?v=W8tRDv9fZ_c

Firesign Theatre. https://www.youtube.com/watch?v=dAcHfymgh4Y

albert_e · 2025-08-21T13:39:30 1755783570

A "Catch me up" on AWS (and for that matter other large platforms) would be very useful for many folks.

Ideally it should be a stream of important updates that can be interactively filtered by time-range. For example, if I have not been actively consuming AWS updates firehose for last 18 months, I should be able to "summarize" that length of updates.

Why this is not already a feature of "What's New" section of AWS and other platforms -- I dont know. Waiting to be built -- either by OEM or by the Community.

ash_091 · 2025-08-21T13:49:15 1755784155

I played a lot of DOTA2 in the past and I've often thought that big tech could learn something from Valve's patch notes. Especially in the context of process changes, stuff you should know, etc. Expecting folk to read a series of lengthy emails/blog posts to stay up to date is unrealistic.

beaviskhan · 2025-08-20T20:25:12 1755721512

Also S3 related: the bucket owner can now be configured as the object owner no matter where the object originated. In the past this was exceedingly painful if you wanted to allow one account contribute objects to a bucket in another account. You could do the initial contribution, but the contributor always owned the object, and you couldn't delegate access to a third account.

abullinan · 2025-08-21T01:54:35 1755741275

This article was a relief. I’m always a tiny bit worried Amazon will change some thing drastically and I’ll have to migrate. I’ve had an ec2 instance running since 2013. It requires effectively zero maintenance. So I am glad there were no surprises in this article. Thanks OP.

biimugan · 2025-08-21T00:46:03 1755737163

> You don’t have to randomize the first part of your object keys to ensure they get spread around and avoid hotspots.

From my understanding, I don't think this is completely accurate. But, to be fair, AWS doesn't really document this very well.

From my (informal) conversations with AWS engineers a few months ago, it works approximately like this (modulo some details I'm sure the engineers didn't really want to share):

S3 requests scale based on something called a 'partition'. Partitions form automatically based on the smallest common prefixes among objects in your bucket, and how many requests objects with that prefix receive. And the bucket starts out with a single partition.

So as an example, if you have a bucket with objects "2025-08-20/foo.txt" and "2025-08-19/foo.txt", the smallest common prefix is "2" (or maybe it considers the root as the generator partition, I don't actually know). (As a reminder, a / in an object key has no special significance in S3 -- it's just another character. There are no "sub-directories"). Therefore a partition forms based on that prefix. You start with a single partition.

Now if the object "2025-08-20/foo.txt" suddenly receives a ton of requests, what you'll see happen is S3 throttle those requests for approximately 30-60 minutes. That's the amount of time it takes for a new partition to form. In this case, the smallest common prefix for "2025-08-20/foo.txt" is "2025-08-2". So a 2nd partition forms for that prefix. (Again, the details here may not be fully accurate, but this is the example conveyed to me). Once the partition forms, you're good to go.

But the key issue here with the above situation is you have to wait for that warm up time. So if you have some workload generating or reading a ton of small objects, that workload may get throttled for a non-trivial amount of time until partitions can form. If the workload is sensitive to multi-minute latency, then that's basically an outage condition.

The way around this is that you can submit an AWS support ticket and have them pre-generate partitions for you before your workload actually goes live. Or you could simulate load to generate the partitions. But obviously, neither of these is ideal. Ideally, you should just really not try and store billions of tiny objects and expect unlimited scalability and no latency. For example, you could use some kind of caching layer in front of S3.

kentm · 2025-08-21T01:31:50 1755739910

Yep, this is still a thing. In the past year I’ve been throttled due to hot partitions. They’ve improved the partitioning so you hit it less, but if you scale too fast you will get limited.

Hit it when building an iceberg Lakehouse using pre existing data. Using object prefixes fixed the issue.

cmcarthur · 2025-08-21T00:58:28 1755737908

This is my understanding too, and this is particularly problematic for workloads that are read/write heavy on very recent data. When partitioning by a date or by an auto-incrementing id, you still run into the same issue.

Ex: your prefix is /id=12345. S3, under the hood, generates partitions named `/id=` and `/id=1`. Now, your id rolls over to `/id=20000`. All read/write activity on `/id=2xxxx` falls back to the original partition. Now, on rollover, you end up with read contention.

For any high-throughput workloads with unevenly distributed reads, you are best off using some element of randomness, or some evenly distributed partition key, at the root of your path.

ElijahLynn · 2025-08-20T21:21:27 1755724887

That. Was a decent investment of my time as a devops engineer. Right to the point. I learned things.