Separate from my sibling comment about AWS SSM, I also believe that if one canno...

__turbobrew__ · 2024-10-24T01:29:31 1729733371

If accidentally nuking a single node while debugging causes issues you have bigger problems. Especially if you are running kubernetes any node should be able to fall off the earth at any time without issues.

I agree that you should set a maximum lifetime for a node on the order of a few weeks.

I also agree that you shouldn’t be giving randos access to production infra, but and the end of the day there needs to be some people at the company who have the keys to the kingdom because you don’t know what you don’t know and you need to be able to deal with unexpected faults or outages of the telemetry and logging systems.

I once bootstrapped an entire datacenter with tens of thousands of nodes from an SSH terminal after an abrupt power failure. It turns out infrastructure has lots of circular dependencies and we had to manually break that dependency.

ramzyo · 2024-10-24T04:02:41 1729742561

Exactly this. Have heard it referred to as "break glass access". Some form of remote access, be it SSH or otherwise, in case of serious emergency.

viraptor · 2024-10-24T03:17:34 1729739854

Passive metrics/logs won't let you debug all the issues. At some point you either need a system for automatic memory dumps and submitting bpf scripts to live nodes... or you need SSH access to do that.

otabdeveloper4 · 2024-10-24T06:31:16 1729751476

This "system for automatic dumps" 100 percent uses ssh under the hood. Probably with some eternal sudo administrator key.

Personal ssh access is always better (from a security standpoint) than bot tokens and keys.

viraptor · 2024-10-24T09:39:05 1729762745

There's a thousand ways to do it without SSH. It can be built into the app itself. It can be a special authenticated route to a suid script. It can be built into the current orchestration system. It can be pull-based using the a queue for system monitoring commands. It can be part of the existing monitoring agent. It can be run through AWS SSM. There's really no reason it has to be SSH.

And even got SSH you can have special keys with access authorised to only specific commands, so a service account would be better than personal in that case.

acdha · 2024-10-24T13:10:58 1729775458

> Separate from my sibling comment about AWS SSM,

This seems like it’s conceding the point since SSM also allows you to run commands on nodes - I use it interchangeably with SSH to have Ansible manage legacy servers. Maybe what you’re trying to say is that it shouldn’t be routine and that there should be more of a review process so it’s not just a random unrestricted shell session? I think that’s less controversial, and especially when combined with some kind of “taint” mode where your access to a server triggers a rebuild after the dust has settled.

mdaniel · 2024-10-24T14:48:59 1729781339

Yes, you nailed it with "it shouldn't be routine" and there for sure should be a review process. My primary concern with the audit logs actually isn't security it's lowering the cowboy of the software lifecycle

> combined with some kind of “taint” mode where your access to a server triggers a rebuild after the dust has settled.

Oh, I love that idea: thanks for bringing it to my attention. I'll for sure incorporate that into my process going forward

acdha · 2024-10-24T17:21:26 1729790486

The first time I heard it was a very simple idea: they had a wrapper for the command which installed SSH keys on an EC2 instance which also set a delete-after tag which CloudCustodian queried.