Separate from my sibling comment about AWS SSM, I also believe that if one cannot know that a Node is sick by the metrics or log egress from it, that's a deployment bug. I'm firmly in the "Cattle" camp, and am getting closer and closer to the "Reverse Uptime" camp - made easier by ASG's newfound "Instance Lifespan" setting to make it basically one-click to get onboard that train
Even as I type all these answers out, I'm super cognizant that there's not one hammer for all nails, and I am for sure guilty of yanking Nodes out of the ASG in order to figure out what the hell has gone wrong with them, but I try very very hard not to place my Nodes in a precarious situation to begin with so that such extreme troubleshooting becomes a minor severity incident and not Situation Normal
If accidentally nuking a single node while debugging causes issues you have bigger problems. Especially if you are running kubernetes any node should be able to fall off the earth at any time without issues.
I agree that you should set a maximum lifetime for a node on the order of a few weeks.
I also agree that you shouldn’t be giving randos access to production infra, but and the end of the day there needs to be some people at the company who have the keys to the kingdom because you don’t know what you don’t know and you need to be able to deal with unexpected faults or outages of the telemetry and logging systems.
I once bootstrapped an entire datacenter with tens of thousands of nodes from an SSH terminal after an abrupt power failure. It turns out infrastructure has lots of circular dependencies and we had to manually break that dependency.
Passive metrics/logs won't let you debug all the issues. At some point you either need a system for automatic memory dumps and submitting bpf scripts to live nodes... or you need SSH access to do that.
There's a thousand ways to do it without SSH. It can be built into the app itself. It can be a special authenticated route to a suid script. It can be built into the current orchestration system. It can be pull-based using the a queue for system monitoring commands. It can be part of the existing monitoring agent. It can be run through AWS SSM. There's really no reason it has to be SSH.
And even got SSH you can have special keys with access authorised to only specific commands, so a service account would be better than personal in that case.
This seems like it’s conceding the point since SSM also allows you to run commands on nodes - I use it interchangeably with SSH to have Ansible manage legacy servers. Maybe what you’re trying to say is that it shouldn’t be routine and that there should be more of a review process so it’s not just a random unrestricted shell session? I think that’s less controversial, and especially when combined with some kind of “taint” mode where your access to a server triggers a rebuild after the dust has settled.
Yes, you nailed it with "it shouldn't be routine" and there for sure should be a review process. My primary concern with the audit logs actually isn't security it's lowering the cowboy of the software lifecycle
> combined with some kind of “taint” mode where your access to a server triggers a rebuild after the dust has settled.
Oh, I love that idea: thanks for bringing it to my attention. I'll for sure incorporate that into my process going forward
The first time I heard it was a very simple idea: they had a wrapper for the command which installed SSH keys on an EC2 instance which also set a delete-after tag which CloudCustodian queried.
Even as I type all these answers out, I'm super cognizant that there's not one hammer for all nails, and I am for sure guilty of yanking Nodes out of the ASG in order to figure out what the hell has gone wrong with them, but I try very very hard not to place my Nodes in a precarious situation to begin with so that such extreme troubleshooting becomes a minor severity incident and not Situation Normal