The ops folks I work with banter around the same idea that you're getting at here, that engineers should not have access to the production system they maintain. I'll ask you the question I ask them: when the system has a production outage¹, how am I supposed to debug it, effectively? To do that, I need to be able to introspect the state of the system, and that pretty much necessitates running arbitrary commands.
Even if I'm stripped of such abilities… I write the code. I can just change the code to surface what data I need to do, and redeploy. That can be incredibly inefficient, as deploying often resets the very state I seek to get at, so I have to sometimes wait for it to recur naturally if I don't have a solid means of reproducing it.
You debug it via tooling, instrumentation and logs. I realize if you're accustomed to sudoing on prod when troubleshooting, this sounds crazy. Trust me, it works fine; better, in fact. Far fewer weird things happen in well-controlled environments.
One of the technical things it does is segregate (on purpose) developers from any kind of production access.
This is because (from historical experience) a "bad apple" developer can do amazingly fraudulent things, and have a more than reasonable chance of covering it up. o_O
We have a platform as a service team which maintains our PaaS infrastructure (think an internal version of Heroku), and they are the only ones who can SSH into any production systems (<50 engineers, I'd guess).
Engineers write the code (mandatory green build and peer review required before merge to protect against bad actors.. but that's just good engineering practice too!), build their own deployment pipelines on Bamboo or Bitbucket Pipelines to push up code & assets to a docker registry, and ultimately the deployment pipelines make some API calls to deploy docker images. Engineers are also responsible for keeping those services running; most products (such as Jira, Confluence, Bitbucket, etc) also have a dedicated SRE team who are focused on improving reliability of services which support that product.
The vast majority (95%) of our production issues are troubleshooted by looking at Datadog metrics (from CloudWatch & services publish a great deal of metrics too) and Splunk (our services log a lot, we log all traffic, and the host systems also ship their logs off). Fixes are usually to do an automated rollback (part of the PaaS), turn off a feature flag to disable code, redeploy the existing code to fix a transient issue (knowing we'll identify a proper fix in the post incident review), or in rare cases, roll forward by merging a patch & deploying that (~30 mins turnaround time - but this happens <5% of the time). Good test coverage, fast builds (~5 mins on avg), fast deploys, and automated smoke tests before a new deploy goes live all help a lot in preventing issues in the first place.
It's not perfect, but it works a lot better than you might expect.
Asking because is one of the conceptual things I've been trying to figure out.
Still currently using non Docker deployment of production stuff for our public services. Have looked at Docker a few times, but for deployment to the public internet where being accessible to clients on both IPv4 and IPv6 is mandatory, it just doesn't seem to suit.
Docker (swarm) doesn't seem to do IPv6 at all, and the general networking approach in non-swarm docker seems insecure as hell for public services + it also seems to change arbitrarily between versions. For a (currently) 1 man setup, it seems like a bad use of time to have to keep on top of. ;)
Maybe using Nginx on the public services, reverse proxying to not-publicly-accessible docker container hosts would be the right approach instead?
Btw - asparck.com (as per your profile info) doesn't seem to be online?
The actual PaaS relies pretty heavily on AWS Cloud Formation - it predates swarm, mesosphere, kube, etc. So when we deploy a new version of a service, it's really "deploy an auto scaling group of EC2 instances across several AZs fronted by an ELB, then there's an automatic DNS change made which makes the new stack of EC2 instances live once the new stack is validated as working". The upside of the one service per EC2 image approach is no multi-tenancy interference - downsides are cost and it takes a bit longer to deploy. There's a project underway to switch compute to being Kube-based though, so that's promising.
All this is apples and oranges though - solns for big companies don't make sense for a 1-person shop. I still have side projects being deployed via ansible and git pull instead of Docker, because it hasn't been worth the ROI to upgrade to the latter.
Re asparck - yeah, it was my personal site but I struggled to find the time to work on it. In the end I decided it was better to have it offline than terribly out of date, but hopefully I'll resurrect it some day.
Problems with this are as follows (real, not imagined)
1. AWS cloudformation scripts - who makes them? If dev does, sysads can't change.
2. Does dev have the security mindset to maintain configurations in IaaS things like Cloudformation? Who reviews things like NACLs, Security Groups, VPCs, and the like?
3. Scripts - how big or what impact does a script need to be written by sysad or dev?
4. Oncall - normally are sysads job, but when you implement strong gates between dev/sysad, you need oncall devs.
Note that I'm not saying devs should have access to every production machine; I'm only saying that access should be granted to devs for what they are responsible for maintaining.
Sure, one can write custom tooling to promote chosen pieces of information out of the system into other, managed systems. E.g., piping logs out to ELK. And we do this. But it often is not sufficient, and production incidents might end up involving information that I did not think to capture at the time I wrote the code.
Certain queries might fail, only on particular data. That data may or may not be logged, and root-causing the failure will necessitate figuring out what that data is.
And it may not be possible to add it to the code at the time of the incident; yes, later one might come back and capture that information in a more formal channel or tool now that one has the benefit of hindsight, but at the time of the outage, priority number one is always to restore the service to a functioning state. Deploying in the middle of that might be risky, or simply make the issue worse, particularly when you do not know what the issue is. (Which, since we're discussing introspecting the system, I think is almost always the part of the outage where you don't yet know what is wrong.)
This is what I always felt was the more appropriate use of tech debt. You are literally borrowing against tech built by others for that likely did not have the same requirements as you.
Is it convenient? Yeah. But it breeds bad choices.
I only have so much time, and very little is budgeted towards things like pushing information into managed systems. I do that when and where I can, but I do not get (and have never, frankly) gotten sufficient support from management or ops teams to have sufficient tooling/infrastructure to where I can introspect the system sufficiently during issues w/o direct access to the system itself.
The only place where I really disagree on principal (that is, what you propose is theoretically possible given way more time & money than I have, except) is unexpected, unanticipated outages, which IMO should be the majority of your outages. Nearly all of our production issues are actual, unforeseen novel issues with the code; most of them are one-offs, too, as the code is subsequently fixed to prevent reoccurrence.
But right at the moment it happens, we generally have no idea why something is wrong, and I really don't see a way to figure that out w/o direct access. We surface what we can: e.g., we send logs to Kibana, metrics to Prom/Grafana. But that requires us to have the foresight to send that information, and we do not always get that right; we'd need to be clairvoyant for that. What we don't capture in managed systems requires direct access.
I'm not really disagreeing. There will be "break glass" situations. I just think these situations should be few and far between, and we should be working to make them fewer and farther. Consider, when was the last time you needed physical access to a machine? Used to, folks fought to keep that ability, too.
High-quality testing of your systems, leading to engineers working on generating playbooks that cover the vast majority of production incidents, could be one approach that some might consider. Designing in metrics that aid debuggability could even be possible in some scenarios! Taken together, this can mean engineers get woken up left often for trivial things.
This isn't impossible. It's not even difficult or complex. It is time-consuming, and definitely requires a shift in mindset on the part of engineers.
For any incident that happens, I'm going to — if at all possible — fix it in code s.t. it doesn't happen again, ever. There is no playbook: the bug is fixed, outright.
That only leaves novel incidents, for which a playbook cannot exist by definition. Had I thought to write a playbook, I would have just fixed the code.
(I am not saying that playbooks can't exist in isolated cases, either, but in the general case of "system is no longer functioning according to specification", you cannot write a playbook for every unknown, since you quite simply can't predict the myriad of ways a system might fail.)
These points are very wise and correct. Yet, is it possible that situations might occur that don't fall into these situations? For instance, a hardware failure or a third-party service failure, or a common process is mis-applied and needs to be reversed. There could be a vast number of potential scenarios that are neither bugs nor novel events for which playbooks could be authored. There is non-trivial value to be gained in making it easy for an operational team to handle such events, particularly when events that do recur have their handling codified.
You are, of course, absolutely correct to note that many events either will not recur or cannot be anticipated. Yet, might there also be value to be gained by recognizing that there are events outside these categories that can be anticipated and planned for?
Otherwise what's the point of automated testing? Just fix any bugs when they show up and never write tests!
that is a point, but it wasn't at all _my_ point. with what is available I was referring to the commands that are installed inside the container which allow potential breakout of the container once the container is compromised.
fwiw there is a breaking point with teams that don't restrict access to the production environment. once too many people have access it becomes unmanageable.