Hacker News new | past | comments | ask | show | jobs | submit login

> where people working in an environment where it's normal to have every command available on a production system (really?)

Yes, really.

The ops folks I work with banter around the same idea that you're getting at here, that engineers should not have access to the production system they maintain. I'll ask you the question I ask them: when the system has a production outage¹, how am I supposed to debug it, effectively? To do that, I need to be able to introspect the state of the system, and that pretty much necessitates running arbitrary commands.

Even if I'm stripped of such abilities… I write the code. I can just change the code to surface what data I need to do, and redeploy. That can be incredibly inefficient, as deploying often resets the very state I seek to get at, so I have to sometimes wait for it to recur naturally if I don't have a solid means of reproducing it.

I think you'll find that at larger companies, or companies with a need for taking their security posture seriously, that is the norm. Even before we grew, only a couple engineers had prod access.

You debug it via tooling, instrumentation and logs. I realize if you're accustomed to sudoing on prod when troubleshooting, this sounds crazy. Trust me, it works fine; better, in fact. Far fewer weird things happen in well-controlled environments.

Yeah, I've occasionally wondered how (if) the DevOps approach is implemented at places which have to be SOX compliant.


One of the technical things it does is segregate (on purpose) developers from any kind of production access.

This is because (from historical experience) a "bad apple" developer can do amazingly fraudulent things, and have a more than reasonable chance of covering it up. o_O

Engineering manager (& previously a developer) at a ~2k developer SOX-compliant company chiming in here.

We have a platform as a service team which maintains our PaaS infrastructure (think an internal version of Heroku), and they are the only ones who can SSH into any production systems (<50 engineers, I'd guess).

Engineers write the code (mandatory green build and peer review required before merge to protect against bad actors.. but that's just good engineering practice too!), build their own deployment pipelines on Bamboo or Bitbucket Pipelines to push up code & assets to a docker registry, and ultimately the deployment pipelines make some API calls to deploy docker images. Engineers are also responsible for keeping those services running; most products (such as Jira, Confluence, Bitbucket, etc) also have a dedicated SRE team who are focused on improving reliability of services which support that product.

The vast majority (95%) of our production issues are troubleshooted by looking at Datadog metrics (from CloudWatch & services publish a great deal of metrics too) and Splunk (our services log a lot, we log all traffic, and the host systems also ship their logs off). Fixes are usually to do an automated rollback (part of the PaaS), turn off a feature flag to disable code, redeploy the existing code to fix a transient issue (knowing we'll identify a proper fix in the post incident review), or in rare cases, roll forward by merging a patch & deploying that (~30 mins turnaround time - but this happens <5% of the time). Good test coverage, fast builds (~5 mins on avg), fast deploys, and automated smoke tests before a new deploy goes live all help a lot in preventing issues in the first place.

It's not perfect, but it works a lot better than you might expect.

Interesting. With the deployed docker images, is that only for internal applications, or do you do external (public facing) applications as well?

Asking because is one of the conceptual things I've been trying to figure out.

Still currently using non Docker deployment of production stuff for our public services. Have looked at Docker a few times, but for deployment to the public internet where being accessible to clients on both IPv4 and IPv6 is mandatory, it just doesn't seem to suit.

Docker (swarm) doesn't seem to do IPv6 at all, and the general networking approach in non-swarm docker seems insecure as hell for public services + it also seems to change arbitrarily between versions. For a (currently) 1 man setup, it seems like a bad use of time to have to keep on top of. ;)

Maybe using Nginx on the public services, reverse proxying to not-publicly-accessible docker container hosts would be the right approach instead?

Btw - asparck.com (as per your profile info) doesn't seem to be online?

Internal and external apps are both deployed on the same PaaS. Only difference is that internal apps aren't reachable "outside the VPN"; when it comes to building your service, it's an extra line of yaml in your service config. There's a network engineering team who works with the PaaS team to make that happen - it's definitely a nice luxury of a big company that you don't need to worry about setting up VPCs etc yourself.

The actual PaaS relies pretty heavily on AWS Cloud Formation - it predates swarm, mesosphere, kube, etc. So when we deploy a new version of a service, it's really "deploy an auto scaling group of EC2 instances across several AZs fronted by an ELB, then there's an automatic DNS change made which makes the new stack of EC2 instances live once the new stack is validated as working". The upside of the one service per EC2 image approach is no multi-tenancy interference - downsides are cost and it takes a bit longer to deploy. There's a project underway to switch compute to being Kube-based though, so that's promising.

All this is apples and oranges though - solns for big companies don't make sense for a 1-person shop. I still have side projects being deployed via ansible and git pull instead of Docker, because it hasn't been worth the ROI to upgrade to the latter.

Re asparck - yeah, it was my personal site but I struggled to find the time to work on it. In the end I decided it was better to have it offline than terribly out of date, but hopefully I'll resurrect it some day.

I work in an org that has strong partitions between devs and sysads. Dev can make and test code. They have no access to prod. Sysads can configure and execute what devs make. They have read-only access to the source repos.

Problems with this are as follows (real, not imagined)

1. AWS cloudformation scripts - who makes them? If dev does, sysads can't change.

2. Does dev have the security mindset to maintain configurations in IaaS things like Cloudformation? Who reviews things like NACLs, Security Groups, VPCs, and the like?

3. Scripts - how big or what impact does a script need to be written by sysad or dev?

4. Oncall - normally are sysads job, but when you implement strong gates between dev/sysad, you need oncall devs.

Thanks, that all falls in line with roughly the kind of problems I'd expect. :)

I work on a SOX compliant project, it works pretty much as described. As a developer I have no access to prod, relying on other teams to make system changes and creating a lengthy mandated paper trail for the SOX audit team to look over. Not to say there aren't headaches with the approach, but thankfully at a big enough organization it's become a relatively smooth process.

Tooling requires access. Are you saying no dev should ever `strace` a process? (This requires not only access, but presuming my UID != the service UID for the service, sudo, too.)

Note that I'm not saying devs should have access to every production machine; I'm only saying that access should be granted to devs for what they are responsible for maintaining.

Sure, one can write custom tooling to promote chosen pieces of information out of the system into other, managed systems. E.g., piping logs out to ELK. And we do this. But it often is not sufficient, and production incidents might end up involving information that I did not think to capture at the time I wrote the code.

Certain queries might fail, only on particular data. That data may or may not be logged, and root-causing the failure will necessitate figuring out what that data is.

And it may not be possible to add it to the code at the time of the incident; yes, later one might come back and capture that information in a more formal channel or tool now that one has the benefit of hindsight, but at the time of the outage, priority number one is always to restore the service to a functioning state. Deploying in the middle of that might be risky, or simply make the issue worse, particularly when you do not know what the issue is. (Which, since we're discussing introspecting the system, I think is almost always the part of the outage where you don't yet know what is wrong.)

Your system should already surface the information necessary for you to do your job. Or you aren't doing your job.

This is what I always felt was the more appropriate use of tech debt. You are literally borrowing against tech built by others for that likely did not have the same requirements as you.

Is it convenient? Yeah. But it breeds bad choices.

> Your system should already surface the information necessary for you to do your job. Or you aren't doing your job.

I only have so much time, and very little is budgeted towards things like pushing information into managed systems. I do that when and where I can, but I do not get (and have never, frankly) gotten sufficient support from management or ops teams to have sufficient tooling/infrastructure to where I can introspect the system sufficiently during issues w/o direct access to the system itself.

The only place where I really disagree on principal (that is, what you propose is theoretically possible given way more time & money than I have, except) is unexpected, unanticipated outages, which IMO should be the majority of your outages. Nearly all of our production issues are actual, unforeseen novel issues with the code; most of them are one-offs, too, as the code is subsequently fixed to prevent reoccurrence.

But right at the moment it happens, we generally have no idea why something is wrong, and I really don't see a way to figure that out w/o direct access. We surface what we can: e.g., we send logs to Kibana, metrics to Prom/Grafana. But that requires us to have the foresight to send that information, and we do not always get that right; we'd need to be clairvoyant for that. What we don't capture in managed systems requires direct access.

Apologies for the slow response.

I'm not really disagreeing. There will be "break glass" situations. I just think these situations should be few and far between, and we should be working to make them fewer and farther. Consider, when was the last time you needed physical access to a machine? Used to, folks fought to keep that ability, too.

> I'll ask you the question I ask them: when the system has a production outage¹, how am I supposed to debug it, effectively?

High-quality testing of your systems, leading to engineers working on generating playbooks that cover the vast majority of production incidents, could be one approach that some might consider. Designing in metrics that aid debuggability could even be possible in some scenarios! Taken together, this can mean engineers get woken up left often for trivial things.

This isn't impossible. It's not even difficult or complex. It is time-consuming, and definitely requires a shift in mindset on the part of engineers.

> leading to engineers working on generating playbooks that cover the vast majority of production incidents

For any incident that happens, I'm going to — if at all possible — fix it in code s.t. it doesn't happen again, ever. There is no playbook: the bug is fixed, outright.

That only leaves novel incidents, for which a playbook cannot exist by definition. Had I thought to write a playbook, I would have just fixed the code.

(I am not saying that playbooks can't exist in isolated cases, either, but in the general case of "system is no longer functioning according to specification", you cannot write a playbook for every unknown, since you quite simply can't predict the myriad of ways a system might fail.)

You're right! For bugs, they should be fixed and never recur. There is no play book for this. For novel one-offs, they also cannot be anticipated, and thus cannot be planned for.

These points are very wise and correct. Yet, is it possible that situations might occur that don't fall into these situations? For instance, a hardware failure or a third-party service failure, or a common process is mis-applied and needs to be reversed. There could be a vast number of potential scenarios that are neither bugs nor novel events for which playbooks could be authored. There is non-trivial value to be gained in making it easy for an operational team to handle such events, particularly when events that do recur have their handling codified.

You are, of course, absolutely correct to note that many events either will not recur or cannot be anticipated. Yet, might there also be value to be gained by recognizing that there are events outside these categories that can be anticipated and planned for?

You're ignoring regressions/maintainability.

Otherwise what's the point of automated testing? Just fix any bugs when they show up and never write tests!

> The ops folks I work with banter around the same idea that you're getting at here, that engineers should not have access to the production system they maintain

that is a point, but it wasn't at all _my_ point. with what is available I was referring to the commands that are installed inside the container which allow potential breakout of the container once the container is compromised.

fwiw there is a breaking point with teams that don't restrict access to the production environment. once too many people have access it becomes unmanageable.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact