And, you know, developers having time to get better at developing instead of spending their time learning all the tooling you need to understand and the experience you need to gain in addition to what you already have to know to be an effective developer. Specialization is not always a bad thing. By having everybody do everything you now just have a bunch of people that do most workflows only a couple of times a year and lack the routine to handle situations where the extra automation you might have built to help them is insufficient. This just adds a lot of stress for your employees. The primary upside is that you save headcount while things are going well.
The old world was one where operations did not allow any dev access on production. So you get a ticket from a user, but you have no access to logs or databases or whatever, just a screenshot from the user. Performance testing is only allowed on dev, with a dataset of 2 heavily anonymized records. Deploying to prod required devs to specify paths and server names, but devs were not allowed to know what servers ran the application or what paths were available. Detailed deployment plans are 'optimized' by ops to remove 'unnecesary' steps, then for some strange reason the new version crashes 5 seconds after release, which is promptly declared a 'programmer error'. Preproduction and production were generally configured by different people, so plenty of differences between the environments exist. Neverending migration projects mean your application might accidentally partially disappear, with only user tickets as a signal that something happened. Monitoring is ignored by ops, as it produced more alerts than they can handle. I wonder why, I never got one.
So now I as a dev also do ops. I added some monitoring, and every few days I manually check basic health parameters. Problems generally get solved without users even noticing, and I can schedule time for this and work uninterrupted after that. I know what I prefer.
I think the ideal might be to get broad experience and then specialise a little later.
I think this system works reasonably well to keep everyone aware of production issues.
Having a platform engineering team, if Your company can initially afford to bootstrap such a thing, is a huge productivity accelerator for your developers but even with that in place, I’d say it’s still healthy for the dev to be first port of call for an app outage.
Yes, you could follow a rule book, or a deployment framework, but you'll be better at what you do if know why you do it.
Not disagreeing with you but want to point out that it’s not straightforward and often requires much work to make ops easier depending on your situation.
It’s sort of a delicate line of course, because you’re right on some level. People can’t know everything. So now our split is that developers need to be capable of working with things like Azure AD, but they don’t need to know how to manage it. They need to be capable of creating databases and how to handle the security on them, but they don’t need to know how we operate our actual database clusters. And so on.
On the flip side some of our operations staff need to know how to program in python to be able to handle disasters if something happens to our Azure Services or many back-ends. They also need to know how to develop our software robots.
But I think the biggest challenge most people face isn’t really on the operations vs development front. It’s on the new development vs old development improvement front. This is likely not as big an issue at a place like Netflix where they don’t service a gazillion different business needs, but in more broader enterprise businesses it quickly becomes a resource issue, because nobody is really going to have updated knowledge on 400 different systems at all times. And you can’t have people sitting around waiting for update requests on systems that may not produce one in a year.
This move to make the dev team all-doing isn't really working as efficiently as the cost reduction made it appear.
As I was once told, you just need to be one chapter ahead of the customer.
The worst is when full-cycle developers (and less often, the managers) jump ship to another team, org, or company after it is clear what they've built is a big ball of unmanageable, painful mud.
When you've got an SRE org, there's stricter guard rails in-place to ensure you don't end up building a frankenstein monster.
For the same reason the Cloud makes sense, Serverless makes sense (offload what you can on to world-class experts who do know what they're doing)... SRE orgs would provide for better RoI than own-what-you-build teams, over the long run. That is, anyway, my armchair take.
And I can't think of anything more annoying than being on-call for work outside regular work hours.
So I think it's awesome that some people want to do this work. I hope they do it full-time, and that lets me do my thing full-time.
Obviously many developers like the mixed model or accept it. I'm curious about whether I'm an outlier or lots of others feel the same?
Unless your company has a very strong relationship between ops and devs, the ops side gets stuck supporting whatever the developer side creates and the developer side is incentivized to jam as many features into code as quickly as possible.
It's very expensive and customer impactful when you find issues in production versus prevent them in the first place with code that's a little more "runnable" (like testable code, I'm also convinced there's "runnable" or "operable" code)
The really productive companies that deliver fast, are also way more robust.
>like testable code, I'm also convinced there's "runnable" or "operable" code
1000% this. For me, this has been a huge focus in my past startup lives. Software and infrastructure are very related. Infra design impacts software design and vice versa. By working closely together, it's possible to try to get the best of both worlds.
As a VP Eng, I expended a lot of political capital with my "why aren't you building more features" CEO boss so that my dev team could focus on significantly enhancing the stability and observability of our platform, including ensuring our log messages were informative and at the appropriate level. i.e. CRITICAL = Fix Now, ERROR = someone should check into this at earliest convenience, INFO = here's how things are generally running. Stability saved everyone time and effort, and observability allowed (and empowered) the ops team to efficiently triage, and let Engineers quickly understand what's going on and how to fix it when things went wrong. The net result was multiple nightly on-call alerts for Engineers dwindling to "I can't remember the last one" and ops folks were able to focus far more on infra than triaging software alerts.
It was a battle and I had to fight to make it happen, and even have my teams "ninja" a bunch of the work via an 18 month inside-out refactor of the codebase, but the result was 200% worth it. The cherry on top was the two full-day grilling by tech leaders during due-diligence during our acquisition. They dug deep into the operational aspects of the business and came away very impressed, and that operational maturity (this was a company that highly valued such things) was highlighted as one of two key factors that swayed the decision to acquire.
My experience was support during business hours as part of my job as a software engineer. I used to hate hearing the phone ring, and it was rarely if ever a problem directly related to what I was developing. I felt like it was just a huge disruption to my work.
> So I think it's awesome that some people want to do this work. I hope they do it full-time, and that lets me do my thing full-time.
If you're lucky and your company lets you pick whether you want to apply for a SWE role or an SRE role (like Google), then it will work out ok. But a lot of companies just used "DevOps" as an excuse to get rid of their ops teams and dump the work onto the dev teams.
Some places take it a step further and get quite toxic, and have a culture where there's a complete lack of sympathy for engineers who have a heavy oncall load. The idea is that their code is shit so they deserve to be woken up in the middle of the night, as some sort of punishment.
I mean, isn’t that the case to some extend? It certainly incentivizes you to do better.
I can see a problem when you don’t get the resources to do better though.
A dedicated Ops team is nice when it's limited on the application level to very basic tasks like just restarting the service. If the services are not under-provisioned and have a properly set up process/app orchestration in place this is actually a non-task... Under these conditions I don't mind being on-call every now and then or even all the time if done right.
But a very important aspect for me is that, if that happens, the developers give the correct priority to fixing the bug that woke me up at night.
So, it's fine you only want to work during the day. But if I get woken up at night or in the weekend, you better start fixing the bug that caused it as soon as it's day again.
I guess my point is more that maybe that separation of roles actually also reflects the differences in personality and how different people work.
Edit: I'm talking about being on-call outside of business hours. I have no problem being on-call 9-5.
And when it comes to big systemic problems with security/stability that plagues the IT industry it's usually this back catalog that's the real cause and the problem that gets the least attention.
Everyone wants to sell you the solution to problems where there are 10,000 servers for a single set of related apps managed by 1,000 developers. Think companies like NetFlix, where there are teams of people dedicated to specific parts of DevOps.
Nobody has good solutions for legacy but still functional apps developed by some guy named Bob that was a contractor for 6 months and he's gone to some other gig.
For example: Take Splunk, or Azure Log Analytics, or any similar tool. They work great if you get 100K hits per day. The graphs are beautifully smooth and the wealth of data makes it trivial to extract all sorts of insights. Approaches like A/B testing can accurately detect changes as small as 1% in many cases.
But what do you do with the site that gets 5 hits per week, but is still important because what it does for each one of those hits is worth on the order of a thousands dollars? The tooling is a total letdown, basically nothing out there can help with apps like this. Trying to retrofit some legacy app into Kubernetes or whatever would cost more than app makes in revenue. Leaving it alone is just as bad, because eventually you have to upgrade the server it lives on and the load balancer in front of it. At which point it will break. Or it might not. How would you know? You only get 5 hits a week...
Log files and few shell scripts can go an awful long way, from simple single user apps to complex corporate environments with dozens of servers. They require almost zero upfront investment and you can build on them as needed.
You won't get fancy A/B testing (without a lot of work) or give you a pretty UI with smooth graphs, but grep can happily tear through 100GB worth of log files and get you some of that good stuff you want from azure log analytics.
As an old developer, this is very much how we operated before CI/CD was a thing in the early 2000. I remember we were responsible for developing, monitoring and support. In our team we had one week each where we did support and monitoring as the primary task.
What if you have, say, DB2 cluster with HADR running, do you think you can monitor it and fix something if needed? Do you know ins and outs of share drives configuration using GFS. Kubernetes configuration details, if something goes wrong should every developer have sufficient knowledge to troubleshoot this? How about A10 load balancer, what average developer knows about this, can average developer fix failed virtual IP configuration or firewall rules on A10? Solving Red Hat Satellite cache issues?
I don't even mention handling some complicated network setup with third parties connections through proprietary VPN solution, AWS cloud deployments using multitude of services that AWS offers or obscure stuff like zSeries IBM or Unisys mainframes, etc.
Developer can be trained in all kind technologies, but there is truly a lot of them, this is huge body of knowledge, learning this will take a lot of time and money.
In that regard, if a developer sees their application degrading due to an issue with the underlying platform, they'd engage the developers that built that platform (and ideally those developers have built a system to detect problems before the "end user" system starts degrading)
In the same vein, it's not expected every developer understands kernel and hardware development--there are abstractions and boundaries in place that separate responsibility. The recurring theme is the person that configured/coded/built is responsible for operating it
There is no way to master language ecosystem X, specially large ones like Java, .NET (C#, F#, VB, C++/CLI), C++20, multiple OS stacks (Windows, GNU/Linux flavours, macOS, Android, iOS, ....) and then put Amazon, Azure, IBM Cloud, .... on top.
In practice, if one doesn't silo into something, we all end up jake of all trades, master of none, jumping from whatever comes from IT Spring collection of the year into the next.
Sorry, but no. All the reasons that exist for developers doing on-call rotation are reasonable, yes, I admit that, but my reason to not do on-call rotation is also reasonable: I do not want to give my employer more than 40h/week. I just do not want more money in exchange for my free time. If your company doesn't provide that, then "alright, thank you" and I will keep looking. Now, if all companies start to require (paid) on-call rotation for developers: that would be a very sad tech scenario (at least for people like me).
I said that: DevOps means different things to different people;
To Some: it's the evolution of "Operations" to integrate them into the team so that operational issues are automated out and that developers can get greater velocity.
To Others: it's a sysadmin who can use CI/CD and fumble through scripting something.
To the majority: it's a developer who can fumble through installing a server or package, and this is jedbergs definition too, and one he argues is canonical.
My opinion is that specialising is important.
To this effect: it's _far_ too much cognitive load to understand how your OS works, your application, its framework, the cloud provider, the logging tools, SLI/SLOs and error budgets, on-call policies and alerting mechanisms, application level security, authentication of systems, authorization of systems, SSO, etc;etc;etc;etc;
Because to learn all of those various things and weigh them in your head every time you make a decision is fatiguing, it took me 10 years to get "great" at Ops and a decent scripter; it's possible that Ops is easier than programming but from what I understand it's not easier, it's just different.
10 years ago it was Nagios, then Zabbix, then some combination of graphite and co, now it's prometheus, maybe Thanos or is it influxDB+Kapacitor?
That's just alerting, there's new and changing solutions for logging, message queues, databases, IdP tooling (keycloak, vs FreeIPA vs AD), distributed tracing systems, heck even load balancers (GCE Ingress vs Nginx vs traefik vs etc;etc;etc).
Hell, understanding how Kubernetes works is basically a full time job otherwise "black magic is happening" and debugging becomes a nightmare.
Frameworks change or get replaced too, but it doesn't feel like it's close to the same rate as infrastructure software.
I remember at college one of the professors said to me: "It's impossible to know everything about computers, because as you're learning: new things come out, and after you've learned something: it will change. The more you learn, the more you have to keep up to date until you can no longer keep up."
I think maybe people don't think like that.
I think we need to have people of different disciplines.
Who is in charge when the latest update of Chrome breaks your clients' websites? Wearing both hats certainly helped us fixing stuff easier by inserting headers through HAProxy although our regular job is programming.
What about performance optimization? Can that slow SQL query be fixed through tuning the MySQL server or maybe just rewrite the code?
With that said, I can also see a potential downside to this development model.
Once a new feature is stable enough, the number of people required to support it has to reduce.
Wondering how Netflix solved this?
As a result, we’re incentivized to reduce on call noise and have the system automate remediation as much as possible.
Granted, our company does have centralized teams that build tooling. So that does seem to be a prerequisite to having full cycle developers.
If you’re building in the cloud, I think there’s less need to be a SysAdmin these days (I used to be be one), and Azure makes it pretty easy to have automatic updates, firewall rules, etc. We rarely touch our hosting environment (Service Fabric on VMSS).
The on-call sucks at times, but I’d never trade it for how fast we’re able to ship.
Devs should be producing easy-to-operate code and their oncall should therefore not be burdensome and there should be tight cycles to drive it to near zero.
previous discussion, 2 years ago, 20 upvotes, https://news.ycombinator.com/item?id=19481912
Developers already have an incredible amount to learn without making them devops too.
This is the sort of thing than non tech management would love though.
It’s a great slogan.
Essentially the sign of mismanagement dressed up as innovation.