Hacker News new | past | comments | ask | show | jobs | submit login
Full cycle developers – Operate what you build (2018) (netflixtechblog.com)
124 points by samdung 11 days ago | hide | past | favorite | 59 comments

>The primary upside of having a separate ops team was less developer interrupts when things were going well.

And, you know, developers having time to get better at developing instead of spending their time learning all the tooling you need to understand and the experience you need to gain in addition to what you already have to know to be an effective developer. Specialization is not always a bad thing. By having everybody do everything you now just have a bunch of people that do most workflows only a couple of times a year and lack the routine to handle situations where the extra automation you might have built to help them is insufficient. This just adds a lot of stress for your employees. The primary upside is that you save headcount while things are going well.

As a developer , I have less interruptions now I can partially do my own operations.

The old world was one where operations did not allow any dev access on production. So you get a ticket from a user, but you have no access to logs or databases or whatever, just a screenshot from the user. Performance testing is only allowed on dev, with a dataset of 2 heavily anonymized records. Deploying to prod required devs to specify paths and server names, but devs were not allowed to know what servers ran the application or what paths were available. Detailed deployment plans are 'optimized' by ops to remove 'unnecesary' steps, then for some strange reason the new version crashes 5 seconds after release, which is promptly declared a 'programmer error'. Preproduction and production were generally configured by different people, so plenty of differences between the environments exist. Neverending migration projects mean your application might accidentally partially disappear, with only user tickets as a signal that something happened. Monitoring is ignored by ops, as it produced more alerts than they can handle. I wonder why, I never got one.

So now I as a dev also do ops. I added some monitoring, and every few days I manually check basic health parameters. Problems generally get solved without users even noticing, and I can schedule time for this and work uninterrupted after that. I know what I prefer.

I think the most important part of what a developer should learn from ops is the daily struggles. Observability, performance characteristics under load, debuggability, stability, etc. If you just build shit and throw it over a hedge for an ops team to manage, you'll never learn.

As someone currently leading a small team that is responsible for absolutely everything technical at our company (from development to ops to IT and even UX), I do agree that it's a good learning experience. On the other hand it sure does take a lot of time out of the day. And I think all the context switching definitely leads me to be less productive than I would otherwise be. Of note for me: while development work often requires uninterrupted time and deep concentration, ops and customer support requests are often urgent and have to be dealt with right away.

I think the ideal might be to get broad experience and then specialise a little later.

What we do is shared ops duty between the DEVs, including the lead. E.g. in a team of 5, each takes a working day a week. On-call for off hours issues is organized separately.

I think this system works reasonably well to keep everyone aware of production issues.

I think what's happening here is we're conflating "developer" with "product designer." And often, yeah, they're the same - but if you're in a big enough team that you have one person doing the designing and another person doing the software architecture and another doing coding, it's the designer hat which needs to be at the coal face watching the users use the product and understanding their pain points.

I think you’ve gone off to an extreme view. There’s a huge middle ground between “oh, it uses electricity? Yeah you want to speak to the developers if that’s not working” and “do not disturb, coding.”

Having a platform engineering team, if Your company can initially afford to bootstrap such a thing, is a huge productivity accelerator for your developers but even with that in place, I’d say it’s still healthy for the dev to be first port of call for an app outage.

It of course depends on the complexity of your product. For sufficiently simple things, a single developer can handle everything themselves. But it doesn't take much to reach a point where a single developer doesn't even understand all the software, let alone all the infrastructure around it. Sysadmining is hard, it's a tall order to expect someone you hired because they where good at inverting binary trees to be a competent sysadmin.

It seems you already know how hard ops can be, so you might already know that a little development effort can often go a long way into alleviating the ops tasks, which is a win for everyone. I'm not sure every developer knows, or understands.

Yes, you could follow a rule book, or a deployment framework, but you'll be better at what you do if know why you do it.

I think it's a big win if developers and operators are the same team - sit next to each other, do the same teambuilding exercises, have responsibility for the same set of products, pair-program together where appropriate. Slice your teams vertically rather than horizontally. I'm not convinced that it's worth going all the way to having every developer be an operator and vice versa.

Having weared both hats multiple time, Dev and Ops, I think this is the right approach.

Imo, a good way to teach developers how they can help other roles is by having a shadow program where developers work together with a different specialized role for a couple of days but don't have to carry the responsibility. That improves cross team communication a lot, while being much less stressful for your employees.

As a sysadmin who became a developer, I don't think I agree that it needs to be hard to operate software these days. We have tools that simplify, standardize, and aggregate a lot of the things that used to make systems administration much more complicated, such that operating the software you wrote very rarely actually requires a full set of systems administration knowledge. With good platforms, it's absolutely feasible to operate your software without needing the full knowledge of a sysadmin.

It really depends on what you’re working with. For instance, using modern tools and frameworks from the get go does make operating much easier. But if you’re stuck with an older tool chain and can’t get the org to commit to refactoring it then you unfortunately either have to pay the operational cost or invent something that makes ops better.

Not disagreeing with you but want to point out that it’s not straightforward and often requires much work to make ops easier depending on your situation.

The joke at one place I worked was developers were good at dealing with all devices. Except the phones!

I think the mixture is more important. I don’t necessarily need my developers to know a whole lot about what our networking personal does. But I do need them to know how to configure a trusted partner in our ADFS, how to create and manage databases and how to operate things in Azure. We used to think differently, we used to do much more specialisation, but because operations needed to also code, they eventually became developers who could also do operations, which meant that some of them simply ended up replacing some of the developers who simply took up too much man hours to get things into production by comparison.

It’s sort of a delicate line of course, because you’re right on some level. People can’t know everything. So now our split is that developers need to be capable of working with things like Azure AD, but they don’t need to know how to manage it. They need to be capable of creating databases and how to handle the security on them, but they don’t need to know how we operate our actual database clusters. And so on.

On the flip side some of our operations staff need to know how to program in python to be able to handle disasters if something happens to our Azure Services or many back-ends. They also need to know how to develop our software robots.

But I think the biggest challenge most people face isn’t really on the operations vs development front. It’s on the new development vs old development improvement front. This is likely not as big an issue at a place like Netflix where they don’t service a gazillion different business needs, but in more broader enterprise businesses it quickly becomes a resource issue, because nobody is really going to have updated knowledge on 400 different systems at all times. And you can’t have people sitting around waiting for update requests on systems that may not produce one in a year.

Yeah I liked a lot the idea when working in a small startup as the ops team we removed was way slower than us, but now that I work in a company making actual money relying on actual regulatory deadline, I find myself wishing they'd reinstate the ops team they also removed to stop us from having to learn all the details on the gigantic infrastructure that can fail, from routers in another continent to red hat bugs to hedge fund client complaints...

This move to make the dev team all-doing isn't really working as efficiently as the cost reduction made it appear.

it never does. Suits want a "generic employee" that can work on any problem. Makes planing way easier.

Aka consultant.

As I was once told, you just need to be one chapter ahead of the customer.

> This just adds a lot of stress for your employees.

The worst is when full-cycle developers (and less often, the managers) jump ship to another team, org, or company after it is clear what they've built is a big ball of unmanageable, painful mud.

Building that is certainly not limited to full-cycle developers. At the very least, there is 1 person that understands the entire thing.

I meant to highlight that full-cycle developers can and do turn away from owning the thing.

When you've got an SRE org, there's stricter guard rails in-place to ensure you don't end up building a frankenstein monster.

For the same reason the Cloud makes sense, Serverless makes sense (offload what you can on to world-class experts who do know what they're doing)... SRE orgs would provide for better RoI than own-what-you-build teams, over the long run. That is, anyway, my armchair take.

I don't know how prevalent my opinion is amongst developers, but the reason I will never take another job with a support aspect is not because of some detailed assessment of the software lifecycle. It's because I don't like it.

And I can't think of anything more annoying than being on-call for work outside regular work hours.

So I think it's awesome that some people want to do this work. I hope they do it full-time, and that lets me do my thing full-time.

Obviously many developers like the mixed model or accept it. I'm curious about whether I'm an outlier or lots of others feel the same?

Now imagine being on call for someone else's code! That's ops.

Unless your company has a very strong relationship between ops and devs, the ops side gets stuck supporting whatever the developer side creates and the developer side is incentivized to jam as many features into code as quickly as possible.

It's very expensive and customer impactful when you find issues in production versus prevent them in the first place with code that's a little more "runnable" (like testable code, I'm also convinced there's "runnable" or "operable" code)

"Move fast and break things" is the mantra of many of the startups out there. If only companies realise that pushing features to production like there is no tomorrow is NOT the way to go.

That's just the ops side of the fence being talked about here. The other side is having endless gating, change advisory boards, batching up work, gruesome change approval processes etc. that the ops protect themselves with against development, which now grinds to a halt. This will not only make the business slow to a halt, ironically it will also produce more errors.

The really productive companies that deliver fast, are also way more robust.

>Unless your company has a very strong relationship between ops and devs, the ops side gets stuck supporting whatever the developer side creates and the developer side is incentivized to jam as many features into code as quickly as possible.

>like testable code, I'm also convinced there's "runnable" or "operable" code

1000% this. For me, this has been a huge focus in my past startup lives. Software and infrastructure are very related. Infra design impacts software design and vice versa. By working closely together, it's possible to try to get the best of both worlds.

As a VP Eng, I expended a lot of political capital with my "why aren't you building more features" CEO boss so that my dev team could focus on significantly enhancing the stability and observability of our platform, including ensuring our log messages were informative and at the appropriate level. i.e. CRITICAL = Fix Now, ERROR = someone should check into this at earliest convenience, INFO = here's how things are generally running. Stability saved everyone time and effort, and observability allowed (and empowered) the ops team to efficiently triage, and let Engineers quickly understand what's going on and how to fix it when things went wrong. The net result was multiple nightly on-call alerts for Engineers dwindling to "I can't remember the last one" and ops folks were able to focus far more on infra than triaging software alerts.

It was a battle and I had to fight to make it happen, and even have my teams "ninja" a bunch of the work via an 18 month inside-out refactor of the codebase, but the result was 200% worth it. The cherry on top was the two full-day grilling by tech leaders during due-diligence during our acquisition. They dug deep into the operational aspects of the business and came away very impressed, and that operational maturity (this was a company that highly valued such things) was highlighted as one of two key factors that swayed the decision to acquire.

Yup. In fact, I transitioned out of development in part because of the prevalence of this. I have learned the hard way that my work/life boundary is absolutely critical to my mental health and wellbeing. Even if it's relatively infrequent, any period of being on-call is incredibly stressful to me. I am willing to work some periods of long hours (within reason), but I am not willing to be in a position where I can be called or paged at night or on a weekend. No job is worth that to me. Hats off to the folks who can tolerate it. We need those people! But I am definitely not one of them.

Does having a separate on-call really shield the devs from support? I assume in most cases on-call would need to escalate or reach out to the devs for issues and for that someone needs to be reachable on phone off hours?

I mean I guess it's pretty dependent on the company and product, I just know if I see "on-call" or "support" in a job ad, I pass. Perhaps I'm just screening myself out of certain business areas.

My experience was support during business hours as part of my job as a software engineer. I used to hate hearing the phone ring, and it was rarely if ever a problem directly related to what I was developing. I felt like it was just a huge disruption to my work.

You're not an outlier, many people (myself included) feel the same way.

> So I think it's awesome that some people want to do this work. I hope they do it full-time, and that lets me do my thing full-time.

If you're lucky and your company lets you pick whether you want to apply for a SWE role or an SRE role (like Google), then it will work out ok. But a lot of companies just used "DevOps" as an excuse to get rid of their ops teams and dump the work onto the dev teams.

Some places take it a step further and get quite toxic, and have a culture where there's a complete lack of sympathy for engineers who have a heavy oncall load. The idea is that their code is shit so they deserve to be woken up in the middle of the night, as some sort of punishment.

> The idea is that their code is shit so they deserve to be woken up in the middle of the night, as some sort of punishment.

I mean, isn’t that the case to some extend? It certainly incentivizes you to do better.

I can see a problem when you don’t get the resources to do better though.

For me it's similar'ish. My approach is this: I'm a developer but I want to be involved in all deployment related steps and ideally deploy the app from day 1 myself. The latter step is usually underestimated and helps to keep the timeline. On the other hand I learn early on how the app behaves during runtime on a real environment and can take precautions that maintenance on application level is almost only needed when deploying.

A dedicated Ops team is nice when it's limited on the application level to very basic tasks like just restarting the service. If the services are not under-provisioned and have a properly set up process/app orchestration in place this is actually a non-task... Under these conditions I don't mind being on-call every now and then or even all the time if done right.

I understand your attitude, and sort of agree with it. If I take a job with on call duty, it also means I will be woken up at night because of "your" mistakes (in code, testing etc.) and that is fine.

But a very important aspect for me is that, if that happens, the developers give the correct priority to fixing the bug that woke me up at night.

So, it's fine you only want to work during the day. But if I get woken up at night or in the weekend, you better start fixing the bug that caused it as soon as it's day again.

I mean sure, it's not like I want to refuse all responsibility. And I'm all for paying the person who does the on-call in line with doing what I consider a tough duty BTW.

I guess my point is more that maybe that separation of roles actually also reflects the differences in personality and how different people work.

Feel the same. I do not like to be on-call. In the first stages of a job interview I always ask if on-call is required, and if so I politely decline.

Edit: I'm talking about being on-call outside of business hours. I have no problem being on-call 9-5.

The problem here as with most of the devops literature is that it's only really applicable for the rare usecase of where an entire company's worth of resources is put behind a single website/app deployment, where as the far more normal practice of having small implementation and operations teams manage a large back catalog of projects/deployment that's supposed to run stable with little need for day to day improvements/interventions, and which are often based on products that were oversold by some consultant/salesdrone 5 years ago as virtually maintenance free.

And when it comes to big systemic problems with security/stability that plagues the IT industry it's usually this back catalog that's the real cause and the problem that gets the least attention.

Oh god, this. This is the real problem!

Everyone wants to sell you the solution to problems where there are 10,000 servers for a single set of related apps managed by 1,000 developers. Think companies like NetFlix, where there are teams of people dedicated to specific parts of DevOps.

Nobody has good solutions for legacy but still functional apps developed by some guy named Bob that was a contractor for 6 months and he's gone to some other gig.

For example: Take Splunk, or Azure Log Analytics, or any similar tool. They work great if you get 100K hits per day. The graphs are beautifully smooth and the wealth of data makes it trivial to extract all sorts of insights. Approaches like A/B testing can accurately detect changes as small as 1% in many cases.

But what do you do with the site that gets 5 hits per week, but is still important because what it does for each one of those hits is worth on the order of a thousands dollars? The tooling is a total letdown, basically nothing out there can help with apps like this. Trying to retrofit some legacy app into Kubernetes or whatever would cost more than app makes in revenue. Leaving it alone is just as bad, because eventually you have to upgrade the server it lives on and the load balancer in front of it. At which point it will break. Or it might not. How would you know? You only get 5 hits a week...

> But what do you do with the site that gets 5 hits per week, but is still important because what it does for each one of those hits is worth on the order of a thousands dollars?

Log files and few shell scripts can go an awful long way, from simple single user apps to complex corporate environments with dozens of servers. They require almost zero upfront investment and you can build on them as needed.

You won't get fancy A/B testing (without a lot of work) or give you a pretty UI with smooth graphs, but grep can happily tear through 100GB worth of log files and get you some of that good stuff you want from azure log analytics.

By departmentalize you get silos and reduced efficiency across teams, by going the other way and having full cycle developers you reduce the silo effect and get alignment of incentives. Again the problem is that you need people who know a lot, and the mental load is very hard and not everybody is interested or able to handle that.

As an old developer, this is very much how we operated before CI/CD was a thing in the early 2000. I remember we were responsible for developing, monitoring and support. In our team we had one week each where we did support and monitoring as the primary task.

I think this all depends on the complexity of operational tasks. If you have hosted MySql and a couple of servers, fine, developer can handle monitoring this.

What if you have, say, DB2 cluster with HADR running, do you think you can monitor it and fix something if needed? Do you know ins and outs of share drives configuration using GFS. Kubernetes configuration details, if something goes wrong should every developer have sufficient knowledge to troubleshoot this? How about A10 load balancer, what average developer knows about this, can average developer fix failed virtual IP configuration or firewall rules on A10? Solving Red Hat Satellite cache issues?

I don't even mention handling some complicated network setup with third parties connections through proprietary VPN solution, AWS cloud deployments using multitude of services that AWS offers or obscure stuff like zSeries IBM or Unisys mainframes, etc.

Developer can be trained in all kind technologies, but there is truly a lot of them, this is huge body of knowledge, learning this will take a lot of time and money.

I think you're conflating platform engineering and operations. A DB2 cluster with HADR is another application that some engineering team has stood up. A consumer of the database isn't responsible for the implementation; they're responsible for how their code interacts with and uses the system another (engineering) team built.

In that regard, if a developer sees their application degrading due to an issue with the underlying platform, they'd engage the developers that built that platform (and ideally those developers have built a system to detect problems before the "end user" system starts degrading)

In the same vein, it's not expected every developer understands kernel and hardware development--there are abstractions and boundaries in place that separate responsibility. The recurring theme is the person that configured/coded/built is responsible for operating it

The world has changed a lot since 2000.

There is no way to master language ecosystem X, specially large ones like Java, .NET (C#, F#, VB, C++/CLI), C++20, multiple OS stacks (Windows, GNU/Linux flavours, macOS, Android, iOS, ....) and then put Amazon, Azure, IBM Cloud, .... on top.

In practice, if one doesn't silo into something, we all end up jake of all trades, master of none, jumping from whatever comes from IT Spring collection of the year into the next.

On the other hand, if you started in 2000 (or before) it seems like everyone else knows very little indeed.

> We mitigate this by having an on-call rotation where developers take turns handling the deployment + operations + support responsibilities

Sorry, but no. All the reasons that exist for developers doing on-call rotation are reasonable, yes, I admit that, but my reason to not do on-call rotation is also reasonable: I do not want to give my employer more than 40h/week. I just do not want more money in exchange for my free time. If your company doesn't provide that, then "alright, thank you" and I will keep looking. Now, if all companies start to require (paid) on-call rotation for developers: that would be a very sad tech scenario (at least for people like me).

Charge them with a significantly higher base pay if they are willing to include on-call into your list of responsibilites, or make sure your contract mentions exemption from on-call activities when you change an employer next time. In other words, make them see on-calls in salary budgets.

So, I had this argument/conversation with Jedberg (who incidentally also works at Netflix);

I said that: DevOps means different things to different people;

To Some: it's the evolution of "Operations" to integrate them into the team so that operational issues are automated out and that developers can get greater velocity.

To Others: it's a sysadmin who can use CI/CD and fumble through scripting something.

To the majority: it's a developer who can fumble through installing a server or package, and this is jedbergs definition too, and one he argues is canonical.

My opinion is that specialising is important.


To this effect: it's _far_ too much cognitive load to understand how your OS works, your application, its framework, the cloud provider, the logging tools, SLI/SLOs and error budgets, on-call policies and alerting mechanisms, application level security, authentication of systems, authorization of systems, SSO, etc;etc;etc;etc;

Because to learn all of those various things and weigh them in your head every time you make a decision is fatiguing, it took me 10 years to get "great" at Ops and a decent scripter; it's possible that Ops is easier than programming but from what I understand it's not easier, it's just different.

With Development (exception: Javascript) you learn primitives and they rarely change. With Ops, the tools and landscape change _frequently_.

10 years ago it was Nagios, then Zabbix, then some combination of graphite and co, now it's prometheus, maybe Thanos or is it influxDB+Kapacitor?

That's just alerting, there's new and changing solutions for logging, message queues, databases, IdP tooling (keycloak, vs FreeIPA vs AD), distributed tracing systems, heck even load balancers (GCE Ingress vs Nginx vs traefik vs etc;etc;etc).

Hell, understanding how Kubernetes works is basically a full time job otherwise "black magic is happening" and debugging becomes a nightmare.

Frameworks change or get replaced too, but it doesn't feel like it's close to the same rate as infrastructure software.

I remember at college one of the professors said to me: "It's impossible to know everything about computers, because as you're learning: new things come out, and after you've learned something: it will change. The more you learn, the more you have to keep up to date until you can no longer keep up."

I think maybe people don't think like that.

I think we need to have people of different disciplines.

If you are in the business of developing and running web applications there's a lot overlap between dev and ops in terms of knowledge. If you have to troubleshoot something you need to know HTTP Headers, CORS and related topics.

Who is in charge when the latest update of Chrome breaks your clients' websites? Wearing both hats certainly helped us fixing stuff easier by inserting headers through HAProxy although our regular job is programming.

What about performance optimization? Can that slow SQL query be fixed through tuning the MySQL server or maybe just rewrite the code?

As an older full-stack/full-cycle developer for many years, this article made sense - and gave quite a bit of hope. If you can't dog-food your product, IMHO developers need to be as close to the coal face of their 'product' as possible. Per the article, assigning a team to a feature has allowed Netflix to get better coding outcomes.

With that said, I can also see a potential downside to this development model. Once a new feature is stable enough, the number of people required to support it has to reduce. Wondering how Netflix solved this?

100% agree with this. I work on a service deployed in Azure processing PBs of data each day. We’re a team of about 8 engineers, and we have an on-call rotation.

As a result, we’re incentivized to reduce on call noise and have the system automate remediation as much as possible.

Granted, our company does have centralized teams that build tooling. So that does seem to be a prerequisite to having full cycle developers.

If you’re building in the cloud, I think there’s less need to be a SysAdmin these days (I used to be be one), and Azure makes it pretty easy to have automatic updates, firewall rules, etc. We rarely touch our hosting environment (Service Fabric on VMSS).

The on-call sucks at times, but I’d never trade it for how fast we’re able to ship.

"Eat your own dog food" is one of my guiding principles and it's very powerful.

"sure, but I'll eat it from my own plate" aka developers get to choose the tooling they will use and ops have no objections to it.

I've always been a bit confused that this wasn't what DevOps was supposed to mean from the start.

Devs should be producing easy-to-operate code and their oncall should therefore not be burdensome and there should be tight cycles to drive it to near zero.

Development and operations are two different mindsets not typically found within the same person.

so many different URLs, damn Medium

previous discussion, 2 years ago, 20 upvotes, https://news.ycombinator.com/item?id=19481912

Hope this term doesn't hit recruiters dictionaries.

Always be thinking about what you can get out of your employer. They’re always thinking about what else they can get out of you.

This used to be called jack of all trades, master of none.

This is not a good idea.

Developers already have an incredible amount to learn without making them devops too.

This is the sort of thing than non tech management would love though.

It’s a great slogan.

Essentially the sign of mismanagement dressed up as innovation.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact