Production-Oriented Development

corytheboyd · on Nov 21, 2020

I think the author has some good points, and has clearly been burned by real world problems at real world scale.

The only piece of advice that I’ve been turned off to recently is avoiding in-house solutions. The opposite problem is provisioning way too many new tools, rushing the rollout with only the most basic integration, and calling it a victory. Even worse is opting in to all of the “advanced configuration” simply because it’s there, or even worse, not following the documented methods of configuring the tool in the first place. You end up not knowing which tool is meant to do what, rushed rollout means you may be fighting bugs with the integration itself, sometimes there is tool overlap and you’re just expected to know which one is the right one, etc.

My point is not “build PagerDuty etc in-house” it’s more the following: 1. Use as few external solutions as possible 2. Do not overconfigure the tools you do use, and do not stray from the documented paths for as long as possible (this is usually a testament to the quality of the tool itself) 3. If this is at a company, on a team, you HAVE to go the extra mile and do the full rollout. If it replaces something, it’s on YOU to go do the replacing, and it’s on YOU to let people know what you are doing and what they should expect going forward

If you strictly apply discipline here, you should become very good at a small number of tools over time.

gtirloni · on Nov 22, 2020

+1 for not overconfiguring your tools. It makes migrating harder and the cognitive load on newcomers doesn't help either.

gonzo41 · on Nov 22, 2020

Not overconfiguring your tools eh. Got any strong opinions on K8's and yaml?

nojvek · on Nov 22, 2020

I think the GP may have meant, not using configurations of tools that are off the beaten path.

Jsonnet as a templating language works very well with k8s yamls.

Pulumi is also phenomenal at having infrastructure as code versioned in repo. Pulumi and Typescript work very well for type checked templates. K8s yamls also officially have TS types so that works pretty well too.

opportune · on Nov 21, 2020

I disagree with almost every single point of this article except that buying can be better than building (note this still isn't always true, especially if you are in a very large company that would get charged usurious rates based on usage).

The other points really only hold true for things that are allowed to break. As someone who writes code that should never, ever break, and gets support tickets even for one-in-a-million breakages, testing is absolutely paramount. I can see how certain products, like a web-app, could follow these principles and benefit from not being over-tested, but it's definitely not an absolute.

Of course, my code does still break. The author is right that this is still an absolute. But the blast radius of the breakage and the frequency are both quite low because of the measures we take to prevent them.

Regarding the point about ditching QA and testing in production: both are possible. You can make QA a shard of your production environment (there are many different ways to implement this based on your tech). That is the ideal way to structure it so that it's actually pretty representative of a real environment without requiring lots of duct tape.

dmitriid · on Nov 21, 2020

> As someone who writes code that should never, ever break, and gets support tickets even for one-in-a-million breakages, testing is absolutely paramount.

Better put: you should test to acceptable limits.

A nuclear reactor or medical software should be tested with any and all possible and impossible scenarios in mind.

Some forum or chat software? Not so much.

And obviously there are 50 million shades in between :)

toast0 · on Nov 21, 2020

The trick is, one person's impossible scenario is another person's normal operating condition.

I've made chat software work when DNS is broken. Other people say that's an impossible situation.

iib · on Nov 21, 2020

That sounds like an interesting problem. Would you want to expand on the solution?

toast0 · on Nov 22, 2020

The solution is fairly simple in principle. You need to have fallback IPs in your clients, that they can connect to in case DNS fails.

But that also means you need to have IPs that will last for as long as your client lasts. Which means you need to have a defined lifetime for your client. And you need to have some mechanism to validate the IPs are still your servers and not been taken over by someone else (mutual authentication).

tuwtuwtuwtuw · on Nov 22, 2020

> But that also means you need to have IPs that will last for as long as your client lasts.

It sounds like implementing this scheme without a method to rotate IP addresses would be a mistake. But maybe I'm missing something.

toast0 · on Nov 22, 2020

You could store fresher IPs after succesful connection. But that doesn't help for people who install from an older (but not expired) installer, while they're on a network with bad DNS.

lstamour · on Nov 22, 2020

First thing that occurred to me was WebRTC for P2P chat using an IP address in the STUN fallback list if DNS fails? As in https://github.com/michal-wrzosek/p2p-chat

rmchugh · on Nov 22, 2020

Definitely not an impossible situation. Could even be a regular situation given a flaky DNS provider and a sufficient volume of requests.

stefanmichael · on Nov 22, 2020

I think you're right. Context matters quite a bit. I have worked in places (quant / hft / algo trading firms) where "but we can fix it fast!" is not a defensible position and "fast" is not fast enough.

Thinking about this article through the lens of an application where the cost of failure is quite low and time to fix is quite fast, the article starts making sense.

rualca · on Nov 22, 2020

> I disagree with almost every single point of this article except that buying can be better than building

I agree, and I found the comment on how QAs somehow made quality worse to be specially absurd and detached from reality. I mean, the author went as far as presenting the brilliant solution to his problem by suggesting... Automating manual testing? Where did he came up with this stuff? I mean, for a couple of decades now, manual tests are only reserved to either tests that are not possible to automate, of exploratory testing. It's a suggestion in line of "let them eat cake", because it shows total detachment from the real world and specially how things are actually done in the industry.

And how exactly can QAs degrade quality if all they do is check if the work done by developers does indeed work and complies with requirements ? It's mind boggling. QAs don't change code. QAs see the mess you do before it hits the customer.

The absurdity of this mess reaches a point where the author complains that a failure by the developer to update tests to reflect his changes is somehow the fault of the guys tasked with running the tests? I mean, come on.

All in all I was left with the impression that the author is totally oblivious to how he has been introducing and living with abhorrent software development practices and, to further compound the problems he has been creating for him and for others, he's shifting the blame away from himself to everyone around him.

nullsense · on Nov 22, 2020

On the contrary it's all largely advice based off Accelerate / State of DevOps reports around how high performing organizations that ship code on demand are able to do so without increasing change failure rate.

He mispoke when he said manual QA makes the quality worse. But the data do suggest that tossing your code over the wall to pass through a manual QA gate before being scheduled for a release doesn't actually improve quality when compared with shipping every merged PR to production provided it passes the automated pipeline and doing a staged rollout of the code to production users via a decent feature flag system.

Continuous Delivery has been a thing for quite a while it seems but I think it's still relatively early in terms of adoption. There are a number of conference talks from recent years about testing in production. I don't live and work in Silicon Valley and most of the companies where I am aren't doing it, but I know of one that is and it sounds pretty incredible from what I've heard.

tuwtuwtuwtuw · on Nov 22, 2020

> how exactly can QAs degrade quality if all they do is check if the work done by developers does indeed work and complies with requirements ?

One thing I have seen far too many times is that a strong QA team may lead to developers not feeling the responsibility of making sure things works themselves. This often leads to worse quality.

This is or course a social issue rather than a technical and I'm sure that many developers do not suffer from this mindsdt. But it's been there to some degree in almost all companies I have consulted for.

Surely you have seen this mindset yourself?

garethrowlands · on Nov 22, 2020

I've seen this and agree with you.

Though, if the nature of your QA teams led to worse quality, you might want to find a term other than "strong" to describe them. Because "strong" implies you want more of that, which isn't what you meant at all. Perhaps "dominant"?

jbmsf · on Nov 22, 2020

Because context matters, everyone is going to have different opinions on some (or all) parts of this one.

From my experience, I've run into a few challenges with some of the claims

For 1: If engineers always operate their code, you will end up with an organization structure built around "these developers work on (and operate) this code." That might be what you want, but it's often not what I want. There are going to be some part of the code base that don't change very much and I'd rather not have a specific team perpetually taxed with operating it (instead of building new business value). Likewise, there are some engineers who are much more productive at building new things (and not just in the the crappy, "now it's your problem way"). I don't want those engineers to have a larger share of operations because they are more productive at development. Support-focused roles have value.

For 5: I have never seen a maintainable automated test suite. I'm sure they exist somewhere, just not in my experience, and I've never worked with anyone who knew how to ensure that developers create one. This means that, with purely-automated testing, your development costs can become dominated by "fixing the tests." My preference is to automate the most important things and to have have humans, with judgement, test behaviors based on test cases written in English.

fefe23 · on Nov 21, 2020

I sure hope OP does not write code for medical devices, aerospace or self-driving cars, reactor or heavy machinery controls or anything else where lives are at stake.

Your current product may be a free TV series reminder service where you can go full YOLO. Then feel free to follow OP's advice. Otherwise, I recommend against it.

edwinbalani · on Nov 22, 2020

To temper this a bit: thankfully, all those industries are full of regulation and V-model and piles of documentation (for every line of code, multiple lines of other stuff on paper). Serious customers won't touch your product if you don't have a certified quality management system that fits The Standard.

I work in one of these industries -- I don't see much safety-critical engineering content on HN, and I think it's fair enough that it's glossed over. I agree that you can't move fast and break things when those things can hurt, or physically are, people.

dehrmann · on Nov 22, 2020

> aerospace

I have no idea what the code culture looks like at Boeing, but it's possible that if engineers had more skin in the game, the 737 MAX problem wouldn't have happened. When you depend on a QA team, it's easy to throw code that compiles over the fence and say it's someone else's problem, now.

benhurmarcel · on Nov 22, 2020

The 737max problem had nothing to do with software bugs. The software worked exactly as specified, it's the higher-level system design which was wrong.

gtirloni · on Nov 22, 2020

Could you elaborate?

thesuperbigfrog · on Nov 22, 2020

Not OP, but I have worked in e-commerce / web services and currently work in safety critical software.

This article does not address safety critical software engineering since some of the points would not work in safety critical software:

"QA Gates Make Quality Worse" - In safety critical software development it would be crazy to negligent to not have QA gates. For example, if the safety critical software is running a critical aircraft control system would you feel comfortable getting on the airplane if there were no QA gates?

"Things Will Always Break" - When things break in safety critical software, it is Boeing 737 Max-bad or worse. People die. Things catch fire and explode. You can not just "roll back" to the previous version of the software and restart the server because often the damage is already done and there is nothing left to restart.

rswail · on Nov 22, 2020

In safety critical environment, you'd better have done your FMECA and designed the system to be "fail-safe". The problem with the 737-Max is that it wasn't.

Fail-safe doesn't mean "no failures" or "no breakages". It means that when they happen (and they will), the system fails in a way that is safe.

QA "gates" in terms of the "throw it over the wall to QA and they will test it in all possible scenarios to ensure quality" is different to QA "gates" that are "throw all of the failure modes that have been identified as dangerous and ensure that the system fails in a safe manner."

That can be part of a CI/CD pipeline in exactly the same way as in the web/cloud world, including generating all of the necessary documentation and evidence for ISO etc standards.

igouy · on Nov 22, 2020

"When A Fail-Safe System Fails, It Fails By Failing To Fail Safe"

spaetzleesser · on Nov 22, 2020

Same here. In a lot of areas it's feasible to roll out something that works to a reasonable degree and then respond to bug reports and fix them quickly. I now work in medical devices and it would be hard to explain that after somebody has been killed by a pacemaker that we are fixing the bug now.

nullsense · on Nov 22, 2020

I would hope to anyone reading that it would be blindingly obvious that the Continuous Delivery paradigm of "testing in production" doesn't apply to fucking aircraft. Just to be clear.

gtirloni · on Nov 22, 2020

Got it, thanks! Makes perfect sense to me (to have those assurances).

hyko · on Nov 21, 2020

“if you’re not on-call for your code, who is?”

Someone whose job it is?

The people who designed your car don’t have to change the oil in it.

The more I think about it, the more I realise it’s just motivated reasoning because people like doing it. The logical conclusion to this line of thinking is that the CEO just does everything themselves. Otherwise, how do they really know if they’re effective or not?

AlphaSite · on Nov 22, 2020

It’s about aligning incentives. We don’t want to spend our time fixing prod, so we build robust systems. Configuration is hard, so we simplify that. Upgrades are error prone, so we make them more stable and update more frequently.

hyko · on Nov 22, 2020

Where do things like pride, responsibility, and empathy fit in this model of incentives? Good people want to do a good job. An engineer designing a bridge doesn’t need to use it every day to be incentivised to do a good job; designing a bridge is an intellectual pursuit that is quite a bit beyond simple incentives.

The downside of this “skin in the game” approach is that you skip specifying things properly and end up with systems that only your devs can configure and upgrade.

deathanatos · on Nov 22, 2020

I'm currently on-call, essentially, for code I didn't write. (Not by choice, by necessity.)

We have a subsystem that is currently emitted millions of logs per hour, It's eating up most of the available compute. In a separate incident, it racked up a few thousand dollar bill by making millions of API calls that all failed.

It clearly has issues. But I'm not the primary dev: I have no familiarity with the code base, I have little idea what it is doing (and yes, I've asked). As I'm not a dev of the code (and have no time to become one — our agile sprint planning will never allow time for that, and, since I'm not one of the devs) I'm not able to add the information I need to the code to get the insights into answering "why is it eating up most of the compute?".

> The people who designed your car don’t have to change the oil in it.

No, but when the car fails to operate as designed, those people need to figure out the why. Also, a mechanic has an understanding of how the car is built, and how it functions. In software, the only people that have that are the devs.

yuribro · on Nov 22, 2020

> But I'm not the primary dev: I have no familiarity with the code base, I have little idea what it is doing (and yes, I've asked). As I'm not a dev of the code (and have no time to become one — our agile sprint planning will never allow time for that, and, since I'm not one of the devs) I'm not able to add the information I need to the code to get the insights into answering "why is it eating up most of the compute?".

It looks like you identified the root causes of the problem here: the fact that you're not the dev doesn't have to be a problem. It's the fact that it's under-documented & has bad metrics.

While everyone understands that we will always have bugs and issues (at least while we keep working in the current paradigm for software development), having good designs, documentation and metrics is attainable. It just has to be prioritized by management.

> "our agile sprint planning will never allow time for that"

Sounds like those who call the shots either don't understand the cost of not doing these things, or believe that it's more cost effective not to do them.

rualca · on Nov 22, 2020

> As I'm not a dev of the code (and have no time to become one — our agile sprint planning will never allow time for that, and, since I'm not one of the devs) I'm not able to add the information I need to the code to get the insights into answering "why is it eating up most of the compute?".

But as oncall, that's not your job, is it?

Being oncall means you are the primary point of contact for your team regarding any issue involving how your product reaches the public. You take the lead identifying issues and finding ways to mitigate how problems impact users. Yet, that doesn't mean you should be attaching debuggers to running processes and adding breakpoints here and there. You are expected to avoid downtime, meet service levels, and coordinate with all teams to fix operational issues and increase code quality.

If you're not the primary dev and you stumble on an issue, you are expected to file a ticket and bring it to the attention of anyone who is in a position to address the issue.

hyko · on Nov 22, 2020

I'm currently on-call[...]Not by choice, by necessity.

I have no familiarity with the code base, I have little idea what it is doing (and yes, I've asked)

It sounds like you’re being asked to do a job without the tools you need to do it (i.e. supporting documentation, a runbook etc.). I obviously don’t know the circumstances, but the organisation needs to resolve those issues so you can do an effective job.

when the car fails to operate as designed, those people need to figure out why

I agree completely. The team responsible for the codebase should be fixing the bugs.

closeparen · on Nov 23, 2020

Code doesn't wear. An "oil change" or "part replacement" workflow in a software system is a fixable bug. A healthy shop fixes those bugs, rather than hiring people to work around and clean up after them. The failures it experiences are novel, and people who are not deep experts in the code would not be able to fix them anyway.

kmetan · on Nov 22, 2020

>>The people who designed your car don’t have to change the oil in it.

If the people who designed the car don't care about oil changers' requirements because "its not their job", than a 2am call is absolutely the best kind of feedback they should get, even if they are not on-call.

If they do care, that means they already got that feedback and/or they listen.

So the fredback from the production should be there and it should be ongoing.

u801e · on Nov 22, 2020

If you want the code to run 24/7, and you want engineers available to fix it, then you hire people to work off hour shifts and pay them more. You don't hire someone to work 9 am to 5 pm Monday to Friday and expect them to wake up at 2 am to fix something. On the other hand, an engineer scheduled to work from 9 pm to 5 am would have no issue doing so.

benhurmarcel · on Nov 22, 2020

My team designs an airliner system. Effectively there is a company support team "answering the phone" 24/7. But when our system fails in a non-trivial way on an aircraft somewhere, one of the design engineers absolutely gets a call and will spend the next few hours supporting it. Even if it's 2am on a week-end.

hliyan · on Nov 22, 2020

Oil change is maintenance, not a fix. If there was a flaw in the design of the oil filter, it will definitely come to people who designed it.

hyko · on Nov 22, 2020

Yes, exactly. The people responsible for that work will be the ones to fix it.

Note how the designers were notified of the flaw without having to be there when it was discovered, because that was someone else’s job.

xwdv · on Nov 21, 2020

Software Engineers are neither CEOs nor designing cars. They can afford to spend some time fixing their own code.

nemetroid · on Nov 22, 2020

Sure, how about 40 hours a week?

hyko · on Nov 22, 2020

I think you’ve misunderstood the point I was making; humans need to work in teams, and that means hand offs, communication, delegation etc.

If you can’t ever cooperate with anyone else because of incentive structures, then your organisation can only have one individual–the CEO–who was to do everything by themselves.

gumby · on Nov 22, 2020

> Buy Almost Always Beats Build

The devil is in the "Almost". I agree 99.9% with the examples the author chose (db, k8s -- if you even really need it).

But a lot of third party code is rife with gotchas -- often excellent as development scaffolding but risky in deployment. npm is notorious in this regard.

What does "buy" buy you? 1 - hopefully a lot of people are using the code in question so it will have fewer bugs. 2 - hopefully time to deployment

But in exchange you may be using a piece of software that might be robust in domain X but not in yours. You're signing on to a piece of code that may be excessively general for you, or inversely make assumptions that shoehorn you down an inconvenient path.

And of course now you're on someone else's config and development timeline.

So it's a balance like any other.

hliyan · on Nov 22, 2020

Sometimes you need to buy and integrate three different tools to get the thing you could do with a small in-house tool. And each of those three tools have a lot more features and options than you need and require a lot of configuration. Then you also have to upgrade those tools. Sometimes you need specialists. Sometimes if something breaks due to an upgrade you have to contend with the vendor's support staff.

Sometimes, all you needed was 50 lines of bash script.

m0llusk · on Nov 21, 2020

Many of these points seem to mask larger issues and be focused on a particular type of application. Very technical, experimental, or fault-intolerant applications may need a different balance of factors.

> Engineers should operate their code.

Are engineers the customers? What matters is the customer experience. Getting engineers directly involved risks lowering of customer priorities.

> Buy ... beats build

It depends on how much control you need. Sure it can be quicker and cheaper overall, but if available systems are not actually a match it is only a matter of time before some customization or a custom system is needed. Try to find and commit to the right balance as soon as possible.

> Make deploys easy

This really depends on the kind of system and its usage. In fault intolerant systems it makes sense to take deploys as slowly and carefully as possible because the risk of problems even from changes that are considered fixes is not tolerable.

> Trust the People Closest to the Knives

In the larger picture this is marketing who must focus on knowing the customer and their needs.

> QA Gates Make Quality Worse

Some interactions are difficult to automate. It often makes more sense to slow things down to get the quality right than to go fast and increase risk.

> Boring Technology is Great.

Depending on your goals and context. Does boring sell? Are your customers requesting boring?

> Simple always wins

That is similar to the boring rule above, but with the added complication that simple is subjective and context dependent. Perhaps some complex existing mechanism can be adapted, so is that really simple or not?

> Non-Production Environments Have Diminishing Returns

But they are essential returns. If the engineers can get the thing to work on their desks then it isn't likely to work for customers.

> Things Will Always Break

This is pure squittering. In some context providing services that do not break is the key feature. For many products markets will reject this modern view of moving fast because it is okay to break things. Some customers are tolerant and will pay extra for special functionality, but others are intolerant and would prefer robust functionality to engineer maximization.

axlee · on Nov 21, 2020

> In the larger picture this is marketing who must focus on knowing the customer and their needs.

Is Marketing responsible for UX research in your organization?

> Depending on your goals and context. Does boring sell? Are your customers requesting boring?

I'd argue that your customers do not give a flying crap whether or not you're using GraphQL or a plain old relational database behind the scenes, as long as their experience isn't affected. Unless you are a vendor of a non-boring alternative to one of the aforementioned boring tools, boring is absolutely superior...until it doesn't fulfill the needs of your platform. But by then, your company will have reached the point where you can put dozens of engineers on that single issue, and come up with the new flavour of "boring" standard.

> But they are essential returns. If the engineers can get the thing to work on their desks then it isn't likely to work for customers.

I think one of core tenets of this approach is to make local development as close as possible to production. Staging servers do not always capture production issues either, and IMO their main use is for internal collaboration and demoing, not finding out about issues. If you can replicate an issue locally, you do not need to test on staging. If you can't replicate an issue locally, it's very unlikely that the issue would arise on staging either. Of course, if your codebase has vastly different runtimes in local and in prod, all of the above does not apply.

user5994461 · on Nov 21, 2020

>>> A good schedule has 6–8 engineers.

Stop right there. The minimum is 10-12 engineers to have a sane on-call schedule and they have to be distributed across the world (timezone).

With 6 people you're going to be oncall almost every week. That's practically only 4 people on rota because 1-2 people are not participating in rota (last joiners are not yet trained and other reasons). Then when there are issues actually happening, they are getting escalated to the person on rota and then escalated to you/team (it's rare than one man can fix/debug much alone), so you're forced to work even when off rota.

bobbiechen · on Nov 21, 2020

I would interpret the 6-8 number as the number of people actually on the rotation, so excluding those who are not yet trained.

>they are getting escalated to the person on rota and then escalated to you/team (it's rare than one man can fix/debug much alone)

This is probably true at first, but over time a team can build up knowledge with better runbooks. When I first joined my team's on-call, my tech lead was very clear that if I wasn't sure what to do, I should just page her or the subject matter expert on the team without hesitating. For each time this happened, we went over the response in a postmortem and added instructions about how to diagnose and fix to the runbook, so the next person to get paged for a similar reason could follow those instead.

user5994461 · on Nov 21, 2020

I agree with what you say on principles, the challenge is doing it in the real world with real people and real organizations.

You're practically saying that the tech lead and manager are on call 24/7 because they can be called anytime. Are they okay with that? can they actually do something about the pages/runbooks? (in many organizations it's not that simple).

What's the average tenure in tech? Something like a year (you can surely imagine that it gets shorter with bad oncall). You're constantly having new joiners (and leavers). It's not as simple as there being 6 people at the ready at all times.

Practically the team starts from zero and have to ramp up to 6 members, how easy is it to recruit 6 dev/ops/sre? It's not that easy and it takes a lot of time and they leave. Outside of a few large organization, teams/department might never reach the size where on call is bearable.

dilyevsky · on Nov 21, 2020

Oncalls are usually week long so you’re oncall once every 1.5 months with six people

user5994461 · on Nov 21, 2020

That's the point, in theory it's every 6 weeks, in practice it's every 2-3 weeks. Because of people not participating in oncall equally and because of getting escalations while you were not officially on call.

madhadron · on Nov 21, 2020

> people not participating in oncall equally

Why would this be? If you're on the rotation, you have the same number of shifts as anyone else. If you can't take your shift you have to swap with someone and take their shift instead.

> getting escalations while you were not officially on call

This usually means that the training of those on the rotation is inadequate.

It's possible for oncalls to be awful, but they don't have to be. The important part is to make those who have the power to change the sources of pain to suffer it.

dilyevsky · on Nov 21, 2020

If people are not participating then it’s not 6 person rotation. Every oncall tool out there is capable of setting even schedule that is reviewed and agreed in advance in every case that I’ve seen. Regarding escalations if your oncall primary frequently requires escalating to other folks you haven’t done a good job of training them or writing runbooks for your component.

sethammons · on Nov 21, 2020

I’m on a team where 6 people are on rotation. We are nearly always 6 weeks until your next rotation. On some cases, you may swap, and then maybe it was 5wks or 7wks since your last rotation.

For some smaller teams, they may have fewer people. We try to staff up those teams or otherwise work to reduce the on call burden.

WesolyKubeczek · on Nov 21, 2020

On "buy vs build": it's patronizing to no end, and misses the point that sometimes down the line, someone needs to actually build the thing.

I'm reading a lot of "the ecosystem is so big that what you need already exists somewhere", and it's so consumerist that it grates on my inner maker's mindset to the point my teeth resonate and hurt.

And no, I find that a lot of times what I need doesn't exist.

erichocean · on Nov 21, 2020

This is especially true at scale, even just small/medium scale (e.g. 10M records updated daily). There are dozens of services for one of the core things our business needs that are easy to use, but either a) they don't scale, or b) they do scale but at an exorbitant cost.

So we've built and maintained our own system with our own hardware, in a datacenter in LA with our own dedicated 10Gb Cogent connection. As a result, we're literally 1/10 to 1/20th of the monthly cost, and it only took takes two (old) Dell 720s to run and less than $10K in total capital (network gear, etc). (We would have spent 5x that in just one month using a 3rd party service, so the "capital" was paid back almost instantly.)

Many such cases.

davidivadavid · on Nov 21, 2020

I don't think it misses that point. System A needs building blocks X,Y,Z to work. They can either buy or build them.

Someone needs to have built blocks X,Y,Z for them to be available for purchase, but the people building X,Y,Z aren't trying to build A. They have their own goal, which is building X,Y,Z, which probably could be done using blocks U,V,W.

It's not "consumerist" so much as "informed about the state of the art." The opposite is generally "I'm a genius engineer who can reinvent something better than what exists without looking at what exists." (You probably can't).

Supermancho · on Nov 22, 2020

> 2. Buy Almost Always Beats Build

The prices for a good buy make no sense to an existing business (without VC infuse). RENT beats build.

> System A needs building blocks X,Y,Z to work

When an American company buys a product with assurances that certain qualities exist (even as small as a $10k purchase), roughly 95% of the time it will fall short. You will have to build some hack solution (which the seller is usually happy to point out) or just ignore the missing feature. There's usually very little you can cannibalize out of a purchased product sourcecode in a full rewrite.

There was a 1 in 25+ cases as an exception, in my career. My company A was purchasing another european company B at over $100m with a full audit clause prior. My team was flown out to the UK and worked on Company B's technology for a couple weeks. It operated exactly as expected with the capabilities they presented (as the audit assured) when we brought the code back to the developers in the US...with branding changes, etc.

> 5. QA Gates Make Quality Worse > Secondly, the teams doing QA often lack context and are under time pressure. They may end up testing “effects” instead of “intents”.

A lot of this is finger pointing. If you don't tell the QA what or why, the tests can't be written to infer this. If the tests can't be maintained, you have a failing QA dept, not a barrier to quality by the mere existence of the gates.

> 7. Simple Always Wins

Then you don't need to BUY it, just BUILD it.

> 8. Non-Production Environments Have Diminishing Returns

You want at least 3, maybe 4.

- Dev (that should be a local docker env).

- Managed Test (for QA to always use).

(-) Ephemerals for testing deploys, scaling, etc

- Prod (Prod)

nullsense · on Nov 22, 2020

>> 8. Non-Production Environments Have Diminishing Returns.

>You want at least 3, maybe 4.

After experiencing a project where delivery was slow as a dog because of a lot of queues waiting on multiple rounds of manual QA and environmental differences causing a class of bugs that only showed up in production anyway further delaying other stuff as it was fixed and had to go through these long lead times to get to production, I did some soul searching and tried to understand it better. I know we could tune the current setup a little better and eek out some marginal gains. But after reading Accelerate and comparing our dev practices with a local company that is doing Continuous Delivery as outlined in this blog post I really feel like the grass might actually be significantly greener on the other side.

ucarion · on Nov 21, 2020

I fully agree with the rest of the article -- production is the only thing that matters, it's our _sine qua non_ -- but this section is too bold by far:

> You’ll never prioritize work on non-prod like you will on prod, because customers don’t directly touch non-prod. Eventually, you’ll be scrambling to keep this popsicle sticks and duct tape environment up and running so you can test changes in it, lying to yourself, pretending it bears any resemblance to production.

Error budgets are good solution to this problem. If your change qualification process (be that a QA team, staging, pre-prod, whatever) plus your release process is very good, then you probably aren't burning through your error budget.

But if it turns out your process isn't good enough, you'll get feedback on this in the form of running out of error budget. So you can spend the rest of your quarter working on your non-production environments, so that in the future you can move fast and break an acceptable amount of things next Q.

specialist · on Nov 22, 2020

From OC:

"5. QA Gates Make Quality Worse"

Okay. What's the alternative?

The only sensible narrative I've seen is Michael Bryzek's "Production - Designing for Testability"

https://www.infoq.com/presentations/quality-production/

--

I was briefly a SQA Manager during the shrink-wrap era. None of the post-PMI "Agile" narratives about QA and Test have ever been plausible.

And most teams don't even have tester, much less QA, roles any more. The minor sanity checking is done by the poor BAs ("business analysts") and middle management PHBs freaking about "someone said the website's down!!".

I wasn't able to persuade any one in my last org to even glance at the Bryzek Method (for lack of a better term). I'll have to sneak it into my next org.

0xbadcafebee · on Nov 22, 2020

All of these points are true. If you're building a tech org, take note.

But there's a couple big problems that these points raise:

1. Why don't engineers design for production, and how do you get them to? I find most devs just do not want to care about how their apps actually work. They don't care about the product or user experience. Getting them to care is hard.

2. Deployment is not a problem - an ongoing site reliability process is the problem. Anyone can ship some code to a server accessible on the internet. But what do you do when it breaks? Or even worse - what do you do when it's just getting incrementally slower? Deployment is just one tiny part of getting your code to production in such a way that it continues to be a functioning product. Site Reliability is really Product Reliability - and that's something devs need to learn about.

3. The company never wants to pay for anything, yet they insist on hiring people to build stuff by arbitrary deadlines that can't be met. How can we fix this? Beats me.

4. A person manually checking for quality is basically a relic of old managers who have no idea how to get quality other than to pay someone to care about quality, but they don't know how to get those people to do the right thing, which is work with devs to write tests.

5. Simple things are the hardest to make, and definitely takes the longest to get right. I would start with easy, and try to work my way up to simple. Simplicity being complexity reduced to its most essential parts. I think all refactors should be always towards simplicity, and should happen often.

6. The reason that building or running systems can be so difficult or error-prone is human communication problems. Look for communication problems and solve them, and you will magically see less errors, more frequent deploys, and happier customers. Yes, this is kind of obvious, but it's amazing how often communication problems are both known and ignored because "we're too busy because we've got to do X other thing".

pif · on Nov 22, 2020

One more article based on the assumption that web development covers the full span of software development...

yuribro · on Nov 22, 2020

In this case, it looks like he doesn't even cover major parts of web development. Only no-SLA consumer apps. Just having a price tag on unavailability will change most of the reasoning for his points.

Also, I do think it's a little dishonest not to point out that one of the recommended products is made by his employer

skolsuper · on Nov 22, 2020

> Only no-SLA consumer apps. Just having a price tag on unavailability will change most of the reasoning for his points.

I don't think so. The gist of the reasoning is that it's better to have more numerous small failures that are fixed quickly than rare but catastrophic failures. I think that's true even when you have an SLA, which is probably why it's broadly in line with the principles in Google's Site Reliability Engineering book

Others have made the point though that this all goes out of the window when talking about safety-critical engineering (Therac-25, 737Max etc)

bob1029 · on Nov 21, 2020

> Engineers should be able to deploy with minimal manual steps and it should be easy to see if the deploy is successful

We took this a step further. Our developers are actually disallowed from running builds & deploying the software to QA or customer environments. We built a tool that exposes all of this through a web dashboard that our project managers use. Building and deploying our software to a customer's production environment requires ~20 seconds of interaction in this web tool and then the process completes automatically.

This works out so much better for us because project managers can have total control over what gets included in each build and handle all coordination with each customer. The direct ownership of issue status labels + build/deploy gives comprehensive control over all inbound+deliverable concerns.

We also have feature flags that we use to decouple concerns like "this build has to wait because of experimental feature XYZ". Developers communicate feature flags up to the project team in relevant issues so that they understand how to control that functionality if it goes out in a build. Effectively, we never merge code to master that would break production. If its something deemed risky, we put a FF around it just in case.

Note that feature flags can also be a huge PITA if you keep them around too long. Clean these up as soon as humanly possible.

user5994461 · on Nov 21, 2020

Seems super annoying, so I develop something and I can't even test what I've done because it's impossible to deploy to a QA environment?

Wouldn't want to work there. Worst case scenario I leave. Best case scenario I become totally disinvested in the project/company because I can't work anyway so why care about anything? Take the salary and do as little as possible, bet every project is de facto late anyway.

By the way, what happens when the manager is away for a day or on holiday? Nobody can deploy anything?

bob1029 · on Nov 21, 2020

[flagged]

user5994461 · on Nov 21, 2020

I take a more charitable approach in job interviews than in HN messages :)

I consider unlimited free access to a QA environment a requirement to develop (critical) software. Local development does not reflect the production environment. Testing/Mocking is not representative of a real database or any dependency the software relies on.

Do you have QA people to test the software integration in QA or is it the developer/manager who's expected to QA after the release?

8note · on Nov 21, 2020

I've deprecated a system with a ton of feature flags and it was fantastic. As pieces got moved out we could flip the flag back off

siliconc0w · on Nov 21, 2020

Re: buy vs build - building a 'custom solution' specific to your business needs can win over integrating an open source project which may be too heavyweight for the use-case or where you may be small potatoes compared to the larger companies who are driving the project for their use-cases instead of yours. This creates risk and technical debt as you'll have to invest engineering time integrating their changes in order to receive updates. You can end up running some architecture that is gluing together multiple OSS components that are all written in different stacks and where you need to heavily hack at them anyway to get them to do what you need them to do.

I guess a tip to add is drive touching production to zero. Ideally this is instrumented in your tooling w/ who, the reason, and an audit log of the actions took. It's fairly common to see development teams slowly overwhelmed by non-development activities due to not properly root cause problems or things like doing 'one-off' sql surgery to fix a customer's issue.

Juliate · on Nov 21, 2020

Plus, building a custom solution can be a necessary first step for your team to research and better understand the problem, and then be able to identify off-the-shelf solutions they can use instead.

edoceo · on Nov 21, 2020

And your custom in-house tool can be morphed along as the business grows.

trumpeta · on Nov 21, 2020

I agree with almost all of the article. It brings up great points. The one thing I don't fully agree with is (2) Buy almost always beats build. I think generally it does apply and especially for unrelated, common pieces like feature flags or managed databases.

But I think there are plenty of cases when it makes sense to build rather than buy even when tools exist. 1) It helps build a muscle of getting things done. 2) It offers a way to learn new things and to try new things. 3) It gives you understanding and better control over the solution. If the saas goes out of service or out of business it can create stressful times trying to migrate at the last minute.

So I would replace 'almost always' with 'often'.

aliceryhl · on Nov 21, 2020

The st ligature in that font is really distracting.

perlgeek · on Nov 22, 2020

Aye. Glad for the Reader Mode in Firefox.

bigpeopleareold · on Nov 21, 2020

This is why I stopped reading it, I was concentrating on the "st"s more.

whateveracct · on Nov 21, 2020

I've been running systems in production for half a decade now ever since I got my degree. Not a whole career yet but it's something. I love writing software, but I cannot wait to escape production internet services. There's nothing glorious or noble about them. It's a pain in the ass and I'll let other engineers who seem to like making it their life's work. Investing so much energy in a company who just pays me is not worth my lifetime's career. Luckily doing it for a few years pays well, and software skills are lucrative for things besides managing internet services. So escape is palpable.

wmichelin · on Nov 21, 2020

What would you be writing software for if it wasn't a production service?

whateveracct · on Nov 21, 2020

Video games, CLIs, data analysis tools, desktop apps.

Pretty much anything that doesn't require a fleet of servers with many 9's of uptime.

The list goes on.

manquer · on Nov 22, 2020

The grass looks greener on the other side. Every field has equally complicated problems.

Writing distributable applications (desktop apps) or high performance ones (data analysis tools) or combination of both (video games) comes with a ton of problems equally complicated problems too.

There is lot you take for granted in the server world that is simply not true anywhere else.

- the control you have over hardware (CPU/RAM/Disk) and OS environment where your service runs. You can very easily throw more resources at a problem, if there is memory leak you can kill and restart your dameon, specify the exact combination of dependencies your application can have down to the patch level, update and change your applications at whim 10 times a day, none of this easy or even possible outside the web service context.

- Typical performance challenges are more horizontal than vertical, i.e. to able to support more users is bigger concern than per user/ API call performance, most web service apps are CRUD applications of some sort, while there are performance challenges for a single computation the path to fix or mitigate is not difficult to see. In the systems programming world performance and concurrency for a single user application is very very different beast, you will end up doing lot more math,algorithms than in the web service world.

sarchertech · on Nov 21, 2020

The OP said production internet services. If you write desktop software that only runs locally it tends not to require someone being on cal 24/7.

SergeAx · on Nov 22, 2020

This is a perfect example of parable about blind men touching an elephant. Those opinions are derived from particular niche of software development and are not applicable to other.

> teams should optimize for getting code to production as quickly as possible

Not if you are developing software for embedded systems or cars or trains or factories or space satellites.

> if you’re not on-call for your code, who is?

How long an engineer should be working on 10 years old codebase to declare it "my code"?

Etc,etc

SergeAx · on Nov 22, 2020

Set aside satellites, if you just writing code for mobile application - you just cannot push fixes every hour!

nickjj · on Nov 22, 2020

Simple and boring really does win out in the end for getting stuff done.

In my podcast around running web apps in production I talked to 50+ different people deploying 50+ different apps with a bunch of different tech stacks and when I asked them their best tips, by far the most common answer was to keep things simple and boring.

The idea of introducing innovation tokens to situationally introduce new tech was also mentioned in a bunch of episodes. I was surprised at how many people knew about that concept. It was new to me the first time I heard it on the show and I've been building and deploying stuff for 20 years.

A full list of the 50+ best tips (and other takeaways from comparing notes between 50+ unique projects / deployments) can be found in this blog post: https://nickjanetakis.com/blog/talking-with-52-devs-about-bu...

cgrealy · on Nov 22, 2020

> Engineers should be on call

Nah... thanks but no thanks. I value my free time and I have 0 desire to allow work to intrude more into it.

And no, I don’t want more money for it either. But I might consider time... every week I’m on call = an extra week’s annual leave.

> build vs buy

Build the thing you’re selling and the thing you’re good at. Carpenters don’t build saws. Chefs don’t build knives or ovens.

dehrmann · on Nov 22, 2020

> Many teams implement periods where deploys are forbidden - these can be referred to as code freezes, or deploy policies like “Don’t deploy on Fridays”. Having such blackout periods can lead to a pile-up of changes, which increases the overall risk of something going very wrong.

Funny thing with this: CD, or even just frequent ad-hoc deployments can hide problems. When you do a code freeze over, say, the week before Thanksgiving, if you also stop deployments, you're changing the how the system runs, potentially leading to an issue at a bad time that the team isn't experienced in dealing with.

You can obviously configure CD to continue to redeploy the same build during code freezes. I'm just not sure if people remember to do it.

zug_zug · on Nov 21, 2020

I agree with some of the points, but the "EXTREME STATEMENT IN BOLD" cliche has to die already.

I suppose you can say "Qa environment suck" but you can also say "Make your QA environments not suck, by investing time keeping them very close to production" (i.e. same OS, same timezone, same stack, very similar DB, minimal mocking).

timhigins · on Nov 22, 2020

Just want to mention there are legitimate reasons for non-prod environments (#8), especially system-level tests, e.g. Chaos Engineering.

But I agree that most "staging" environments are an unnecessary extra step that will only rarely catch something legit.

mrits · on Nov 21, 2020

It sounds like the author has more experience in certain areas than others so prefers to mitigate the impact of those in his release cycle. This was written a year ago. I'd be interested to know if he changed his mind on any of these yet.

marsdepinski · on Nov 22, 2020

I think there's a bit more nuance depending on what kind of software you write, but overall it's on the right thought track. Current predominant monoculture creates some astoundingly bad systems.

dehrmann · on Nov 22, 2020

A good summary of this is you want developers with the SRE mindset.

1996 · on Nov 21, 2020

Just, no! This is good for a hobby or B2C, but not for serious stuff that must work 24/7.

First, about my impression from the title: it's not only the code put into production that matters: it's the experience and history of all the code that was decommissioned from production because of issues, or the code that almost made it to production but didn't because of some critical issue found at the last minute.

Maybe I'm the exception, but I often leave that code, commented out, in the sources - and I keep adding more and more!

I know it's not proper in this day in age (git, documentation etc. should be the place it goes) but that's the only place where I'm 100% positive a pair of eyeball WILL see that FOR SURE.

It's a way to avoid institutional knowledge loss when working in teams, but also a way to avoid forgetting what you did when you work on projects spanning multiple years.

Now for the points raised, (2) is bad: what you buy you don't understand when or how it may fail. Bitten once, twice shy...

So yes, I also go for manual QC (5), and staging environment (8), actually with production split in 4 : both running different version of the code (current, and previous), each with their own backup, because for what I do, (9) is unacceptable: if there's a break in 24/7 operation, the business closes.

Consequently, for (3), deployments are voluntarily made NOT EASY and are manual. It doesn't add much extra friction, because code reaching a production server will at least have been reviewed by eyeballs forced to read what failed before (in comments), who'll then have had to (5) quality control themselves to avoid feeling too sure about themselves, after which the code will have to prove its worth in (8) a staging environment for a few weeks.

Then, if something bad happens, back to the design board and eyeballing. If not, the code is just "good enough" to be deployed on half the fleet, replacing the oldest "known good" version first on the backup servers, then on the main servers.

And... that's how it stays until the next main version.

If some unforeseen problem is discovered, the previous "known good" version is still available on half the fleet. If the server has a hardware problem, the backup server with the N-1 version of the code is the final thing that remains between business and usual and "the 24/7 contract is broken, clients leave, the business closes".

I sell data feed with SLA guaranteeing 24/7 operation, and latency parameters. I've been going on for 3 years with only 1 failure that came close to interrupting production... but didn't. Each lesson was dearly learned.

spicyramen · on Nov 22, 2020

No QA...good luck

yawaramin · on Nov 21, 2020

> Importantly, engineers should be on-call for their code - being on-call creates a positive feedback loop and makes it easier to know if their efforts in writing production-ready code are paying off. I’ve heard people complain about the prospect of being on-call, so I’ll just ask this: if you’re not on-call for your code, who is?

OK, that's fine, but in that case we can shut down the production system outside of business hours so that our work-life balance isn't affected. Oh? We can't shut down the production system outside of business hours? So we need to be on call continuously, 24/7, meaning we can't ever be off the grid or unavailable? That sounds like a we're expected to give up our personal lives at a moment's notice? Interesting, hmm.

OJFord · on Nov 21, 2020

Not 24/7/52, on a rota. And paid. It doesn't mean no work-life balance, it means part of the work is being available when on-call (limited 'life'), and life as normal when not.

It's hardly some horrendous controversial idea, nor unique to software engineering.

ThePadawan · on Nov 21, 2020

Right.

At several places I worked (and others for which I asked in job interviews), the general amount which companies get away with (and employees find bearable) seems to be <= 5 weeks of on-call per employee and year.

And obviously you're being paid to be on standby, and then paid for your overtime should an incident occur.

burade · on Nov 21, 2020

Are construction workers on call for the buildings they build? Maybe. But they'll be called like once every 5 years. Because their industry actually has standards.

Devs think they're hot stuff, when in reality we're probably one of the most abused professions out there. (I'm talking about regular devs, not people who were born in wealth/went to good schools etc)

OJFord · on Nov 22, 2020

While they're building, yes there are absolutely people on call for issues on the site out of hours.

At my former employer I was on an on-call rotation; I'm obviously not now it's a 'former' employer, so the building analogy doesn't really hold up. (And not just leaving the company, but e.g. my former colleagues now working on something else at the same company aren't on call for the software they wrote but are no longer responsible for.)

user5994461 · on Nov 21, 2020

>>> And paid

The article did not mention anything about pay or compensation for oncall.

bdcravens · on Nov 21, 2020

The article also didn't mention whether you are notified via email, SMS, or Slack; that seems like detail handled by the business.

yawaramin · on Nov 21, 2020

When it comes to money, that is a rather important 'detail'. Especially given the fact that the most prevalent form of theft is wage theft: https://www.gq.com/story/wage-theft

bdcravens · on Nov 21, 2020

For the purposes of the article, that's still a business concern. Presumably on-call expectations are part of the compensation agreement. (as they are with most industries)

user5994461 · on Nov 21, 2020

I can't think of a single job/interview/offer where the expectations or the compensation for on call were discussed formally, let alone agreed.

The best I get in job interviews is usually a mention that there is a rota every X period. Then have to poke interviewers trying to guess what is it like without coming up as too negative, "when is the last time you worked on a week end?" "when is the last time you were awaken in the middle of the night?"

lifeisstillgood · on Nov 21, 2020

There is always a cost to software in production.

The issue is who pays and when?

You can pay that cost upfront - for example JPL/NASA SDLC. This will ensure you won't get woken at odd hours but then the massive upfront cost is something most business won't pay

You can sling code without tests and fix it in prod, hoping speed will help you find product market fit.

Pretty much everyone sits somewhere between the two. This article just describes one point onthe spectrum where the author feels is best practise - but to be honest the trade offs vary across this spectrum.

Probably the right way to think of this is "the total cost of making this software NASA level is 10X, and the revenue from such perfect working software would be 20X (with no loss due to downtime)

As such of you ask me to not code to NASA standards, I and my team will incur a personal cost of 5X in being woken up, stressful release days etc.

Therefore you will compensate me with payments of 5-10X.

This discussion is much easier with a Union involved

yawaramin · on Nov 23, 2020

So developers who are on-call are paid 5-10x as much as JPL/NASA engineers?

lifeisstillgood · on Nov 23, 2020

No.

Ok - there is a spectrum of reliability - lets say that NASA produces the most reliable code anywhere, and that it has a very high cost to produce code like that. At the other end of the spectrum is some guy slinging php code out without any testing, hoping that it will turn into the next unicorn.

If we asked both ends of the spectrum to write code to solve the same business (Pet food delivery app) then the guy slinging PHP will get woken up at 2am regularly because the server is always crashing. The NASA guy will never get woken up, but also the app probably will be out on the market a year after the first one.

So the business has to choose a trade off - sling code and get lots of 2 am wake up calls or wait and possibly lose market share to a competitor.

Now there was a famous example of a Reddit co-founder who slept next to his laptop and just rebooted the server every two hours till they discovered Python Supervisor. Now that seems ok - the business (co-founder) was making the trade off and exploiting the worker (same co-founder). The worker was happy to take the job because they were likely to get paid if it all worked out (and it did)

The issue comes when the worker on call is not making the business judgement. How much should they demand in payment?

If they have a healthy equity payment in a growth company, that might work just like the above founder. Otherwise the payment needs to come out of the money not spent.

SO I guess my argument is that there is a fixed cost to reliable software to the business - it should either pay for highly reliable software, or it should pay the saved cash to the code clinger for each time the server goes down.

This will change the trade off mathematics.

layoric · on Nov 21, 2020

I actually think it is a good feedback loop, but it needs to be staffed well and costs need to be well understood before a startup just takes on global 24/7 product and hoping the devs sort it out.

I did 24/7 support solo and with just 1-2 other devs for years on global system and never again will I do DevOps in such a small team with on call 24/7 requirements. The cost of maintaining features and systems varies so that having great enterprise support can be a non issue or a constant headache that you have little to no control over (Eg a system that takes a dependency on external data you can’t control).

On top of pushing features out constantly, maintaining quality and automating everything you can, a startup can easily fall into building systems their staff have trouble maintaining without impacting output significantly as well as impacting mental health of their devs. I think the problem is that it is hard to see these costs up front as you can build systems these days on cloud providers where most of the time things will come back on their own without intervention, but obviously depends what impact being offline for 5 mins vs 3 hours has on the business.

bdcravens · on Nov 21, 2020

Isn't that answered in your quote: "I’ve heard people complain about the prospect of being on-call, so I’ll just ask this: if you’re not on-call for your code, who is?"

To me this implies an on-call rotation where you know your expectations. Not "we need to be on call continuously, 24/7, meaning we can't ever be off the grid or unavailable". Many other industries have the idea of being on-call, and they are "expected to give up our personal lives at a moment's notice" when they know they are on-call. (For example, my brother in law is a surgery tech; he's had to take off during family outings more than once)

Also, if this happens often enough that it's a serious problem, this says a lot about the quality of the code you own.

yawaramin · on Nov 23, 2020

Code quality is a result of many factors, but the single biggest factor is whether Management treats developers as a profit centre or as a cost centre. That makes the difference in:

- Developer compensation

- Training and career development

- Staffing properly i.e. not under-staffing

- Giving devs proper slack time between tasks and not over-burdening them with projects

- Letting developers own the stack not just in name only but truly own the technical decisions made in the stack without micro-management, including choice of language, platform, etc.

Without all those factors, it's a red herring to point to the code quality. The code quality is just the final output of all of the above decisions.

willejs · on Nov 21, 2020

If your writing software or operating infrastructure, you need to be on call. Otherwise you don’t have skin in the game. It makes you a better engineer, and at most places increases the quality of your software in two ways. One, you don’t want to get that call at 2am, so you think more about reliability, edge cases, writing playbooks etc. Secondly, when things do go wrong, you perform a post mortem and you get the action items in your stream of work. Additionally you should always track on call stats and use it as a metric in your team health checks. If people are getting called a lot out of hours, it’s time to pull the cord, and sort it out.

tonyjstark · on Nov 21, 2020

Is the project manager called too? Or is it just the developer who gets pressured in delivery and then pressured in maintenance, basically at the very low end of the food chain?

willejs · on Nov 22, 2020

Kind of. Who ever is on call should be able to deal with the issue. Your highest level of escalation should be your engineering manager, VP of engineering or CTO. Your product manager should care about team health checks, and how much you are getting paged, then prioritise engineering time to reduce it. Its about collaboration. If it doesn't work out, maybe you could put your product manager shadow on call too. I haven't ever had to do this, but it could be fun.

sethammons · on Nov 21, 2020

We’ve done that: executive team, product owners, and managers. It was only for teams in a particular area of our product. System stability became a funded project.

hyko · on Nov 21, 2020

This “skin in the game” reasoning is nonsensical. Highly competent and passionate engineers don’t avoid mistakes because they’re bummed out about being woken up in the middle of the night with the consequences of them. That thinking is just based a cartoon version of human motivation; its twin is the idea that offering money and promotion as an incentive will lead to better performance.

Your manager may fancy themselves as a latter day Cortés, but you don’t need to play their mind games (most of them based on the misunderstood readings of an unsettled science) to create an effective and high functioning organisation.

yawaramin · on Nov 23, 2020

But the decision to operate the service or infrastructure 24/7 was a business decision, it wasn't my decision. Why should I be on the hook for a decision I didn't make? And if the business really, really wants 24/7 availability, why should that cut into my personal time outside of work? Why shouldn't the business set up teams in multiple timezones for a follow-the-sun model?

At the end of the day all this 'you need to be on call for your code' is purely a business money-saving ploy. We are an industry full of suckers, I guess, because we fell for the 'plausible-sounding' explanation hook, line, and sinker.

Too · on Nov 22, 2020

I'm generally for eating your own dogfood, but if we are to play devils advocate for a moment. If everyone is responsible for maintaining their own features, could it incentive you to not ship any features at all?

burade · on Nov 21, 2020

> It makes you a better engineer,

Nah, it makes you a better serf. Are you working at Amazon and getting paid 400k/year? Sure, do whatever. but regular devs making 70k shouldn't put up with this bullshit.

gru · on Nov 21, 2020

No mature organisation would expect engineers to be on-call continuously, 24/7. There are ways to have a sane, balanced approach to on-call. See the SRE book for one example: https://landing.google.com/sre/sre-book/chapters/being-on-ca...

yawaramin · on Nov 23, 2020

So what's a reasonable on-call schedule for developers?

arctangent · on Nov 21, 2020

> So we need to be on call continuously, 24/7, meaning we can't ever be off the grid or unavailable?

The well-established solution to 24/7 availability is to operate a shift pattern.

whateveracct · on Nov 21, 2020

Yep and if you're on-call for a week 24/7, I'd say it's only fair you do nothing else preemptively. Because you need to allow for the possibility of being paged.

No manager or employer would ever buy that shit because it rounds in the direction of less work though.

jodrellblank · on Nov 21, 2020

A shift pattern without anyone on call. 3x 8hr shifts in a 24hr day[1], optionally distributed around the planet so that all shifts are working in their local daytime.

[1] Gene Ray understood this.

AlphaSite · on Nov 21, 2020

Someone is always on call, so it’s just a case of whoa on the hook.

user5994461 · on Nov 21, 2020

Factually wrong. All developers could turn off their phone/pagers and then there's no on call anymore!

The worst that can happen is that the company is down a few hours overnight. Issues can be investigated and fixed during office hours.

I'd wager that most companies don't have global customers and don't need 24/7 coverage.

sethammons · on Nov 21, 2020

> the worst that can happen is the company is down for a few hours overnight

I think this is a great example of why disagreements arise on HN: different world experiences and base assumptions. For many companies, being unavailable for that window of time would be catastrophic. We had one client that suffered about an hour of downtime (turned out to be their issue). They accounted that hour for 5 million dollars lost.

Too · on Nov 22, 2020

$5M/hour can pay for a lot of engineers. So as you say - with such assumptions - you can and should pay for both up-front design, QA and people on-call, otherwise you only have your self to blame for the loss.

Juliate · on Nov 21, 2020

Yes, and in the long term, it's always better/more scalable/efficient to be disciplined enough to be able to have someone on call who has not designed/written the thing running.

ownagefool · on Nov 21, 2020

Why is that?

The whole "you build it you run it" movement is an attempt to fix dev teams just not giving a fuck about quality of code they put out, especially from a reliability point of view.

Why is the opposite better?

pantulis · on Nov 21, 2020

I guess because that enforces the documentation to be good enough so that someone without the faintest idea of what the software does can operate it.

Probably this approach is more scalable specially in big companies where you can have operation teams on-call for a myriad of project.

I personally believe that this does not guarantee a better service.

Juliate · on Nov 22, 2020

It's a complement not an opposite.

It's exactly like properly/cleverly documenting your code/project: not only for others now or in a few year, but also for yourself later on.

It's having common rules across teams to get more reliability out of the whole company.

You build it, you run it. Fine. Until the point when you can't anymore (because... reasons - it just happens). In any activity you want to sustain, you always have to have backups (in people and in processes), instead on relying on your-(self/team) alone.

hyko · on Nov 21, 2020

If your teams don’t give a fuck about the quality of their code, why would they give a fuck about the quality of their production support?

ownagefool · on Nov 21, 2020

Because they'll be rang at 4am...

Juliate · on Nov 22, 2020

And what happens when they (or the most competent elements of them) leave? You will need to do in a rush what could have been prepared before.

A whole business takes that into account as importantly as their disaster recovery processes (which is not necessarily something you focus early on, but you eventually do).

edoceo · on Nov 21, 2020

+1! And the feedback loop on this method is awesome too. Even in short term should have different parties for Dev and Ops

pkrumins · on Nov 21, 2020

Note to myself: Hire this guy.