The only piece of advice that I’ve been turned off to recently is avoiding in-house solutions. The opposite problem is provisioning way too many new tools, rushing the rollout with only the most basic integration, and calling it a victory. Even worse is opting in to all of the “advanced configuration” simply because it’s there, or even worse, not following the documented methods of configuring the tool in the first place. You end up not knowing which tool is meant to do what, rushed rollout means you may be fighting bugs with the integration itself, sometimes there is tool overlap and you’re just expected to know which one is the right one, etc.
My point is not “build PagerDuty etc in-house” it’s more the following:
1. Use as few external solutions as possible
2. Do not overconfigure the tools you do use, and do not stray from the documented paths for as long as possible (this is usually a testament to the quality of the tool itself)
3. If this is at a company, on a team, you HAVE to go the extra mile and do the full rollout. If it replaces something, it’s on YOU to go do the replacing, and it’s on YOU to let people know what you are doing and what they should expect going forward
If you strictly apply discipline here, you should become very good at a small number of tools over time.
Jsonnet as a templating language works very well with k8s yamls.
Pulumi is also phenomenal at having infrastructure as code versioned in repo. Pulumi and Typescript work very well for type checked templates. K8s yamls also officially have TS types so that works pretty well too.
The other points really only hold true for things that are allowed to break. As someone who writes code that should never, ever break, and gets support tickets even for one-in-a-million breakages, testing is absolutely paramount. I can see how certain products, like a web-app, could follow these principles and benefit from not being over-tested, but it's definitely not an absolute.
Of course, my code does still break. The author is right that this is still an absolute. But the blast radius of the breakage and the frequency are both quite low because of the measures we take to prevent them.
Regarding the point about ditching QA and testing in production: both are possible. You can make QA a shard of your production environment (there are many different ways to implement this based on your tech). That is the ideal way to structure it so that it's actually pretty representative of a real environment without requiring lots of duct tape.
Better put: you should test to acceptable limits.
A nuclear reactor or medical software should be tested with any and all possible and impossible scenarios in mind.
Some forum or chat software? Not so much.
And obviously there are 50 million shades in between :)
I've made chat software work when DNS is broken. Other people say that's an impossible situation.
But that also means you need to have IPs that will last for as long as your client lasts. Which means you need to have a defined lifetime for your client. And you need to have some mechanism to validate the IPs are still your servers and not been taken over by someone else (mutual authentication).
It sounds like implementing this scheme without a method to rotate IP addresses would be a mistake. But maybe I'm missing something.
Thinking about this article through the lens of an application where the cost of failure is quite low and time to fix is quite fast, the article starts making sense.
I agree, and I found the comment on how QAs somehow made quality worse to be specially absurd and detached from reality. I mean, the author went as far as presenting the brilliant solution to his problem by suggesting... Automating manual testing? Where did he came up with this stuff? I mean, for a couple of decades now, manual tests are only reserved to either tests that are not possible to automate, of exploratory testing. It's a suggestion in line of "let them eat cake", because it shows total detachment from the real world and specially how things are actually done in the industry.
And how exactly can QAs degrade quality if all they do is check if the work done by developers does indeed work and complies with requirements ? It's mind boggling. QAs don't change code. QAs see the mess you do before it hits the customer.
The absurdity of this mess reaches a point where the author complains that a failure by the developer to update tests to reflect his changes is somehow the fault of the guys tasked with running the tests? I mean, come on.
All in all I was left with the impression that the author is totally oblivious to how he has been introducing and living with abhorrent software development practices and, to further compound the problems he has been creating for him and for others, he's shifting the blame away from himself to everyone around him.
He mispoke when he said manual QA makes the quality worse. But the data do suggest that tossing your code over the wall to pass through a manual QA gate before being scheduled for a release doesn't actually improve quality when compared with shipping every merged PR to production provided it passes the automated pipeline and doing a staged rollout of the code to production users via a decent feature flag system.
Continuous Delivery has been a thing for quite a while it seems but I think it's still relatively early in terms of adoption. There are a number of conference talks from recent years about testing in production. I don't live and work in Silicon Valley and most of the companies where I am aren't doing it, but I know of one that is and it sounds pretty incredible from what I've heard.
One thing I have seen far too many times is that a strong QA team may lead to developers not feeling the responsibility of making sure things works themselves. This often leads to worse quality.
This is or course a social issue rather than a technical and I'm sure that many developers do not suffer from this mindsdt. But it's been there to some degree in almost all companies I have consulted for.
Surely you have seen this mindset yourself?
Though, if the nature of your QA teams led to worse quality, you might want to find a term other than "strong" to describe them. Because "strong" implies you want more of that, which isn't what you meant at all. Perhaps "dominant"?
From my experience, I've run into a few challenges with some of the claims
For 1: If engineers always operate their code, you will end up with an organization structure built around "these developers work on (and operate) this code." That might be what you want, but it's often not what I want. There are going to be some part of the code base that don't change very much and I'd rather not have a specific team perpetually taxed with operating it (instead of building new business value). Likewise, there are some engineers who are much more productive at building new things (and not just in the the crappy, "now it's your problem way"). I don't want those engineers to have a larger share of operations because they are more productive at development. Support-focused roles have value.
For 5: I have never seen a maintainable automated test suite. I'm sure they exist somewhere, just not in my experience, and I've never worked with anyone who knew how to ensure that developers create one. This means that, with purely-automated testing, your development costs can become dominated by "fixing the tests." My preference is to automate the most important things and to have have humans, with judgement, test behaviors based on test cases written in English.
Your current product may be a free TV series reminder service where you can go full YOLO. Then feel free to follow OP's advice. Otherwise, I recommend against it.
I work in one of these industries -- I don't see much safety-critical engineering content on HN, and I think it's fair enough that it's glossed over. I agree that you can't move fast and break things when those things can hurt, or physically are, people.
I have no idea what the code culture looks like at Boeing, but it's possible that if engineers had more skin in the game, the 737 MAX problem wouldn't have happened. When you depend on a QA team, it's easy to throw code that compiles over the fence and say it's someone else's problem, now.
This article does not address safety critical software engineering since some of the points would not work in safety critical software:
"QA Gates Make Quality Worse" - In safety critical software development it would be crazy to negligent to not have QA gates. For example, if the safety critical software is running a critical aircraft control system would you feel comfortable getting on the airplane if there were no QA gates?
"Things Will Always Break" - When things break in safety critical software, it is Boeing 737 Max-bad or worse. People die. Things catch fire and explode. You can not just "roll back" to the previous version of the software and restart the server because often the damage is already done and there is nothing left to restart.
Fail-safe doesn't mean "no failures" or "no breakages". It means that when they happen (and they will), the system fails in a way that is safe.
QA "gates" in terms of the "throw it over the wall to QA and they will test it in all possible scenarios to ensure quality" is different to QA "gates" that are "throw all of the failure modes that have been identified as dangerous and ensure that the system fails in a safe manner."
That can be part of a CI/CD pipeline in exactly the same way as in the web/cloud world, including generating all of the necessary documentation and evidence for ISO etc standards.
Someone whose job it is?
The people who designed your car don’t have to change the oil in it.
The more I think about it, the more I realise it’s just motivated reasoning because people like doing it. The logical conclusion to this line of thinking is that the CEO just does everything themselves. Otherwise, how do they really know if they’re effective or not?
We have a subsystem that is currently emitted millions of logs per hour, It's eating up most of the available compute. In a separate incident, it racked up a few thousand dollar bill by making millions of API calls that all failed.
It clearly has issues. But I'm not the primary dev: I have no familiarity with the code base, I have little idea what it is doing (and yes, I've asked). As I'm not a dev of the code (and have no time to become one — our agile sprint planning will never allow time for that, and, since I'm not one of the devs) I'm not able to add the information I need to the code to get the insights into answering "why is it eating up most of the compute?".
> The people who designed your car don’t have to change the oil in it.
No, but when the car fails to operate as designed, those people need to figure out the why. Also, a mechanic has an understanding of how the car is built, and how it functions. In software, the only people that have that are the devs.
It looks like you identified the root causes of the problem here: the fact that you're not the dev doesn't have to be a problem. It's the fact that it's under-documented & has bad metrics.
While everyone understands that we will always have bugs and issues (at least while we keep working in the current paradigm for software development), having good designs, documentation and metrics is attainable. It just has to be prioritized by management.
> "our agile sprint planning will never allow time for that"
Sounds like those who call the shots either don't understand the cost of not doing these things, or believe that it's more cost effective not to do them.
But as oncall, that's not your job, is it?
Being oncall means you are the primary point of contact for your team regarding any issue involving how your product reaches the public. You take the lead identifying issues and finding ways to mitigate how problems impact users. Yet, that doesn't mean you should be attaching debuggers to running processes and adding breakpoints here and there. You are expected to avoid downtime, meet service levels, and coordinate with all teams to fix operational issues and increase code quality.
If you're not the primary dev and you stumble on an issue, you are expected to file a ticket and bring it to the attention of anyone who is in a position to address the issue.
I have no familiarity with the code base, I have little idea what it is doing (and yes, I've asked)
It sounds like you’re being asked to do a job without the tools you need to do it (i.e. supporting documentation, a runbook etc.). I obviously don’t know the circumstances, but the organisation needs to resolve those issues so you can do an effective job.
when the car fails to operate as designed, those people need to figure out why
I agree completely. The team responsible for the codebase should be fixing the bugs.
The downside of this “skin in the game” approach is that you skip specifying things properly and end up with systems that only your devs can configure and upgrade.
If the people who designed the car don't care about oil changers' requirements because "its not their job", than a 2am call is absolutely the best kind of feedback they should get, even if they are not on-call.
If they do care, that means they already got that feedback and/or they listen.
So the fredback from the production should be there and it should be ongoing.
Note how the designers were notified of the flaw without having to be there when it was discovered, because that was someone else’s job.
If you can’t ever cooperate with anyone else because of incentive structures, then your organisation can only have one individual–the CEO–who was to do everything by themselves.
The devil is in the "Almost". I agree 99.9% with the examples the author chose (db, k8s -- if you even really need it).
But a lot of third party code is rife with gotchas -- often excellent as development scaffolding but risky in deployment. npm is notorious in this regard.
What does "buy" buy you? 1 - hopefully a lot of people are using the code in question so it will have fewer bugs. 2 - hopefully time to deployment
But in exchange you may be using a piece of software that might be robust in domain X but not in yours. You're signing on to a piece of code that may be excessively general for you, or inversely make assumptions that shoehorn you down an inconvenient path.
And of course now you're on someone else's config and development timeline.
So it's a balance like any other.
Sometimes, all you needed was 50 lines of bash script.
> Engineers should operate their code.
Are engineers the customers? What matters is the customer experience. Getting engineers directly involved risks lowering of customer priorities.
> Buy ... beats build
It depends on how much control you need. Sure it can be quicker and cheaper overall, but if available systems are not actually a match it is only a matter of time before some customization or a custom system is needed. Try to find and commit to the right balance as soon as possible.
> Make deploys easy
This really depends on the kind of system and its usage. In fault intolerant systems it makes sense to take deploys as slowly and carefully as possible because the risk of problems even from changes that are considered fixes is not tolerable.
> Trust the People Closest to the Knives
In the larger picture this is marketing who must focus on knowing the customer and their needs.
> QA Gates Make Quality Worse
Some interactions are difficult to automate. It often makes more sense to slow things down to get the quality right than to go fast and increase risk.
> Boring Technology is Great.
Depending on your goals and context. Does boring sell? Are your customers requesting boring?
> Simple always wins
That is similar to the boring rule above, but with the added complication that simple is subjective and context dependent. Perhaps some complex existing mechanism can be adapted, so is that really simple or not?
> Non-Production Environments Have Diminishing Returns
But they are essential returns. If the engineers can get the thing to work on their desks then it isn't likely to work for customers.
> Things Will Always Break
This is pure squittering. In some context providing services that do not break is the key feature. For many products markets will reject this modern view of moving fast because it is okay to break things. Some customers are tolerant and will pay extra for special functionality, but others are intolerant and would prefer robust functionality to engineer maximization.
Is Marketing responsible for UX research in your organization?
> Depending on your goals and context. Does boring sell? Are your customers requesting boring?
I'd argue that your customers do not give a flying crap whether or not you're using GraphQL or a plain old relational database behind the scenes, as long as their experience isn't affected. Unless you are a vendor of a non-boring alternative to one of the aforementioned boring tools, boring is absolutely superior...until it doesn't fulfill the needs of your platform. But by then, your company will have reached the point where you can put dozens of engineers on that single issue, and come up with the new flavour of "boring" standard.
> But they are essential returns. If the engineers can get the thing to work on their desks then it isn't likely to work for customers.
I think one of core tenets of this approach is to make local development as close as possible to production. Staging servers do not always capture production issues either, and IMO their main use is for internal collaboration and demoing, not finding out about issues. If you can replicate an issue locally, you do not need to test on staging. If you can't replicate an issue locally, it's very unlikely that the issue would arise on staging either.
Of course, if your codebase has vastly different runtimes in local and in prod, all of the above does not apply.
Stop right there. The minimum is 10-12 engineers to have a sane on-call schedule and they have to be distributed across the world (timezone).
With 6 people you're going to be oncall almost every week. That's practically only 4 people on rota because 1-2 people are not participating in rota (last joiners are not yet trained and other reasons). Then when there are issues actually happening, they are getting escalated to the person on rota and then escalated to you/team (it's rare than one man can fix/debug much alone), so you're forced to work even when off rota.
>they are getting escalated to the person on rota and then escalated to you/team (it's rare than one man can fix/debug much alone)
This is probably true at first, but over time a team can build up knowledge with better runbooks. When I first joined my team's on-call, my tech lead was very clear that if I wasn't sure what to do, I should just page her or the subject matter expert on the team without hesitating. For each time this happened, we went over the response in a postmortem and added instructions about how to diagnose and fix to the runbook, so the next person to get paged for a similar reason could follow those instead.
You're practically saying that the tech lead and manager are on call 24/7 because they can be called anytime. Are they okay with that? can they actually do something about the pages/runbooks? (in many organizations it's not that simple).
What's the average tenure in tech? Something like a year (you can surely imagine that it gets shorter with bad oncall). You're constantly having new joiners (and leavers). It's not as simple as there being 6 people at the ready at all times.
Practically the team starts from zero and have to ramp up to 6 members, how easy is it to recruit 6 dev/ops/sre? It's not that easy and it takes a lot of time and they leave. Outside of a few large organization, teams/department might never reach the size where on call is bearable.
Why would this be? If you're on the rotation, you have the same number of shifts as anyone else. If you can't take your shift you have to swap with someone and take their shift instead.
> getting escalations while you were not officially on call
This usually means that the training of those on the rotation is inadequate.
It's possible for oncalls to be awful, but they don't have to be. The important part is to make those who have the power to change the sources of pain to suffer it.
For some smaller teams, they may have fewer people. We try to staff up those teams or otherwise work to reduce the on call burden.
I'm reading a lot of "the ecosystem is so big that what you need already exists somewhere", and it's so consumerist that it grates on my inner maker's mindset to the point my teeth resonate and hurt.
And no, I find that a lot of times what I need doesn't exist.
So we've built and maintained our own system with our own hardware, in a datacenter in LA with our own dedicated 10Gb Cogent connection. As a result, we're literally 1/10 to 1/20th of the monthly cost, and it only took takes two (old) Dell 720s to run and less than $10K in total capital (network gear, etc). (We would have spent 5x that in just one month using a 3rd party service, so the "capital" was paid back almost instantly.)
Many such cases.
Someone needs to have built blocks X,Y,Z for them to be available for purchase, but the people building X,Y,Z aren't trying to build A. They have their own goal, which is building X,Y,Z, which probably could be done using blocks U,V,W.
It's not "consumerist" so much as "informed about the state of the art." The opposite is generally "I'm a genius engineer who can reinvent something better than what exists without looking at what exists." (You probably can't).
The prices for a good buy make no sense to an existing business (without VC infuse). RENT beats build.
> System A needs building blocks X,Y,Z to work
When an American company buys a product with assurances that certain qualities exist (even as small as a $10k purchase), roughly 95% of the time it will fall short. You will have to build some hack solution (which the seller is usually happy to point out) or just ignore the missing feature. There's usually very little you can cannibalize out of a purchased product sourcecode in a full rewrite.
There was a 1 in 25+ cases as an exception, in my career. My company A was purchasing another european company B at over $100m with a full audit clause prior. My team was flown out to the UK and worked on Company B's technology for a couple weeks. It operated exactly as expected with the capabilities they presented (as the audit assured) when we brought the code back to the developers in the US...with branding changes, etc.
> 5. QA Gates Make Quality Worse
> Secondly, the teams doing QA often lack context and are under time pressure. They may end up testing “effects” instead of “intents”.
A lot of this is finger pointing. If you don't tell the QA what or why, the tests can't be written to infer this. If the tests can't be maintained, you have a failing QA dept, not a barrier to quality by the mere existence of the gates.
> 7. Simple Always Wins
Then you don't need to BUY it, just BUILD it.
> 8. Non-Production Environments Have Diminishing Returns
You want at least 3, maybe 4.
- Dev (that should be a local docker env).
- Managed Test (for QA to always use).
(-) Ephemerals for testing deploys, scaling, etc
- Prod (Prod)
>You want at least 3, maybe 4.
After experiencing a project where delivery was slow as a dog because of a lot of queues waiting on multiple rounds of manual QA and environmental differences causing a class of bugs that only showed up in production anyway further delaying other stuff as it was fixed and had to go through these long lead times to get to production, I did some soul searching and tried to understand it better. I know we could tune the current setup a little better and eek out some marginal gains. But after reading Accelerate and comparing our dev practices with a local company that is doing Continuous Delivery as outlined in this blog post I really feel like the grass might actually be significantly greener on the other side.
"5. QA Gates Make Quality Worse"
Okay. What's the alternative?
The only sensible narrative I've seen is Michael Bryzek's "Production - Designing for Testability"
I was briefly a SQA Manager during the shrink-wrap era. None of the post-PMI "Agile" narratives about QA and Test have ever been plausible.
And most teams don't even have tester, much less QA, roles any more. The minor sanity checking is done by the poor BAs ("business analysts") and middle management PHBs freaking about "someone said the website's down!!".
I wasn't able to persuade any one in my last org to even glance at the Bryzek Method (for lack of a better term). I'll have to sneak it into my next org.
> You’ll never prioritize work on non-prod like you will on prod, because customers don’t directly touch non-prod. Eventually, you’ll be scrambling to keep this popsicle sticks and duct tape environment up and running so you can test changes in it, lying to yourself, pretending it bears any resemblance to production.
Error budgets are good solution to this problem. If your change qualification process (be that a QA team, staging, pre-prod, whatever) plus your release process is very good, then you probably aren't burning through your error budget.
But if it turns out your process isn't good enough, you'll get feedback on this in the form of running out of error budget. So you can spend the rest of your quarter working on your non-production environments, so that in the future you can move fast and break an acceptable amount of things next Q.
But there's a couple big problems that these points raise:
1. Why don't engineers design for production, and how do you get them to? I find most devs just do not want to care about how their apps actually work. They don't care about the product or user experience. Getting them to care is hard.
2. Deployment is not a problem - an ongoing site reliability process is the problem. Anyone can ship some code to a server accessible on the internet. But what do you do when it breaks? Or even worse - what do you do when it's just getting incrementally slower? Deployment is just one tiny part of getting your code to production in such a way that it continues to be a functioning product. Site Reliability is really Product Reliability - and that's something devs need to learn about.
3. The company never wants to pay for anything, yet they insist on hiring people to build stuff by arbitrary deadlines that can't be met. How can we fix this? Beats me.
4. A person manually checking for quality is basically a relic of old managers who have no idea how to get quality other than to pay someone to care about quality, but they don't know how to get those people to do the right thing, which is work with devs to write tests.
5. Simple things are the hardest to make, and definitely takes the longest to get right. I would start with easy, and try to work my way up to simple. Simplicity being complexity reduced to its most essential parts. I think all refactors should be always towards simplicity, and should happen often.
6. The reason that building or running systems can be so difficult or error-prone is human communication problems. Look for communication problems and solve them, and you will magically see less errors, more frequent deploys, and happier customers. Yes, this is kind of obvious, but it's amazing how often communication problems are both known and ignored because "we're too busy because we've got to do X other thing".
Also, I do think it's a little dishonest not to point out that one of the recommended products is made by his employer
I don't think so. The gist of the reasoning is that it's better to have more numerous small failures that are fixed quickly than rare but catastrophic failures. I think that's true even when you have an SLA, which is probably why it's broadly in line with the principles in Google's Site Reliability Engineering book
Others have made the point though that this all goes out of the window when talking about safety-critical engineering (Therac-25, 737Max etc)
We took this a step further. Our developers are actually disallowed from running builds & deploying the software to QA or customer environments. We built a tool that exposes all of this through a web dashboard that our project managers use. Building and deploying our software to a customer's production environment requires ~20 seconds of interaction in this web tool and then the process completes automatically.
This works out so much better for us because project managers can have total control over what gets included in each build and handle all coordination with each customer. The direct ownership of issue status labels + build/deploy gives comprehensive control over all inbound+deliverable concerns.
We also have feature flags that we use to decouple concerns like "this build has to wait because of experimental feature XYZ". Developers communicate feature flags up to the project team in relevant issues so that they understand how to control that functionality if it goes out in a build. Effectively, we never merge code to master that would break production. If its something deemed risky, we put a FF around it just in case.
Note that feature flags can also be a huge PITA if you keep them around too long. Clean these up as soon as humanly possible.
Wouldn't want to work there. Worst case scenario I leave. Best case scenario I become totally disinvested in the project/company because I can't work anyway so why care about anything? Take the salary and do as little as possible, bet every project is de facto late anyway.
By the way, what happens when the manager is away for a day or on holiday? Nobody can deploy anything?
I consider unlimited free access to a QA environment a requirement to develop (critical) software. Local development does not reflect the production environment. Testing/Mocking is not representative of a real database or any dependency the software relies on.
Do you have QA people to test the software integration in QA or is it the developer/manager who's expected to QA after the release?
I guess a tip to add is drive touching production to zero. Ideally this is instrumented in your tooling w/ who, the reason, and an audit log of the actions took. It's fairly common to see development teams slowly overwhelmed by non-development activities due to not properly root cause problems or things like doing 'one-off' sql surgery to fix a customer's issue.
But I think there are plenty of cases when it makes sense to build rather than buy even when tools exist. 1) It helps build a muscle of getting things done. 2) It offers a way to learn new things and to try new things. 3) It gives you understanding and better control over the solution. If the saas goes out of service or out of business it can create stressful times trying to migrate at the last minute.
So I would replace 'almost always' with 'often'.
Pretty much anything that doesn't require a fleet of servers with many 9's of uptime.
The list goes on.
Writing distributable applications (desktop apps) or high performance ones (data analysis tools) or combination of both (video games) comes with a ton of problems equally complicated problems too.
There is lot you take for granted in the server world that is simply not true anywhere else.
- the control you have over hardware (CPU/RAM/Disk) and OS environment where your service runs. You can very easily throw more resources at a problem, if there is memory leak you can kill and restart your dameon, specify the exact combination of dependencies your application can have down to the patch level, update and change your applications at whim 10 times a day, none of this easy or even possible outside the web service context.
- Typical performance challenges are more horizontal than vertical, i.e. to able to support more users is bigger concern than per user/ API call performance, most web service apps are CRUD applications of some sort, while there are performance challenges for a single computation the path to fix or mitigate is not difficult to see. In the systems programming world performance and concurrency for a single user application is very very different beast, you will end up doing lot more math,algorithms than in the web service world.
> teams should optimize for getting code to production as quickly as possible
Not if you are developing software for embedded systems or cars or trains or factories or space satellites.
> if you’re not on-call for your code, who is?
How long an engineer should be working on 10 years old codebase to declare it "my code"?
Nah... thanks but no thanks. I value my free time and I have 0 desire to allow work to intrude more into it.
And no, I don’t want more money for it either. But I might consider time... every week I’m on call = an extra week’s annual leave.
> build vs buy
Build the thing you’re selling and the thing you’re good at. Carpenters don’t build saws. Chefs don’t build knives or ovens.
In my podcast around running web apps in production I talked to 50+ different people deploying 50+ different apps with a bunch of different tech stacks and when I asked them their best tips, by far the most common answer was to keep things simple and boring.
The idea of introducing innovation tokens to situationally introduce new tech was also mentioned in a bunch of episodes. I was surprised at how many people knew about that concept. It was new to me the first time I heard it on the show and I've been building and deploying stuff for 20 years.
A full list of the 50+ best tips (and other takeaways from comparing notes between 50+ unique projects / deployments) can be found in this blog post: https://nickjanetakis.com/blog/talking-with-52-devs-about-bu...
Funny thing with this: CD, or even just frequent ad-hoc deployments can hide problems. When you do a code freeze over, say, the week before Thanksgiving, if you also stop deployments, you're changing the how the system runs, potentially leading to an issue at a bad time that the team isn't experienced in dealing with.
You can obviously configure CD to continue to redeploy the same build during code freezes. I'm just not sure if people remember to do it.
I suppose you can say "Qa environment suck" but you can also say "Make your QA environments not suck, by investing time keeping them very close to production" (i.e. same OS, same timezone, same stack, very similar DB, minimal mocking).
But I agree that most "staging" environments are an unnecessary extra step that will only rarely catch something legit.
First, about my impression from the title: it's not only the code put into production that matters: it's the experience and history of all the code that was decommissioned from production because of issues, or the code that almost made it to production but didn't because of some critical issue found at the last minute.
Maybe I'm the exception, but I often leave that code, commented out, in the sources - and I keep adding more and more!
I know it's not proper in this day in age (git, documentation etc. should be the place it goes) but that's the only place where I'm 100% positive a pair of eyeball WILL see that FOR SURE.
It's a way to avoid institutional knowledge loss when working in teams, but also a way to avoid forgetting what you did when you work on projects spanning multiple years.
Now for the points raised, (2) is bad: what you buy you don't understand when or how it may fail. Bitten once, twice shy...
So yes, I also go for manual QC (5), and staging environment (8), actually with production split in 4 : both running different version of the code (current, and previous), each with their own backup, because for what I do, (9) is unacceptable: if there's a break in 24/7 operation, the business closes.
Consequently, for (3), deployments are voluntarily made NOT EASY and are manual. It doesn't add much extra friction, because code reaching a production server will at least have been reviewed by eyeballs forced to read what failed before (in comments), who'll then have had to (5) quality control themselves to avoid feeling too sure about themselves, after which the code will have to prove its worth in (8) a staging environment for a few weeks.
Then, if something bad happens, back to the design board and eyeballing. If not, the code is just "good enough" to be deployed on half the fleet, replacing the oldest "known good" version first on the backup servers, then on the main servers.
And... that's how it stays until the next main version.
If some unforeseen problem is discovered, the previous "known good" version is still available on half the fleet. If the server has a hardware problem, the backup server with the N-1 version of the code is the final thing that remains between business and usual and "the 24/7 contract is broken, clients leave, the business closes".
I sell data feed with SLA guaranteeing 24/7 operation, and latency parameters. I've been going on for 3 years with only 1 failure that came close to interrupting production... but didn't. Each lesson was dearly learned.
OK, that's fine, but in that case we can shut down the production system outside of business hours so that our work-life balance isn't affected. Oh? We can't shut down the production system outside of business hours? So we need to be on call continuously, 24/7, meaning we can't ever be off the grid or unavailable? That sounds like a we're expected to give up our personal lives at a moment's notice? Interesting, hmm.
It's hardly some horrendous controversial idea, nor unique to software engineering.
At several places I worked (and others for which I asked in job interviews), the general amount which companies get away with (and employees find bearable) seems to be <= 5 weeks of on-call per employee and year.
And obviously you're being paid to be on standby, and then paid for your overtime should an incident occur.
Devs think they're hot stuff, when in reality we're probably one of the most abused professions out there. (I'm talking about regular devs, not people who were born in wealth/went to good schools etc)
At my former employer I was on an on-call rotation; I'm obviously not now it's a 'former' employer, so the building analogy doesn't really hold up. (And not just leaving the company, but e.g. my former colleagues now working on something else at the same company aren't on call for the software they wrote but are no longer responsible for.)
The article did not mention anything about pay or compensation for oncall.
The best I get in job interviews is usually a mention that there is a rota every X period. Then have to poke interviewers trying to guess what is it like without coming up as too negative, "when is the last time you worked on a week end?" "when is the last time you were awaken in the middle of the night?"
The issue is who pays and when?
You can pay that cost upfront - for example JPL/NASA SDLC. This will ensure you won't get woken at odd hours but then the massive upfront cost is something most business won't pay
You can sling code without tests and fix it in prod, hoping speed will help you find product market fit.
Pretty much everyone sits somewhere between the two. This article just describes one point onthe spectrum where the author feels is best practise - but to be honest the trade offs vary across this spectrum.
Probably the right way to think of this is "the total cost of making this software NASA level is 10X, and the revenue from such perfect working software would be 20X (with no loss due to downtime)
As such of you ask me to not code to NASA standards, I and my team will incur a personal cost of 5X in being woken up, stressful release days etc.
Therefore you will compensate me with payments of 5-10X.
This discussion is much easier with a Union involved
Ok - there is a spectrum of reliability - lets say that NASA produces the most reliable code anywhere, and that it has a very high cost to produce code like that. At the other end of the spectrum is some guy slinging php code out without any testing, hoping that it will turn into the next unicorn.
If we asked both ends of the spectrum to write code to solve the same business (Pet food delivery app) then the guy slinging PHP will get woken up at 2am regularly because the server is always crashing. The NASA guy will never get woken up, but also the app probably will be out on the market a year after the first one.
So the business has to choose a trade off - sling code and get lots of 2 am wake up calls or wait and possibly lose market share to a competitor.
Now there was a famous example of a Reddit co-founder who slept next to his laptop and just rebooted the server every two hours till they discovered Python Supervisor. Now that seems ok - the business (co-founder) was making the trade off and exploiting the worker (same co-founder). The worker was happy to take the job because they were likely to get paid if it all worked out (and it did)
The issue comes when the worker on call is not making the business judgement. How much should they demand in payment?
If they have a healthy equity payment in a growth company, that might work just like the above founder. Otherwise the payment needs to come out of the money not spent.
SO I guess my argument is that there is a fixed cost to reliable software to the business - it should either pay for highly reliable software, or it should pay the saved cash to the code clinger for each time the server goes down.
This will change the trade off mathematics.
I did 24/7 support solo and with just 1-2 other devs for years on global system and never again will I do DevOps in such a small team with on call 24/7 requirements. The cost of maintaining features and systems varies so that having great enterprise support can be a non issue or a constant headache that you have little to no control over (Eg a system that takes a dependency on external data you can’t control).
On top of pushing features out constantly, maintaining quality and automating everything you can, a startup can easily fall into building systems their staff have trouble maintaining without impacting output significantly as well as impacting mental health of their devs. I think the problem is that it is hard to see these costs up front as you can build systems these days on cloud providers where most of the time things will come back on their own without intervention, but obviously depends what impact being offline for 5 mins vs 3 hours has on the business.
To me this implies an on-call rotation where you know your expectations. Not "we need to be on call continuously, 24/7, meaning we can't ever be off the grid or unavailable". Many other industries have the idea of being on-call, and they are "expected to give up our personal lives at a moment's notice" when they know they are on-call. (For example, my brother in law is a surgery tech; he's had to take off during family outings more than once)
Also, if this happens often enough that it's a serious problem, this says a lot about the quality of the code you own.
- Developer compensation
- Training and career development
- Staffing properly i.e. not under-staffing
- Giving devs proper slack time between tasks and not over-burdening them with projects
- Letting developers own the stack not just in name only but truly own the technical decisions made in the stack without micro-management, including choice of language, platform, etc.
Without all those factors, it's a red herring to point to the code quality. The code quality is just the final output of all of the above decisions.
Your manager may fancy themselves as a latter day Cortés, but you don’t need to play their mind games (most of them based on the misunderstood readings of an unsettled science) to create an effective and high functioning organisation.
At the end of the day all this 'you need to be on call for your code' is purely a business money-saving ploy. We are an industry full of suckers, I guess, because we fell for the 'plausible-sounding' explanation hook, line, and sinker.
Nah, it makes you a better serf. Are you working at Amazon and getting paid 400k/year? Sure, do whatever. but regular devs making 70k shouldn't put up with this bullshit.
The well-established solution to 24/7 availability is to operate a shift pattern.
No manager or employer would ever buy that shit because it rounds in the direction of less work though.
 Gene Ray understood this.
The worst that can happen is that the company is down a few hours overnight. Issues can be investigated and fixed during office hours.
I'd wager that most companies don't have global customers and don't need 24/7 coverage.
I think this is a great example of why disagreements arise on HN: different world experiences and base assumptions. For many companies, being unavailable for that window of time would be catastrophic. We had one client that suffered about an hour of downtime (turned out to be their issue). They accounted that hour for 5 million dollars lost.
The whole "you build it you run it" movement is an attempt to fix dev teams just not giving a fuck about quality of code they put out, especially from a reliability point of view.
Why is the opposite better?
Probably this approach is more scalable specially in big companies where you can have operation teams on-call for a myriad of project.
I personally believe that this does not guarantee a better service.
It's exactly like properly/cleverly documenting your code/project: not only for others now or in a few year, but also for yourself later on.
It's having common rules across teams to get more reliability out of the whole company.
You build it, you run it. Fine. Until the point when you can't anymore (because... reasons - it just happens). In any activity you want to sustain, you always have to have backups (in people and in processes), instead on relying on your-(self/team) alone.
A whole business takes that into account as importantly as their disaster recovery processes (which is not necessarily something you focus early on, but you eventually do).