I understand the sentiment but I agree with 2, 3 and 6 of this article, the rest is, imho, actually dangerous in many non startup cases.
Example; Simple is always better IF you can apply it; a lot of companies and people you work with do not do so simple. A lot of companies still have SOAP, CORBA or in house protocols and you will have to work with it. So you can shout from the rafters that simple wins; you will not get the project. That can be a decision but I do not see many people who finally got into a bank/insurer/manufacturer/... go ‘well, your tech is not simple in my definition so I will look elsewhere’.
It is a nice utopia and maybe it will happen when all legacy gets phased out in 50-100 years.
Today's code will become tomorrow's legacy.
- that it’s possible for the team developing the product to deploy or monitor it (example cases where it isn’t: most things that aren’t web based such as desktop, most things embedded into hardware that might not yet exist etc.)
- that if you can deliver continuously, customers actually accept that you do. Customers may want Big Bang releases every two years and reject the idea of the software changing the slightest in between.
- not validating a deployment for a long time before it meets customers is also only ok if the impact of mistakes is that you deploy a fix. If the next release window is a year away and/or if people are harmed by a faulty product then you probably want it painstakingly manually tested.
My point is: if you are a team developing and operating a product that is as web site/app/service and you are free to choose if and when to deploy, then most of the article is indeed good advice. But then you are also facing the simple edge case among software deployment scenarios.
In these cases, you can have a pre-production embedded (in the sense of "embedded journalism") field test, where the developers come out to the production line and/or testing field to iterate on the software together with other departments + the final customers.
IIRC this is done often in military weapons testing—you'll often find the software engineer of a new UAV autonavigation system at the testing range for that system, doing live pair-debugging with the field operator.
Yes. The assumption that you are working on a web based service is so core to this piece that it doesn't seem any more necessary to say "this doesn't work for desktop" than it would be to say "this doesn't work without internet".
given that you are delivering software on the web, your customers are going to get changes to it and like it, because their other option is to run systems on the internet with known exploits. Customers who don't want changes host their own instance.
And if your next release is a year away and you have no way to roll back the release, but you have no manual validation - then you aren't following this advice to begin with, and you have an appallingly broken process.
> If the next release window is a year away and/or if people are harmed by a faulty product then you probably want it painstakingly manually tested.
No, you manual testing should only be for the things which are difficult to automatically test. But I think you should _always_ strive for extensive automatic testing. Even with hardware which doesn't yet exist, mocks are perfect for that.
It’s also possible to extend the argument to the relation between “compiler verification” and “test verification”. That is: don’t spend time writing tests for things a compiler could catch.
I do disagree with:
> Environments like staging or pre-prod are a fucking lie.
You need an environment that runs on production settings but isn't production. Setting up an environment that ideally has read-only access to production data has saved a huge number of bugs from reaching customers, at least IME.
There's just so many classes of bugs that are easily caught by some sort of pre-prod environment, including stupid things like "I marked this dependency as development-only but actually it's needed in production as well".
Development environments are frequently so incredibly far removed from production environments that some sort of intermediary between production is almost always so helpful to me that the extra work involved in maintaining that staging environment is well worth it.
It's not the same as production obviously, but it's a LOT closer than development.
> Setting up an environment that ideally has read-only access to production data has saved a huge number of bugs from reaching customers, at least IME.
That's an anecdote, not a reason. Also, just because you've done it that way doesn't mean it has to be done that way, like you asserted.
> There's just so many classes of bugs that are easily caught by some sort of pre-prod environment
Also does not support the claim that you need a pre-prod env.
> Development environments
Whoa, there! You're sneaking yet another kind of environment into the conversation? Maybe not. This is unclear, given the many different ways that people do work.
> not the same as production obviously, but it's a LOT closer
You seem to want something like production. There is nothing more like production than production.
If you're set up to do A/B tests or deploys with canaries or give potential customers test accounts you're probably able to start testing in production in a sane, contained way.
You seem to be assuming 1. some sort of large-horizontal-scale production system with multiple customers, where the impact of a failure can be minimized by minimizing the number of users exposed to new features, and where 2. there's no type of bug in the code that would potentially take down production as a whole.
What if your production system is, say, a bank's ACH reconciliation logic? A medical device? A car? The live server for a popular MMORPG? A telephone backbone switch? A television or radio broadcast station?
In these cases, your software isn't a service with multiple distinct customers that each make requests to it, where you can test your new code on one customer in a thousand; your software is just running and doing something—one, unified something per instance of the system (though that process may track multiple customers)—and if the code is wrong, then the whole system the software operates will fail.
How do you test software for such systems?
Usually by having a "production simulation" whose failure won't kill people or cost a million dollars in lost revenue.
Currently I work on systems to prepare and validate birth and death certificates for the state, counties, hospitals, et al, and this whole “throw it against the wall and see what sticks” methodology doesn’t fly. Nor would it have worked when preparing and presenting investment account information 5 years ago, nor the job 10 years ago processing lawsuit and insurance claim cases and legal bills. Nor any place that I at least have ever worked in the last 30 years.
Basically you're outsourcing QA to your customer. Some systems may afford this, others not.
You've just described any software ever used.
The unstated assumption of "staging" fans is that bad test coverage is a universal, naturally-occurring condition. At the companies where I've worked that had good tests, they did not have staging. At companies where I thought the test coverage was poor, they did have it.
I think a manual QA team is very valuable. Sure the tests pass but what if the UI is confusing or disorienting. QA can be user advocates in a way a unit test can't be. I work in games so maybe it's just a squishier design philosophy but you can't unit test fun.
I also don't understand the worry about other environments. If you're automating deployments how is another environment added work? Shouldn't it be just as easy to deploy to?
There is definitely value in having both automated testing for repetitive stuff, AND, humans touching stuff to spot unspecified insanity.
Regarding the 'buy vs build' I think buying software is one of the most risky things that you can do. Since it cost money you cannot then say 'o well, i guess it just did not pan out, let us just not use it'. Now you are kind of married to the software. And some of the worst software out there is paid for. E.g., jira vs. redmine. This is actually a bit ironic considering the fact that I actually am writing software in my job that is sold.... O well, it actually is sold as a part of a piece of hardware, so it is not really sofware as such.....
Regarding the last point, failure can be made uncommon if a relatively safe route to production is available, starting with a language that verifies the correct use of types, automated tests that verify the correctness of code, a testing environment that one attempts to keep close to what production is like and so on. Getting a call that production is not working is the event that I am trying to prevent by all means possible, and I think research would be able to show that people who get fewer calls, not just because production is failing, but in general, fewer calls regarding whatever subject, will live longer and happier.
It is usually way more costly and risky to develop your own. It's many hours spent on what is a separate product to your actual product, and you're way more married to it: you've just spent money, time and energy developping a custom homegrown solution. What are the chances you'll go "o well, i guess it just did not pan out, let us just not use it"? Very, very low
So you end up spending more money and a significant amount of time/energy for a product that's probably subpar because there's no reason you'd do better than companies that are focused on this product.
I think buying software is one of the least risky things you can do, you know exactly how much money you have at risk and you usually know pretty well what you're buying. You don't know how much money/time/energy it will take to make your own solution, and you don't know what result you'll get.
One is the difference between optimizing for MTBF and MTTR (respectively, mean time between failures and mean time to repair). Quality gates improve the former but make the latter worse.
I think optimizing for MTTR (and also minimizing blast radius) is much more effective in the long term even in preventing bugs. For many reasons, but big among them is that quality gates can only ever catch the bugs you expect; it isn't until you ship to real people that you catch the bugs that you didn't expect. But the value of optimizing for fast turnaround isn't just avoiding bugs. It's increasing value delivery and organizational learning ability.
The other is that I think this grows out of an important cultural difference: the balance between blame for failure and reward for improvement. Organizations that are blame-focused are much less effective at innovation and value delivery. But they're also less effective at actual safety. 
To me, the attitude in, "Getting a call that production is not working is the event that I am trying to prevent by all means possible," sounds like it's adaptive in a blame-avoidance environment, but not in actual improvement. Yes, we should definitely use lots of automated tests and all sorts of other quality-improvement practices. And let's definitely work to minimize the impact of bugs. But we must not be afraid of production issues, because those are how we learn what we've missed.
 For those unfamiliar, I recommend Dekker's "Field Guide to Human Error": https://www.amazon.com/Field-Guide-Understanding-Human-Error...
The blame vs reward issue to me sounds rather orthogonal to the one we are discussing here. If the house crumbles one can choose to blame or not blame the one who built it but independently of that issue, in that situation it quite clear that it is not the time to attach pretty pictures to the walls. I.e., it certainly is not the time to do any improvement let alone reward anyone for it. First the walls have to be reliable and then we can attach pictures to them. The question what percentage of my time am I busy repairing failure vs what percentage can I write new stuff seems to me more important than MTBF vs. MTTR.
I have to grant you that underneath what I write there is some fear going on, but it is not the fear of blame. It is the fear of finding myself in a situation that I do not want to find myself in, namely, the thing is not working in production and I have no idea what caused it, no way to reproduce it and I will just have to make an educated guess how to fix it. Note that all of the stuff that was written to provide quality gates is often also very helpful to reproduce customer issues in the lab. This way the quality gates can decrease MTTR by a very large amount.
I think the quality gates mentioned in the article are the ones where you have a human approving a deployment. If you have an issue in production and you solve it you should definitely add an automated test to make sure the same issue doesn’t reappear. That automated test should then work as a gate preventing deployment if the test fails.
And I think the issue of blame is very much related to what you say drives this: fear. Fear is the wrong mindset with which to approach quality. Much more effective are things like bravery, curiosity, and and resolve. I think if you dig in on why you experience fear, you'll find it relates to blame and experiences related to blame culture. That's how it was for me.
If you really want to know why bugs occur in production and how to keep them from happening again, the solution isn't to create a bunch of non-production environments that you hope will catch the kinds of bugs you expect. The solution is a better foundation (unit tests, acceptance tests, load tests), better monitoring (so you catch bugs sooner), and better operating of the app (including observability and replayability).
Then you say that e.g., bravery is better than fear. Well, there is fear right there inside bravery. I would be inclined to make up the equation bravery = fear + resolve.
And why are you pitting replayability against what I am saying? Replayability is a very good example of what I was talking about the whole time. I have written an application in the past that could replay its own log file. That worked very well to reproduce issues. I would do that again if the situation arose. Many of these replayed logs would afterwards become automated tests. The author of the original article would be against it, though. The replaying is not done in the production environment, so it is bad, apparently.
And I'm saying that the things I listed are good ways to get quality while not having QA environments and QA steps in the process.
I also don't know where you get the notion that all debugging has to be done in production. If one can do it there, great. But if not, developers still have machines. He's pretty clearly against things like QA and pre-prod environments, not developers running the code they're working on.
So it seems to me you're mainly upset at things that I don't see in his article.
Once you start relying on it a bit too much, it can hurt really badly if you are not able to fix issues by yourself, or if they decide to change the price later on. The worse is when a company that you were paying a software for goes out of business. You just have to start again.
In dev you can break almost anything, no biggie. In stage if you break something, great just don’t deploy it to prod. If you break something in prod, well ... you may end up going below SLA and may legit lose money and your customers trust.
Don’t YOLO into prod. Build reliable shit.
- Infrastructure as code and schemas as code make it easier to keep environmental parity, because everything can be rolled back/forwards/reset with easy source control and CD operations. Visual environment diffing and drift detection can make this even easier.
- Make your stage and prod into a blue-green situation, where if stage is ready to go, you flip users onto it. I can guarantee your stage and prod will both be respected as prod then. Failing that, just add load/stress tests to stage to make it more prod-like.
- Non-prod environments and attention are not necessarily debt, but they are expensive insurance premiums. You should only pay those premiums if you need the insurance. It's about risk management.
- As time passes, the people who wrote a specific part of a system don't know it anymore, so having them babysit 'their' code in production has diminishing returns. On the other hand, having a systems quality team who have a broad mandate to bugfix, put in preventative measure, reduce technical debt, improve observibility and establish good patterns for developers to do these things, can enabled these things to actually happen, when just telling devs who are busy making features that they should happen often doesn't make them happen. Also there are devs who enjoy creating new things, and others who love trouble-shooting and metrics.
I understand the deploy-often-and-rollback-if-there-is-a-problem strategy, but certain things like DB migrations and config changes are difficult to rollback, so doing a dry run in a staging environment seems like a good thing...
For some of our customers, we operate 2 environments which are both effectively production. The only real difference between these is the users who have access. Normal production allows all expected users. "Pre" production allows only 2-3 specific users who understand the intent of this environment and the potential damage they might cause. In these ideal cases, we go: local development -> internal QA -> pre production -> production actual. These customers do not actually have a dedicated testing or staging environment. Everyone loves this process who has seen it in action. The level of confidence in an update going from pre production to production is pretty much absolute at this point.
The amount of frustration this has eliminated is staggering. At least in cases where we were allowed by our customers the ability to practice it. For many there is still that ancient fear that if we haven't tested for a few hours in staging that the world will end. For others, weeks of bullshit ceremony can be summarily dismissed in favor of actually meeting the business needs directly and with courage. Hiding in staging is ultimately cowardice. You don't want to deal with the bugs you know will be found in production, so you keep it there as long as possible. And then, when it does finally go to production, it's inevitably a complete shitshow because you've been making months worth of changes built upon layers of assumptions that have never been validated against reality.
This all said, there are definitely specific ecosystems in which the traditional model of test/staging/prod works out well, but I find these to be rare in practice. Most of the time, production is hooked up to real-world consequences that can never be fully replicated in a staging or test environment. We've built some incredibly elaborate simulators and still cannot 100% prove that code passing on these will succeed in production against the real deal.
This wasn't cheap; they payed Oracle somewhere between 50k€ and 200k€ a year just for the database for this environment, but they considered it worth it. (They were also in a pretty tightly regulated vertical).
My main takeaway is that I don't think there is a one-size-fits-all answer to the question of how many and what environments you need. IME having at least one "buffer" between dev and prod is a good thing, but I'm not sure to which extend my experience generalizes.
1. No, the engineers should not by default be on call; the owners of the product are the first call line. If they're not engineers or if they're engineers but don't have enough time to deal with all incidents–in short, if they need to delegate–they better be willing to pay very generously for the extra hours of on call duty.
2. No, hosted is not better than open source, both for philosophical and operational reasons: mostly, you become subject to the whims of the provider. A good compromise is hosted open source solutions, which at least takes you half way to a migration, if the need for one comes up.
That aside, I very much agree on everything else.
Without such a preflight box, or automated incremental rollouts, you are kind of doing a Hail Mary, since you are exposing all users immediately to a system that has not been verified in production before going live.
Strongly disagree with that, well maybe it is a good idea when you are over founded by VC where cost of money is equal to zero and you don't want to master what you are working on but in all other cases this is wrong, you shouldn't rebuild everything from scratch but creating a company is not the same as playing with LEGO
And this is the same argument as saying you should have everything in AWS because if you self host you will have to hire devops engineer
I've made the build-vs-buy decision many times in my career. I don't necessarily regret /all/ of those times, but the general lesson I've learned time and time again is that you're going to end up investing WAY too much time maintaining your special version of X when you should have spent that time solving problems unique to your business model.
If not, you're wasting your money in a different way, by not focusing on the things that really bring in revenue or by paying salary to people to maintain it.