Hacker News new | past | comments | ask | show | jobs | submit login

I think you're missing a couple things here.

One is the difference between optimizing for MTBF and MTTR (respectively, mean time between failures and mean time to repair). Quality gates improve the former but make the latter worse.

I think optimizing for MTTR (and also minimizing blast radius) is much more effective in the long term even in preventing bugs. For many reasons, but big among them is that quality gates can only ever catch the bugs you expect; it isn't until you ship to real people that you catch the bugs that you didn't expect. But the value of optimizing for fast turnaround isn't just avoiding bugs. It's increasing value delivery and organizational learning ability.

The other is that I think this grows out of an important cultural difference: the balance between blame for failure and reward for improvement. Organizations that are blame-focused are much less effective at innovation and value delivery. But they're also less effective at actual safety. [1]

To me, the attitude in, "Getting a call that production is not working is the event that I am trying to prevent by all means possible," sounds like it's adaptive in a blame-avoidance environment, but not in actual improvement. Yes, we should definitely use lots of automated tests and all sorts of other quality-improvement practices. And let's definitely work to minimize the impact of bugs. But we must not be afraid of production issues, because those are how we learn what we've missed.

[1] For those unfamiliar, I recommend Dekker's "Field Guide to Human Error": https://www.amazon.com/Field-Guide-Understanding-Human-Error...




One can talk about MTBF and MTTR but not all failures are created equal so maybe not all attempts to do statistics about them make sense. The main class of failures that I am worrying about regarding the MTTR is the very same observable problem that you solved last week occurring again due to a lack of quality gates. To the customer this looks like last weeks problem was not solved at all despite promises to the contrary. If the customer is calculating MTTR he would say that the TTR for this event is at least a week. And I could not blame the customer for saying that. Since getting the same bug twice is worse than getting two different ones, it actually is quite great that quality gates defend against known bugs.

The blame vs reward issue to me sounds rather orthogonal to the one we are discussing here. If the house crumbles one can choose to blame or not blame the one who built it but independently of that issue, in that situation it quite clear that it is not the time to attach pretty pictures to the walls. I.e., it certainly is not the time to do any improvement let alone reward anyone for it. First the walls have to be reliable and then we can attach pictures to them. The question what percentage of my time am I busy repairing failure vs what percentage can I write new stuff seems to me more important than MTBF vs. MTTR.

I have to grant you that underneath what I write there is some fear going on, but it is not the fear of blame. It is the fear of finding myself in a situation that I do not want to find myself in, namely, the thing is not working in production and I have no idea what caused it, no way to reproduce it and I will just have to make an educated guess how to fix it. Note that all of the stuff that was written to provide quality gates is often also very helpful to reproduce customer issues in the lab. This way the quality gates can decrease MTTR by a very large amount.


> The main class of failures that I am worrying about regarding the MTTR is the very same observable problem that you solved last week occurring again due to a lack of quality gates. To the customer this looks like last weeks problem was not solved at all despite promises to the contrary.

I think the quality gates mentioned in the article are the ones where you have a human approving a deployment. If you have an issue in production and you solve it you should definitely add an automated test to make sure the same issue doesn’t reappear. That automated test should then work as a gate preventing deployment if the test fails.


I can't say I spelled every letter in the article but it says so many strange and wrong things I would not give it any benefit of the doubt of the sort of 'but it cannot really actually say that, right?'


Problems don't occur due to a lack of quality gates. Quality gates are one way to fix problems, but are far from the only way. And, IMHO, far from the best way.

And I think the issue of blame is very much related to what you say drives this: fear. Fear is the wrong mindset with which to approach quality. Much more effective are things like bravery, curiosity, and and resolve. I think if you dig in on why you experience fear, you'll find it relates to blame and experiences related to blame culture. That's how it was for me.

If you really want to know why bugs occur in production and how to keep them from happening again, the solution isn't to create a bunch of non-production environments that you hope will catch the kinds of bugs you expect. The solution is a better foundation (unit tests, acceptance tests, load tests), better monitoring (so you catch bugs sooner), and better operating of the app (including observability and replayability).


I am sorry but what you are saying really does not make much sense to me. You say quality gates are bad and instead we should have unit tests, acceptance tests and so on. Actually, unit tests and acceptance tests are examples of quality gates. And do note that the original article is down even on unit tests because they are not the production environment.

Then you say that e.g., bravery is better than fear. Well, there is fear right there inside bravery. I would be inclined to make up the equation bravery = fear + resolve.

And why are you pitting replayability against what I am saying? Replayability is a very good example of what I was talking about the whole time. I have written an application in the past that could replay its own log file. That worked very well to reproduce issues. I would do that again if the situation arose. Many of these replayed logs would afterwards become automated tests. The author of the original article would be against it, though. The replaying is not done in the production environment, so it is bad, apparently.


I don't believe the original article is down on unit tests. He's very clearly down on manual tests and tests that are part of a human-controlled QA step. But he also says, "If you have manual tests, automate them and build them into your CI pipeline (if they do deliver value)." So he is in favor of automated tests being part of a CI pipeline.

And I'm saying that the things I listed are good ways to get quality while not having QA environments and QA steps in the process.

I also don't know where you get the notion that all debugging has to be done in production. If one can do it there, great. But if not, developers still have machines. He's pretty clearly against things like QA and pre-prod environments, not developers running the code they're working on.

So it seems to me you're mainly upset at things that I don't see in his article.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: