Testing on production

contravariant · 2023-08-14T11:35:10

An interesting perspective I once heard from an information security expert is that there's a difference between risks and 'things that can go wrong'. Something is only an actual risk if it hurts the bottom-line. In particular quite a few things that can go wrong don't carry that much risk, and conversely something that is hard but not impossible to go wrong may carry huge amounts of risk.

The trick with this perspective is that after identifying the real risks you can then link the risks and possible mitigations by looking at all 'things' and identifying the ways in which they might fail (and how this may be prevented from happening). This way you can easily identify which mitigations are helping prevent risks and which risks are not sufficiently mitigated. It's a fair bit of work, but it's not complicated and often gives useful insights.

What this article basically does is note that you should first asses what risks a failed deployment has, and correctly states that in quite a few cases this risk is low and therefore the mitigations (of which there can be many) may not be necessary and may in fact be doing harm without actually sufficiently preventing any risk.

chias · 2023-08-14T16:22:35

This was an interesting bit of math I did when I joined a startup. It's pretty counter-intuitive to think about how very large numbers can increase the importance of small ones.

Say the company you work for is worth $10,000,000, and that you're hosted on GCP. Now take your best guess: what do you think the likelihood is of e.g. a fire or earthquake or something occurring in all relevant Google infrastructure simultaneously*, basically ushering in the end of all of your infrastructure, data, and backups? Frame that in a number of years. Is this kind of event something that may happen once in a thousand years? Once in ten thousand years? Let's say this is the sort of thing that might happen once in ten thousand years -- that's a long time!

Then the cost of this particular risk to your company is $1000 / year.

This kind of math isn't just a toy. When you have questions like "would maintaining actual physical backups in a safe somewhere outside of GCP be worth it?", you now have a framework to answer them ("if it would cost less than $1000 per year, then yes")

--

* or substitute in your favorite company-ending event.

contravariant · 2023-08-14T21:05:45

This is similar, but one of the benefits of thinking about 'things that musn't happen' and relating them to 'things that can go wrong and how to prevent them' is that it avoids talking about expected damage.

This avoids two nasty problems with trying to express risk as an expected value.

The first is that it is hard to express all kinds of probabilities and damages numerically, not all kinds of damages convert easily to money, and some probabilities are hard to guess (you quickly get uncertain probabilities, but expected values just flatten those into an average again). Even without those issues pinning a number on it can lead to lots of discussion (good if you want discussion, not so good if you want to get shit done).

The second is that you easily fall into the trap of assuming everything has an average, and that the law of large number applies. While physics kind of helps you there by putting hard limits on the maximum amount of damage possible, you may end up in a situation where all nasty stuff is in the long improbable tail. Good example is earthquakes, magnitude increases tenfold for every point in the Richter scale but frequency also only decreases tenfold, what then is the average?

Well and something that's not really a big problem, but worth thinking about, some of these eventualities may very well cause you damage but are beyond your sphere of influence. Sure you should try to avoid going bankrupt if someone knocks over a server rack, but if all google data centres go down over an entire continent you've got bigger fish to fry. So focusing on the things you can do something about is a helpful way to keep focused.

tnel77 · 2023-08-22T15:01:50

While I agree with your math, I am curious how much your physical backups would be worth if something so catastrophic occurred that all of Google/AWS/Azure cloud services were destroyed. Whether that be an act of war, a massive solar flare, etc., I am curious if it would even matter anymore that you had those backups.

big_question · 2023-08-14T17:30:39

Similarly:

"We are spending 50$ per month just for one test in our code. We could cut it down to 10$ if we wanted."

"How many hours would it take to reduce the spend? If it's more than a couple of hours for a senior engineer, then it's not worth it."

We kept spending money on this inefficient test and it was the right choice.

Vt71fcAqt7 · 2023-08-14T18:53:08

Reciprocally, when low wages are available in manufacturing, comapnies are less likely to use automated proccesses because labor is so cheap it can cost less to eg. have two buckets/wheelbarrows between two parts of a factory line with one person to swap them rather than use a conveyor belt. Getting 100% automation would allow factories to come back to the US but getting that last 20% at a competitive cost is difficult. See America’s largest tool company couldn’t make a wrench in America (wsj.com).[0]

[0] https://news.ycombinator.com/item?id=36828861

Scubabear68 · 2023-08-14T17:32:56

Very well said.

I had a similar conversation as a new-ish fractional CTO last year. One team was working on a new CRM product that was effectively alpha-level software used only internally. The team had become terrified of shipping and breaking something and was horrifically risk averse. For a new release that the team was going to delay again at the last minute, I got the CEO on the release call and asked him what would happen if the release completely failed and it took us an entire day to get the product working again. He replied “Not a big deal. The users would just write stuff down like they do today and key it on tomorrow. It’s not like this has enough features to be critical or anything”.

The team was completely stunned. It goes without saying we did the release, found a small mistake, fixed it, and life went on.

Teams really do have understand who her users are and criticality of the software.

chpmrc · 2023-08-14T17:53:26

This story perfectly aligns with the arguments in this article. I'll add it as a note if you don't mind.

Scubabear68 · 2023-08-14T18:00:49

Please do :-)

pravus · 2023-08-14T12:22:03

When I was doing disaster recovery we measured three things: 1) the likelihood of a particular scenario, 2) the magnitude of the impact if the scenario unfolded, and 3) the level of effort expected to recover. It was the combination of these three things that prioritized decision making.

We were working towards a business continuity plan which can include incidents like your main office and operations being destroyed and having to quickly relocate all services to 3rd-parties using off-site backups with minimal staff. While that was a worst-case, a primary focus was just getting a notification site up and running in the event of a network outage because that was vastly more frequent and had high visibility.

It was a very interesting project and I learned quite a bit about how to think comprehensively about the solutions we provided.

angarg12 · 2023-08-14T14:52:59

> may in fact be doing harm without actually sufficiently preventing any risk.

The biggest practical impediment to increasing velocity of delivery that I encounter is trying to convey this. People can visualize and estimate the risk and impact of a deployment gone wrong, but have a hard time estimating the impact of processes that slow down delivery. Therefore they overindex in heavy and "safe" processes (which often don't increase safety) at the cost of speed of iteration.

I'm not sure how to define this asymmetry, maybe some variation of loss aversion.

sandworm101 · 2023-08-14T14:47:41

>> Something is only an actual risk if it hurts the bottom-line.

Sometimes not even that. We have see many huge breakdowns in recent years that did hit bottom lines, but didn't impact stock prices. Perhaps, at least for a publicly-traded company, the only real risks are those that might impact stock prices. That might include things that hurt other companies if doing so might result in money leaving the entire sector.

kawemi · 2023-08-14T14:51:07

It's a really useful perspective in real-life scenarios when you're not developping critical software. Of course a baseline of risk-avoidance is always important, but businesses/custommers/users most of the time are ready to handle some risks, like downtime, bugs, delays, etc. SWE and developpers are the more risk-averse of the two parties, which leads to us over-valuing the importance of robustness and stability.

For example, it's way easier/faster to implement observability and some sort of rollback of bad versions than to try and prevent every possible way an app could crash and trigger a bunch of problems. What's going to happen if the app crash is pretty simple : customers will be mad (CS/Marketing/PR can handle them), you'll notice the downtime quickly and rollback (or maybe even rollback automatically!). Then you'll be in a perfect position to handle what went wrong : systems will be back on a known stable position and all the stress of trying to fix something in a live production system will be gone.

johnmaguire · 2023-08-14T15:20:57

Of course, there is no magic bullet. Some problems aren't solved by rolling back services. (e.g. A thundering herd of clients caused by re-deploying an old build overloading your database.)

kawemi · 2023-08-14T17:00:02

Yes of course, my fake situation was assuming a pretty boring case of failure with an easy out (rollback). The underlying principle is that most of the time trying to preempt every situation is way more work than being conscious of them and giving yourself and your team(s) reasonable tools to mitigate them :)

HL33tibCe7 · 2023-08-14T11:53:32

Risk = likelihood * severity

contravariant · 2023-08-14T12:26:42

That's also an approach, but it may lead to endless discussions about how likely something is. It's easier to tell what the worst possible consequence is (what this article calls criticality). After that it's fairly straightforward to figure out if you're doing enough to prevent this scenario from being realized (which is more like coverage = risk * mitigations).

hef19898 · 2023-08-14T12:06:59

Sometimes one has to include detectability as well.

datadrivenangel · 2023-08-14T12:24:48

Severity should include detectability. If you never detect an issue, it's not an issue because nobody sees it.

hef19898 · 2023-08-14T13:52:05

Usually it is a seperate factor, at least as far as P/D-MEAs are concerned. Quick and dirty, sure, it can be included in severity. Personally, I prefer the increased transparency and granularity of having detectability as a different factor.

HL33tibCe7 · 2023-08-14T15:08:53

To me, that’s a subcomponent of severity

Narann · 2023-08-14T11:49:02

> something that is hard but not impossible to go wrong may carry huge amounts of risk.

I think it's the definition of the black swan theory[1].

[1]: https://en.wikipedia.org/wiki/Black_swan_theory

chpmrc · 2023-08-14T11:49:28

Fantastic take. Thank you.

hef19898 · 2023-08-14T12:06:14

Or, one cpuld alao just follow some standard processes all the time, instead of developing an individual approach every single time. Those standards should contain mitigation for most of the common risks, and rules to apply for the rest.

And one of those standards, an no I don't give a shit about developer experience, software or otherwise, should be you never ever test on production. As soon as you work on real products for real customers you better start behaving like a professional. Childs play is over as soon as some one is paying you to do stuff.

kasey_junk · 2023-08-14T12:42:18

Every enterprise I’ve ever worked for, including corporate giants running the backbones of modern finance ran tests on production. Post deploy smoke tests, fail over tests, small group alphas, test migration/roll backs etc.

Not being confident in your test plan is a sign of immaturity not maturity because at some point you are going to need to validate how something behaves in production.

There are a wide range of processes, procedures and software architectures to get you to being confident that your production testing is doing more good than harm for your customers but in an environment where you can deploy new software you are going to do some testing in production.

Eddygandr · 2023-08-14T14:13:53

There’s many reasons you test in production! The most recent TiP project I worked on was running a regression suite against production to notify outages before before users did (in an irregularly used but critical system, such that you can’t just go by user logs alone).

civilized · 2023-08-14T11:36:56

I have a dumb question as a non-SWE who is curious about software engineering.

I've heard "feature flags" are popular these days, and I understand that that's where you commit code for a new way of doing things but hide it behind a flag so you don't have to turn it on right away.

Now, if I want to test in prod, couldn't I just make the flag for my new feature turn on if I log in on a special developer test account? And if everything goes well, I change the condition to apply to everyone?

rahoulb · 2023-08-14T11:46:56

Yes.

As long as your code makes sure it takes account of that flag everywhere that it is used. Otherwise your new feature could "leak" into the system for everyone else.

Plus, as systems grow in complexity, there's always a danger that features step on each other. We'd like to think that everything we write is nicely isolated and separated from the rest of the system, but it never works that way - plus we're just a group of squishy humans who make mistakes. There will be times when having Features A and C switched on, with B switched off, produces some weird interactions that don't happen if A, B and C are switched on together.

mwint · 2023-08-14T11:40:24

Feature flags sound great, but a company I’ve been consulting for has been using them to their own detriment. Seems like many bugs are due to a (production!) user not having the right combinations of flags enabled.

There ends up being code to deal with what happens when various combinations of flags are on/off, and that code doesn’t get tested much.

And teams spend a lot of time just removing flags.

This isn’t a safety-critical app - I really think they’d do better dropping the flags, and just deploying what they want when it’s ready.

xnorswap · 2023-08-14T11:46:14

I'm going to go further and say that Feature Flags are a nightmare and should be avoided. Because instead of just being used to stage roll-out, they get used to configure different environments for different customers.

You not only waste time with "Remove feature flag X" stories if all customers end up with the feature, you also slow down the response time of some categories of bugs, because you end up having to stop and check the combination of feature flags to reproduce a bug.

And if you end up with a feature that isn't popular except by one customer, not only are you now stuck supporting "Legacy feature Y", you're actually stuck supporting, "Optional legacy feature Y" which is worse.

Maybe I'm ranting about "misuse of feature flags", but I don't like to pontificate about how things ought to be, but how in my experience they actually are.

steveBK123 · 2023-08-14T11:50:34

Yes, it really depends on the type of environment / app you are running too. If your app is stateful, uses lots of data, etc.. then feature flags can cause a lot of issues with inadvertent upgrades that have to be rolled back manually in things like user data.

Or you can have infinite permutations of feature flags if you don't flip them to be on for everyone quickly enough, and it becomes hard to test if[a&(not b)&c&(not d)] behavior vs if[a&b&(not c)&d] and.... you end up with too many to cover with testing well.

JohnBooty · 2023-08-14T12:16:06

They can be very very very nice if you have a lengthy (or perhaps just unpredictable) build/deploy process. And/or if you have lots of teams working independently on the same monolith.

Suppose you have daily production builds. You are rolling out Feature XYZ. You would like to enable it in prod, but you would like to monitor it closely and may need to turn it off again. Feature flags allow that.

Ultimately what's being achieved is a decoupling of configuration and deployment.

    Maybe I'm ranting about "misuse of feature flags", but 
    I don't like to pontificate about how things ought to be, 
    but how in my experience they actually are.

Similarly, I might just be making excuses for bad build/deploy processes. =)

At my last job we relied heavily on feature flags via Launch Darkly. I will admit: it was somewhat of a band-aid for the fact that our build process was way too slow and flaky, and that we had too many teams working on an overstuffed monolith.

whstl · 2023-08-14T12:39:22

I also use feature flags when I'm 100% sure stakeholders or PMs will somehow find fault a certain feature after it's deployed, even though they're the ones who specified it, approved it and tested it in a staging environment.

Not exactly the thing that we should be using Feature Flags for, but it saved my ass several times.

On the other hand: this removes some of the accountability that non-technical folks have over software. This can be detrimental in the long term.

steveBK123 · 2023-08-14T17:10:18

I have also found that for UIs the best thing to do is have a staged rollout approach.

Internals / Friendly users / Less friendly users / VIPs. The blast radius & intensity of explosion is smaller on the earlier groups.

The groups themselves need not be fixed. If you have a stakeholder/group that demanded the new features, they can be in an early wave. Inevitably they may be the ones to find defects in it, so the sooner the better.

gwright · 2023-08-14T12:07:40

> they get used to configure different environments for different customers

That is not a feature flag, that is a customer configuration option. They are different things and should not be treated in the same way.

Sure, it is possible for a feature flag to behave like a configuration option but they have different lifecycles and different audiences and so should not be confused. Of course it is easy to say that but harder in practice to maintain those differences.

smallerfish · 2023-08-14T12:08:29

> Seems like many bugs are due to a (production!) user not having the right combinations of flags enabled.

In my experience, feature flags work best if you aim to remove them as quickly as possible. They can be useful to allow continual deployment, and even for limited beta programs, but if you're using them to enable mature features for the whole customer base, they're no longer feature flags.

mastersummoner · 2023-08-14T12:04:19

We've been using feature flags extensively lately. A step that helps for this issue is having all merged code deploy automatically to our QA environment first. We have automated tests which run there regularly, as well as it being the environment most people use for testing, which increases the likelihood that issues like this will become evident quickly.

Definitely doesn't do anything like completely obviate the issue though.

perrygeo · 2023-08-14T13:59:57

You've described the ideal use case - a single feature flag, short lived, to let select users test one isolated piece of functionality until it's made generally available. Feature flags used in this way are wonderful.

But there are numerous ways to use feature flags incorrectly - typically once you have multiple long-lived flags that interact with each other, you've lost the thread. You no longer have one single application, you have n_flags ^ 2 applications that all behave in subtlety different ways depending on the interaction of the flags.

There's no way around it - you have to test all branches of your code somehow. "Just let the users find the bugs" doesn't work in this case since each user can only test their unique combination of flags. I've regularly seen default and QA tester flag configurations work great, only to have a particular combination fail for customers.

The only solution is setting up a full integration test for every combination of flags. If that sounds tedious (and it is), the solution is to avoid feature flags, not to avoid testing them!

codethief · 2023-08-14T17:58:18

> The only solution is setting up a full integration test for every combination of flags.

I've long been wondering whether there are tools that help with that. Like they measuring a test suite's code coverage but for feature toggle permutations. Either you test those permutations explicitly or you rule them out explicitly.

recroad · 2023-08-14T15:37:32

Long lived feature flags are totally fine, they're more like operational flags than anything. The Fowler article is pretty good at classifying them. Depending on the type of flag (longevity/dynamism) the design will vary. https://martinfowler.com/articles/feature-toggles.html

kiitos · 2023-08-14T15:43:57

An essential property of a feature flag is that it is short-lived, existing only for the duration of the roll-out of the feature. In the language of your linked article, feature flags are 1-to-1 with "release toggles" and not really any other kind of toggle.

andreareina · 2023-08-14T14:51:25

2^nflags actually. Which is a much bigger number.

perrygeo · 2023-08-14T20:27:07

Yes, thank you for the correction! Though the point still stands - keep your nflags <= 2 and you can reasonably test it.

marcosdumay · 2023-08-14T14:09:07

The solution is to remove your feature flags after you are done with them.

Eddygandr · 2023-08-14T14:16:01

The problem is when you use feature flags for customer-bespoke reasons or to enable paid features. Then they’re always there and have to be tested in combinations which sucks.

marcosdumay · 2023-08-14T14:47:29

Yeah, those things are called "user settings". If you need them, you need them, but pretending they are feature flags and trying to port the flags development methods into your settings will lead to nothing but tears.

kiitos · 2023-08-14T15:48:44

Echoing sibling comments, feature flags are about managing the deployment of new product capabilities, and should always be short-lived. They're not an appropriate choice for any kind of long-lived capability, like anything that's per-customer, or paid vs. non-paid, or etc. Using feature flags for those kinds of things is a classic design mistake.

seanthemon · 2023-08-14T14:49:17

It's so closely related and so often mistaken, feature flags and tenant features are two completely different things

rco8786 · 2023-08-14T12:57:28

Yes, that's the general idea - and it works pretty well.

It can also be a huge PITA. The fallacy is that a "feature" is an isolated chunk of code. You just wrap that in a thing that says "if feature is on, do the code!". But in reality, a single feature often touches numerous different code points, potentially across multiple codebases and services/APIs. So you have to intertwine that feature flag all over the place. Then write tests that test for each scenario (do the right thing when the feature is off, do the right thing then the feature is on). Then you have to remember to go back and clean up all that code when the feature is on for everyone and stabilized.

It's a good tool, but it's not an easy tool like a lot of folks think it is.

zimzam · 2023-08-16T14:30:20

In web development there is often a single place you can put a feature flag though.

For example maybe the feature flag just shows/hides a new button on the UI. The rest of the code like the new backend endpoint and the new database column are "live" (not behind any flags) and just invisible to a regular user since they will never hit that code without the button.

As far as "remembering" to clean up the feature flag, teams I've been on have added a ticket for cleaning up the feature flag(s) as part of the project, so this work doesn't get lost in the shuffle. (And also to make visible to Product and other teams that there is some work there to clean up)

jiggawatts · 2023-08-14T11:54:17

This is pretty common at larger scales, and is also often done on a per-tenant or per-account basis.

For example, the Microsoft Azure public cloud has a hierarchy of tenant -> subscription -> resource group -> resource.

It's possible to have feature flags at all four levels, but the most common one I see is rolling deployments where they pick customer subscriptions at random, and deploy to those in batches.

This means you can have a scenario where your tenant (company) is only partially enabled for a feature, with some departments having subscriptions with the feature on, but others don't have it yet.

This can be both good and bad. The blast radius of a bad update is minimised, but the users affected don't care how many other users aren't affected! Similarly, inconsistencies like the one above are frustrating. Even simple things like demonstrating a feature for someone else can result in accidental gaslighting where you swear up and down that they just need to "click here" and they can't find the button...

datadrivenangel · 2023-08-14T12:27:05

The training aspect of feature flags is a huge pain point.

Not to mention it looks really awkward when an account manager has forgotten to enable some great new feature for you.

devjab · 2023-08-14T11:43:05

Yes. That’s how we typically do it in our shop. Though we do test it during development. Then when we think it’s ready, we have the product owner (or whoever ordered it) “play around with it” on a test setup. Before we let select users “test it in production”.

I’m not a fan of this article in general, however, a lot of what it talks about is anti-pattern in my book. Take the bit about Micro-services as an example. They are excellent in small teams, even when you only have 2-5 developers. The author isn’t wrong as such, it’s just that the author seems to misunderstand why Conway’s law points toward service architectures. Because even when you have 2-5 developers, the teams that actually ”own” the various things you build in your organisation might make up hundreds of people. In which case you’re still going to avoid a lot of complexity by using service architecture even if your developers sort of work on the same things.

eloisius · 2023-08-14T11:46:11

You’re describing QC. The reason that’s not sufficient is because your test user might not meet the conditions that trigger a bug. Trivial example: a bug that only shows up for users using RTL languages. A test suite allows you to test edge cases like that. Another shortfall of QC is that it doesn’t provide future assurance. A test suite makes sure the feature keeps working in the future when changes that interact with it are introduced.

speed_spread · 2023-08-14T11:50:48

Yes. Also, feature flags don't have to be on/off, they can be set to a % of requests or users, enabling a progressive rollout period.

MH15 · 2023-08-14T11:40:26

Yes, this is a relatively common practice. There’s of course still the chance you make a mistake setting up the feature flag and bring down production/expose the feature to users who shouldn’t have access.

roenxi · 2023-08-14T11:54:26

The risk is context dependent. It could be a great idea or it could be the end of the company.

Classic story: https://dougseven.com/2014/04/17/knightmare-a-devops-caution...

sidlls · 2023-08-14T11:38:50

Feature flags are just code, like the rest of the software. You can program any feature with it, including auto-enabling it given appropriate circumstances (e.g., the user is logged in to a developer account). Of course, that doesn't work for features available without requiring an account.

literallyroy · 2023-08-14T11:49:07

Yes, feature flags are often able to be applied globally, or per customer. However, feature flags add complexity (littering your business logic with feature flag checks), so many small non-feature changes wouldn’t use them.

miguelxt · 2023-08-14T11:42:33

I guess you can implement them however works for you and your team. I have personally implemented them in various way: depending on the date, on the client IP address, on an env var, on a logged user id, etc etc.

bradleyjg · 2023-08-14T13:15:12

A note of caution re: flags from an oracle dev: https://news.ycombinator.com/item?id=18442941

Shrezzing · 2023-08-14T12:12:39

I enjoyed the entire article except this part:

> Unfortunately there is no easy way to distinguish between people who are good and need a paycheck from people who just need a paycheck. But you sure as hell don’t want the latter in your team.

If you can't tell them apart, then the distinction is unimportant. So if among the group of people who need paychecks, good is indistinguishable from non-good, the comment serves no purpose other than needless elitism.

nonameiguess · 2023-08-14T16:28:35

There is something I've started to notice as I've been working as a platform-layer consultant for the past few years. Many of the companies I've worked with don't have anyone in their company with any meaningful level of experience or expertise in environment administration, security, really anything ops-related. I see this especially when I start trying to hand off work I've done into maintenance phase operations and they don't have any kind of operations team to take over, but my contract sure as shit doesn't say I'm going to come in at 3 AM on a Sunday morning and I never will. So they may try to identify someone in the company to train up or they may try to hire, but the core problem they face isn't that it's impossible in principle to tell good apart from bad. The problem is it's impossible for them to tell the difference because they have no one in their company even qualified to conduct such an interview.

solatic · 2023-08-14T12:15:41

It's implied that it isn't easy to distinguish them during interviews. After they join your team, it's very easy to distinguish them.

Shrezzing · 2023-08-14T12:35:26

I've re-read the developer experience section, and I can't see where that implication is established. In that context, the paragraph stands out as an abrupt diversion from the main theme of the section, and undermines the argument of the entire piece. The section defines developer dissonance, and asserts that it's possible to overcome it with reasoned and sensible questioning. If it's possible to overcome dissonance with reasoned questioning, a hiring interview should a prime opportunity to roll out some reasoned questions and head-off dissonance before it enters the organisation in the first place.

chpmrc · 2023-08-14T15:01:06

I can't see how making that statement undermines the rest of the argument. It would help if you could clarify that relationship.

And I'm not sure I understand why, if you can't distinguish them, the distinction is unimportant. It's hard to distinguish an edible mushroom from a poisonous one and yet making that distinction makes a huge difference.

Interviews are definitely a limited tool to do so btw, this is only something that you realize over time. It's also very easy to play an interviewer if the interviewee's soft skills are better than the interviewer's (which happens often in this industry).

dijksterhuis · 2023-08-14T10:08:45

this made me chuckle

> If GitHub makes a mistake it can affect thousands of businesses but they’ll likely shrug and their DevOps team will just post “GitHub is down, nothing we can do” on some Slack channel.

Gonna try and read the rest of this on the lunch break as was surprisingly meaty for a clickbait title ;)

kubanczyk · 2023-08-14T10:18:03

I love the style:

> That’s a terrible mistake and in the long run will be the cause of cost overruns, unmet deadlines, increased churn and overall bad vibes. And nobody wants bad vibes.

NewEntryHN · 2023-08-14T11:27:13

Good article, but it's a bit binary on the notion of incident. For the same company, it can be very serious to have a global 1h outage, but not so serious to have the internal admin interface down for 1h. This allows for more fine-grained assessment of the validation required to push to prod: the "checks" only have to test the critical part of the application. Dev exp start deteriorating when the non-critical parts are over-tested.

chpmrc · 2023-08-14T11:32:06

Yes criticality is multidimensional, this was a simplification for the sake of brevity. Will add a note. Thank you!

bhaney · 2023-08-14T10:46:01

Love this article. So many great points that I deeply agree with but have never really put into words, and all written in such an engaging style.

chpmrc · 2023-08-14T10:56:20

Thank you so much!

wheelerof4te · 2023-08-14T11:32:44

Just keep the enironments separate, but similar. What works in the test environment, should work in production.

Of course, there are always exceptions to this rule. Adapt and modify the code as needed.

We keep three environments at work: Dev, Test and Prod. However, dev environments are sometimes neglected and some features land in Test only.

So, use Dev as a development playground. Use Test to test the changes made in Dev. If the change is approved in Test, it will go in Prod environment.

tempodox · 2023-08-14T16:03:39

Everybody has got a test environment. Some also have a production environment.

al_be_back · 2023-08-14T15:03:17

>> If Tesla makes a mistake in their autopilot software, people might die.

In this case, a good "Testing on Production" rule would be to not let customers test your software, period.

There's plenty of land and resources to construct towns and cities that simulate real-life commute very accurately.

In the case of self-driving (or even autopilot), you're not really testing a feature, you're researching a new product, they difference is vast.

hulitu · 2023-08-14T15:45:39

> Shipping confidence We can define “shipping confidence” as the feeling a mentally sane developer has when they know their code is about to be deployed to production (whether it can be updated over the air or not).

A bug which must be fixed in production is much more expensive than a bug fixed during development.

People here complain when you bash Microsoft, but their phylosophy was (and still is) let the users test the product.

bornfreddy · 2023-08-15T06:33:04

Double negation is hard... :) (yes and no should be switched)

> Ask yourself a question: do you have any reason to think that your engineers will not do a good job? If the answer is no: why are they still there? If the answer is yes: let them do their damn job.

postalrat · 2023-08-14T14:44:41

We all test in production but some people are in denial and refuse to accept it.

therealchiko · 2023-08-14T11:29:15

> The TL;DR is that some (“best”) practices are contextual and understanding when to use them is ultimately what gives us the title of “engineers”.

So well put, just today I implemented a feature and kept asking myself if i should be extending the component (leaning more towards OOP) or just add an additional argument to said component. The latter would have stuck more with the current style but I also realized there's no obvious better way, extending made sense and I realized the importance of understanding the nuance and standing up for those design decisions is what I am here to do :)

thank for putting that in less words

trollied · 2023-08-14T10:10:46

"Everybody has a testing environment. Some people are lucky enough enough to have a totally separate environment to run production in"

ethbr1 · 2023-08-14T11:05:13

I code at the interface between ops teams (on the business side) of companies and dev teams (on the IT side).

One of the things I've realized is that in most unregulated companies (read: non-healthcare/financial) the business side of the house is used to having little or no lower lifecycle.

If they want to make a process change, they make it on production work.

Granted, they have change control approvals, etc. etc., but the whole dev-test-prod cycle looks extremely different for them, because you can't do certain things without lower environments.

JohnBooty · 2023-08-14T12:26:28

This hasn't been my experience. I think it depends on how business-critical the application is.

I worked at a home remodeling company. Revenue was several million dollars a day. App handled sales, scheduling, logistics, everything. Breaking production was a big deal, it cost us millions per day and created logjams.

I would think that most online applications are the same. Even if a simple online web shop goes down you are costing money.

What kinds of experiences have you had where testing in production was the norm?

    because you can't do certain things without lower environments.

I agree that this is something many shops REALLY struggle with.

One of the most challenging things is exporting or creating some kind of realistic data set for local development use. I think 99% of companies struggle with this.

dfox · 2023-08-14T19:18:35

Well, the lucky part is the important one. What it boils down is that such system is either self-contained or has rigidly defined outside interfaces. For anything that deals with a physical reality outside of the pure computational realm this tends to be impossible. You are not going to build an entire warehouse to serve as the physical part of testing environment and even if you did so it will not be really useful, because the thing will be different than the production one due to who knows what tolerances involved in building a physical things. In same vein if you interact with external services you can either mock them or use whatever testing environment the communication partner provides, in both cases it is bound to not behave the same way as the actual production environment.

chpmrc · 2023-08-14T10:56:45

Ha! Love it.

wheelerof4te · 2023-08-14T11:35:41

That should honestly be the norm at any larger company.

Even at startups, the added initial costs yield more long term benefits with higher-quality products.

anoy8888 · 2023-08-14T10:38:45

I don’t understand why people redefine words just make their point. It can be confusing at best and at worst change the me meaning of words when it becomes viral. “Smart” people means smart people. It shouldn’t be used to mean junior dev who are trying to hard to prove themselves and over engineer or choose the wrong approach. So many words have changed their original meaning because someone decides to write a viral post and redefine words to make a point

chpmrc · 2023-08-14T10:54:54

Oh what an accomplishment it would be, to be able to change the meaning of the word "smart" with a single article!

(Don't take it too seriously, like I said this is mostly a brain dump, I'm sure there's a lot of stuff that can be improved)

JohnBooty · 2023-08-14T12:20:09

I like your usage of "smart" in the article.

I see this challenge a lot in the industry. The young engineers truly are smart, even brilliant, but lack wisdom and experience.

chpmrc · 2023-08-14T15:53:26

I completely agree. "Smart" isn't used sarcastically here. It's an adjective that most young devs would (rightfully) like to be referred to as. But I see experienced devs as less interested in looking/being "smart" (or clever or whatever word you want to use) and just getting things done in a way that allows the org to make money and get rid of BS (unrelated to the above) as much as possible.

Maybe there's a better way to outline this difference.

routerl · 2023-08-14T11:30:08

Oh right, the "original meaning" of "smart"... so you must mean "pain or ache"? I really don't see how that's relevant to the article.

Words change, they always have, they always will. Get over it.

And anyway, the article's usage is consistent with the well-established phrase "smart guy", within which the word "smart" carries a sarcastic and derisive tone.

gwright · 2023-08-14T12:14:14

> Words change, they always have, they always will. Get over it.

While this is true, I think it is helpful to communication to resist changes to language. This isn't the same thing as opposing change entirely, but language needs to have a certain stability and common understanding to maximize its usefulness.

routerl · 2023-08-15T04:14:26

> I think it is helpful to communication to resist changes to language

Your opinion is wrong. The most widely spoken languages, in every historical period, are the most adaptable. Adaptability is the single most important factor in a language's ability to survive, in a useful/usable/used state, and always has been.

chrisfowles · 2023-08-14T12:21:30

One of the reasons that English has been so successful as a global business language is its ability to be flexible and splodgable and still make sense.

chpmrc · 2023-08-14T12:17:58

Welp! For some reason someone at HN decided to change the title and bump this down to the 11th position (atm). Not sure what I did wrong here but it feels pretty crappy...

@dang any chance you could help here? :(

chpmrc · 2023-08-14T15:02:55

For the sake of transparency it was explained to me that the title was too "link baity" and that the comment section was a bit too heated. I appreciate the explanation and I agree this kind of moderation is, unfortunately, required to keep things civil and constructive.