
Production-Oriented Development - jgrodziski
https://medium.com/@paulosman/production-oriented-development-8ae05f8cc7ea
======
tluyben2
Right. Commenting on a very specific part of this with an anecdote. I did some
integration with a ‘neo bank’ a few years ago where the CTO said testing and
staging envs are a lie. I vehemently disagree(d) but they were paying to only
test on production. I guess you can guess what happened (and not because of us
as we spent 10k of our own money to build a simulator of their env; I have
some kind of pride); testing was extremely expensive (it is a bank so
production testing and having bugs is actually losing money... also they could
not really test millions of transactions because of that so there were bugs,
many bugs, in their system...), violated rules and the company died and got
bought for scraps.

I understand the sentiment but I agree with 2, 3 and 6 of this article, the
rest is, imho, actually dangerous in many non startup cases.

Example; Simple is always better IF you can apply it; a lot of companies and
people you work with do not do so simple. A lot of companies still have SOAP,
CORBA or in house protocols and you will have to work with it. So you can
shout from the rafters that simple wins; you will not get the project. That
can be a decision but I do not see many people who finally got into a
bank/insurer/manufacturer/... go ‘well, your tech is not simple in my
definition so I will look elsewhere’.

It is a nice utopia and maybe it will happen when all legacy gets phased out
in 50-100 years.

~~~
veeralpatel979
Thanks for your comment! But I don't think all legacy will ever get phased
out.

Today's code will become tomorrow's legacy.

~~~
tluyben2
I agree, I meant 50-100 yrs with a bit of wink as no-one can reason about that
period of time in software obviously.

------
alkonaut
This article makes a lot of assumptions that only hold true in a very specific
set of circumstances:

\- that it’s possible for the team developing the product to deploy or monitor
it (example cases where it isn’t: most things that aren’t web based such as
desktop, most things embedded into hardware that might not yet exist etc.)

\- that _if_ you can deliver continuously, customers actually _accept_ that
you do. Customers may want Big Bang releases every two years and reject the
idea of the software changing the slightest in between.

\- not validating a deployment for a long time before it meets customers is
also only ok if the impact of mistakes is that you deploy a fix. If the next
release window is a year away and/or if people are harmed by a faulty product
then you probably want it painstakingly manually tested.

My point is: if you are a team developing and operating a product that is as
web site/app/service and you are free to choose if and when to deploy, then
most of the article is indeed good advice. But then you are also facing the
simple edge case among software deployment scenarios.

~~~
lazyasciiart
> This article makes a lot of assumptions that only hold true in a very
> specific set of circumstances:

Yes. The assumption that you are working on a web based service is so core to
this piece that it doesn't seem any more necessary to say "this doesn't work
for desktop" than it would be to say "this doesn't work without internet".

 _given_ that you are delivering software on the web, your customers are going
to get changes to it and like it, because their other option is to run systems
on the internet with known exploits. Customers who don't want changes host
their own instance.

And if your next release is a year away and you have no way to roll back the
release, but you have no manual validation - then you aren't following this
advice to begin with, and you have an appallingly broken process.

~~~
alkonaut
Absolutely agree, my complaint wasn't that the advice was bad but that it
lacks the specifier "what follows is good avice for this small subset of
scenarios". I really dislike the phenomenon that "software development"
without qualifier has begun to imply web app/servic development.

------
wgerard
Cool article, enjoy the summary of relevant knowledge that's been passed
around various circles.

I do disagree with:

> Environments like staging or pre-prod are a fucking lie.

You need an environment that runs on production settings but isn't production.
Setting up an environment that ideally has read-only access to production data
has saved a huge number of bugs from reaching customers, at least IME.

There's just so many classes of bugs that are easily caught by some sort of
pre-prod environment, including stupid things like "I marked this dependency
as development-only but actually it's needed in production as well".

Development environments are frequently so incredibly far removed from
production environments that some sort of intermediary between production is
almost always so helpful to me that the extra work involved in maintaining
that staging environment is well worth it.

It's not the same as production obviously, but it's a LOT closer than
development.

~~~
drewcoo
> You need an environment that runs on production settings but isn't
> production.

Why?

> Setting up an environment that ideally has read-only access to production
> data has saved a huge number of bugs from reaching customers, at least IME.

That's an anecdote, not a reason. Also, just because you've done it that way
doesn't mean it has to be done that way, like you asserted.

> There's just so many classes of bugs that are easily caught by some sort of
> pre-prod environment

Also does not support the claim that you need a pre-prod env.

> Development environments

Whoa, there! You're sneaking yet another kind of environment into the
conversation? Maybe not. This is unclear, given the many different ways that
people do work.

> not the same as production obviously, but it's a LOT closer

You seem to want something like production. There is nothing more like
production than production.

If you're set up to do A/B tests or deploys with canaries or give potential
customers test accounts you're probably able to start testing in production in
a sane, contained way.

~~~
derefr
> If you're set up to do A/B tests or deploys with canaries or give potential
> customers test accounts you're probably able to start testing in production
> in a sane, contained way.

You seem to be assuming 1. some sort of large-horizontal-scale production
system with multiple customers, where the impact of a failure can be minimized
by minimizing the number of users exposed to new features, and where 2.
there's no type of bug in the code that would potentially take down production
as a whole.

What if your production system is, say, a bank's ACH reconciliation logic? A
medical device? A car? The live server for a popular MMORPG? A telephone
backbone switch? A television or radio broadcast station?

In these cases, your software isn't a _service_ with multiple distinct
_customers_ that each make requests to it, where you can test your new code on
one customer in a thousand; your software is just _running_ and _doing_
something— _one, unified_ something per instance of the system (though that
process may _track_ multiple customers)—and if the code is wrong, then the
whole system the software operates will fail.

How do you test software for such systems?

Usually by having a "production simulation" whose failure won't kill people or
cost a million dollars in lost revenue.

~~~
Roboprog
Thank you for contrasting life building the latest social media platform from
what many of the rest of us do.

Currently I work on systems to prepare and validate birth and death
certificates for the state, counties, hospitals, et al, and this whole “throw
it against the wall and see what sticks” methodology doesn’t fly. Nor would it
have worked when preparing and presenting investment account information 5
years ago, nor the job 10 years ago processing lawsuit and insurance claim
cases and legal bills. Nor any place that I at least have ever worked in the
last 30 years.

~~~
TruffleMuffin
Agree, and I work in the latest 'social media platform' type end. We have many
customers. I can assure the author of the post, when those customers pay for
enterprise licensing and their system is broken with an obvious bug. The 'we
didn't do any testing before hand because staging is a lie' doesn't actually
fair well as a valid excuse for anything. In fact, you just look like an
unprepared and immature muppet.

------
jayd16
Some good points but some controversial ones.

I think a manual QA team is very valuable. Sure the tests pass but what if the
UI is confusing or disorienting. QA can be user advocates in a way a unit test
can't be. I work in games so maybe it's just a squishier design philosophy but
you can't unit test fun.

I also don't understand the worry about other environments. If you're
automating deployments how is another environment added work? Shouldn't it be
just as easy to deploy to?

~~~
dodobirdlord
I think the valuable purpose you are describing of QA is better achieved by
having a UX team earlier in the pipeline.

~~~
jayd16
I do support constant deployments to the QA environment (also a no-no
apparently). That can keep the QA team involved at all times. I wouldn't
suggest waiting on large changes before having QA do a pass.

~~~
stallmanite
Literally constant? As in whilst attempting to replicate a bug the software
could change out from underneath the tester? Would that complicate things or
am I misunderstanding something about the process?

~~~
jayd16
Hmm you're right. We CI to a dev environment that QA pulls in at their
discretion.

~~~
hawaiianbrah
Do you mean you CD to a dev environment?

~~~
jayd16
yes

------
cjfd
I think I disagree with this about 100%. Sure, production is what it is all
about in the end. But how do you know the letters you just typed are going to
be any good in production? They might just crash and burn there. That is why
we need all those quality gates. The sooner and the farther removed from
production that you discover a problem the easier it is to fix.

Regarding the 'buy vs build' I think buying software is one of the most risky
things that you can do. Since it cost money you cannot then say 'o well, i
guess it just did not pan out, let us just not use it'. Now you are kind of
married to the software. And some of the worst software out there is paid for.
E.g., jira vs. redmine. This is actually a bit ironic considering the fact
that I actually am writing software in my job that is sold.... O well, it
actually is sold as a part of a piece of hardware, so it is not really sofware
as such.....

Regarding the last point, failure can be made uncommon if a relatively safe
route to production is available, starting with a language that verifies the
correct use of types, automated tests that verify the correctness of code, a
testing environment that one attempts to keep close to what production is like
and so on. Getting a call that production is not working is the event that I
am trying to prevent by all means possible, and I think research would be able
to show that people who get fewer calls, not just because production is
failing, but in general, fewer calls regarding whatever subject, will live
longer and happier.

~~~
williamdclt
> Regarding the 'buy vs build' I think buying software is one of the most
> risky things that you can do. Since it cost money you cannot then say 'o
> well, i guess it just did not pan out, let us just not use it'. Now you are
> kind of married to the software.

It is usually _way_ more costly and risky to develop your own. It's many hours
spent on what is a separate product to your actual product, and you're way
more married to it: you've just spent money, time and energy developping a
custom homegrown solution. What are the chances you'll go "o well, i guess it
just did not pan out, let us just not use it"? Very, very low

So you end up spending more money and a significant amount of time/energy for
a product that's probably subpar because there's no reason you'd do better
than companies that are focused on this product.

I think buying software is one of the _least_ risky things you can do, you
know exactly how much money you have at risk and you usually know pretty well
what you're buying. You don't know how much money/time/energy it will take to
make your own solution, and you don't know what result you'll get.

~~~
TruffleMuffin
Regarding your last point. You weigh buying software over building it when you
know how much it costs to buy and maintain, and have a strong grasp on how
much time money and energy it costs to build it yourself. That is how you make
an informed judgement. Sure there is risk, but if your burning 15K a year on a
build server and you can build it yourself for 5k and run it for 1k a year
then math doesn't lie about what choice you _should_ make.

------
nojvek
He lost me at “non production environments are bullshit”.

In dev you can break almost anything, no biggie. In stage if you break
something, great just don’t deploy it to prod. If you break something in prod,
well ... you may end up going below SLA and may legit lose money and your
customers trust.

Don’t YOLO into prod. Build reliable shit.

------
maxwellg
Non-production environments are useful for more than testing application code.
Changing underlying infrastructure (Upgrading a database, networking
shenanigans, messing around with ELB or Nginx settings) requires testing too.
Having the same traffic / data shape in pre-prod is not as important.

------
sadness2
Radical! Some counter-points, though.

\- Infrastructure as code and schemas as code make it easier to keep
environmental parity, because everything can be rolled back/forwards/reset
with easy source control and CD operations. Visual environment diffing and
drift detection can make this even easier.

\- Make your stage and prod into a blue-green situation, where if stage is
ready to go, you flip users onto it. I can guarantee your stage and prod will
both be respected as prod then. Failing that, just add load/stress tests to
stage to make it more prod-like.

\- Non-prod environments and attention are not necessarily debt, but they are
expensive insurance premiums. You should only pay those premiums if you need
the insurance. It's about risk management.

\- As time passes, the people who wrote a specific part of a system don't know
it anymore, so having them babysit 'their' code in production has diminishing
returns. On the other hand, having a systems quality team who have a broad
mandate to bugfix, put in preventative measure, reduce technical debt, improve
observibility and establish good patterns for developers to do these things,
can enabled these things to actually happen, when just telling devs who are
busy making features that they should happen often doesn't make them happen.
Also there are devs who enjoy creating new things, and others who love
trouble-shooting and metrics.

------
ivan_ah
What exactly is would be the disadvantage of running something in stating
environment before running it "for real" in production? I'm assuming the
staging environment is an exact clone of production (except reduced size:
fewer app servers + smaller DB instance)?

I understand the deploy-often-and-rollback-if-there-is-a-problem strategy, but
certain things like DB migrations and config changes are difficult to
rollback, so doing a dry run in a staging environment seems like a good
thing...

------
capkutay
It's funny in my career I've observed similar development styles. But I always
just thought of this as great/good developers verses average/mediocre
developers. the A+ coders would always make their code very easy to access,
deploy from a user standpoint, debug, read etc. The mediocre guys would wait
for someone else to hit a landmine beefore fixing something that was obviously
wrong.

------
bob1029
While I do not agree with everything presented in this article (especially
item #2), I definitely share the overall sentiment.

For some of our customers, we operate 2 environments which are both
effectively production. The only real difference between these is the users
who have access. Normal production allows all expected users. "Pre" production
allows only 2-3 specific users who understand the intent of this environment
and the potential damage they might cause. In these ideal cases, we go: local
development -> internal QA -> pre production -> production actual. These
customers do not actually have a dedicated testing or staging environment.
Everyone loves this process who has seen it in action. The level of confidence
in an update going from pre production to production is pretty much absolute
at this point.

The amount of frustration this has eliminated is staggering. At least in cases
where we were allowed by our customers the ability to practice it. For many
there is still that ancient fear that if we haven't tested for a few hours in
staging that the world will end. For others, weeks of bullshit ceremony can be
summarily dismissed in favor of actually meeting the business needs directly
and with courage. Hiding in staging is ultimately cowardice. You don't want to
deal with the bugs you know will be found in production, so you keep it there
as long as possible. And then, when it does finally go to production, it's
inevitably a complete shitshow because you've been making months worth of
changes built upon layers of assumptions that have never been validated
against reality.

This all said, there are definitely specific ecosystems in which the
traditional model of test/staging/prod works out well, but I find these to be
rare in practice. Most of the time, production is hooked up to real-world
consequences that can never be fully replicated in a staging or test
environment. We've built some incredibly elaborate simulators and still cannot
100% prove that code passing on these will succeed in production against the
real deal.

~~~
perlgeek
I've worked with a customer who also had a post-production environment. They
used it for the sole purpose of being able to replicate problems and do root-
cause analysis in case things went horribly wrong. Then they took a snapshot
of prod, synced it to post-prod, hotfixed prod as fast as possible, and then
did their detailed analysis in post-prod.

This wasn't cheap; they payed Oracle somewhere between 50k€ and 200k€ a year
just for the database for this environment, but they considered it worth it.
(They were also in a pretty tightly regulated vertical).

My main takeaway is that I don't think there is a one-size-fits-all answer to
the question of how many and what environments you need. IME having at least
one "buffer" between dev and prod is a good thing, but I'm not sure to which
extend my experience generalizes.

------
luord
I agree with most of the points, but I have serious caveats on the first two.

1\. No, the engineers should not by default be on call; the owners of the
product are the first call line. If they're not engineers or if they're
engineers but don't have enough time to deal with all incidents–in short, if
they need to delegate–they better be willing to pay _very_ generously for the
extra hours of on call duty.

2\. No, hosted is not better than open source, both for philosophical and
operational reasons: mostly, you become subject to the whims of the provider.
A good compromise is _hosted open source_ solutions, which at least takes you
half way to a migration, if the need for one comes up.

That aside, I very much agree on everything else.

------
gfodor
I agree with most of this but the point about QA gating deploys should be
amended. A 5 minute integration test on a pre-flight box in the production
environment by the deploying engineer is a form of QA, and can catch a lot of
issues. It shouldn’t be considered anti pattern. Manually verifying critical
paths in production before putting them live is about the best thing you can
do to ensure no push results in catastrophic breakage.

Without such a preflight box, or automated incremental rollouts, you are kind
of doing a Hail Mary, since you are exposing all users immediately to a system
that has not been verified in production before going live.

------
0x445442
I agree with most everything said in the article but with a big condition. If
I as an engineer am responsible for everything the author says I should be
responsible for then I want total control of the tech stack and runtime
environment.

------
VHRanger
Why would you ever share the medium version with all the added crud when the
author has a version of the post on his personal blog?

------
zentropia
I'm sorry but I'm tired of Medium paywall to the point I don't want to read
anything there.

~~~
GordonS
Also published at: [https://paulosman.me/2019/12/30/production-oriented-
developm...](https://paulosman.me/2019/12/30/production-oriented-
development.html)

------
banq
this is another Agile or DevOps

------
polote
> 2\. Buy Almost Always Beats Build

Strongly disagree with that, well maybe it is a good idea when you are over
founded by VC where cost of money is equal to zero and you don't want to
master what you are working on but in all other cases this is wrong, you
shouldn't rebuild everything from scratch but creating a company is not the
same as playing with LEGO

And this is the same argument as saying you should have everything in AWS
because if you self host you will have to hire devops engineer

~~~
simonw
Could you expand more on why you disagree with this? Do you believe the
opposite - that "Build Almost Always Beats Buy"?

I've made the build-vs-buy decision many times in my career. I don't
necessarily regret /all/ of those times, but the general lesson I've learned
time and time again is that you're going to end up investing WAY too much time
maintaining your special version of X when you should have spent that time
solving problems unique to your business model.

