
All of us test in production all the time (2019) - jrvarela56
https://increment.com/testing/i-test-in-production/
======
siliconc0w
So it's a fair point that the fidelity of non-prod environments are inherently
limited and you _still_ need a bunch of other stuff like canaries, automated
canary analysis, automated rollbacks, zone fault-tolerance, feature-flags,
chaos engineering, server and client-side instrumentation, but this is
generally 'next-level' stuff when most shops aren't even getting the basics
right.

For the overwhelming majority of non-prod environments and release processes
I've seen they usually have basic problems and would catch a lot more issues
with additional engineering investment. I.e Releases aren't automated, self-
contained, or idempotent. Unit testing coverage is bad, integration tests are
non-existent, data used isn't reflective of reality, there is no performance
or load testing done, downstream or upstream systems have poorly defined
interfaces and are excessively mocked or just not considered.

~~~
faizshah
Wholeheartedly agree to everything you’ve said.

I have worked at a lot of places where automated testing is almost non
existent. Instead they use staging/testing environments and manual testing.

For all these projects they frequently have downtime or errors in prod where
something was uncaught. It takes ages to release a fix because they have to
carefully research how a change affects other code instead of running a test
suite.

They are convinced that unit tests are unnecessary extra work for their
project but they don’t realize they are losing time by having to manually test
and research changes. I think a lot of the focus on fancy testing practices
has lead some programmers to think testing is a load of extra work when things
will fail in prod anyway. We need to push back and say automated tests are a
way to codify assumptions about the code and help speed up the prototype-
build-test-release loop.

Side-note: self documenting code is a myth but I digress.

~~~
anon9001
> They are convinced that unit tests are unnecessary extra work for their
> project but they don’t realize they are losing time by having to manually
> test and research changes.

If people are finding tests unnecessary extra work, maybe they're right for
their project.

Most jobs are just "ship feature asap, try not to break anything, repeat".
Most managers don't care about code quality as much as speed of delivery. Most
coders don't have much "skin in the game" and can always get another job when
the project fails.

Writing tests is a sign that you know what you're building, you have a
methodology for building it, and you have a commitment to keeping it working
in the future. That's simply not the requirements for most programming work,
and it can be a disservice to both the coders and the management to push for
testing that won't help achieve the business goals. If a business needs to
iterate fast and reliably, write tests. If a business needs to iterate fast
and is ok with breaking things, skip the tests and have a rollback plan
instead.

~~~
faizshah
Sorry, I have to completely disagree with this. This is a horrible, but often
used, excuse.

In my experience these coders are doing exactly what you would do in a unit
test but instead they are doing it manually by adding printlns and testing
various values on the command line, in a repl, a notebook, or a main. Instead
of manually adding these things every time you want to test your work, if you
just added some basic unit tests it would speed up your delivery. Not only
that but you can't "forget to test that" because you're forced to write down
the test cases instead of deleting them when you're done.

If you add up all the time you spent adding printlns to your code, testing
various input values in a main method, researching inside your codebase to see
what could fail, debugging your ETL because you forgot about an uncaught error
or a different kind of null in this column that only came up after it
processed 3TB of data (real example), etc. I guarantee you would save time if
you just implement some basic automated tests instead of testing in prod or
staging.

------
simonw
It seems likely that some of the comments here are reacting to the headline as
opposed to reading the article - which is one of my favourite pieces of
writing on the subject of observability and responsible maintenance of large
scale systems.

Lots of great quotable bits in here. Here's just one for people who didn't
make it to the bottom:

"There’s a lot of daylight between just throwing your code over the wall and
waiting to get paged and having alert eyes on your code as it’s shipped,
watching your instrumentation, and actively flexing the new features."

~~~
dang
What would be a more accurate and neutral title?

~~~
Shoop
Maybe "Once you deploy, you aren’t testing code anymore, you’re testing
systems"? Trying to lift something out of the article instead of writing a new
title.

~~~
dang
I saw that too, but I think it's a bit too obscure as a title.

------
inopinatus
Apropos of this in the context of hiring, one of the biggest green flags I'll
attach to someone's CV is any combination of infrastructure/operations, and
application development, at the same job.

Since the 1990s I've worked around ISP, hosting, and cloud firms. Many have a
core of general purpose people that can't help themselves but have one foot in
both graves (we call this DevOps or SRE now, but those are new labels for a
long-standing viewpoint). This often correlates to someone who embodies the
mindset in the article, viz. that they they will gladly and actively defend
any ditch we dig together. They always get on my interview list (the green
flag). Very often the interviews are a wide-ranging, free-flowing, and in-
depth discussion of multilateral & cross-functional technology/process
interactions.

Many of the people I've met with this combination will progress, either
immediately or eventually, to become very effective CTOs, tech co-founders, or
the highest levels of IC at larger tech firms.

Interestingly, and I say this purely anecdotally because I am not actually
qualified to make the diagnosis, some of them also appear to me to have an
attentional difference, or present from an unprivileged background, and may
not have followed a standard educational path as a result of either. Which is
to say that I usually delete "must have bachelors degree" from any JD that HR
ask me to authorize.

~~~
ficklepickle
Wow, you just described me. Except I'm about 4 years into my career. Glad to
hear that I have a type. It is oddly reassuring in a manner that I can't
currently articulate.

Most places I apply I just don't hear back from. Places I have worked had no
idea how to use me effectively. I've applied for very few jobs because of this
(on the spectrum, have trouble with rejection). I do a lot of free or
underpaid work or just do something that fascinates me. The poverty sucks but
I love that I get to pursue so many interesting things.

I have faith that I will find somewhere that is a good fit one day. In the
meantime, I get to explore my interests and grow. Thanks for this post, it
gave my hope a much needed boost.

~~~
xelxebar
Your self-description aligns very much with my own. Feel like we might get
along well. Would love to chat. Shoot me a PM if you're willing.

Cheers, mate. There definitely is a place for people like us in the world.

------
aunty_helen
>Nobody invests in their “test in prod” tooling.

Firstly, what is logging then?

How is this not tooling to ensure things are running smoothly? `less +F`
anyone?

Secondly, if you're running an aws/azure/gcp based server you now have a
ridiculous amount of tooling for production diag, analytics and tracing.

~~~
mgkimsal
"Firstly, what is logging then?"

"Secondly, if you're running an aws/azure/gcp based server you now have a
ridiculous amount of tooling for production diag, analytics and tracing."

This presume that you, the person expected to fix problem X actually have
access to logs and the servers where problems are happening. I've been at
multiple projects/companies where this simply Isn't Allowed(tm).

"You wrote this code, you need to fix it! It's broken!"

"Let me get on the server and take a look at the logs to see what's going on".

"That violates our security policy! You can't do that!"

I was tasked with 'finding a problem' and was told to look in the logs. We
knew what _day_ the problem was, but I couldn't get anyone to confirm if the
log files were in UTC or something else (turns out it was something else).

HOWEVER, I never actually got the actual log files. I didn't have access
directly (that takes about a week to go through the chain of command to
approve), so someone just sent me small snippets from where they thought the
problem might be.

So even orgs that tick all their checkboxes of stats/analytics/logging...
sometimes seem to forget that access to the collected info is a requirement
too.

~~~
aunty_helen
Feel your pain <3 You can tell it's been a few years now but I have worked at
Big Corp Inc. before.

I know what it's like to ask someone in charge of 8k people to personally
approve my server permission escalation even though he's never met me, doesn't
know anything about the IT systems, has never visited the site I work at or
probably even been in the country I reside.

Anyway, a good thing on this front now is that (speaking from Azure exp.), you
can dump logs directly to blob storage (s3, whatever google call their data
storage). Then you only need that permission.

As for the non-UTC servers... -_- but if it's a problem, you can always append
the tz info to the log date format.

~~~
jagraff
The problem that the above user had was that the logs were already written
without tz info, so they had no way to know. Assuming it was set to the
machine's tz producing the logs, you could probably figure out the tz from the
machine's ip address (assuming that's produced in the logs as well..)

~~~
mgkimsal
what I learned was that the logs are timestamped to whatever machine they're
on, wherever it is. The servers in London - GMT. The servers in west coast -
Pacfic time. Servers in DC - Eastern time. Not ideal.

~~~
jagraff
Right - assuming the server IP, or at least some other network identifying
factor, is in the log, you could write some sort of regex to parse the logs
and identify the correct time. Of course, that depends on someone being able
to actually parse the logs with your regex

~~~
perl4ever
I once found a solution to determining the timezone when it wasn't available -
a field with only a date was unnecessarily and incorrectly converted to UT,
meaning that the zone could be derived from the offset from midnight and
applied to other fields.

If enough things are screwed up, you can possibly find the solution to one
issue in another.

------
t-writescode
Would a better word be "validating" in production?

I try to validate all of my changes in production, see them run, see them
output expected values, etc.

I try very hard to test everything I can prior to production; but, then I read
loglines, watch graphs or run the code myself, all to validate that all the
interconnected pieces are working together.

------
bluedino
Most places that I've worked, that "test in prod", do so for one reason. It
all comes down to laziness.

The first scenario is they have never set up a test environment in the first
place. They're either too lazy to do so, or too lazy to look into how to do
it. Often confused with being 'too busy to do it'.

The second scenario is that they do have a testing environment-however, for
some reason, it's broken. Some change was made, usually in a hurry to meet a
deadline, and it's been broken ever since. This one is usually a result of the
fix being too difficult, because the test environment was haphazardly thrown
together in the first place.

~~~
sagichmal
> The first scenario is they have never set up a test environment in the first
> place. They're either too lazy to do so, or too lazy to look into how to do
> it. Often confused with being 'too busy to do it'.

One lesson of modern architectures (i.e. anything more recent than the
LiveJournal-style Web/App/DB 3-tier stack) is that it is literally impossible
to create and maintain a test environment that has enough similarity to prod
to be useful.

~~~
UweSchmidt
As QA I think testing, test environments, test ressources should be first
class concepts in software. Building whatever "modern" architecture twice, and
simulating traffic is not easy but possible, at least to an extent. You need
to get that extra license for any piece of software for testing during
procurement, and provide means to create test ressources, like typically test
users as needed. Sounds trivial, but is often so complicated (banks,
insurances...) that it should have been considered early during design.

So, I wouldn't use the word "lazy", but we could do better.

~~~
sagichmal
Modern architectures look like this: [https://github.com/donnemartin/system-
design-primer](https://github.com/donnemartin/system-design-primer)

Building this out twice is either actually impossible, or so cost prohibitive
as to be practically infeasible.

~~~
eitland
> Modern architectures look like this:

A lot of modern architecture is not like that.

And even the diagram there is something should be possible to replicate if the
organization values it.

Today it doesn't even need to be too expensive as it can be deployed with
terraform and torn down an hour later when the full system tests are finished
qnd go back to running the integration tests.

~~~
sagichmal
It's not possible to replicate the CDN, or DNS, or the message queue if it's
hosted, or the database of it's big enough, or etc. etc. The differences
between what you're able to create in a staging environment and what exists in
prod are significant enough that testing against the staging environment
doesn't build much more confidence than testing with mocks.

Testing in prod means accepting this reality, avoiding the largely
unproductive toil of building and maintaining a staging env, and making the
production system observable, resilient, and operationally agile enough that
you can deploy changes that can trigger unpredictable and emergent system
behaviors while managing risk appropriately.

------
mickotron
Once upon a time I was a tester. Now I am a sys admin/DevOps engineer (what is
a name, really?).

You would not believe the amount of bugs I would find from both vendor code
and internal code. Bugs that require dynamic execution and context-spexific
scenarios to reveal themselves. Then you had misinterpretations of
requirements, which is not something the developers would find, considering
they wrote the code and believed it met the requirements.

I worked for the police force of my region as a software tester for all their
critical systems. I once developed a test tool that created packets to mimic
mobile phone calls to police dispatch. I tested every combination of the data
spec, including the very last one, where there was a critical bug. The vendor
code had implemented the earlier version of the spec, not the latest. The
vendor's test tool created the wrong packets as it targeted the wrong spec.
Mine caught the bug cause it creates the correct packets.

As much as I agree with the author's ideas of familiarity with Production,
proper unbiased and independent testing is still something this industry
needs, even if it's unfashionable.

------
tehlike
Most large companies encourage testing in prod. It's called split test.
Clearly precursor to that is staging, and precursor to that is integration
tests and whatnot. But none of that catches what you can with a/b test, and
great monitoring.

~~~
bszupnick
Exactly what I was thinking, but I think that a/b test "feature" is mostly
framed (and thus used?) for BI purposes as opposed to "let's see if this
breaks".

Don't get me wrong, where I work they definitely are open to making mistakes
in production. We roll out "risky" new features with split tests and I think
we use that logic well so this notion obviously exists, but I'm not sure how
widespread it is.

~~~
tehlike
I have seen google and Facebook first hand. The practice is widespread, but
the acceptance of "let's test it in prod" as a real stage in software
development varies. Some people look at you like you are crazy person that
should not be trusted, some people acknowledge and know it to be true.

Feature testing is one thing, but you can go a/b testing a whole binary
between new and old version to catch interesting bugs too.

~~~
wikibob
Do you mean acceptance at FB/G? Or acceptance at companies with less mature
engineering?

~~~
tehlike
Both. Big companies are not homogenous, so the practices also vary among orgs.

------
csours
Well no, but actually yes.

Kind of like how you don't get requirements from talking to customers or "the
business". You get requirements from demoing. If your first demo is your prod
deploy, then guess what, you've just deployed a proof of concept. It's a de
facto thing that just happens even if you don't plan on it. In fact it happens
BECAUSE you don't plan on it.

More to this point: there are many different kinds of tests and what we are
most commonly referring to is "regression tests". There are lots of bugs that
are not due to regressions.

~~~
simonw
"You get requirements from demoing."

I really like that - what an insightful observation, neatly condensed into a
few words.

------
throwawaw
Headline is unfortunate, and then the first part of the essay is spent
justifying/walking back the title. The good part starts a few paragraphs down,
at "We conduct experiments in risk management every single day".

------
vinceguidry
A lot of this boils down to the constant churn in frameworks, instrumentation,
programming languages. Developers don't have time to master one way of doing
things, it's just a constant lava layer of crap.

I took the time to learn how to build and maintain Ruby on Rails systems,
hoping it would be the ticket to a fun, manageable career. Any project I
worked on was an island of sustainable, fast development where the team
finished the sprint's work in a few days. I knew where the bottlenecks were
should the project need to scale, and scaling issues never took down the site
completely.

Only to throw all that expertise away when it was just decided to replatform
one day. Because I guess stable just wasn't good enough. After it happened
twice, well, might as well go into devops. Y'all are gonna need it.

~~~
tluyben2
I had the same issue with Rails as I have with Node : the churn is detriment
to long standing projects. Most projects I do run for years and even decades;
you go run an update for security for Rails code that is 10 years old. It is a
nightmare. I did not have issues like that with php, asp.net or spring.

~~~
vinceguidry
I loathe NodeJS with the passion of a thousand suns. When Rails projects
churn, at least you can rely on the stability of Ruby. With Node, the number
of times I had to work around inconsistencies with the framework and even
language were maddening. Babel is a horrible horrible thing, and Javascript
was never meant to run on a server.

Maybe it's better now. But I won't ever do serious work again with the stack
so I'll never find out.

~~~
rybosworld
Node is getting better each year imo. I've been pretty happy with ES6.

What makes you say Babel is horrible? Not disagreeing but I haven't heard
anyone say that before.

~~~
ratww
People don't say it directly, but all the constant complaining about the bloat
and complexity of the JS ecosystem and the size of node_modules folder boils
down mostly to two tools: Babel and Webpack.

~~~
pydry
When people make fun of leftpad they are not talking about babel and webpack.

The bloat is a lot about the lack of a standard library in JS. Or rather, too
many of the damn things. That and an inconsistent and unpredictable type
system.

~~~
sli
> The bloat is a lot about the lack of a standard library in JS.

Which is why someone created core-js, which a bunch of libraries use but, in
my experience, never update. Every React project will have multiple libraries
complaining that core-js 2 is deprecated.

So basically even the "fix" has all those problems.

> That and an inconsistent and unpredictable type system.

Agreed and honestly, I don't think TypeScript's solution for third party types
is any better, either. I recently ended up in a situation where `some-library`
and `@types/some-library` got out of sync and I simply had to try versions of
`@types/some-library` until it would compile again. This happened because
typedef versions are not pinned to library version in any way, they are
versioned just like any other library, and to the best of my knowledge you
can't simply look up this information to figure out which @types release you
need for a library. You can only enumerate released versions and hope the
latest of each work fine. Was there a breaking change that the typedefs
haven't accounted for yet? Well guess what, you can't update that library,
yet, even if there were security fixes. Your code won't compile.

This means you could potentially have `some-library` at version 1.2.1 and
`@types/some-library` at version 1.1.0. There's no relationship there. This
happens the moment there's a revision in a library that doesn't require a
revision in the typedefs.

Sorry to ramble, I'm just really unhappy with the state of our two major ES
languages.

------
ianamartin
I love articles like this because it's so easy to just add that company to a
list of places to never ever work.

I did read the whole article, btw. It's an absolute clickbait title that the
author doesn't really mean, and after the article spends a lot of time
diffusing the clickbait title it really boils down to, "This is hard, so I
give up."

It's true that many--if not most--companies operate this way without ever
acknowledging it. And that's bad. It's also true that systems are harder to
test than code. But it's not deep fucking magic. Look at the work aphyr does.
Look at the testing work that the FoundationDB team did to prove their
system's guarantees. Look at the work that security and devops people do every
day. She is right that it is hard to test systems. So what? We don't get paid
as much as we do because it's easy.

In a certain environment, it is truly impossible to test a system. That's when
you have a dev culture that refuses to actually design knowable systems. A
much better approach for the article would be to address exactly why systems
are so hard to test rather than just saying fuck it. Everything she cites in
her list of things that are hard to test are absolutely testable, if you have
a knowable system. The real problem here is that agile/scrum/Xtreme
programming practices inevitably and by principle do not result in knowable,
testable systems. When you have 30+ agile teams on their own sprint cycles and
product managers leaning on them to ship features and figure the rest out
later, there can be no other result than fragile, broken, unknowable,
untestable system.

But the answer to that isn't "Everybody else is doing it so why can't I." The
answer isn't to "embrace it." The answer isn't "This is hard, fuck it." The
answer is most definitely not to make individual engineers pay the price of
being on call because a company's culture and process are totally and
completely hosed.

The answer is to address the problems in your company that caused this
situation in the first place. The answer is to get your head out of the
feature cult and the velocity war and reset your priorities. Systems aren't
hard because your engineers suck. They're hard because companies suck. Systems
are hard because in most places, no one is allowed to spend more than a couple
minutes _thinking about the systems._

Agile culture after your early startup cycle is a lot like being a 40 year old
guy who's 30 lbs over weight. How did this happen? How did I get here? I was
just taking life one thing at a time and getting shit done. Now nothing works
quite as well as it used to, it's harder to find dates, and everything just
sort of hurts. Would anyone in their right mind just say, "Embrace it! Most 40
year old tech dudes look about like you and are in the same situation! It's
fine!" No. Of course not. You have to realize that your priorities have been
totally broken for the last 15-20 years of your life, that you really weren't
getting shit done, and you have to take some responsibility for your diet and
get off your ass and exercise.

That's what companies have to do. They won't, of course. But they have to,
otherwise they'll die young deaths. This article is totally correct when she
recognizes a terrible symptom of unhealthy companies. But her treatment is
hopelessly and tragically wrong.

~~~
joshuamorton
This works great if you're building something with a tightly controlled API.

If, however, your configuration space grows to an even middling size, it no
longer becomes feasible to do much of this validation across the configuration
space. A good example is any system where the user can customize system
aspects. Do you run all of your integration tests across the full
configuration space?

Additionally managing configuration skew between a dev and prod environment is
not simple. Simply claiming that there should be no skew doesn't work. Often
you want the prod and dev environments to run as different users, and you
certainly want them to have different acls (your dev environment should not
have access to your production database).

So you now have to, across your configuration space, validate that only the
things that are "supposed" to be different differ, and that the things that
aren't don't. Which maybe works for a while, but your prod configuration may
also differ across parts of prod if, for example, a change is being canaried
or incrementally deployed.

I've spent a non-trivial amount of effort on trying to solve the one problem
of configuration skew between dev and prod for _one_ real system. It's
ultimately not worth it. The effort expended to "fix" that would be more work,
than not. And I mean that in the long term, the effort to maintain and follow
the rules that such a system would impose is more effort than dealing with the
annoyances of unintended skews.

Systems are hard because systems are hard. There's no good company that
doesn't, test/experiment in production. All of them do.

~~~
ianamartin
I didn't say that any of this was simple or non-trivial. Again, it depends on
your priorities and your values as well as your company culture. In fact, I
specifically said that testing systems are hard and provided examples of how
hard systems are to test. Do you think that Cassandra is a tightly controlled
API with a small configuration space?

You seem to feel like close enough is good enough. And that's the cause of the
problem I'm trying to address here. Does it really matter if you don't get a
notification when someone messages you on Facebook? Or if you get two
notifications? Is that particular problem worth testing every possible Kafka
configuration? I _think_ that you are saying is no, it doesn't matter.

But I'm arguing a different point. I'm not arguing about whether the
testability of any individual feature is important. For obvious reasons: some
features really just aren't that important. But not being able to do that, and
actively choosing not to understand that system is a symptom of a far deeper
problem. When a company makes the choice you have just described, the company
has decided to accept that they can't, won't, and will never fully understand
their own systems. It's often not a conscious decision, it's a decision made
by habit, policy, and culture, which is what's so subversive about it. People
don't make big-picture decisions to intentionally have a system that is
unknowable/untestable. People make small decisions just like the ones you are
talking about that make systems that way. And it's the practice of letting
lots of disconnected people make the small decisions of what does and doesn't
matter, what is and isn't worth it that destroys systems.

Systems are hard, and I agree with that, but systems are made even more so by
bad process.

The being old analogy didn't seem to resonate with you, which is fine. But let
me ask you a question about a system.

You have a database. It gets backed up every night. Or maybe every hour. Your
job is to take snapshots and store them because that's what you're supposed to
do. Yeah, I know, that should be or can be automated. Whatever.

The big picture system and purpose is that you are supposed to be able to
recover from a hardware failure/data loss. But that's not your problem. Your
problem is that you have to back up the database manually every day. The data
team only tests restoring backups from dev to dev instead of prod to dev.
Because reasons. Because it's hard.

That type of backup system checks all the boxes you're supposed to check when
you get audited. Or at least enough to get through it. But when you really
need to understand the system, it fails for all kinds of reasons and people
are sitting around looking at each other saying, "well I did what I was
supposed to do."

Individuals sitting around making isolated, disconnected decisions like the
ones you're talking about (i.e., it just isn't worth it; it's not feasible;
it's hard) compound in organizations and create the kinds of systems _you_
don't want to deal with. You're making your own hell here. You seemed to have
missed that key point in my earlier comment.

Laziness is a good trait in an individual programmer. But laziness is the
absolute death of an organization. Agile is really just distributed,
organizational laziness. That's what creates horrible, unknowable systems.

Conflating test/experiment with what the original article claimed to be
talking about (and then later walked back) is borderline disingenuous. No one
is talking about A/B testing or intentional experiments.

The article is talking about rolling the dice in production deployments and
claiming that's fine and something to be proud of. It isn't fine, and it's not
something to be proud of. She's the CEO. She should fix her company instead of
being proud of how bad it is.

A lot of what we're talking about here is a matter of perspective. And that is
the problem I'm taking to task both with you and with the article.

~~~
joshuamorton
> I think that you are saying is no, it doesn't matter.

No I'm not saying that. I'm saying that the best way to prevent that isn't
always to have a staging environment that mirrors production as well as you
can.

> Individuals sitting around making isolated, disconnected decisions like the
> ones you're talking about (i.e., it just isn't worth it; it's not feasible;
> it's hard) compound in organizations and create the kinds of systems you
> don't want to deal with. You're making your own hell here. You seemed to
> have missed that key point in my earlier comment.

No, this was an intentional decision _by the organization_ , that the
organization shouldn't continue to invest time in solving the problem this
way, because after significant effort expended by the organization, the
conclusion of the people who the organization asked to investigate the problem
was that solutions would not be feasible and would not improve things. You're
acting like these decisions are always made in a vacuum. They're not. Often
smart organizations investigate and make decisions at the level of leadership.

> Conflating test/experiment with what the original article claimed to be
> talking about (and then later walked back) is borderline disingenuous. No
> one is talking about A/B testing or intentional experiments.

Are you sure?

FTA:

> We conduct experiments in risk management every single day, often
> unconsciously. Every time you decide to merge to master or deploy to prod,
> you’re taking a risk.

> A healthy culture of experimentation and testing in production pulls
> together all three.

Canarying is just testing in production, but you have processes and
"guardrails" (quoting the article) to make sure that it is done safely by
default.

For the record, I work primary on reliability and release/experiment, and so
I'm well aware that being explicit about your decisions is _vital_ , as is
knowing the tradeoffs involved. That's why pretending that you don't test in
prod is a bad idea, because you almost assuredly do. That's what the article
is saying.

Edit: As for Cassandra, it looks like they have system bugs caught in
production, so I'm not sure what your point is
([https://issues.apache.org/jira/projects/CASSANDRA/issues/CAS...](https://issues.apache.org/jira/projects/CASSANDRA/issues/CASSANDRA-13810?filter=allopenissues))

------
danpozmanter
"Engineers should be on call for their own code." \- Would you rather work
someplace you are expected to be on call 24/7, or a company that doesn't
require that?

It isn't the norm, and it isn't competitive. It's just more "always on"
culture in the workplace - and that's not healthy. A company should understand
workers need real breaks - and being on call is not a real break.

~~~
jsmith12673
Want to jump in here - I have worked at a company where engineers are not on
call for their code, and it was a living nightmare.

_You_ might not be on call for your code, but _somebody_ will be. Often some
poor SRE/ops person that has absolutely no idea what the app is doing/or why
it's failing in production.

Not being on-call makes engineers complicit. I've seen it all, known memory
leaks shipped into production, apps where half the endpoints couldn't even be
compiled, code dumping the production redis at 1AM ... and every time the pain
just felt on deaf ears.

If your code is what wakes you up in the middle of the night, you have: \-
Incentive to fix/mitigate as soon as possible. \- No blame game to play.
Either the error was made by you, or someone on your team. It doesn't have to
go up 3 rungs on the ladder then back down again.

I don't think the author was suggesting that everyone should always be on
call, just that you _must_ be responsible for your own code in production

~~~
ratww
I'm always happy to help some poor SRE in the middle of the night, and I once
even drove to the office in a rainy Sunday, in the middle of my vacation, to
access IP-restricted stuff because a support intern messaged me on Instagram.

...but with that said: I'm glad I only worked in countries where work is
properly regulated and "on call" means "I'm getting fucking paid every cent
for each hour I _must_ answer that goddamn phone". Which in practice means
there's no PagerDuty.

The unpaid on-call culture is bullshit. The company can either pay me or go
fuck itself.

~~~
jsmith12673
I unfortunately work in a place where on-call is unpaid. I'm an SRE stuck in
the 90s.

The policy states that only the Operations team gets paid on-call, because I
guess in the old days they would be the expected to deal with production.

Fast forward to today, and the Operations folks are a small team managing 2
datacentres, and all on-call rotations between SREs and developers are
considered unofficial and therefore not eligible to be paid.

One of our Sr. Managers tried to take this up the chain, but then got
reprimanded for putting developers on-call.

------
m1117
Test in prod is terrible. Write unit and integration tests, test on staging,
and then monitor Prod.

~~~
site-packages1
Sorry to make the assumption that you didn't read the article, but it really
sounds like you didn't and are just making a comment about the headline. I
would recommend reading the article.

~~~
m1117
Yes, I think the headline must reflect the content of the article, otherwise
it confuses me and make me write false comments and wastes my time.

------
29athrowaway
I usually try to make all my development environments the same. Make them
differ only in terms of infrastructure, not code.

The more code differences you have between environments, the less valuable
pre-production testing becomes.

------
classified
The author apparently didn’t get the sarcasm in the "I test in prod" meme.

~~~
sagichmal
The author originated the meme. It is not meant to be interpreted as
sarcastic.

~~~
classified
So much the worse.

~~~
sagichmal
On the contrary, it's the best strategy for reducing risk in modern web
service architectures.

------
0xdeadbeefbabe
> We’re a startup. Startups don’t tend to fail because they moved too fast.
> They tend to fail because they obsess over trivialities that don’t actually
> provide business value. It was important that we reach a reasonable level of
> confidence, handle errors, and have multiple levels of fail-safes (i.e.,
> backups)."

Do established companies, or stopdown, fail because they moved too fast? Can
we dismiss Charity's astonishingly good advice over perceived startupiness?
That's dumb.

------
hn_throwaway_99
This article is inane to me. Someone decided to write a lengthy blog post on
the deliberate and obvious misinterpretation of a meme.

DevOps and Observability are well described disciplines. I don't know anyone
who has uttered the meme "I don't always test, but when I do, I test in
production" who meant that monitoring, observing and reviewing prod behavior
is bad. They invariably meant that they were rushed to release things in prod
before it was adequately verified to ensure your customers don't have crappy
experiences.

~~~
dnautics
The opinion of the article is that the catchphrase has become an excuse to
build poor observability and in-prod testing tools, and we should be better
about that.

~~~
hn_throwaway_99
> The opinion of the article is that the catchphrase has become an excuse to
> build poor observability and in-prod testing tools, and we should be better
> about that.

Is that a problem that actually exists though? I've certainly never
encountered it, as the types of teams the are strict about good pre-release
testing are also the types of teams that are big on observability.

~~~
dnautics
Haha that's why I said "opinion". If you're curious about my opinion. I think
there are some good points, that observability tools could be better. I think
"throw it over the fence to ops" is a slow-boiling problem coming down the
pike. Are observability tool abjectly poor though? I don't think so.

