
Spark joy by running fewer tests - caution
https://engineering.shopify.com/blogs/engineering/spark-joy-by-running-fewer-tests
======
mdoms
> Unfortunately, one can’t fully eradicate intermittently failing tests,

Oh boy do I disagree with this. I have a zero tolerance policy for flaky
tests. In code bases where I have the authority, I immediately remove flaky
tests and notify the relevant teams. If you let flaky tests fester - or worse,
hack together test re-runners and flaky reporters - they will erode trust in
your test suite. "Just re-run the build" will become a common refrain, and
hours are wasted performing re-runs instead of tracking down the actual
problems causing the red tests.

~~~
quicklime
I generally agree, but there's one thing that bothers me about the practice:

Before disabling the test, does anyone even consider the possibility that
maybe it's not the test that's flaky, but that the software under test is
itself buggy in a non-deterministic way?

I don't have a good answer for how to deal with this. It would be an enormous
amount of effort to go through every flaky result and reason about whether the
problem is with the test or the code under test, especially in larger
codebases. Maybe the heuristic of assuming that it's the test's fault is good
enough, and this is what I've done in every team I've worked in.

But it does bother me.

~~~
smarterclayton
There’s one particular domain where this question runs deep - tests that
verify parts of distributed systems. Note: please do not extrapolate any
problem described below outside of distributed systems.

For instance in Kubernetes we run lots of tests that verify parts of the
system actually working together, not just mocked. A fair amount of the time,
once tests are in and soaked, the test flake is actually the canary that says
things like “no, the kernel regressed networking AGAIN” or “no, your retry
failure means that etcd lost quorum - which shouldn’t have happened, what on
earth is going on, oh, someone suddenly started hotlooping and now the api
server is performing 10k writes a second”.

At some point, testing the system in isolation is not enough because all
failures are interactions of multiple components. Your flakes are the evidence
that the gigantic stack of software known as “modern computing run from HEAD
lol” is full bugs.

It’s definitely a culture change - once anyone anywhere thinks “oh it’s just
flaky” they stop treating it like signal. Once they treat it like noise, it’s
very hard to unwind.

Today, when I look at one of those test suites and see a flake, I assume that
everything in open source is broken, again, and pull out the shovel, because
75% of the time it is.

~~~
smarterclayton
Of course, to be able to see the signal, you have to be ruthless to noise.
Deflake first, disable if you must, but never stop until it goes green and
then be ruthless about keeping it that way.

------
erulabs
As a long time DevOps engineer and now founder - my perspective on tests has
really gone thru a rollercoaster. In the past life, I’d regularly be the guy
rejecting deployments and asking for additional tests - barking at developers
who ignored failures, lecturing on about the sanity saving features of a good
integration test.

These days? Well the headlong rush to release features and fixes is a real
thing. Ditching some tests in favor of manual verification is a good example
of YC’s advice “do things that don’t scale”. I add tests when the existential
fear notches up - but not a whole lot before that.

Like with almost all topics in software development - the longer I’m in the
field - the less intense my opinions become. The right answer: “eh, tests
would be nice here! Let’s get to that once the customer is happy!”

~~~
commandlinefan
> Ditching some tests in favor of manual verification

I think that's the biggest problem with (unit) testing: there's too much
"either test absolutely everything, or don't (unit) test anything at all"
thinking. You can save yourself a lot of hassle if you just stick to writing
tests for the things that are relatively straightforward to test: stay away
from anything that requires an external system or anything that runs in
parallel (multithreading, for example), and leave that stuff for manual
testing.

A big benefit of having reasonable code coverage that I don't often see
discussed is that if you have a unit test for each function in the system, you
can also _run_ any function without recompiling and redeploying the entire
app: when you've isolated the source of a problem, this capability can be a
godsend in allowing you to iterate quickly to narrow down the cause. I
invariably find myself stuck working with systems for which unit tests were
never developed, so there's no way to test a change besides recompiling the
whole thing, redeploying a whole environment, waiting for the whole thing to
start up, firing up the UI, navigating to the problem spot, and finally
testing what I was after in the first place.

~~~
cellularmitosis
> "You can save yourself a lot of hassle if you just stick to writing tests
> for the things that are relatively straightforward to test"

Yup. Focus most of your effort on testing your pure functions. If you've
followed functional core / imperative shell, typically what's left in the
shell isn't very interesting from a testing perspective.

And if the majority of your codebase is still in the imperative shell, well,
time to get to work :)

[https://www.destroyallsoftware.com/talks/boundaries](https://www.destroyallsoftware.com/talks/boundaries)

------
hideo
I wonder if Marie Kondo-ing your tests is a good idea in general. This article
reminded me a lot of "Write tests. Not too many. Mostly integration"
[https://kentcdodds.com/blog/write-tests/](https://kentcdodds.com/blog/write-
tests/) and other such articles for limiting testing, like this test diamond
article [http://toddlittleweb.com/wordpress/2014/06/23/the-testing-
di...](http://toddlittleweb.com/wordpress/2014/06/23/the-testing-diamond-and-
the-pyramid-2/)

I've seen far too many unit tests at this point that just assumed too much to
be meaningful, and conversations about unit testing quickly devolve into "no
true scotsman" style arguments about what is truly a unit and what isn't.

~~~
sidlls
Kent's article isn't so much about limiting the test coverage so much as
preferring one kind (integration) over another (unit). He's right. Tons of
unit testing in application codebases is a huge, stinking code smell in my
experience. Usually it means the code is overly abstract (interfaces
everywhere for "mockability") or unnecessarily complex (e.g. conditional
expressions factored out into functions called in one site just so they can be
tested separately) or both.

------
bigmanwalter
When it comes to testing, I now follow the advice of Gary Bernhardt's
presentation, Functional Core, Imperative Shell:
[https://www.destroyallsoftware.com/screencasts/catalog/funct...](https://www.destroyallsoftware.com/screencasts/catalog/functional-
core-imperative-shell)

The idea is to move all the logic of your app into pure functions that can be
tested on the order of milliseconds. When you refactor your code to allow for
this, everything just makes more sense. Tests can be run more often and you
are far more confident about the behaviour of your code.

~~~
8192kjshad09-
Maybe I just don't get it, but this design seems impractical for systems that
use the network alot.

I work on an app whose basic structure is

1\. Receive data from a webhook (not controlled by me) 2\. Make some HTTP
requests based on data from 1 3\. Send data to another HTTP API

The core of the logic is in step 2. How can I possibly have a functional core
here? Step 2s behavior completely depends on the HTTP responses.

~~~
jzoch
The idea is that your server handler mimics those steps and contains _all_ the
IO you need (aka external depdendencies). The receiving and sending of the
http requests + any other I/O you need should be immediately visible in the
top level method (in a simple model). Everything below is functional and pure.
Take a request, use pure functions to transform and reshape the data, pass
that back up the call-stack to the top level method. Do some I/O operation
against the database, go into functional land to shape it a bit, bring it back
up, combine with other data in a pure way and then finally send the response.

------
jwalton
I wonder if they considered using something like ptrace to track which .rb
files a given test suite loads? This would probably be orders of magnitude
faster than Rotoscope or Tracepoint. You'd get a much "coarser" data set,
since you wouldn't know which individual tests called which modules (unless
you run each test case in it's own Ruby runtime). On the upside, you'd be able
to watch JSON and YAML files.

~~~
byroot
> I wonder if they considered using something like ptrace to track which .rb
> files a given test suite loads?

It's not really possible as we eager load the entire application on CI. And
even if we didn't to trace depedencies like this we'd need to boot the
application in lazy load mode for each test suite, that alone would be slower
than running the entire test suite.

------
umaar
When it comes to browser automation tests, everywhere I've worked at suffered
from intermittently failing tests. When I was at the Ministry of Justice (UK),
I configured CircleCI to run tests hundreds of times [1] (over a number of
days) through cron jobs. This allowed me to reflect on all test results, and
find out what failed most often and eventually solve those root causes. This
strategy worked well.

Interestingly enough, just today I posted a GitHub thread [2] and asked the
community to 'thumbs up' the video course they'd like for me to create. "Learn
Browser Automation" is currently the highest voted. If it's the one I end up
making, a huge focus will absolutely be on: How to reduce test flakiness with
headless browsers.

Words of advice, avoiding sleep() and other brittle bits of code will help.
But in addition, run your tests frequently to catch out the flakiness early.
Invest in tooling which helps you diagnose failing tests (screenshots,
devtools trace dumps). Configure something like VNC so you can remotely
connect to the machine running the test.

[1] [https://github.com/ministryofjustice/fb-automated-
tests/blob...](https://github.com/ministryofjustice/fb-automated-
tests/blob/7c9cee58db902419abf5449aaef6e91e575502d9/.circleci/config.yml#L52)

[2] [https://github.com/umaar/dev-tips-
tracker/issues/33](https://github.com/umaar/dev-tips-tracker/issues/33)

------
m12k
I think this shows how counting on horizontal scaling to handle inefficiency
only works for a while, and will eventually introduce its own set of
complexities - and dealing with those might be more trouble than coding things
more efficiently in the first place. Then again, maybe it's worth it for a
huge company like Shopify, because they save work for the feature devs by
adding extra load on the test infrastructure devs, effectively increasing
developer parallelism.

Finally I wonder if they could have done more to speed up their tests? I'm
maintaining a Rails codebase too, and I cut test time down by two thirds by
rewriting tests with efficiency in mind - e.g. by ignoring dogma like 'each
test needs to run completely independently of previous ones' (if I verify the
state between tests, do I really need to pay the performance penalty of a
database wipe?) The test-prof gem has a great tool, let-it-be, that allows you
to designate db state that should not get wiped between tests. That and
focusing on more unit tests and fewer integration test has really gone a long
way toward speeding things up again for me.

------
lazyant
> Before the feature was rolled out, many developers were skeptical that it
> would work. Frankly, I was one of them.

Kudos for the Skepticism section. Every "how we fixed problem X at company Y"
should have this section, and esp. written by someone who opposed the
solution. The challenging and battle stories tend to be the most interesting
part, at least for me.

------
wwright
Interesting that they don't talk about Bazel. Isn't skipping tests like this
one of its biggest selling points, particularly for monorepo users?

~~~
spankalee
Exactly. The first thing they should dispense with is the idea that they
should run _all_ the tests in the monorepo. That's what doesn't scale.

Run only the effected tests and the overhead is now more proportional the
potential impact of the change.

~~~
wwright
That seemed to be what they were describing, but with dynamic dependency
detection via introspection.

------
Seb-C
My experience is that most of the time, randomly failing tests are actually
failing because they were made to be random.

Sure, adding Faker to your test/mock data may help you to find rare edge
cases. The problem is that those edge cases will be triggered in the future,
in another context and by another developer. So not only it probably won't be
fixed, but it will be a waste of time and annoyance for someone. Same thing
for some option like the `random` setting in Jasmine (runs your unit tests in
a different random order each time).

So now I only have tests with static, predictable data.

Having tests that depends on the state of the previous one (or are affected by
it) is quite common too. And not always easy to fix properly.

Once this is removed, the remaining random failures are usually authentic and
helpful.

------
flukus
And here's a makefile implementation with a statically typed language:

    
    
      #run test and output to a .testresult file when the test is modified
      %.testresult: %.test
        $< > $@ 2>&1
    
      #depends on a .testresult file for each test
      test: $(ALL_TEST_RESULTS)
        cat $(ALL_TEST_RESULTS)
    

Sure the dynamic typing introduces many problems, but surely you could have
something a bit more half way like "%.testresult: %.test $(DEPENENDENCIES)
$(METAPROGRAMMING_MAGIC_DEPENDENCIES)", then only hopefully rare changes to
the core files require the full test suite to be rerun. Seems like dynamic
typing is the core of their problem though and this is a crazy complicated
solution to try and work around that.

~~~
jolmg
> $< 2&1> $@

I'm not really well versed in Makefiles, but that seems to pass the argument
"2" to the testing program, launching it in the background with "&", and then
doing a redirection for the stdout of an empty command to the target file with
"1>".

I think you meant

    
    
      $< > $@ 2>&1
    

which could also be

    
    
      $< &> $@
    

I can't see how your example relates to static or dynamic typing, though.

~~~
flukus
Fixed, the former is correct, the latter I think is a bashism that may not
work. Unfortunately I plucked that from a project in a very broken state.

> I can't see how your example relates to static or dynamic typing, though.

Because with static typing you know the object/file graph at compile time and
only affected tests need to be run. In that example the .test files will only
be built when their dependencies change and .testresult will only be created
if .test is built, everything is incremental and "make test" won't execute any
tests if there are no changes that could affect them.

~~~
joshuamorton
You know this in dynamic languages too, based on imports.

So if you have the build graph represented for your dynamic language, you get
the same result.

This is just a poor (poor, poor) buggy implementation of Bazel.

------
davewritescode
The approach they use to apply this to a large Ruby project is interesting but
this type of strategy has been in use since forever and at least to me, seems
fairly obvious.

Running _all_ the tests with every build is always a bad idea. A better
approach that doesn't require fancy dynamic analysis is to organize tests in a
way that it's clear what's likely to break and to make sure you're constantly
running your test suite in QA environments.

Making a change to a module should force you to run that module's test suite.
Interactions between modules can be tested all day in a loop and monitored
before deploying to production.

------
overgard
I wonder if the tests are even serving a purpose at that point? If they can't
reliably answer the question "did I accidentally break something" in a
reasonable period of time what's the point?

------
hospadar
We just did pretty much the exact same thing with a large in-house ETL
application. We can do great static analysis of the dependencies of different
jobs and the coverage of various tests. Most PRs are now running a tiny
fraction of the test suite (thousands of tests total) in minutes instead of an
hour. Run all test cases before deployment of master just in case we missed
something.

~~~
yawaramin
Was there internal resistance to this, and if so what was it?

------
simon_000666
> has over 150,000 tests

> takes about 30-40 min to run on hundreds of docker containers in parallel

This seems like it might be a signal that now could be the time to start
splitting services out in an SOA fashion and having their own test suites and
some contract driven tests? Having to run that many tests on each commit is
definitely a smell that something architecturally fundamental is wrong...

------
darksaints
This sort of thing is why I always scoff at the idea of slow compile times for
static strongly typed languages (eg. Scala, Haskell, OCaml, Rust, etc).

Sure, it definitely compiles slower than some other compiled language, and
there is no compilation for an interpreted language. But if you factor in
testing, the type systems of those languages can easily remove a massive
amount of testing that other less rigorous languages would either a) not test
at all, or b) test at a significant cost. I'd be willing to bet that if
shopify were using a strongly typed language, 50-90% of their testing would be
completely redundant, because the type system already takes care of it.

That isn't to say that there are not reasons to use dynamically typed
languages...just that if you are building production systems in strongly typed
languages, compile time is almost completely irrelevant as a factor in
productivity, regardless of how much slower they compile in comparison to an
alternative.

------
mehrdadn
Does anyone have thoughts on whether test suites suffer from Goodhart's law?
Sometimes I feel like they only work well if people assume they don't exist
and commit accordingly.

~~~
jwalton
My dad was a huge proponent of this back when he was in software; don't tell
you developers what tests failed, only what percentage of tests failed.

A software test is, in many ways, like an exam you might write in university;
just like an exam can't possibly cover 100% of what you're supposed to know, a
test (especially an integration test for a large and complex system) can't
possibly cover 100% of the conditions the system is supposed to operate under.
An exam is a good way to measure if you know a subject though, and similarly a
test suite is a good way to check that the quality of a system meets a certain
bar.

Once you write the exam, if I go back and say "Here are all the questions you
got wrong. Go study up and write the same exam tomorrow," though, it very much
ceases to be a good measure of whether or not you know the subject matter. You
can now "cheat the system" by studying only the parts of the subject that the
exam covers.

Similarly, once your integration tests are failing, if someone tells you which
tests are failing and how, you're going to go back and fix only what you need
to get the tests passing. At this point, the tests stop being a good
indication of code quality - 100% of the tests are passing, but you can't say
that 100% of the defects have been removed, so the tests are, in a sense, now
kind of worthless. They might stop a limited number of future defects getting
in, but they're not doing an arguably much more important job of telling you
what your overall quality level is.

If instead, when you submit a commit, I say "5% of the tests are now failing"
and nothing else, you have to go look for defects in your code, and you're
probably going to find a lot of defects on your own before you even get to the
5% that the tests are complaining about.

~~~
mdoms
This sounds like a fun game to play with a team of developers who have no time
constraints. In every organisation I've worked with you would get a very stern
talking to for behaving like this.

My tests are specifically designed to show you where the defect is, so you can
solve the immediate problem and get back to work. I don't expect every
developer who triggered a failing test to perform a full analysis of the code
base and resolve every other defect. That would be nice, if we had the time.

~~~
jwalton
I'll prefix by saying that this is exactly how I write my tests, too. But,
let's do a little critical thinking here and ask "Why do we write integration
tests?" If the goal is to improve software quality, I'm afraid I have some
disappointing results for you.

Back when my dad was working at a huge software company (BigCorp, let's say),
he looked at how much effort the manual test team spent over a two week
period, and how many defects they found. Then he did the same over the next
two week period. Now, logically, in that second period, some of the earlier
defects had been found and fixed, so it should now be harder to find defects,
so the total defect count per unit of test effort should be lower, right?
Armed with two data points, he did a regression and worked out how many
defects they'd find if they did an infinite amount of testing; effectively the
undiscovered defect count left in the product. The number he got was
astoundingly huge - no one believed this was possible. So, he went over to the
plotter and plotted out a giant effort/defect count curve, and then every two
weeks he'd put a pin in his plot to show reality vs. his prediction, and for
months and months until he got tired of doing it, he was pretty dead on. And
he didn't just do this for one project, he did it for lots of projects, across
lots of different teams of various sizes.

On all of these projects, all the manual testing they could possibly hope to
accomplish if they had their entire testing staff spend 100 years testing
would have reduced the overall defect count by a tiny tiny fraction of a
percent. So testing and then fixing bugs found in tests (at least in all the
projects at BigCorp) didn't really have a huge impact on software quality.

And this should not really be a surprise; if you were manufacturing cars, you
might test the power seats on every 100th car, and use this to figure out what
the quality of power seats in your cars is. You might discovery only 95% of
power seats are working, and you might think that's unacceptable. If you do,
though, you're not going to "solve" the problem by fixing all the broken power
seats you test; you're going to go figure out where in your manufacturing
process/supply chain things are going wrong and fix the problem there, and
improve your quality. The testing is a measure of your process.

Software is not so different - by the time it gets to integration testing, the
code has been written. The level of quality of the code has largely already
been set at this point - all the defects that are going to be introduced have
already been introduced. The quality level is dependent upon your process and
the skill level of your developers. So testing some arbitrary fraction of the
lines-of-code is going to find problems in some percentage of those lines-of-
code, but fixing those particular problems? Is this going to have a huge
impact on quality?

~~~
krab
> you're not going to "solve" the problem by fixing all the broken power seats
> you test; you're going to go figure out where in your manufacturing
> process/supply chain things are going wrong

I think the point is that examining the failing seats will lead you to the
points in process that should be fixed. Therefore you can fix the issue more
efficiently than going through the whole process. Similar as with knowing
details about failing software tests.

Imagine I hide the information about seats and tell you 10% of the finished
cars have "some" defect. Where would you even start looking in the factory?

