Hacker News new | past | comments | ask | show | jobs | submit login
Spark joy by running fewer tests (shopify.com)
147 points by caution 8 months ago | hide | past | favorite | 99 comments

> Unfortunately, one can’t fully eradicate intermittently failing tests,

Oh boy do I disagree with this. I have a zero tolerance policy for flaky tests. In code bases where I have the authority, I immediately remove flaky tests and notify the relevant teams. If you let flaky tests fester - or worse, hack together test re-runners and flaky reporters - they will erode trust in your test suite. "Just re-run the build" will become a common refrain, and hours are wasted performing re-runs instead of tracking down the actual problems causing the red tests.

I generally agree, but there's one thing that bothers me about the practice:

Before disabling the test, does anyone even consider the possibility that maybe it's not the test that's flaky, but that the software under test is itself buggy in a non-deterministic way?

I don't have a good answer for how to deal with this. It would be an enormous amount of effort to go through every flaky result and reason about whether the problem is with the test or the code under test, especially in larger codebases. Maybe the heuristic of assuming that it's the test's fault is good enough, and this is what I've done in every team I've worked in.

But it does bother me.

There’s one particular domain where this question runs deep - tests that verify parts of distributed systems. Note: please do not extrapolate any problem described below outside of distributed systems.

For instance in Kubernetes we run lots of tests that verify parts of the system actually working together, not just mocked. A fair amount of the time, once tests are in and soaked, the test flake is actually the canary that says things like “no, the kernel regressed networking AGAIN” or “no, your retry failure means that etcd lost quorum - which shouldn’t have happened, what on earth is going on, oh, someone suddenly started hotlooping and now the api server is performing 10k writes a second”.

At some point, testing the system in isolation is not enough because all failures are interactions of multiple components. Your flakes are the evidence that the gigantic stack of software known as “modern computing run from HEAD lol” is full bugs.

It’s definitely a culture change - once anyone anywhere thinks “oh it’s just flaky” they stop treating it like signal. Once they treat it like noise, it’s very hard to unwind.

Today, when I look at one of those test suites and see a flake, I assume that everything in open source is broken, again, and pull out the shovel, because 75% of the time it is.

Of course, to be able to see the signal, you have to be ruthless to noise. Deflake first, disable if you must, but never stop until it goes green and then be ruthless about keeping it that way.

One of the culture changes I've had to try to push was to test for the failure rate of a distributed system rather than the absolute number of failures- don't test for "Were number of failures less than n", test for "Did the success rate fall below 99.99%"

I totally understand, in our setup, flaky tests happen when testing parts of the software that depend on the Hadoop infrastructure. Sometimes you can’t fully control some pieces and as a result those tests can be flaky.

However I want to highlight how frustrating it is when your tests fail when you wait for the build to finish...

> It would be an enormous amount of effort to go through every flaky result and reason about whether the problem is with the test or the code under test, especially in larger codebases.

IMO a "flaky test" is one for which the investigation has been performed and the design of the test assessed to be "flaky." i.e. not a test that passes some times and not others, but one whose design cannot reliably expect a given result or one whose implementation can be impacted by factors outside of the control of the test suite. Timeouts are a classic source of flaky tests.

Once a test is determined to be flaky it should be evicted from the test suite, or at least demoted from the "reliable/frequently run" rank to a lower one that can tolerate some hand holding.

I'm not sure if it's exhaustive, but any test with a dynamic dependency is one I would likely classify as "flaky", like anything hitting another system. Draw your own boundaries as fit around what "dynamic" means (ex: some might say static files are OK, others not).

Also, I'm talking every-commit, unit-level testing for context.

This is tricky but when you reframe it it's clear the heuristic you mention is the correct one. The value of a test is giving signal to a developer that the code they wrote is correct. If a test is flaky due to the test or the code, the signal the test provides has now become negative. The parent mentions filing it on the team that owns the test. This sounds correct and the onus is then on that team. At larger companies this process is automated. You may enjoy the flake detection section in this blog post https://dropbox.tech/infrastructure/athena-our-automated-bui...

I have found two cases that occur enough to not assume that the test is wrong:

1. latency/performance - the test system in a well-controlled environment shouldn't have so much latency variance anyway (project was an 'instant' messenger)

2. maybe 30% of the time the test (run frequently in CI) is identifying a race condition that doesn't show up when run ad-hoc by a dev

I've always been curious about this. My fiction is that if someone spent the time they'd find a few sources of most of the flakiness and and at some point the flakiness would fall drastically. Like maybe there are 1000 flaky tests but fixing 20 bugs would clear up 900 of those flaky tests.

No idea if it's true. My experience though is that programmer A who has no domain knowledge of the issue marks test as flaky and files a bug. That bug is completely ignored and there's a giant pile of flaky test and people wondering why stuff broke and it turns out because 1000s of tests are being ignored as flaky

We had exactly this at an old job. A flaky browser test that turned out to be due to some misuse of thread local storage, so occasionally users would see data for another user.

Once we tracked that down, we took flaky tests much more seriously.

When thread-local and request-local aren't the same thing...

This is why your CI system should record statistics on which tests failed and when.

A test with a pattern of failing intermittently over its entire lifetime is most likely a flaky test. Disabling the test modifies a bunch of metrics, like skipped tests and code coverage (woe be unto you if some idiot set it up so you can't commit code that lowers test coverage, instead of just warning you).

There are much bigger and more likely sources of regression in the code than a new concurrency bug in code covered by a test that already has a concurrency bug in it. Like someone modifying a test to match the regression they just created.

If there is a concurrency or randomness bug in the untested code, that will probably surface when trying to write a robust test to replace the old one. If not by the author, then by the person the author asks for help when they can't figure it out.

> A test with a pattern of failing intermittently over its entire lifetime is most likely a flaky test.

I'm not GP, but I think they meant something more radical.

In CI, if a test is seen to fail and then succeed on the same git commit for the first time, that's it. One time is enough to classify it as "flaky". No more lifetime, no more patterns of interesting behavior. The test is removed and thrown into backlog.

I might paint with a broader brush. If your commit breaks master, we revert the whole change and you try again.

When the old stuff breaks, there's a bunch of other decisions built on top of it and there is no clear path to unwind the changes. Between HEAD and HEAD-4? Yank it.

If you have a problem with testing infrastructure/code, that will start disabling random tests until either you're left with nothing, or realise that radical moves are rarely practical. (With a side of distrust from any team you send the work to)

Well, if it's a third-party software, that's not ideal but "ok" for realistic levels of ok.

But yeah, your tests shouldn't be non-deterministic unless there's a very good reason why not (and to be honest I can't think of any)

For added resiliency, try changing the order of the tests and/or not running some tests sometimes

That being said, my obligatory test rant is "don't write micro-tests" (like "test if method returns" then "test if method returns the right value" etc) and don't set-up everything in the setup function (that is, setup only what you need).

> Before disabling the test, does anyone even consider the possibility that maybe it's not the test that's flaky, but that the software under test is itself buggy in a non-deterministic way?

I work on the team that build the system in the article. When we disable tests we ask our developers investigate the root cause. Most of the time it is related to application code, so fixing a intermittently failing test definitely isn't related to the test itself, but is often a result of multiple factors.

The key realisation, I think, is that you've already got that problem. A flaky test is one that's failed to do its job. That's why you can't just disable it, you need to back that up with rework to make sure the underlying problem is addressed.

In talking to people about this problem, a common misconception I hear is that "well it only fails 1% of the time (so why should I care?)"

Developers don't think in percentages. Our experiences are based on how frequently something happens, not the percent of the times it happens. In a CI build, where we are agressively building many times a day, your one bad test is going to fail every week or two.

Now add two dozen more tests that do the same, and now builds are failing 2x a day on average and more frequently during crunch times and every few months the build will fail 4, 5 times in a row. And for some reason this happens right when someone actually needs a build now because they broke something and everyone is waiting for them to fix it. Talk about stressful.

Every day is "all the time" to most people, and 5 builds in a row might be "hours" (because you have to keep babysitting it). These are serious broken windows and that is how you should treat them.

The worst thing is you end up normalizing failure - nobody even bothers checking whether it's a "real" failure until it's failed a few times in a row. At that point you might have introduced other test failures which nobody noticed, or someone might have built on top of the broken code, making a fix harder, and so on and so on.

Indeed. This pattern though is often the point at which an intervention is possible. Almost like it has to happen repeatedly before you can get consensus to stop it.

I've worked on the project in the article so I can give some perspective on our handling of intermittently failing tests.

Generally, I agree with your sentiment. Intermittently failing test are eroding trust in the codebase. In our case there are practical challenges with the size of our test base. Not everybody has context on every part of the codebase. This makes it really hard for people to decide if a test is intermittently failing and decide to remove it. The other issue is that at the size of our test base, we are introducing more intermittently failing tests than we can eradicate effectively. To combat this problem we build a test-onboarding system that tests test for common issues. Once a test enters the codebase we are disabling the tests if they are failing and then passing for the same commit too many times so that an individual developer doesn't have to make the decision.

I totally agree with you. We have a monorepo with maybe a hundred thousand tests, and we have a nightly build that runs against all of them, and publishes the results in an email to all engineering. 3/5 days we come out clean. 2/5 days we discover and fix a flaky test or broken build target.

If there’s an intensity of flakiness that can withstand the gaze of a couple hundred engineers for more than an hour or so, we haven’t found one yet.

Edit: looks like our numbers are quite close to Shopify’s. We don’t need anything close to hundreds of docker containers running in parallel. But we’re mostly a python shop, maybe they’re being a ruby shop slows things down here.

Ruby is not really slowing us down, the bigger issue is that we have a lot of integration tests and few isolated unit tests. So we are massively parallelizing our builds so we can get p50 build times of 23 minutes when running the whole test suite. The time mentioned in the article is a little bit closer to the p95 time.

> If there’s an intensity of flakiness that can withstand the gaze of a couple hundred engineers for more than an hour or so, we haven’t found one yet.

That is a good approach and we also used to disable tests that would fail on full master builds. When it comes to fixing intermittently failing tests we are now storing every test run so we can analyze our test suite. This way we are able to find the slowest and flakiest tests in the codebase, disable them, and notify the owners of the code to investigate.

That’s interesting and definitely helps make the numbers make more sense! :)

In your experience, do you find high value in having your tests be 80/20 (rough numbers) integration/unit vs the other way around?

I guess it depends on the programming style, but here it’s the other way around. Some minority of engineers here clamor for end to end tests every once in a while, but we’ve rarely found the difficulty of establishing the test harness to be worth the marginal gains in reliability. We’re also wary of the temptation to write e2e tests to test unit-test-level correctness, because of the implications on test performance, like you describe.

On our side, the closest thing we have to ubiquitous e2e testing is that we spin up and spin down one ephemeral database instance per test module (could be used by many test functions).

How does your zero tolerance policy work when you can't find a reason the rest fails? I.e. you have critical functionality you test, but fail one test run in 1000. You can spend 2 days on the issue and not find a solution.

Sinking more time into it doesn't really help anyone much and won't offset any time saving in a realistic future. On the other hand, if you actually remove the test, you risk messing up a critical transaction.

If you have a critical function failing testing, you need to figure out the root cause, whether it takes 2 days or 2 months. 2 days is nothing - I've spent more than that on UI polish bugs, to not be able to dedicate 5x that time to fixing critical functionality, something is desperately wrong.

Viraptor is right. Your "solution" is far from unversally acceptable. If you came to me as developer telling me that you did 2 months debug on test that fails randmoly once in a week and you didn't do much else, I would probably show you the dor as a team lead. We need to put things into perspective.

I hate the flaky tests just as much as a next dude but seriously, you need to factor in everything - deadlines, features, CI/CD, metrics etc, while still having constant time frame.

So lets not get into such polar opinions, its just not realistic at all. Life is never that ordered.

If you came to me with that attitude, I would fire you immediately.

This is, per definition, critical functionality, and we can't figure out whether it is working or not, and you are saying you don't care.

> If you came to me as developer telling me that you did 2 months debug on test that fails randmoly once in a week and you didn't do much else, I would probably show you the dor as a team lead. We need to put things into perspective.

This is a pretty polar opinion right here - it's not a flaky test that fails once a week, it's unreliable critical functionality. If I were _your_ lead and you told me you fired someone because they spent two months on something, I'd probably fire you for not knowing what your team are doing for two months.

> This is a pretty polar opinion right here - it's not a flaky test that fails once a week, it's unreliable critical functionality.

It may be unreliable critical functionality, or unreliable test. In some cases it's obvious which one it is, in some it isn't. If you know it's the test, it becomes much less beneficial to spend days on it.

I didn't say fire, didn't I, I said "show you the dor", like, get out from my office :P

And seriously, if you fire someone for having a strong opinion, you should remind yourself that even little kids tollerate a far worse behavior.

If it's acceptable for your critical functionality to fail 0.1% of the time in production, then sure, leave it in.


Majority of time, in my experience, flaky tests has nothing to do with the software but it is about faulty test infrastructure.

There's difference between a flakey test and test failing due to randomly failing feature. They're not always the same thing. The test failing does not necessarily mean the feature doesn't work.

The biggest thing is anything time based the effort to mock out of fake time is normally way larger than just dealing with a test that flakes from time to time.

That being said I make a concentrated effort to take in time rather then randomly calling now() in the middle of my code but most code bases aren't green fields.

It is my personal opinion that in that first statement you are severely underestimating the time and product quality that is eventually lost because of a flaky test that isn't dealt with quickly.

We just implemented a rule where tests taking > 100ms to execute are considered a failure. You can experiment with the parameters or limits but this sort of rule has paid huge dividends.

The majority of the tests it has disqualified sucked and didn't really "test" anything. They also tended to be the most brittle.

You don't write tests that hit APIs and databases when you have strict time limits and this eliminates a lot of "flaky" tests.

Testing budgets are important by disqualifying a test because it's "slow" is really silly.

1. Tests that hit APIs and databases are important because when you mock out the database and the API you're writing a test that asserts your code integrates correctly with what you think the database or API should be doing.

2. Some systems are highly asynchronous and thus very difficult to test individual cases in under 100 ms.

If I have a suite of 100 tests that completes in 10 seconds because everything runs concurrently, that's a lot more important than limiting and individual test to some arbitrary timespan.

Again, make a budget for tests and meet it instead of making hard rules about how fast or slow a given test should be.

obviously the context matters, but putting time limits on some tests can really cause headaches.

We've had problems where builds failed when many jobs ran together because build runner contention meant that jobs took just a bit longer than normal.. and of course the re-run would work because contention is lower.. the function still behaves as expected just a bit slower, is that worth failing a test and hurting productivity?

The obvious exception is benchmark performance critical code, but run of the mill functional tests should maybe have an unreasonably high time limit bc it hurts less?

Maybe an area you want to carve out an exception for are benchmarking tests; i.e. tests that ensure that a function runs faster/slower than a certain speed. In this case, you probably need more than 100ms of testing for accurate results.

For example, password checking functions should be slow, and extremely hot/performance sensitive functions should be fast.

Of course, you should have fairly wide margins to avoid flakes.

Are there exceptions? Our codebase has some functions related to generating secure cryptographic keys and certificate chains, and that test suit takes many seconds to run since it actually stresses the CPU.

If you can't hit your database in CI within 100ms or talking to it makes your tests flaky, you got problems.

I assume the parent commenter meant unit tests must take < 100ms. Unit tests should be mocking databases in-memory. So latency to databases in unit tests should be less than a microsecond.

It's okay (and expected) if integration and end-to-end tests take more than 100ms.

My point isn't that it's OK to take more than 100ms for and end to end test. My point is there is no reason why you shouldn't be able to reliably run tests that hit databases, multiple times, within a 100ms deadline and without adding flakiness.

It's 2020, computers are fast. You can do tens of thousands of database transactions in a second. Starting up postgres takes a fraction of a second and consumes negligible resources. If you treat your DB files as a build artifact that includes schema migration.

Spinning up a local instance of postgres defeats the whole purpose of an end-to-end test. The purpose of an end-to-end test is to test the whole system end-to-end; this includes network, other servers, etc.

Each of these can break, which is why e2e tests are inherently more flaky than unit tests. And network latency and transfer speeds mean that taking more than 100ms may happen often.

Another example of an e2e test that takes a long time are screenshot diffs. Spinning up Selenium and rendering a webpage might take more than 100ms in rare instances. iOS/Android tests are in the same vein where reloading an app takes > 100ms much of the time.

> Spinning up a local instance of postgres defeats the whole purpose of an end-to-end test.

So run some end to end test in a fully featured staging-style environment as well? I'm not disagreeing at all that having some environment that closely resembles production for very late phases as well is nice to have. I'm arguing against nonsense like mocking out database calls because of weird superstitions.

> And network latency and transfer speeds mean that taking more than 100ms may happen often.

The operative word being "may". If your staging server in Timbutku needs to talk to your other staging server on the South Pole you will blow your 100ms window all the time. If it all runs within a single good DC, network latency ought to be ~0.1ms roundtrip and you can shovel around Gigabytes per second.

Tests that hit dynamic data stores like DBs don't fail very often due to network speed constraints compared to configuration, system updates, security, schema changes, and a myriad of other time-consuming hard to diagnose problems. It's easier to just prevent these situations.

In the abstract I'm not sure what a test that talks to the database is actually testing.

How do you effectively develop if you can't reliably talk to the DB in dev due to configuration, system updates, security, schema changes, and a myriad of other time-consuming hard to diagnose problems?

> In the abstract I'm not sure what a test that talks to the database is actually testing.

Well, presumably the reason that you have a DB in the first place is that your application requires a DB for some of its functionality. So how do you make sure this functionality works if you don't talk to the DB?

Don't get me wrong -- I love that you refuse to accept ridiculously slow tests and enforce a 100ms deadline; I wish everything I was working on had the same approach. But unless you have some specific requirements (like talking to a proprietary cloud DB), you should maybe investigate if you might not have to accept slow and flaky DBs in dev or CI either. My experience is you don't.

Demote it to a group of flaky tests that don't block pushes. Or, demote them and block the push when a flaky test fails 100% of the time.

Then what’s the point of them?

That gives me an idea: what if you had a pool of flaky/failing tests, but it was limited in size, and if you add a test to that pool that takes it over the maximum size, you must either fix it, or evict another flaky rest by fixing it.

You could use the pool capacity and size as a proxy for technical debt. Starting a new, large project? You can increase the pool size temporarily until the project reaches mvp. Periodically the size of the test pool should be evaluated, tests fixed and the pool size shrunk.

Over time the overall pool size should shrink or at least stabilise.

I imagine without any discipline though you just get a consistent set of tests that no-one has the bandwidth to go and fix or make unflaky. I wonder if flakiness of tests correlates with discipline to paying down tech debt and improving testing

They still detect real failures. You want to know when they go from failing 10% of the time to 100%. If your team wants, block pushes at 100% failure.

But how is it a reliable indicator of any sort of failure if it doesn’t always fail? Doesn’t it erode trust?

As a long time DevOps engineer and now founder - my perspective on tests has really gone thru a rollercoaster. In the past life, I’d regularly be the guy rejecting deployments and asking for additional tests - barking at developers who ignored failures, lecturing on about the sanity saving features of a good integration test.

These days? Well the headlong rush to release features and fixes is a real thing. Ditching some tests in favor of manual verification is a good example of YC’s advice “do things that don’t scale”. I add tests when the existential fear notches up - but not a whole lot before that.

Like with almost all topics in software development - the longer I’m in the field - the less intense my opinions become. The right answer: “eh, tests would be nice here! Let’s get to that once the customer is happy!”

> Ditching some tests in favor of manual verification

I think that's the biggest problem with (unit) testing: there's too much "either test absolutely everything, or don't (unit) test anything at all" thinking. You can save yourself a lot of hassle if you just stick to writing tests for the things that are relatively straightforward to test: stay away from anything that requires an external system or anything that runs in parallel (multithreading, for example), and leave that stuff for manual testing.

A big benefit of having reasonable code coverage that I don't often see discussed is that if you have a unit test for each function in the system, you can also run any function without recompiling and redeploying the entire app: when you've isolated the source of a problem, this capability can be a godsend in allowing you to iterate quickly to narrow down the cause. I invariably find myself stuck working with systems for which unit tests were never developed, so there's no way to test a change besides recompiling the whole thing, redeploying a whole environment, waiting for the whole thing to start up, firing up the UI, navigating to the problem spot, and finally testing what I was after in the first place.

> "You can save yourself a lot of hassle if you just stick to writing tests for the things that are relatively straightforward to test"

Yup. Focus most of your effort on testing your pure functions. If you've followed functional core / imperative shell, typically what's left in the shell isn't very interesting from a testing perspective.

And if the majority of your codebase is still in the imperative shell, well, time to get to work :)


Some people view testing as a goal in and of itself rather than serving to make the product better. I think a lot of people start off there career in the similar place to you testing is needed because testing is good and slowly move towards a more realistic or big picture view of the role of testing.

I wonder if Marie Kondo-ing your tests is a good idea in general. This article reminded me a lot of "Write tests. Not too many. Mostly integration" https://kentcdodds.com/blog/write-tests/ and other such articles for limiting testing, like this test diamond article http://toddlittleweb.com/wordpress/2014/06/23/the-testing-di...

I've seen far too many unit tests at this point that just assumed too much to be meaningful, and conversations about unit testing quickly devolve into "no true scotsman" style arguments about what is truly a unit and what isn't.

Kent's article isn't so much about limiting the test coverage so much as preferring one kind (integration) over another (unit). He's right. Tons of unit testing in application codebases is a huge, stinking code smell in my experience. Usually it means the code is overly abstract (interfaces everywhere for "mockability") or unnecessarily complex (e.g. conditional expressions factored out into functions called in one site just so they can be tested separately) or both.

A unit test is a test that fails in a way that immediately points to the failure cause.

When it comes to testing, I now follow the advice of Gary Bernhardt's presentation, Functional Core, Imperative Shell: https://www.destroyallsoftware.com/screencasts/catalog/funct...

The idea is to move all the logic of your app into pure functions that can be tested on the order of milliseconds. When you refactor your code to allow for this, everything just makes more sense. Tests can be run more often and you are far more confident about the behaviour of your code.

It's excellent advice, but you still need to test that the wiring of all those functions works as well.

As an example, PhotoStructure has over 4,000 tests which test isolated functions: classical "unit" tests, which run under a minute. There are only a few hundred integration tests, but those take a couple minutes. All of these tests are run on every commit across all supported OSes using GitLab's CI and self-hosted runners.

Maybe I just don't get it, but this design seems impractical for systems that use the network alot.

I work on an app whose basic structure is

1. Receive data from a webhook (not controlled by me) 2. Make some HTTP requests based on data from 1 3. Send data to another HTTP API

The core of the logic is in step 2. How can I possibly have a functional core here? Step 2s behavior completely depends on the HTTP responses.

The idea is that your server handler mimics those steps and contains all the IO you need (aka external depdendencies). The receiving and sending of the http requests + any other I/O you need should be immediately visible in the top level method (in a simple model). Everything below is functional and pure. Take a request, use pure functions to transform and reshape the data, pass that back up the call-stack to the top level method. Do some I/O operation against the database, go into functional land to shape it a bit, bring it back up, combine with other data in a pure way and then finally send the response.

Pardon my pseudo-JavaScript, but given an event handler like this:

    const handler = (
        ) => {
            if (message === “a”)
                return req(“/api/a”, data)
            if (message === “b”)
                return {label:”b”, value: req(“/api/b”)}
By supplying the http client using the default argument on line 4, the client can be replaced when testing with a mock handler:

    test(“handler requests /api/a”, t=>
            handler(“a”, {b:”c”}, (url, body)=>[url, body])
            [“/api/a”, {b:”c”}]

    test(“handler requests /api/b”, t=>
            handler(“b”, {b:”c”}, (url, body)=>[url, body])
            {label:”b”, value:[“/api/b”, {b:”c”}]}
You could also write test cases that handle multiple steps, complex transformations, and so on.

It’s really nice being able to specify plainly the inputs and expected outputs of your functions and testing against that! It may not guarantee that everything behaves as the real system but it will catch 100% of your logic errors.

Hardcore test-all-the-things people would likely tell you to break whatever is in step 2 up into functions such that the webhook data are passed as parameters into a function, that all the HTTP request/response work is encapsulated into a separate function, and that the function under test should make calls to these.

Really hardcore ones might even suggest having separate functions for all these things, each of which can be tested as a single unit.

These people are, in my view, making a categorical error (that more testing of this sort means higher confidence and better quality code).

I wonder if they considered using something like ptrace to track which .rb files a given test suite loads? This would probably be orders of magnitude faster than Rotoscope or Tracepoint. You'd get a much "coarser" data set, since you wouldn't know which individual tests called which modules (unless you run each test case in it's own Ruby runtime). On the upside, you'd be able to watch JSON and YAML files.

> I wonder if they considered using something like ptrace to track which .rb files a given test suite loads?

It's not really possible as we eager load the entire application on CI. And even if we didn't to trace depedencies like this we'd need to boot the application in lazy load mode for each test suite, that alone would be slower than running the entire test suite.

When it comes to browser automation tests, everywhere I've worked at suffered from intermittently failing tests. When I was at the Ministry of Justice (UK), I configured CircleCI to run tests hundreds of times [1] (over a number of days) through cron jobs. This allowed me to reflect on all test results, and find out what failed most often and eventually solve those root causes. This strategy worked well.

Interestingly enough, just today I posted a GitHub thread [2] and asked the community to 'thumbs up' the video course they'd like for me to create. "Learn Browser Automation" is currently the highest voted. If it's the one I end up making, a huge focus will absolutely be on: How to reduce test flakiness with headless browsers.

Words of advice, avoiding sleep() and other brittle bits of code will help. But in addition, run your tests frequently to catch out the flakiness early. Invest in tooling which helps you diagnose failing tests (screenshots, devtools trace dumps). Configure something like VNC so you can remotely connect to the machine running the test.

[1] https://github.com/ministryofjustice/fb-automated-tests/blob...

[2] https://github.com/umaar/dev-tips-tracker/issues/33

I think this shows how counting on horizontal scaling to handle inefficiency only works for a while, and will eventually introduce its own set of complexities - and dealing with those might be more trouble than coding things more efficiently in the first place. Then again, maybe it's worth it for a huge company like Shopify, because they save work for the feature devs by adding extra load on the test infrastructure devs, effectively increasing developer parallelism.

Finally I wonder if they could have done more to speed up their tests? I'm maintaining a Rails codebase too, and I cut test time down by two thirds by rewriting tests with efficiency in mind - e.g. by ignoring dogma like 'each test needs to run completely independently of previous ones' (if I verify the state between tests, do I really need to pay the performance penalty of a database wipe?) The test-prof gem has a great tool, let-it-be, that allows you to designate db state that should not get wiped between tests. That and focusing on more unit tests and fewer integration test has really gone a long way toward speeding things up again for me.

> Before the feature was rolled out, many developers were skeptical that it would work. Frankly, I was one of them.

Kudos for the Skepticism section. Every "how we fixed problem X at company Y" should have this section, and esp. written by someone who opposed the solution. The challenging and battle stories tend to be the most interesting part, at least for me.

Interesting that they don't talk about Bazel. Isn't skipping tests like this one of its biggest selling points, particularly for monorepo users?

Exactly. The first thing they should dispense with is the idea that they should run _all_ the tests in the monorepo. That's what doesn't scale.

Run only the effected tests and the overhead is now more proportional the potential impact of the change.

That seemed to be what they were describing, but with dynamic dependency detection via introspection.

Does Shopify use Bazel? I would have guessed not. In any case, the more natural fit for them would be Nix, since it should be good at eliminiating redundant builds, and they recently went all-in on it.

AFAIK Bazel and Nix are orthogonal and actually go well together. Nix is great for packaging and environments, but is anemic at actual build tooling (a big one: no incremental builds within a package). Bazel does very little to manage the environment or packages, but it has very granular and efficient build support.

True, but they both require a rather large complexity and learning and complexity budget. It would be difficult to introduce both into a shop at the same time.

My experience is that most of the time, randomly failing tests are actually failing because they were made to be random.

Sure, adding Faker to your test/mock data may help you to find rare edge cases. The problem is that those edge cases will be triggered in the future, in another context and by another developer. So not only it probably won't be fixed, but it will be a waste of time and annoyance for someone. Same thing for some option like the `random` setting in Jasmine (runs your unit tests in a different random order each time).

So now I only have tests with static, predictable data.

Having tests that depends on the state of the previous one (or are affected by it) is quite common too. And not always easy to fix properly.

Once this is removed, the remaining random failures are usually authentic and helpful.

And here's a makefile implementation with a statically typed language:

  #run test and output to a .testresult file when the test is modified
  %.testresult: %.test
    $< > $@ 2>&1

  #depends on a .testresult file for each test
Sure the dynamic typing introduces many problems, but surely you could have something a bit more half way like "%.testresult: %.test $(DEPENENDENCIES) $(METAPROGRAMMING_MAGIC_DEPENDENCIES)", then only hopefully rare changes to the core files require the full test suite to be rerun. Seems like dynamic typing is the core of their problem though and this is a crazy complicated solution to try and work around that.

> $< 2&1> $@

I'm not really well versed in Makefiles, but that seems to pass the argument "2" to the testing program, launching it in the background with "&", and then doing a redirection for the stdout of an empty command to the target file with "1>".

I think you meant

  $< > $@ 2>&1
which could also be

  $< &> $@
I can't see how your example relates to static or dynamic typing, though.

Fixed, the former is correct, the latter I think is a bashism that may not work. Unfortunately I plucked that from a project in a very broken state.

> I can't see how your example relates to static or dynamic typing, though.

Because with static typing you know the object/file graph at compile time and only affected tests need to be run. In that example the .test files will only be built when their dependencies change and .testresult will only be created if .test is built, everything is incremental and "make test" won't execute any tests if there are no changes that could affect them.

You know this in dynamic languages too, based on imports.

So if you have the build graph represented for your dynamic language, you get the same result.

This is just a poor (poor, poor) buggy implementation of Bazel.

The approach they use to apply this to a large Ruby project is interesting but this type of strategy has been in use since forever and at least to me, seems fairly obvious.

Running all the tests with every build is always a bad idea. A better approach that doesn't require fancy dynamic analysis is to organize tests in a way that it's clear what's likely to break and to make sure you're constantly running your test suite in QA environments.

Making a change to a module should force you to run that module's test suite. Interactions between modules can be tested all day in a loop and monitored before deploying to production.

I wonder if the tests are even serving a purpose at that point? If they can't reliably answer the question "did I accidentally break something" in a reasonable period of time what's the point?

We just did pretty much the exact same thing with a large in-house ETL application. We can do great static analysis of the dependencies of different jobs and the coverage of various tests. Most PRs are now running a tiny fraction of the test suite (thousands of tests total) in minutes instead of an hour. Run all test cases before deployment of master just in case we missed something.

Was there internal resistance to this, and if so what was it?

> has over 150,000 tests

> takes about 30-40 min to run on hundreds of docker containers in parallel

This seems like it might be a signal that now could be the time to start splitting services out in an SOA fashion and having their own test suites and some contract driven tests? Having to run that many tests on each commit is definitely a smell that something architecturally fundamental is wrong...

This sort of thing is why I always scoff at the idea of slow compile times for static strongly typed languages (eg. Scala, Haskell, OCaml, Rust, etc).

Sure, it definitely compiles slower than some other compiled language, and there is no compilation for an interpreted language. But if you factor in testing, the type systems of those languages can easily remove a massive amount of testing that other less rigorous languages would either a) not test at all, or b) test at a significant cost. I'd be willing to bet that if shopify were using a strongly typed language, 50-90% of their testing would be completely redundant, because the type system already takes care of it.

That isn't to say that there are not reasons to use dynamically typed languages...just that if you are building production systems in strongly typed languages, compile time is almost completely irrelevant as a factor in productivity, regardless of how much slower they compile in comparison to an alternative.

Does anyone have thoughts on whether test suites suffer from Goodhart's law? Sometimes I feel like they only work well if people assume they don't exist and commit accordingly.

My dad was a huge proponent of this back when he was in software; don't tell you developers what tests failed, only what percentage of tests failed.

A software test is, in many ways, like an exam you might write in university; just like an exam can't possibly cover 100% of what you're supposed to know, a test (especially an integration test for a large and complex system) can't possibly cover 100% of the conditions the system is supposed to operate under. An exam is a good way to measure if you know a subject though, and similarly a test suite is a good way to check that the quality of a system meets a certain bar.

Once you write the exam, if I go back and say "Here are all the questions you got wrong. Go study up and write the same exam tomorrow," though, it very much ceases to be a good measure of whether or not you know the subject matter. You can now "cheat the system" by studying only the parts of the subject that the exam covers.

Similarly, once your integration tests are failing, if someone tells you which tests are failing and how, you're going to go back and fix only what you need to get the tests passing. At this point, the tests stop being a good indication of code quality - 100% of the tests are passing, but you can't say that 100% of the defects have been removed, so the tests are, in a sense, now kind of worthless. They might stop a limited number of future defects getting in, but they're not doing an arguably much more important job of telling you what your overall quality level is.

If instead, when you submit a commit, I say "5% of the tests are now failing" and nothing else, you have to go look for defects in your code, and you're probably going to find a lot of defects on your own before you even get to the 5% that the tests are complaining about.

This sounds like a fun game to play with a team of developers who have no time constraints. In every organisation I've worked with you would get a very stern talking to for behaving like this.

My tests are specifically designed to show you where the defect is, so you can solve the immediate problem and get back to work. I don't expect every developer who triggered a failing test to perform a full analysis of the code base and resolve every other defect. That would be nice, if we had the time.

I'll prefix by saying that this is exactly how I write my tests, too. But, let's do a little critical thinking here and ask "Why do we write integration tests?" If the goal is to improve software quality, I'm afraid I have some disappointing results for you.

Back when my dad was working at a huge software company (BigCorp, let's say), he looked at how much effort the manual test team spent over a two week period, and how many defects they found. Then he did the same over the next two week period. Now, logically, in that second period, some of the earlier defects had been found and fixed, so it should now be harder to find defects, so the total defect count per unit of test effort should be lower, right? Armed with two data points, he did a regression and worked out how many defects they'd find if they did an infinite amount of testing; effectively the undiscovered defect count left in the product. The number he got was astoundingly huge - no one believed this was possible. So, he went over to the plotter and plotted out a giant effort/defect count curve, and then every two weeks he'd put a pin in his plot to show reality vs. his prediction, and for months and months until he got tired of doing it, he was pretty dead on. And he didn't just do this for one project, he did it for lots of projects, across lots of different teams of various sizes.

On all of these projects, all the manual testing they could possibly hope to accomplish if they had their entire testing staff spend 100 years testing would have reduced the overall defect count by a tiny tiny fraction of a percent. So testing and then fixing bugs found in tests (at least in all the projects at BigCorp) didn't really have a huge impact on software quality.

And this should not really be a surprise; if you were manufacturing cars, you might test the power seats on every 100th car, and use this to figure out what the quality of power seats in your cars is. You might discovery only 95% of power seats are working, and you might think that's unacceptable. If you do, though, you're not going to "solve" the problem by fixing all the broken power seats you test; you're going to go figure out where in your manufacturing process/supply chain things are going wrong and fix the problem there, and improve your quality. The testing is a measure of your process.

Software is not so different - by the time it gets to integration testing, the code has been written. The level of quality of the code has largely already been set at this point - all the defects that are going to be introduced have already been introduced. The quality level is dependent upon your process and the skill level of your developers. So testing some arbitrary fraction of the lines-of-code is going to find problems in some percentage of those lines-of-code, but fixing those particular problems? Is this going to have a huge impact on quality?

> you're not going to "solve" the problem by fixing all the broken power seats you test; you're going to go figure out where in your manufacturing process/supply chain things are going wrong

I think the point is that examining the failing seats will lead you to the points in process that should be fixed. Therefore you can fix the issue more efficiently than going through the whole process. Similar as with knowing details about failing software tests.

Imagine I hide the information about seats and tell you 10% of the finished cars have "some" defect. Where would you even start looking in the factory?

> You can now "cheat the system" by studying only the parts of the subject that the exam covers.

You can't consistently cheat the system if the exam randomly covers every topic studied over the year.

Therefore if your unit tests are not dumb but clearly defined for edge-cases (or at least defined for random points) then you can't cheat them consistently.

> don't tell you developers what tests failed, only what percentage of tests failed.

Puts me in mind of that famous exchange in the list of bad fault reports:

Bug: "Something is broken in the dashboard" Engineer response: "Fixed something in the dashboard"

Seems like your dad's strategy might invite this kind of anti-response ;-)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact