
Running three hours of Ruby tests in under three minutes - nelhage
https://stripe.com/blog/distributed-ruby-testing
======
dankohn1
We're not nearly at Stripe's scale, but my startup (Spreemo) has achieved
pretty amazing parallelism using the commercial SaaS CircleCI. We have 3907
expects across 372 RSpec and Cucumber files. Our tests complete in ~14 minutes
when run across 8 containers.

One of the great strengths for CircleCI is that they auto-discover our test
types, calculate how long each file takes to run, and then auto-allocate the
files in future runs to try to equalize the run times across containers. The
only effort we had to do was split up our slowest test file when we found that
it was taking longer to complete than a combination of files on the other
machines.

I also like that I can run pronto
[https://github.com/mmozuras/pronto](https://github.com/mmozuras/pronto) to
post Rubocop, Rails Best Practices, and Brakeman errors as comments on Github.

~~~
andreasklinger
i like pronto's approach

we simply added the linters/code analysis to the CI itself

reasoning: we try to have as little as possible "code style" discussion in PRs

------
clayallsopp
I'm super curious how Stripe approaches end-to-end testing (like
Selenium/browser testing, but maybe something more bespoke too)

My understanding is that they have a large external dependency (my term: "the
money system"), and running integration tests against it might be tricky or
even undependable. Do they have a mock banking infrastructure they integrate
against?

~~~
nelhage
This is a great question, and it's definitely a problem we have.

We don't have a single answer we use for every system we work on, but we
employ a few common patterns, ranging from just keeping hard-coded strings
containing the expected output, up to and including implementing our own fake
versions of external infrastructure. We have, for example, our own faked
ISO-8583 [1] authorization service, which some of our tests run against to get
a degree of end-to-end testing.

Back-testing is also incredibly valuable: We have repositories of every
conversation or transaction we've ever exchanged with the banking networks,
and when making changes to parsers or interpreters, we can compare their
output against the old version on all of that historical data.

[1]
[https://en.wikipedia.org/wiki/ISO_8583](https://en.wikipedia.org/wiki/ISO_8583)

~~~
heywire
>Back-testing is also incredibly valuable: We have repositories of every
conversation or transaction we've ever exchanged with the banking networks,
and when making changes to parsers or interpreters, we can compare their
output against the old version on all of that historical data.

Are you referring to test data or actual live transaction data? The latter
would seem like a huge liability and target for hackers.

~~~
nelhage
Live data, but they're stored redacted, and/or with sensitive data (e.g.
credit card numbers) replaced with opaque tokens that reference an encrypted
store that's carefully access-controlled.

~~~
ersii
Do you have any policy or decision made on how long you plan on storing that
data? What I'm wondering, are the transactions currently "stored
indefinitely"? (I'm referring to both data stores. The tokenized and the
encrypted one)

------
com2kid
I am tired of this technology having to be re-invented time and time again.

The best I ever saw was an internal tool at Microsoft. It could run tests on
devices (Windows Mobile phones, but it really didn't care), had a nice
reservation and pool system and a nice USB-->Ethernet-->USB system that let
you route any device to any of the test benches.

This was great because it was a heterogeneous pool of devices, with different
sets of tests that executed appropriately.

The test recovery was the best I've ever seen. The back end was wonky as
anything, every single function returned a BOOL indicating if it had ran
correctly or not, every function call was wrapped in an IF statement. That was
silly, but the end result was that every layer of the app could be restarted
independently, and after so many failures either a device would be auto
removed from the pool and the tests reran on another device, or a host machine
could be pulled out, and the test package sent down to another host machine.

The nice part was the simplicity of this. All similar tools I've used since
have involved really stupid setup and configuration steps with some sort of
crappy UI that was hard to use en-masse.

In comparison, this test system just tool a path to a set of source files on a
machine, the compilation and execution command line, and then if the program
returned 0 the test was marked as pass, if it returned anything else it was
marked as fail.

All of this (except for copying the source files over) was done through an
AJAX Web UI back in 2006 or so.

Everything I've used since than has either been watching people poorly
reimplementing this system (frequently with not as good error recovery) or
just downright inferior tools.

(For reference a full test pass was ~3 million tests over about 2 days, and
there were opportunities for improvement, network bandwidth alone was a huge
bottle neck)

All that said, the test system in the link sounds pretty sweet.

~~~
speedkills
I agree. We already have projects like [http://test-load-
balancer.github.io](http://test-load-balancer.github.io) but I have a feeling
I will see five more posts on he in the next year about re-inventing this
wheel and yet not see a single contribution to existing solutions like tlb.

It must be a little depressing to build a really useful product you know many
people need, give it away only hoping people will use it and be happy, then
find out everyone would rather build their own.

But we do like to build things, it is in our nature. Plus, what looks better
on your resume: 1) I migrated my teams test suite to using test load balancer
in two days, saving hours every test run. 2) I contributed improvements to the
open source test load balancer project. 3) I designed and implemented my own
distributed test load balancing tool!

~~~
com2kid
> 3) I designed and implemented my own distributed test load balancing tool!

This, so many times over.

It doesn't help that when interviewing, I kinda-sorta want to know that the
people I hire are capable of understanding systems from the ground up. The
best way to demonstrate that is to go and build a system from the ground up...

TLB looks cool, nice to see that such a tool exists at least within one eco-
system!

------
ryanong
If you want to implement this locally without using mini-test checkout test-
queue by Aman Gupta at github.

[https://github.com/tmm1/test-queue](https://github.com/tmm1/test-queue)

One thing that really sped up our test suite was by creating an NGINX proxy
that served up all the static files instead of making rails do it. This saved
us about 10 minutes off our 30 minute tests.

~~~
tmm1
test-queue supports minitest too, and follows the same basic design outlined
in this article: a central queue sorted by slowest tests first, with test
runners forked off either locally or on other machines to consume off the
queue.

We use test-queue in a dozen different projects at GitHub and most CI runs
finish in ~30s. The largest suite is for the github/github rails app which
runs 30 minutes of tests in 90s.

~~~
ryanong
Thanks for the awesome work you have done on test-queue. We have really
appreciated it at SchoolKeep.

------
sytse
Very cool stuff. For reference at GitLab we use a less impressive and simpler
solution. We split the jobs off in [https://gitlab.com/gitlab-org/gitlab-
ce/blob/master/.gitlab-...](https://gitlab.com/gitlab-org/gitlab-
ce/blob/master/.gitlab-ci.yml) These jobs will be done by separate runners,
this brought our time down from 1+ hours to 23 minutes
[https://ci.gitlab.com/projects/1/refs/respect_filters/commit...](https://ci.gitlab.com/projects/1/refs/respect_filters/commits/e58e75aa8860c4c1530ebe7ad1e4bf557fa1e848)

------
teacup50
How much cheaper (in time, code, effort, complexity) would it be if:

\- Their language runtime supported thread-based concurrency, which would
drastically reduce implementation complexity _and_ actual per-task overhead,
thus improving machine usage efficiency AND eliminating the concerns about
managing process trees that introduces a requirement for things like Docker.

\- Their language runtime was AOT or JIT compiled, simply making _everything_
faster to a degree that test execution could be reasonably performed on _one_
(potentially large) machine.

\- They used a language with a decent type system, significantly reducing the
number of tests that had to be both written and run?

~~~
100k
Only if the early engineers could write in a language that they were as
productive in as Ruby. Getting Stripe launched was the key thing Stripe needed
to accomplish. Everything else follows from that.

~~~
teacup50
There are certainly enough languages to choose from.

~~~
100k
Sure there are tons of languages, with different strengths and weaknesses. The
part that is important is that the early engineers need to be productive in
the language.

Choose boring (to you) technology.

~~~
teacup50
If the technology that's boring to "early engineers" has significantly more
weaknesses than the alternatives, then, simply: the wrong people were hired.

~~~
100k
Stripe is worth $5 billion. I don't think they hired the wrong people.

~~~
teacup50
Ex post facto justification of complex interdependent decisions and
consequences by virtue of market valuation?

That's just dumb survival bias endemic to the YC camp.

It's not something you should actually base business and technical decisions
on.

------
yjgyhj
One thing I've noticed since coding with immutable data structures & functions
(rather than mutable OOP programs) is how tests run really fast, and are easy
to run in parallell.

I/O only happens in a few functions, and most other code just takes data in ->
transforms -> returns data out. This means I only have few functions that need
to 'wait' on something outside of itself to finish, and much lesser delays in
the code.

This is coding in Clojure for me, but you can do that in any language that has
functions (preferable with efficient persistent data structures. Like the
tree-based PersistentVector in Clojure).

~~~
schneems
Immutable data structures give you easy parallelism, however there's a hidden
runtime cost: you have to allocate way more objects. For example, I was able
to save a ton of object allocations here: [https://github.com/mime-types/ruby-
mime-types/pull/93](https://github.com/mime-types/ruby-mime-types/pull/93)
mostly by mutating. For tasks that are not easily parallelizable it may be
slower to use immutable structures.

I mostly only ever hear about how fast FP languages are, so maybe they use
some tricks to avoid allocations somehow. I would be interested in hearing
more about it.

~~~
yjgyhj
Yes, mutating something at one place in memory is more efficient, because you
don't need to allocate new memories.

I don't know all to much about other functional languages, as I learned perl
-> ruby -> javascript -> little bit C & Java & Go -> now doing Clojure. But I
find Clojures collection data structures interesting. The vector (collection
like lists or array) type look immutable, but under the hood are trees. When
you append to a vector, you seem get a new vector returned.

In reality, you added the new element as a node in a tree. Then just modified
pointers to that the new and old version share almost all of the pointers &
allocations. With simple arrays or lists, you would allocate every element
anew.

Idk if I can properly explain. Found this blog post very interesting and
easier to follow: [http://hypirion.com/musings/understanding-persistent-
vector-...](http://hypirion.com/musings/understanding-persistent-vector-pt-1)

Personally, most things in business I find easily parallelisable. You mostly
decouple the I/O parts of something with the logic parts of it. But yeah I
still have much to learn. Thanks for the link! Interesting :)

~~~
schneems
That's cool, In Ruby Hash merges are really expensive. I played around with a
non-mutating data structure that uses references two two hashes and behaves as
if it had been merged instead of allocating. It was a fun thought experiment
but wasn't 100% API backwards compatible, as mutations got ugly. Thanks for
the link.

------
jtchang
Love this. Sometimes testing can be a huge pain in the ass. I know more than
one project I work on where getting them to run is a lot of effort in itself.

There is something to be said about code quality and having tests run in under
a few seconds. The ideal situation is when you can have a barrage of tests run
as fast as you are making changes to code. If we ever got to the point of
instant feedback that didn't suck I'd think we'd change a lot about how we
think about tests.

------
sigil
_We opted for an alternate, dynamic approach, which allocates work in real-
time using a work queue. We manage all coordination between workers using an
nsqd instance... In order to get maximum parallel performance out of our build
servers, we run tests in separate processes, allowing each process to make
maximum use of the machine 's CPU and I/O capability. (We run builds on
Amazon's c4.8xlarge instances, which give us 36 cores each.)_

This made me long for a unit test framework as simple as:

    
    
        $ make -j36 test
    

Where you've got something like the following:

    
    
        $ find tests/
    
          tests/bin/A
          tests/bin/B
          ...
          tests/input/A
          tests/input/B
          ...
          tests/expected/A
          tests/expected/B
          ...
          tests/output/
    
        $ cat Makefile
    
          test : $(shell find tests/bin -type f | sed -e 's@/bin/@/output/@')
          
          tests/output/% : tests/bin/% tests/input/% tests/expected/%
                  @ printf "testing [%s] ... " $@
                  @ sh -c 'exec $$0 < $$1' $^ > $@
                  @ # ...runs tests/bin/% < tests/input/% > tests/output/%
                  @ sh -c 'exec cmp -s $$3 $$0' $@ $^ && echo pass || echo fail
                  @ # ...runs cmp -s tests/expected/% tests/output/%
         
          clean :
                  rm -f tests/output/*
    

You get test parallelism and efficient use of compute resources "for free"
(well, from make -j, because it already has a job queue implementation
internally). This setup closely resembles the "rts" unit test approach you'll
find in a number of djb-derivative projects.

The defining obstacle for Stripe seems like Ruby interpreter startup time
though. I'm not sure how to elegantly handle preforked execution in a
Makefile-based approach. Drop me a line if you have ideas or have tackled this
in the past, I've got a couple projects stalled out on it.

------
atonse
On a previous project, I had built a shell script that essentially created n
mysql databases and just distributed the test files under n rails processes.

We were able to run tests that took an hour in about 3 minutes. It was good
enough for us. Nothing sophisticated for evenly balancing the test files, but
it was pretty good for 1-2 days of work.

------
vkjv
"This second round of forking provides a layer of isolation between tests: If
a test makes changes to global state, running the test inside a throwaway
process will clean everything up once that process exits."

But, then how do you catch bugs where shared mutable state is not compatible
with multiple changes?

~~~
praxulus
You write tests specifically for testing multiple changes. You shouldn't be
testing changes to global state by seeing how multiple supposedly independent
tests interact.

~~~
ambicapter
Is there any value in designing tests in such a way that they test multiple
things at once while still being able to isolate which specific thing is
responsible for breaking? I'm thinking of something like JMP.

------
arturhoo
Congratulations on what look a very challenging task. I'm assuming a part of
those tests hit a database. How have you dealt with it? I assume that a single
instance, even on a powerful bare server could be a road blocker in this
situation. A few insights on the Docker/Containerization part of it would also
be nice!

~~~
nelhage
Our testing running infrastructure spins up a pool of database instances on
each worker machine, one for each worker process. The test spinup and teardown
code handles schema management, hooking into our DB access layer to create and
clean up database tables only if they're used by a given test.

------
Ono-Sendai
This is an interesting and possibly overlooked problem with using slow
languages like Ruby - your unit tests take forever to run. (unless you spend a
lot of engineering effort on making them run faster, in which case they may
run somewhat acceptably fast)

~~~
Aqua_Geek
This isn't just a problem with Ruby. Our ObjC test suite for a project I work
on takes about 10 min to run, too.

~~~
Ono-Sendai
Our main product has approximately 5400 unit tests, over ~150 files. The test
suite runs on a single computer in about 14s. This is one of the advantages of
using a fast language (C++) with multithreading :)

~~~
aianus
Not exactly a fair comparison, because you have to compile the changed files
first and link which I'm sure takes longer than 14s.

On the other hand, using ruby I can have it continuously run the tests for the
specific feature I'm working on without the long building step.

~~~
Ono-Sendai
Compiling a changed file and linking takes about 5s.

~~~
mappu
Is that the best case for changing a single cpp file, or the worst case for
changing a header file shared across many translation units?

~~~
Ono-Sendai
That's changing a cpp file. Changing a widely used header file could take 1-2
mins for a full or nearly full rebuild.

------
raverbashing
I guess a lot of problems come from the stupidly brain dead way people usually
write tests (because it's the "recommended TDD way")

Things like using the same setup function for every test and setting
up/tearing down for every test regardless of dependencies

Also tests like

    
    
        def test1():
          do_a() 
          check_condition_X()
    

then

    
    
        def test2():
          do_a() 
          check_condition_Y()
    

Or

    
    
        def test1():
          do_a() 
          check_condition_X()
    
        def test2():
          do_a()
          do_b()
          check_condition_Y()
    

When it could have been consolidated into 1 test

Then people wonder why it takes so much time?

Also helpful is if you can shutdown database setup for tests that don't need
it

~~~
aianus
The time it saves me when I see 'test2' failed instead of 'test_enormous:137'
failed is worth more than the marginal computation required.

These are embarrassingly parallel problems, we just need better tools to fully
saturate every core on every node in the test cluster.

~~~
hinkley
My last project had a mean run time of <9ms per test. We were not at all
worried about parallelization. Nobody even mentioned it until we hit 1100
(eleven hundred) tests, and we ended up optimizing the build phase to reduce
the code/build/test cycle time instead.

------
falsedan
Oh hey, we have the same sort of system here. It's 60,000 Python tests which
take ~28 hours if run serially, but we keep it around 30-40 minutes. We wrote
a UI & scheduler & artifact distribution system (which we're probably going to
replace with S3). We run selenium & unit tests as well as the integration
tests.

We've noticed that starting and stopping a ton of docker containers in rapid
succession really hoses dockerd, also that Jenkins' API is a lot slower than
we expected for mostly-read-only operations.

Have you considered mesos?

~~~
badmadrad
Have you considered another containerization solution like LXD. I feel like
testing like this fits the "container hyper-visor" use case and this is what
LXD is designed to do.

~~~
falsedan
We tried docker, then had to drop back to running the tests outside of a
container (some old technical decisions in the project under test made it hard
to run in a container). It's been improved since then, and we're close to
running in containers again.

Each executor gets a non-shared prod-like environment thanks to a handful of
docker containers. The same setup is used for dev, so switching the testing
environment to LXC would mean switching devs as well.

------
cthyon
Not sure if this has already been answered, but would Stripe's methods only
work with unit tests where tests are not dependent on each other?

How would one go about building a similar distributed testing setup for end-
to-end tests where a sequence of tests have to be run in particular order.
Finding the optimal ordering / distribution of tests between workloads would
certainly be more complicated. Maybe they could be calculated with directed
graph algorithms?

~~~
matthewmacleod
_How would one go about building a similar distributed testing setup for end-
to-end tests where a sequence of tests have to be run in particular order._

I reckon that would be solving the wrong problem. End-to-end tests should be
independent of each other, and tests should never be dependent on the order in
which they are run. End-to-end tests might be longer as a result, but managing
the complexity of test dependencies will quickly cripple any system that uses
this approach, I imagine.

------
hinkley
_needle scratching on record_

They have an average of 9 assertions per test case. I think I may see part of
their problem.

~~~
junto
I'm not sure if you are talking from a performance perspective or a conceptual
perspective, but this provides a useful discussion on multiple assertions:

[http://programmers.stackexchange.com/questions/7823/is-it-
ok...](http://programmers.stackexchange.com/questions/7823/is-it-ok-to-have-
multiple-asserts-in-a-single-unit-test)

My 2 cents is that multiple assertions are legitimate, as long as they prove a
singluar assumption. Hence (as per the test on that page), this is a valid use
of multiple assertions:

    
    
      [Test]
      public void ValueIsInRange()
      {
        int value = GetValueToTest();
    
        Assert.That(value, Is.GreaterThan(10), "value is too small");
        Assert.That(value, Is.LessThan(100), "value is too large");
      }

~~~
hinkley
[Edit] Thanks for the link. I have a whole bunch of comments in there to
upvote. Guess my evening is planned :) [Edit]

I would also like to point out that ranges, like the one in your example, are
almost always a symptom of an unstable test to begin with. I'd want to know
why you're providing a range. Does the test blow up if another server is
running tests at the same time? Let's fix that so the tests actually fail when
there is a failure.

Now, there are lots of matchers that misbehave for corner case inputs, and an
assert like "Make sure there's text, that it's a number, and that the number
equals 10" may be necessary in order to prove that "10" appears, especially
when you invert the test an say "Make sure the number isn't 10". And in this
case I would say "write us a better matcher so that everyone can benefit from
you figuring out how to do this".

This should go without saying, but I feel I have to repeat it every time
there's an audience:

Green is not the end goal of testing. Red when there is an actual problem is
the end goal of testing. Anything else is a very expensive way to consume
resources.

------
chinathrow
Any reason why a financial infrastructure provider like Stripe would run CI
tests on someone elses infrastructure? Isn't that a no go from a security
point of view? Or - how do you trust the hosted CI company not to look at your
code?

~~~
patio11
_how do you trust the hosted CI company not to look at your code?_

Contracts, not firewalls, make the world go round.

~~~
scrollaway
To be fair a contract does not guarantee the security framework of the company
you are contracting, which means your code is only as safe as _their_ weakest
link.

~~~
NeutronBoy
Which is why contracts include things like right-to-audit, so you can verify
for yourself.

------
meesterdude
I wrote a rubygem called cloudspeq
([http://github.com/meesterdude/cloudspeq](http://github.com/meesterdude/cloudspeq))
that distributes rails rspec spec's across a bunch of digital ocean machines
to reduce test execution time for slow test suits in dev.

one of the things I did that may be of interest is to break up spec files
themselves to help reduce hotspots (or dedicate a machine to it specifically)

Not as complex or as robust as what they did, but it works!

------
grandalf
It's interesting to imagine, for a test suite that would take three hours, how
much of the execution time is state management vs algorithm execution.

~~~
MrBra
No, they aren't going to switch to a pure functional language.

------
jwatte
[http://engineering.imvu.com/2011/01/19/buildbot-and-
intermit...](http://engineering.imvu.com/2011/01/19/buildbot-and-intermittent-
tests/)

~~~
jwatte
Also, since 2011, we have added features and platforms under test, yet
deceased test run time to < 4 minutes. So, yay progress!

------
rubiquity
Does this mean each process has its own database or are you able to use
transactions with the selenium/capybara tests?

------
throwaway832975
Pull-based load balancing is a generally underrated technique.

------
smegel
> Initially, we experimented with using Ruby's threads instead of multiple
> processes

Why, to be cool? Tests are a classic case of things that should be run in
isolation - you don't want tests interfering with earth other or crashing the
whole test suite. Using separate processes would have been the sensible
approach to start with.

------
edoloughlin
Was anyone else expecting the article to be about replacing Ruby with a
compiled language?

~~~
werdnapk
No.

