Running three hours of Ruby tests in under three minutes

dankohn1 · on Aug 13, 2015

We're not nearly at Stripe's scale, but my startup (Spreemo) has achieved pretty amazing parallelism using the commercial SaaS CircleCI. We have 3907 expects across 372 RSpec and Cucumber files. Our tests complete in ~14 minutes when run across 8 containers.

One of the great strengths for CircleCI is that they auto-discover our test types, calculate how long each file takes to run, and then auto-allocate the files in future runs to try to equalize the run times across containers. The only effort we had to do was split up our slowest test file when we found that it was taking longer to complete than a combination of files on the other machines.

I also like that I can run pronto https://github.com/mmozuras/pronto to post Rubocop, Rails Best Practices, and Brakeman errors as comments on Github.

andreasklinger · on Aug 13, 2015

i like pronto's approach

we simply added the linters/code analysis to the CI itself

reasoning: we try to have as little as possible "code style" discussion in PRs

clayallsopp · on Aug 13, 2015

I'm super curious how Stripe approaches end-to-end testing (like Selenium/browser testing, but maybe something more bespoke too)

My understanding is that they have a large external dependency (my term: "the money system"), and running integration tests against it might be tricky or even undependable. Do they have a mock banking infrastructure they integrate against?

nelhage · on Aug 13, 2015

This is a great question, and it's definitely a problem we have.

We don't have a single answer we use for every system we work on, but we employ a few common patterns, ranging from just keeping hard-coded strings containing the expected output, up to and including implementing our own fake versions of external infrastructure. We have, for example, our own faked ISO-8583 [1] authorization service, which some of our tests run against to get a degree of end-to-end testing.

Back-testing is also incredibly valuable: We have repositories of every conversation or transaction we've ever exchanged with the banking networks, and when making changes to parsers or interpreters, we can compare their output against the old version on all of that historical data.

[1] https://en.wikipedia.org/wiki/ISO_8583

heywire · on Aug 13, 2015

>Back-testing is also incredibly valuable: We have repositories of every conversation or transaction we've ever exchanged with the banking networks, and when making changes to parsers or interpreters, we can compare their output against the old version on all of that historical data.

Are you referring to test data or actual live transaction data? The latter would seem like a huge liability and target for hackers.

nelhage · on Aug 13, 2015

Live data, but they're stored redacted, and/or with sensitive data (e.g. credit card numbers) replaced with opaque tokens that reference an encrypted store that's carefully access-controlled.

ersii · on Aug 14, 2015

Do you have any policy or decision made on how long you plan on storing that data? What I'm wondering, are the transactions currently "stored indefinitely"? (I'm referring to both data stores. The tokenized and the encrypted one)

kevan · on Aug 14, 2015

> ranging from just keeping hard-coded strings containing the expected output, up to and including implementing our own fake versions of external infrastructure.

This sounds very familiar, we rely on external credit systems pretty heavily. We started by mocking service responses and including the response XML in our unit tests. Now we have a service simulator that returns expected values and has record/playback capability. It's not ideal and responses get outdated occasionally but we haven't found a more elegant way to handle it yet.

ngoede · on Aug 13, 2015

What percentage of the tests are full system, integration, and unit tests?

sanderjd · on Aug 13, 2015

I'm very curious about that as well. I worked on a big project that had a (perhaps analogous) large external dependency on networks of embedded devices in homes and businesses, and integration testing it was …difficult. I'd love to hear how Stripe solves that problem.

crdoconnor · on Aug 13, 2015

Could you not create mock embedded devices?

sanderjd · on Aug 13, 2015

That's basically what we did, but more like mocking things at the network communication boundary. But for an integration test, it was often unsatisfactory, because of things behaving differently than they do in the real world. We also had a suite of automated tests that communicated with real devices in a lab, which were much better, but extremely hard to maintain. My general experience was: tenable, but tricky. So I'd love to hear how others have handled similar things.

crdoconnor · on Aug 14, 2015

Was it not possible to build upon the mock device to make it run more realistically?

I pretty much do this habitually now:

1. Get report of bug in production or on staging.

2. Write test to reproduce the bug.

3. 2/3 of the time get stuck because the environment isn't capable of mimicking the bug.

4. Build upon the integration testing environment to make it capable of mimicking the bug.

I find the counter intuitive part of integration testing is that step 4 ends up being where most of the work is required and far too many people just don't do it because they feel it's not a worthwhile investment.

I actually ended up writing an open source framework to handle a lot of the boilerplate (which no other frameworks do AFAIK). Especially making mock devices easier to write (see http://hitchtest.com/ and check out HitchSMTP).

sanderjd · on Aug 14, 2015

> I find the counter intuitive part of integration testing is that step 4 ends up being where most of the work is required and far too many people just don't do it because they feel it's not a worthwhile investment.

That's all I'm saying: tenable, but difficult (read: expensive). Frankly, I'm not convinced it is a worthwhile investment. Hence, my interest in how others have approached a similar problem.

Hitch looks pretty nifty, but I'm not sold on the yml/jinja2 approach. I grew to loathe Cucumber, and this approach seems similar. If you can't convince your non-technical staff to write tests in this language (which, in my experience, you can't), then you're better off writing the honest-to-god code that programmers are comfortable with (and can more easily modularize and refactor). YMMV I suppose!

crdoconnor · on Aug 15, 2015

I don't like cucumber either. I dumped it on a previous project and just wrote code instead which I found to be easier too. In many ways this project was borne out of the frustrations I felt.

YAML is different. It has much clearer syntax and the method mapping is super easy and can even easily handle more complex data structures being passed in in the steps (lists, dicts, lists of dicts, etc.) which Cucumber either couldn't do, or required tortuous syntax and horrible regexps to do.

I did it this way mainly to adhere to the rule of least power and to enforce separation of concerns between execution code and test scenarios. Readability by non-programmers is just a nice side effect.

I suppose one day I might make a GUI to generate the YAML - maybe then non-technical staff might write tests, but probably not before.

andreasklinger · on Aug 13, 2015

Not a stripe member but i would assume that anything that involves intense security auditing, PCI, etc would be seperate codebases that rarely change.

(eg cc handling could anon the CCs in a service before they reach the main app)

The integration with 3rd parties is a seperate issue that exists no matter of it is banks or not - i would guess they abstracted that as well as services or libs and decide case by case.

com2kid · on Aug 13, 2015

I am tired of this technology having to be re-invented time and time again.

The best I ever saw was an internal tool at Microsoft. It could run tests on devices (Windows Mobile phones, but it really didn't care), had a nice reservation and pool system and a nice USB-->Ethernet-->USB system that let you route any device to any of the test benches.

This was great because it was a heterogeneous pool of devices, with different sets of tests that executed appropriately.

The test recovery was the best I've ever seen. The back end was wonky as anything, every single function returned a BOOL indicating if it had ran correctly or not, every function call was wrapped in an IF statement. That was silly, but the end result was that every layer of the app could be restarted independently, and after so many failures either a device would be auto removed from the pool and the tests reran on another device, or a host machine could be pulled out, and the test package sent down to another host machine.

The nice part was the simplicity of this. All similar tools I've used since have involved really stupid setup and configuration steps with some sort of crappy UI that was hard to use en-masse.

In comparison, this test system just tool a path to a set of source files on a machine, the compilation and execution command line, and then if the program returned 0 the test was marked as pass, if it returned anything else it was marked as fail.

All of this (except for copying the source files over) was done through an AJAX Web UI back in 2006 or so.

Everything I've used since than has either been watching people poorly reimplementing this system (frequently with not as good error recovery) or just downright inferior tools.

(For reference a full test pass was ~3 million tests over about 2 days, and there were opportunities for improvement, network bandwidth alone was a huge bottle neck)

All that said, the test system in the link sounds pretty sweet.

speedkills · on Aug 14, 2015

I agree. We already have projects like http://test-load-balancer.github.io but I have a feeling I will see five more posts on he in the next year about re-inventing this wheel and yet not see a single contribution to existing solutions like tlb.

It must be a little depressing to build a really useful product you know many people need, give it away only hoping people will use it and be happy, then find out everyone would rather build their own.

But we do like to build things, it is in our nature. Plus, what looks better on your resume: 1) I migrated my teams test suite to using test load balancer in two days, saving hours every test run. 2) I contributed improvements to the open source test load balancer project. 3) I designed and implemented my own distributed test load balancing tool!

com2kid · on Aug 14, 2015

> 3) I designed and implemented my own distributed test load balancing tool!

This, so many times over.

It doesn't help that when interviewing, I kinda-sorta want to know that the people I hire are capable of understanding systems from the ground up. The best way to demonstrate that is to go and build a system from the ground up...

TLB looks cool, nice to see that such a tool exists at least within one eco-system!

hueving · on Aug 13, 2015

>I am tired of this technology having to be re-invented time and time again.

So did Microsoft open source this? If not, quit complaining. Just because you saw a massive software engineering company doing something better doesn't mean everyone else who doesn't have access to it sucks for not reaching parity.

com2kid · on Aug 13, 2015

> So did Microsoft open source this? If not, quit complaining.

Microsoft has alone re-invented this at least a half dozen times. At least one version of it, more limited in some regards more powerful in others, is sold as part of Visual Studio.

Of course the VS one is both much more "enterprisey" and less flexible in numerous ways.

(That said it does have nice charts.)

The industry as a whole though keeps remaking test frameworks again and again.

I admit that a custom made framework to solve a team's problems is going to be easier to use than an infinitely configurable framework that is designed to solve everyone's problems, Microsoft used to have that tool as well, and it was widely disliked for how little it did out of the box and how much work it required to get it up and running. (Also in its early days it had serious scaling problems, and its configuration + use required a lot of mental gymnastics)

I'm just annoyed that we haven't found a nice simple compromise solution, or at least created some fundamental building blocks.

On top of that so few testing systems pay attention to the user interface, if it takes me 5 minutes to add a single test, damned if I am going to be adding 50 tests.

Lots of test systems go with simple annotations, but then the instant I want something more powerful I am boned. MSTest was restricted like this for years, finally in VS2013 they made it much more extensible, but there is minimal C++ support. Other ecosystems are not a lot better, developers are really good at creating test systems that run on their local dev box, zippity do-da.

Then again I have spent most of my developer life in the devices area, which means test results need to in the very least get sent across the wire to a host machine of some type (depending on the intelligence of one's device under test).

I want my devs to be able to annotate a source file, have IPC code generated on both sides (device, and PC side library), and then have the test auto added to my test management system.

Bah humbug, I think I'll just write a parsing system with Perl and RegExs.

The manually adding tests to the test system part still sucks though. (There is an API for it, but again, mental gymnastics create a barrier to entry).

ryanong · on Aug 13, 2015

If you want to implement this locally without using mini-test checkout test-queue by Aman Gupta at github.

https://github.com/tmm1/test-queue

One thing that really sped up our test suite was by creating an NGINX proxy that served up all the static files instead of making rails do it. This saved us about 10 minutes off our 30 minute tests.

tmm1 · on Aug 13, 2015

test-queue supports minitest too, and follows the same basic design outlined in this article: a central queue sorted by slowest tests first, with test runners forked off either locally or on other machines to consume off the queue.

We use test-queue in a dozen different projects at GitHub and most CI runs finish in ~30s. The largest suite is for the github/github rails app which runs 30 minutes of tests in 90s.

ryanong · on Aug 13, 2015

Thanks for the awesome work you have done on test-queue. We have really appreciated it at SchoolKeep.

beilabs · on Aug 14, 2015

Really interested in this approach, can you point somewhere that talks further about this nginx proxy strategy?

bbuchalter · on Aug 14, 2015

Could you share a bit more about this nginx proxy setup for static assets? Basically mimicking production env?

bittersweet · on Aug 14, 2015

Seconded, I have not thought about this before or come across this idea at all and certainly sounds interesting!

sytse · on Aug 13, 2015

Very cool stuff. For reference at GitLab we use a less impressive and simpler solution. We split the jobs off in https://gitlab.com/gitlab-org/gitlab-ce/blob/master/.gitlab-... These jobs will be done by separate runners, this brought our time down from 1+ hours to 23 minutes https://ci.gitlab.com/projects/1/refs/respect_filters/commit...

teacup50 · on Aug 13, 2015

How much cheaper (in time, code, effort, complexity) would it be if:

- Their language runtime supported thread-based concurrency, which would drastically reduce implementation complexity and actual per-task overhead, thus improving machine usage efficiency AND eliminating the concerns about managing process trees that introduces a requirement for things like Docker.

- Their language runtime was AOT or JIT compiled, simply making everything faster to a degree that test execution could be reasonably performed on one (potentially large) machine.

- They used a language with a decent type system, significantly reducing the number of tests that had to be both written and run?

100k · on Aug 13, 2015

Only if the early engineers could write in a language that they were as productive in as Ruby. Getting Stripe launched was the key thing Stripe needed to accomplish. Everything else follows from that.

teacup50 · on Aug 13, 2015

There are certainly enough languages to choose from.

100k · on Aug 13, 2015

Sure there are tons of languages, with different strengths and weaknesses. The part that is important is that the early engineers need to be productive in the language.

Choose boring (to you) technology.

teacup50 · on Aug 13, 2015

If the technology that's boring to "early engineers" has significantly more weaknesses than the alternatives, then, simply: the wrong people were hired.

100k · on Aug 18, 2015

Stripe is worth $5 billion. I don't think they hired the wrong people.

teacup50 · on Aug 19, 2015

Ex post facto justification of complex interdependent decisions and consequences by virtue of market valuation?

That's just dumb survival bias endemic to the YC camp.

It's not something you should actually base business and technical decisions on.

pekk · on Aug 13, 2015

Thread-based concurrency based on shared mutable state doesn't reduce complexity.

teacup50 · on Aug 13, 2015

Thread-based concurrency doesn't require shared mutable state at the application implementation level.

someone7x · on Aug 13, 2015

Is it fair to assume that time, code, effort, and complexity would be some degree of cheaper? May very well be more expensive. Language choice isn't a silver bullet.

ryanong · on Aug 13, 2015

Would it be cheap enough to encourage a re-write, re-engineer the server stack, and re-train employees? I doubt it.

I think it is an interesting question but a bad one most of the time unless you take into account all the other external factors that don't include the language it self.

amalag · on Aug 14, 2015

JRuby will do the first and part of the second

brianwawok · on Aug 13, 2015

You mean something like Scala?

It would run the tests 10-100x faster per test, and also require less tests (due to having a real type system).

I do giggle a little when I see huge engineering hurdles people have to overcome because of the language that was chosen. Building an app that is going to scale to millions of users? May not want to use Ruby...

(Nothing against Stripe, I am a paying customer - love the product. I do suspect it would be easier to engineer on a better platform than RoR though).

yjgyhj · on Aug 13, 2015

One thing I've noticed since coding with immutable data structures & functions (rather than mutable OOP programs) is how tests run really fast, and are easy to run in parallell.

I/O only happens in a few functions, and most other code just takes data in -> transforms -> returns data out. This means I only have few functions that need to 'wait' on something outside of itself to finish, and much lesser delays in the code.

This is coding in Clojure for me, but you can do that in any language that has functions (preferable with efficient persistent data structures. Like the tree-based PersistentVector in Clojure).

schneems · on Aug 13, 2015

Immutable data structures give you easy parallelism, however there's a hidden runtime cost: you have to allocate way more objects. For example, I was able to save a ton of object allocations here: https://github.com/mime-types/ruby-mime-types/pull/93 mostly by mutating. For tasks that are not easily parallelizable it may be slower to use immutable structures.

I mostly only ever hear about how fast FP languages are, so maybe they use some tricks to avoid allocations somehow. I would be interested in hearing more about it.

yjgyhj · on Aug 13, 2015

Yes, mutating something at one place in memory is more efficient, because you don't need to allocate new memories.

I don't know all to much about other functional languages, as I learned perl -> ruby -> javascript -> little bit C & Java & Go -> now doing Clojure. But I find Clojures collection data structures interesting. The vector (collection like lists or array) type look immutable, but under the hood are trees. When you append to a vector, you seem get a new vector returned.

In reality, you added the new element as a node in a tree. Then just modified pointers to that the new and old version share almost all of the pointers & allocations. With simple arrays or lists, you would allocate every element anew.

Idk if I can properly explain. Found this blog post very interesting and easier to follow: http://hypirion.com/musings/understanding-persistent-vector-...

Personally, most things in business I find easily parallelisable. You mostly decouple the I/O parts of something with the logic parts of it. But yeah I still have much to learn. Thanks for the link! Interesting :)

schneems · on Aug 14, 2015

That's cool, In Ruby Hash merges are really expensive. I played around with a non-mutating data structure that uses references two two hashes and behaves as if it had been merged instead of allocating. It was a fun thought experiment but wasn't 100% API backwards compatible, as mutations got ugly. Thanks for the link.

JBiserkov · on Aug 13, 2015

You are of course correct. I recommend you check out this 5 posts on Persistent vectors in Clojure http://hypirion.com/musings/understanding-persistent-vector-... "Spoiler from part 4":

>Transients are an optimisation on persistent data structures, which decreases the amount of memory allocations needed. This makes a huge difference for performance critical code.

http://hypirion.com/musings/understanding-clojure-transients

birdsbolt · on Aug 13, 2015

Allocations don't have to be expensive if your GC is smart. Smart as C++ destructors positioning :D

schneems · on Aug 14, 2015

> if your GC is smart

Can you say more? I don't know how C++ uses destructor positioning? Seems to me if it gets to GC it's too late, you already made an allocation which is the expensive part.

Ono-Sendai · on Aug 13, 2015

You can allocate stuff on the stack.

schneems · on Aug 14, 2015

Allocating things on the stack is still allocating things? It's faster to mutate than to allocate?

Ono-Sendai · on Aug 14, 2015

Allocating stuff on the stack is basically free. You just have to decrement the stack pointer.

jtchang · on Aug 13, 2015

Love this. Sometimes testing can be a huge pain in the ass. I know more than one project I work on where getting them to run is a lot of effort in itself.

There is something to be said about code quality and having tests run in under a few seconds. The ideal situation is when you can have a barrage of tests run as fast as you are making changes to code. If we ever got to the point of instant feedback that didn't suck I'd think we'd change a lot about how we think about tests.

sigil · on Aug 14, 2015

We opted for an alternate, dynamic approach, which allocates work in real-time using a work queue. We manage all coordination between workers using an nsqd instance... In order to get maximum parallel performance out of our build servers, we run tests in separate processes, allowing each process to make maximum use of the machine's CPU and I/O capability. (We run builds on Amazon's c4.8xlarge instances, which give us 36 cores each.)

This made me long for a unit test framework as simple as:

    $ make -j36 test

Where you've got something like the following:

    $ find tests/

      tests/bin/A
      tests/bin/B
      ...
      tests/input/A
      tests/input/B
      ...
      tests/expected/A
      tests/expected/B
      ...
      tests/output/

    $ cat Makefile

      test : $(shell find tests/bin -type f | sed -e 's@/bin/@/output/@')
      
      tests/output/% : tests/bin/% tests/input/% tests/expected/%
              @ printf "testing [%s] ... " $@
              @ sh -c 'exec $$0 < $$1' $^ > $@
              @ # ...runs tests/bin/% < tests/input/% > tests/output/%
              @ sh -c 'exec cmp -s $$3 $$0' $@ $^ && echo pass || echo fail
              @ # ...runs cmp -s tests/expected/% tests/output/%
     
      clean :
              rm -f tests/output/*

You get test parallelism and efficient use of compute resources "for free" (well, from make -j, because it already has a job queue implementation internally). This setup closely resembles the "rts" unit test approach you'll find in a number of djb-derivative projects.

The defining obstacle for Stripe seems like Ruby interpreter startup time though. I'm not sure how to elegantly handle preforked execution in a Makefile-based approach. Drop me a line if you have ideas or have tackled this in the past, I've got a couple projects stalled out on it.

atonse · on Aug 13, 2015

On a previous project, I had built a shell script that essentially created n mysql databases and just distributed the test files under n rails processes.

We were able to run tests that took an hour in about 3 minutes. It was good enough for us. Nothing sophisticated for evenly balancing the test files, but it was pretty good for 1-2 days of work.

vkjv · on Aug 13, 2015

"This second round of forking provides a layer of isolation between tests: If a test makes changes to global state, running the test inside a throwaway process will clean everything up once that process exits."

But, then how do you catch bugs where shared mutable state is not compatible with multiple changes?

praxulus · on Aug 13, 2015

You write tests specifically for testing multiple changes. You shouldn't be testing changes to global state by seeing how multiple supposedly independent tests interact.

ambicapter · on Aug 13, 2015

Is there any value in designing tests in such a way that they test multiple things at once while still being able to isolate which specific thing is responsible for breaking? I'm thinking of something like JMP.

arturhoo · on Aug 13, 2015

Congratulations on what look a very challenging task. I'm assuming a part of those tests hit a database. How have you dealt with it? I assume that a single instance, even on a powerful bare server could be a road blocker in this situation. A few insights on the Docker/Containerization part of it would also be nice!

nelhage · on Aug 13, 2015

Our testing running infrastructure spins up a pool of database instances on each worker machine, one for each worker process. The test spinup and teardown code handles schema management, hooking into our DB access layer to create and clean up database tables only if they're used by a given test.

Ono-Sendai · on Aug 13, 2015

This is an interesting and possibly overlooked problem with using slow languages like Ruby - your unit tests take forever to run. (unless you spend a lot of engineering effort on making them run faster, in which case they may run somewhat acceptably fast)

Aqua_Geek · on Aug 13, 2015

This isn't just a problem with Ruby. Our ObjC test suite for a project I work on takes about 10 min to run, too.

Ono-Sendai · on Aug 13, 2015

Our main product has approximately 5400 unit tests, over ~150 files. The test suite runs on a single computer in about 14s. This is one of the advantages of using a fast language (C++) with multithreading :)

aianus · on Aug 13, 2015

Not exactly a fair comparison, because you have to compile the changed files first and link which I'm sure takes longer than 14s.

On the other hand, using ruby I can have it continuously run the tests for the specific feature I'm working on without the long building step.

Ono-Sendai · on Aug 13, 2015

Compiling a changed file and linking takes about 5s.

mappu · on Aug 13, 2015

Is that the best case for changing a single cpp file, or the worst case for changing a header file shared across many translation units?

Ono-Sendai · on Aug 13, 2015

That's changing a cpp file. Changing a widely used header file could take 1-2 mins for a full or nearly full rebuild.

raverbashing · on Aug 13, 2015

I guess a lot of problems come from the stupidly brain dead way people usually write tests (because it's the "recommended TDD way")

Things like using the same setup function for every test and setting up/tearing down for every test regardless of dependencies

Also tests like

    def test1():
      do_a() 
      check_condition_X()

then

    def test2():
      do_a() 
      check_condition_Y()

Or

    def test1():
      do_a() 
      check_condition_X()

    def test2():
      do_a()
      do_b()
      check_condition_Y()

When it could have been consolidated into 1 test

Then people wonder why it takes so much time?

Also helpful is if you can shutdown database setup for tests that don't need it

aianus · on Aug 13, 2015

The time it saves me when I see 'test2' failed instead of 'test_enormous:137' failed is worth more than the marginal computation required.

These are embarrassingly parallel problems, we just need better tools to fully saturate every core on every node in the test cluster.

hinkley · on Aug 13, 2015

My last project had a mean run time of <9ms per test. We were not at all worried about parallelization. Nobody even mentioned it until we hit 1100 (eleven hundred) tests, and we ended up optimizing the build phase to reduce the code/build/test cycle time instead.

raverbashing · on Aug 13, 2015

Though I certainly don't advocate a test that goes to 137 lines, I think the point of having to guess what the test is doing only by the name/messages is moot, you'll end up checking the test source code to see what it is doing exactly

hinkley · on Aug 13, 2015

mocha and jasmine (in the node/javascript space) support nested setup and teardown methods and it's been really challenging for me to go back to using other frameworks, languages.

Not only does the nesting help limit the amount of setup and teardown you do, but when broad-reaching functional changes hit you in version 2, 3, it's so much easier to reorganize your tests to get the pre- and post-conditions right when they are already grouped that way.

The sad thing is that it takes a few release cycles before you feel any difference at all, and a couple more before you're absolutely sure that there are qualitative differences between the conventions. So it seems like a pretty arbitrary selection process instead of an obvious choice.

twerquie · on Aug 13, 2015

I'm not sure if you're talking about the pain of going back to testing ruby / rails after using mocha / node, but I feel that specific pain, especially on projects with old-school Rails purists who insist on Test::Unit style. Switching to rspec gives you nested describe blocks with shared setup and teardown steps, as nice as mocha. Minitest has this BDD style built in too, but somehow the way Rails ties it in makes it difficult or impossible to take advantage of.

falsedan · on Aug 13, 2015

Oh hey, we have the same sort of system here. It's 60,000 Python tests which take ~28 hours if run serially, but we keep it around 30-40 minutes. We wrote a UI & scheduler & artifact distribution system (which we're probably going to replace with S3). We run selenium & unit tests as well as the integration tests.

We've noticed that starting and stopping a ton of docker containers in rapid succession really hoses dockerd, also that Jenkins' API is a lot slower than we expected for mostly-read-only operations.

Have you considered mesos?

badmadrad · on Aug 13, 2015

Have you considered another containerization solution like LXD. I feel like testing like this fits the "container hyper-visor" use case and this is what LXD is designed to do.

falsedan · on Aug 14, 2015

We tried docker, then had to drop back to running the tests outside of a container (some old technical decisions in the project under test made it hard to run in a container). It's been improved since then, and we're close to running in containers again.

Each executor gets a non-shared prod-like environment thanks to a handful of docker containers. The same setup is used for dev, so switching the testing environment to LXC would mean switching devs as well.

akoumjian · on Aug 14, 2015

Anything significantly different in your Python implementation?

falsedan · on Aug 14, 2015

Hard to say anything more than what I posted without more details from Stripe.

lawrencewu · on Aug 14, 2015

hi frei

falsedan · on Aug 14, 2015

Other Dan…

cthyon · on Aug 14, 2015

Not sure if this has already been answered, but would Stripe's methods only work with unit tests where tests are not dependent on each other?

How would one go about building a similar distributed testing setup for end-to-end tests where a sequence of tests have to be run in particular order. Finding the optimal ordering / distribution of tests between workloads would certainly be more complicated. Maybe they could be calculated with directed graph algorithms?

matthewmacleod · on Aug 14, 2015

How would one go about building a similar distributed testing setup for end-to-end tests where a sequence of tests have to be run in particular order.

I reckon that would be solving the wrong problem. End-to-end tests should be independent of each other, and tests should never be dependent on the order in which they are run. End-to-end tests might be longer as a result, but managing the complexity of test dependencies will quickly cripple any system that uses this approach, I imagine.

givehimagun · on Aug 14, 2015

I'd love to know if their integration tests use a database or reference external services of any sort.

We ended up making a compromise where each test can never expect another test to have run...but some tests expect certain test data to be present and in a known state. To handle that, every test cleans up the data of the previous run (Entity Framework has a nice change tracker where we can keep track of the unit of work before it is persisted). We wouldn't be able to parallelize everything though...we can only accept a single test to be active on the DB at a single point in time.

notduncansmith · on Aug 14, 2015

I think those would not be considered "unit" tests. Often the definition of unit tests includes the ability to run those tests in any order. Any tests that have to be run in a particular order (i.e. "stateful" tests) should be considered a single test, and likely an integration test at that.

hinkley · on Aug 13, 2015

needle scratching on record

They have an average of 9 assertions per test case. I think I may see part of their problem.

junto · on Aug 13, 2015

I'm not sure if you are talking from a performance perspective or a conceptual perspective, but this provides a useful discussion on multiple assertions:

http://programmers.stackexchange.com/questions/7823/is-it-ok...

My 2 cents is that multiple assertions are legitimate, as long as they prove a singluar assumption. Hence (as per the test on that page), this is a valid use of multiple assertions:

  [Test]
  public void ValueIsInRange()
  {
    int value = GetValueToTest();

    Assert.That(value, Is.GreaterThan(10), "value is too small");
    Assert.That(value, Is.LessThan(100), "value is too large");
  }

hinkley · on Aug 13, 2015

[Edit] Thanks for the link. I have a whole bunch of comments in there to upvote. Guess my evening is planned :) [Edit]

I would also like to point out that ranges, like the one in your example, are almost always a symptom of an unstable test to begin with. I'd want to know why you're providing a range. Does the test blow up if another server is running tests at the same time? Let's fix that so the tests actually fail when there is a failure.

Now, there are lots of matchers that misbehave for corner case inputs, and an assert like "Make sure there's text, that it's a number, and that the number equals 10" may be necessary in order to prove that "10" appears, especially when you invert the test an say "Make sure the number isn't 10". And in this case I would say "write us a better matcher so that everyone can benefit from you figuring out how to do this".

This should go without saying, but I feel I have to repeat it every time there's an audience:

Green is not the end goal of testing. Red when there is an actual problem is the end goal of testing. Anything else is a very expensive way to consume resources.

hinkley · on Aug 13, 2015

That's two asserts, and yes you are essentially testing the same concept, which I'm comfortable with as long as it's not a regular thing. People go through all sorts of gymnastics to convince themselves "it's one thing" and I find it exhausting, especially since fixing the problem is usually easier than the rationalizing.

If multiple asserts is a regular thing, you can either start breaking down your tests, or write a custom matcher. The custom matcher gives you better diagnostics when it breaks, so is probably the way to go.

Assert.That(value, Is.Within(10, 100)); // matcher generates error message

chinathrow · on Aug 13, 2015

Any reason why a financial infrastructure provider like Stripe would run CI tests on someone elses infrastructure? Isn't that a no go from a security point of view? Or - how do you trust the hosted CI company not to look at your code?

patio11 · on Aug 13, 2015

how do you trust the hosted CI company not to look at your code?

Contracts, not firewalls, make the world go round.

inopinatus · on Aug 13, 2015

Can't upvote this hard enough. It's a classic conceit of secops people that they are the only line of defence against unscrupulous behaviour. Systemic pathologies follow from this misbelief.

c.f. also: "Enterprise Architects", a group of people who think building IT systems qualifies you to redesign an entire organisation.

scrollaway · on Aug 13, 2015

To be fair a contract does not guarantee the security framework of the company you are contracting, which means your code is only as safe as their weakest link.

NeutronBoy · on Aug 13, 2015

Which is why contracts include things like right-to-audit, so you can verify for yourself.

brown9-2 · on Aug 13, 2015

how do you trust the hosted CI company not to look at your code

One can probably assume that they are not relying upon the secrecy of their code for security.

hawkice · on Aug 14, 2015

There are other reasons to keep code proprietary than fearing a security failure in the event the code leaks.

chinathrow · on Aug 15, 2015

Yes - I think their fraud detection code might be worth some $$$ if sold to the right folks.

jwatte · on Aug 14, 2015

If their code is right, everyone in the world reading it wouldn't be a problem.

meesterdude · on Aug 13, 2015

I wrote a rubygem called cloudspeq (http://github.com/meesterdude/cloudspeq) that distributes rails rspec spec's across a bunch of digital ocean machines to reduce test execution time for slow test suits in dev.

one of the things I did that may be of interest is to break up spec files themselves to help reduce hotspots (or dedicate a machine to it specifically)

Not as complex or as robust as what they did, but it works!

grandalf · on Aug 13, 2015

It's interesting to imagine, for a test suite that would take three hours, how much of the execution time is state management vs algorithm execution.

MrBra · on Aug 13, 2015

No, they aren't going to switch to a pure functional language.

jwatte · on Aug 14, 2015

http://engineering.imvu.com/2011/01/19/buildbot-and-intermit...

jwatte · on Aug 14, 2015

Also, since 2011, we have added features and platforms under test, yet deceased test run time to < 4 minutes. So, yay progress!

rubiquity · on Aug 13, 2015

Does this mean each process has its own database or are you able to use transactions with the selenium/capybara tests?

throwaway832975 · on Aug 15, 2015

Pull-based load balancing is a generally underrated technique.

smegel · on Aug 13, 2015

> Initially, we experimented with using Ruby's threads instead of multiple processes

Why, to be cool? Tests are a classic case of things that should be run in isolation - you don't want tests interfering with earth other or crashing the whole test suite. Using separate processes would have been the sensible approach to start with.

edoloughlin · on Aug 13, 2015

Was anyone else expecting the article to be about replacing Ruby with a compiled language?

werdnapk · on Aug 13, 2015