One of the great strengths for CircleCI is that they auto-discover our test types, calculate how long each file takes to run, and then auto-allocate the files in future runs to try to equalize the run times across containers. The only effort we had to do was split up our slowest test file when we found that it was taking longer to complete than a combination of files on the other machines.
I also like that I can run pronto https://github.com/mmozuras/pronto to post Rubocop, Rails Best Practices, and Brakeman errors as comments on Github.
we simply added the linters/code analysis to the CI itself
reasoning: we try to have as little as possible "code style" discussion in PRs
My understanding is that they have a large external dependency (my term: "the money system"), and running integration tests against it might be tricky or even undependable. Do they have a mock banking infrastructure they integrate against?
We don't have a single answer we use for every system we work on, but we employ a few common patterns, ranging from just keeping hard-coded strings containing the expected output, up to and including implementing our own fake versions of external infrastructure. We have, for example, our own faked ISO-8583  authorization service, which some of our tests run against to get a degree of end-to-end testing.
Back-testing is also incredibly valuable: We have repositories of every conversation or transaction we've ever exchanged with the banking networks, and when making changes to parsers or interpreters, we can compare their output against the old version on all of that historical data.
Are you referring to test data or actual live transaction data? The latter would seem like a huge liability and target for hackers.
This sounds very familiar, we rely on external credit systems pretty heavily. We started by mocking service responses and including the response XML in our unit tests. Now we have a service simulator that returns expected values and has record/playback capability. It's not ideal and responses get outdated occasionally but we haven't found a more elegant way to handle it yet.
I pretty much do this habitually now:
1. Get report of bug in production or on staging.
2. Write test to reproduce the bug.
3. 2/3 of the time get stuck because the environment isn't capable of mimicking the bug.
4. Build upon the integration testing environment to make it capable of mimicking the bug.
I find the counter intuitive part of integration testing is that step 4 ends up being where most of the work is required and far too many people just don't do it because they feel it's not a worthwhile investment.
I actually ended up writing an open source framework to handle a lot of the boilerplate (which no other frameworks do AFAIK). Especially making mock devices easier to write (see http://hitchtest.com/ and check out HitchSMTP).
That's all I'm saying: tenable, but difficult (read: expensive). Frankly, I'm not convinced it is a worthwhile investment. Hence, my interest in how others have approached a similar problem.
Hitch looks pretty nifty, but I'm not sold on the yml/jinja2 approach. I grew to loathe Cucumber, and this approach seems similar. If you can't convince your non-technical staff to write tests in this language (which, in my experience, you can't), then you're better off writing the honest-to-god code that programmers are comfortable with (and can more easily modularize and refactor). YMMV I suppose!
YAML is different. It has much clearer syntax and the method mapping is super easy and can even easily handle more complex data structures being passed in in the steps (lists, dicts, lists of dicts, etc.) which Cucumber either couldn't do, or required tortuous syntax and horrible regexps to do.
I did it this way mainly to adhere to the rule of least power and to enforce separation of concerns between execution code and test scenarios. Readability by non-programmers is just a nice side effect.
I suppose one day I might make a GUI to generate the YAML - maybe then non-technical staff might write tests, but probably not before.
(eg cc handling could anon the CCs in a service before they reach the main app)
The integration with 3rd parties is a seperate issue that exists no matter of it is banks or not - i would guess they abstracted that as well as services or libs and decide case by case.
The best I ever saw was an internal tool at Microsoft. It could run tests on devices (Windows Mobile phones, but it really didn't care), had a nice reservation and pool system and a nice USB-->Ethernet-->USB system that let you route any device to any of the test benches.
This was great because it was a heterogeneous pool of devices, with different sets of tests that executed appropriately.
The test recovery was the best I've ever seen. The back end was wonky as anything, every single function returned a BOOL indicating if it had ran correctly or not, every function call was wrapped in an IF statement. That was silly, but the end result was that every layer of the app could be restarted independently, and after so many failures either a device would be auto removed from the pool and the tests reran on another device, or a host machine could be pulled out, and the test package sent down to another host machine.
The nice part was the simplicity of this. All similar tools I've used since have involved really stupid setup and configuration steps with some sort of crappy UI that was hard to use en-masse.
In comparison, this test system just tool a path to a set of source files on a machine, the compilation and execution command line, and then if the program returned 0 the test was marked as pass, if it returned anything else it was marked as fail.
All of this (except for copying the source files over) was done through an AJAX Web UI back in 2006 or so.
Everything I've used since than has either been watching people poorly reimplementing this system (frequently with not as good error recovery) or just downright inferior tools.
(For reference a full test pass was ~3 million tests over about 2 days, and there were opportunities for improvement, network bandwidth alone was a huge bottle neck)
All that said, the test system in the link sounds pretty sweet.
It must be a little depressing to build a really useful product you know many people need, give it away only hoping people will use it and be happy, then find out everyone would rather build their own.
But we do like to build things, it is in our nature. Plus, what looks better on your resume:
1) I migrated my teams test suite to using test load balancer in two days, saving hours every test run.
2) I contributed improvements to the open source test load balancer project.
3) I designed and implemented my own distributed test load balancing tool!
This, so many times over.
It doesn't help that when interviewing, I kinda-sorta want to know that the people I hire are capable of understanding systems from the ground up. The best way to demonstrate that is to go and build a system from the ground up...
TLB looks cool, nice to see that such a tool exists at least within one eco-system!
So did Microsoft open source this? If not, quit complaining. Just because you saw a massive software engineering company doing something better doesn't mean everyone else who doesn't have access to it sucks for not reaching parity.
Microsoft has alone re-invented this at least a half dozen times. At least one version of it, more limited in some regards more powerful in others, is sold as part of Visual Studio.
Of course the VS one is both much more "enterprisey" and less flexible in numerous ways.
(That said it does have nice charts.)
The industry as a whole though keeps remaking test frameworks again and again.
I admit that a custom made framework to solve a team's problems is going to be easier to use than an infinitely configurable framework that is designed to solve everyone's problems, Microsoft used to have that tool as well, and it was widely disliked for how little it did out of the box and how much work it required to get it up and running. (Also in its early days it had serious scaling problems, and its configuration + use required a lot of mental gymnastics)
I'm just annoyed that we haven't found a nice simple compromise solution, or at least created some fundamental building blocks.
On top of that so few testing systems pay attention to the user interface, if it takes me 5 minutes to add a single test, damned if I am going to be adding 50 tests.
Lots of test systems go with simple annotations, but then the instant I want something more powerful I am boned. MSTest was restricted like this for years, finally in VS2013 they made it much more extensible, but there is minimal C++ support. Other ecosystems are not a lot better, developers are really good at creating test systems that run on their local dev box, zippity do-da.
Then again I have spent most of my developer life in the devices area, which means test results need to in the very least get sent across the wire to a host machine of some type (depending on the intelligence of one's device under test).
I want my devs to be able to annotate a source file, have IPC code generated on both sides (device, and PC side library), and then have the test auto added to my test management system.
Bah humbug, I think I'll just write a parsing system with Perl and RegExs.
The manually adding tests to the test system part still sucks though. (There is an API for it, but again, mental gymnastics create a barrier to entry).
One thing that really sped up our test suite was by creating an NGINX proxy that served up all the static files instead of making rails do it. This saved us about 10 minutes off our 30 minute tests.
We use test-queue in a dozen different projects at GitHub and most CI runs finish in ~30s. The largest suite is for the github/github rails app which runs 30 minutes of tests in 90s.
- Their language runtime supported thread-based concurrency, which would drastically reduce implementation complexity and actual per-task overhead, thus improving machine usage efficiency AND eliminating the concerns about managing process trees that introduces a requirement for things like Docker.
- Their language runtime was AOT or JIT compiled, simply making everything faster to a degree that test execution could be reasonably performed on one (potentially large) machine.
- They used a language with a decent type system, significantly reducing the number of tests that had to be both written and run?
Choose boring (to you) technology.
That's just dumb survival bias endemic to the YC camp.
It's not something you should actually base business and technical decisions on.
I think it is an interesting question but a bad one most of the time unless you take into account all the other external factors that don't include the language it self.
It would run the tests 10-100x faster per test, and also require less tests (due to having a real type system).
I do giggle a little when I see huge engineering hurdles people have to overcome because of the language that was chosen. Building an app that is going to scale to millions of users? May not want to use Ruby...
(Nothing against Stripe, I am a paying customer - love the product. I do suspect it would be easier to engineer on a better platform than RoR though).
I/O only happens in a few functions, and most other code just takes data in -> transforms -> returns data out. This means I only have few functions that need to 'wait' on something outside of itself to finish, and much lesser delays in the code.
This is coding in Clojure for me, but you can do that in any language that has functions (preferable with efficient persistent data structures. Like the tree-based PersistentVector in Clojure).
I mostly only ever hear about how fast FP languages are, so maybe they use some tricks to avoid allocations somehow. I would be interested in hearing more about it.
In reality, you added the new element as a node in a tree. Then just modified pointers to that the new and old version share almost all of the pointers & allocations. With simple arrays or lists, you would allocate every element anew.
Idk if I can properly explain. Found this blog post very interesting and easier to follow: http://hypirion.com/musings/understanding-persistent-vector-...
Personally, most things in business I find easily parallelisable. You mostly decouple the I/O parts of something with the logic parts of it. But yeah I still have much to learn. Thanks for the link! Interesting :)
>Transients are an optimisation on persistent data structures, which decreases the amount of memory allocations needed. This makes a huge difference for performance critical code.
Can you say more? I don't know how C++ uses destructor positioning? Seems to me if it gets to GC it's too late, you already made an allocation which is the expensive part.
There is something to be said about code quality and having tests run in under a few seconds. The ideal situation is when you can have a barrage of tests run as fast as you are making changes to code. If we ever got to the point of instant feedback that didn't suck I'd think we'd change a lot about how we think about tests.
This made me long for a unit test framework as simple as:
$ make -j36 test
$ find tests/
$ cat Makefile
test : $(shell find tests/bin -type f | sed -e 's@/bin/@/output/@')
tests/output/% : tests/bin/% tests/input/% tests/expected/%
@ printf "testing [%s] ... " $@
@ sh -c 'exec $$0 < $$1' $^ > $@
@ # ...runs tests/bin/% < tests/input/% > tests/output/%
@ sh -c 'exec cmp -s $$3 $$0' $@ $^ && echo pass || echo fail
@ # ...runs cmp -s tests/expected/% tests/output/%
rm -f tests/output/*
The defining obstacle for Stripe seems like Ruby interpreter startup time though. I'm not sure how to elegantly handle preforked execution in a Makefile-based approach. Drop me a line if you have ideas or have tackled this in the past, I've got a couple projects stalled out on it.
We were able to run tests that took an hour in about 3 minutes. It was good enough for us. Nothing sophisticated for evenly balancing the test files, but it was pretty good for 1-2 days of work.
But, then how do you catch bugs where shared mutable state is not compatible with multiple changes?
On the other hand, using ruby I can have it continuously run the tests for the specific feature I'm working on without the long building step.
Things like using the same setup function for every test and setting up/tearing down for every test regardless of dependencies
Also tests like
Then people wonder why it takes so much time?
Also helpful is if you can shutdown database setup for tests
that don't need it
These are embarrassingly parallel problems, we just need better tools to fully saturate every core on every node in the test cluster.
Not only does the nesting help limit the amount of setup and teardown you do, but when broad-reaching functional changes hit you in version 2, 3, it's so much easier to reorganize your tests to get the pre- and post-conditions right when they are already grouped that way.
The sad thing is that it takes a few release cycles before you feel any difference at all, and a couple more before you're absolutely sure that there are qualitative differences between the conventions. So it seems like a pretty arbitrary selection process instead of an obvious choice.
We've noticed that starting and stopping a ton of docker containers in rapid succession really hoses dockerd, also that Jenkins' API is a lot slower than we expected for mostly-read-only operations.
Have you considered mesos?
Each executor gets a non-shared prod-like environment thanks to a handful of docker containers. The same setup is used for dev, so switching the testing environment to LXC would mean switching devs as well.
How would one go about building a similar distributed testing setup for end-to-end tests where a sequence of tests have to be run in particular order. Finding the optimal ordering / distribution of tests between workloads would certainly be more complicated. Maybe they could be calculated with directed graph algorithms?
I reckon that would be solving the wrong problem. End-to-end tests should be independent of each other, and tests should never be dependent on the order in which they are run. End-to-end tests might be longer as a result, but managing the complexity of test dependencies will quickly cripple any system that uses this approach, I imagine.
We ended up making a compromise where each test can never expect another test to have run...but some tests expect certain test data to be present and in a known state. To handle that, every test cleans up the data of the previous run (Entity Framework has a nice change tracker where we can keep track of the unit of work before it is persisted). We wouldn't be able to parallelize everything though...we can only accept a single test to be active on the DB at a single point in time.
They have an average of 9 assertions per test case. I think I may see part of their problem.
My 2 cents is that multiple assertions are legitimate, as long as they prove a singluar assumption. Hence (as per the test on that page), this is a valid use of multiple assertions:
public void ValueIsInRange()
int value = GetValueToTest();
Assert.That(value, Is.GreaterThan(10), "value is too small");
Assert.That(value, Is.LessThan(100), "value is too large");
I would also like to point out that ranges, like the one in your example, are almost always a symptom of an unstable test to begin with. I'd want to know why you're providing a range. Does the test blow up if another server is running tests at the same time? Let's fix that so the tests actually fail when there is a failure.
Now, there are lots of matchers that misbehave for corner case inputs, and an assert like "Make sure there's text, that it's a number, and that the number equals 10" may be necessary in order to prove that "10" appears, especially when you invert the test an say "Make sure the number isn't 10". And in this case I would say "write us a better matcher so that everyone can benefit from you figuring out how to do this".
This should go without saying, but I feel I have to repeat it every time there's an audience:
Green is not the end goal of testing. Red when there is an actual problem is the end goal of testing. Anything else is a very expensive way to consume resources.
If multiple asserts is a regular thing, you can either start breaking down your tests, or write a custom matcher. The custom matcher gives you better diagnostics when it breaks, so is probably the way to go.
Assert.That(value, Is.Within(10, 100)); // matcher generates error message
Contracts, not firewalls, make the world go round.
c.f. also: "Enterprise Architects", a group of people who think building IT systems qualifies you to redesign an entire organisation.
One can probably assume that they are not relying upon the secrecy of their code for security.
one of the things I did that may be of interest is to break up spec files themselves to help reduce hotspots (or dedicate a machine to it specifically)
Not as complex or as robust as what they did, but it works!
Why, to be cool? Tests are a classic case of things that should be run in isolation - you don't want tests interfering with earth other or crashing the whole test suite. Using separate processes would have been the sensible approach to start with.