Hacker News new | comments | show | ask | jobs | submit login
How Rust is tested (brson.github.io)
299 points by brson 193 days ago | hide | past | web | favorite | 76 comments



I love the concept of the "crater run" (or I guess now "cargo bomb"?), where the entire crate (i.e. "package") ecosystem is compiled, as a test for speculative compiler changes.

It's one of the huge benefits of being both a statically-typed compiled language and offering a modern package system. A lot of languages have the types & compiling but no standard package ecosystem (Haskell, Go, Java), or the package ecosystem but not the compiling (Ruby+Gems, JS+npm, etc). Rust is the only language I can think of that has both, although I'm sure there are others. I'm not sure if there's tooling for compiler writers in other languages to compile them all, though.

And this feature will only get more and more powerful as the ecosystem grows. I think it's a seriously important tool to keep the rust language feeling stable.


What stops Java doing this is not that it doesn't have a "modern" package system, it's that the package system ships compiled bytecode rather than source code.

Conversely, what stops other languages with source-shipping package managers doing this is that packages don't include tests. One notable exception to this is Perl, where packages can, and typically do, include tests, and these are run as part of the normal installation process. There are even some public-spirited souls who download and test everything that gets uploaded:

http://www.cpantesters.org/

This incorporation of tests is both a consequence of and a justification for the fact that Perl inter-package dependencies aren't versioned, so you can never really be sure that you're going to download a combination that works.


An interesting thing I learned recently about cpantesters is that they'll also test packages against the bleeding edge versions of the interpreter, so sometimes a package maintainer will get an issue filed (or a PR) saying "your package is going to break in about eight months", which I find both funny and cool.


In fact I used this for the past Perl release 5.26 to help test everything since there was a known to be breaking change that was going to occur. I need to redo.my efforts now that the release has happened but it did help identify a couple of key packages that needed to be updated


You say it's the absence of tests, but Ruby breaks RubySpec all the time, so I think the real issue is that the maintainers plain don't care.


This doesn't need a statically typed language at all. Here's the same thing, with test runs, nightly, for Racket: http://pkg-build.racket-lang.org/ (that's building all packages on the current release, there's another one against the nightly snapshot).


Haskell does have a community package repository (hackage) and a command for installing libraries and their dependencies (cabal).

https://hackage.haskell.org/


Actually, I'd say that these days stackage is the place to go for Haskell.


Everything flows from Hackage even if you're using Stackage.


That's a very controversial question; stack is divisive, and very opinionated (as are its authors).


stack is just software. It's software that a large portion (maybe the majority?) of the community prefers to use. Haskell is an open source project. Someone wanted to do package management in a different way, so they put in the work and wrote the code, and it turns out a lot of people like it.

I've seen some intense debates over the merits of the two package managers, but ultimately they both have their pros and cons, and are fortunately fairly compatible.


Software can be "opinionated", when it has very specific ideas about how to do things and enforces a specific workflow. That can be a good thing, if you like that workflow well enough, or a bad thing, if you don't, or if you need flexibility it doesn't provide.


Definitely. Just compare eg git, mercurial, bazaar and darcs. They can do roughly the same things these days, but have very different opinions. (And that's just their CLI. The various GUI front-ends are yet another source of opinions.)


You don't need a statically-typed or compiled language to have this sort of infrastructure! Julia does something similar for important changes — and it actually runs all the unit tests of all the registered packages.


Heck, an extreme is what I did for Carakan, Opera's last in-house JS VM, where we were running the testsuite of pretty much any notable JS library in 2010 in CI; that caught loads of subtle bugs. (That said, obviously harder to automate, given there's no standard for in-browser testing!)

Far as I'm aware, every browser is running some version of the jQuery testsuite nowadays, though I doubt that's the only third-party library tested in such a way in each browser.


This technique is used to at least some extent in the Haskell world as well. Library changes are sometimes judged by the amount of breakage that they cause (often, Stackage or a subset thereof is used for these tests as a set of "commonly" used packages).

Also, Haskell definitely has a standard package system: Hackage. (Stackage is just a "view"/subset of Hackage)


> "crater run" (or I guess now "cargo bomb"?)

Crater is the older tool that runs builds. CargoBomb is the newer tool that runs builds and tests.


Yeah, I was really impressed by that as well when it was first announced. Makes perfect sense once you hear it, but as an old fogey I just wasn't used to thinking on that kind of scale.

I had a similar lightbulb moment years ago when someone pointed out that y'know, these days it's perfectly feasible to test a simple float32 function by running every possible float32 bitpattern through it.


These days? Grab a 20 year old Pentium II and you can test a 300-cycle 32-bit function in an hour.


True... as long as you know what output you are expecting for all 2^32 patterns

(Not your point I know)


This is what Stackage(https://www.stackage.org/) LTS releases are in the Haskell world. A compiler and a large subset of the Hackage repository known to compile together and pass tests.


At my employer, we did some internal work on "compiling" - i.e. installing (including compiling any native extensions) and running test suites - on the entire Python, Ruby and NPM ecosystems.

Conference talk: https://www.youtube.com/watch?v=v7wSqOQeGhA


AFAIK, Go devs actually do a similar thing (i.e. test the compiler on a body of selected notable third party packages/apps) [citation needed]. An official central index similar to Rust's https://crates.io for Go is https://godoc.org (though there are other non-official ones too)


Yes, my impression is that many production-grade compiler dev teams have technique like this, though often they do not talk much about it, often they are proprietary and/or covered by NDAs, and often they are, as you say, 'selected' code bases. I'm very curious for other compiler teams to say more about what they do here.


In the Python world where there are multiple viable implementations of the language, they often take large/notable packages and make those packages' test suites part of the language implementation's test suite.


> no standard package ecosystem ... Java

that's not true! The standard "package" ecosystem in java is maven, and the maven repository!


> There’s a big downside though in that landing patches to Rust is serialized on running the test suite on every patch, and it takes a particularly long time to Run the Rust test suite in all the configurations we care about.

One solution is speculative batched testing. Kubernetes uses this to keep up with its very high rate of PRs (>40 merges/day) and ~1hr maximum test times.

Given a queue of patches A, B, C, D, start the normal testing against A, but also start testing the merge of A+B+C+D against the master. If the batch passes first, merge them all. If A is merged first, it's still "compatible" with A+B+C+D, so they can go ahead. Doing a batch in parallel with a single PR like this slightly increases testing load, but ideally doesn't cause any slowdowns versus the fully serialized case.

It looks like bors already supports something like this with "rollups", where simple fixes can be marked for testing together: https://internals.rust-lang.org/t/batched-merge-rollup-featu...


Doesn't this process break when, e.g., master + A passes tests, master + A + B fails, but master + A + B + C + D passes again (having been fixed by C + D accidentally)?

I mean, it's pretty unlikely to happen, but it's possible. Plus it doesn't matter if all you cared about was that the HEAD of master passes tests, but not the whole history.

However, the current approach, if rebasing is used to squash every merge request into a single commit on top of master, allows you to have a linear history of the master branch where every commit is known to pass the whole test suite that was in existence at that moment, which, I think, helps a lot when bisecting.

I'm not sure if Rust compresses merge requests into a single commit on top of master, though.


In Rust the "verified product" is the chain of bors merges - we even save build artifacts for each node in the chain for easy bisection of issues. A rollup creates a single bors merge for all the commits in them, so it's only 1 link in the chain.


Yes, it's theoretically possible. When it happens in Kubernetes, it's far more common that m+A+B failed because of a flaky test than because B was actually broken, and C+D fixed it.

What we often see is m+A fails because of a flake, and m+A+B+C+D+E passes and merges as a batch. We have a lot of flaky tests. :(


I remember my first experience with That's flaky tests! (or more accurately, Cargo's:

https://github.com/rust-lang/cargo/pull/3715


I'd at some point drawn up a proposal for this, back when Rust started having problems with this. At the time I think we didn't really have the budget for such testing either.

bors' rollup support only lets one mark things for rollup, which show up differently in the queue. It's effectively a priority=-1. This is for things which probably won't break the build, and contributors make a batch rollup every now and then (bors has a button for this). Generally when I do rollups I go through these and ensure they're green on travis.

I've considered giving homu (the engine behind bors) better support for automatically doing this (at least automatically showing travis statuses) but I don't have much time.


Oh, neat. I've never heard it proposed quite that way before, where both branches being tested are compatible. That's real smart.

So far we haven't been able to stomach doing parallel integration builds because of the doubling of expenses.

As you see, our current solution is rollups, where a human batches low-risk PRs into one big PR. It works just ok, and is super high-maintenance.


You already trigger Travis tests for PRs before bors does an auto run, right?

In that case, it wouldn't double expenses. Assuming batches are slightly effective, it should reduce expenses-- with >50% success rate with batches of >=3 PRs, you don't have to do individual testing for the later PRs in the batch.


The Travis smoke tests against PRs are only done in a single configuration, so are relatively cheap.

I think it would double expenses because our CI runs at capacity. To do parallel builds we would have to contract for double the compute resources. (With a different purchase structure for our CPU time you may be right about that, but not sure).


If cost is a factor then you could use batch testing as your default mode of operation and not bother testing individual commits at all unless a batch fails.

Using batch testing as a default would both reduce costs and increase merge speed, assuming that your commits usually pass testing.

The downside of using batching as a default is that it wouldn't test every commit in isolation. That means it wouldn't necessarily be safe to roll back to a particular commit if that commit was tested as part of a batch. E.g. if patches A, B and C were tested together, then it's not certain that patch A by itself would pass the tests.


That's a good point that I didn't consider. You could squash commits to get that property back, but the results wouldn't carry much meaning.


If your test-suite pass rate is significantly greater than 70% (and statistically independent), you can gain throughput by grouping PRs into batches.

For example, if the pass rate is 80%, then by batching 3 PRs together there's a 51.2% chance to pass with all three, which would save 2 test runs. and a 48.8% to fail, which would cost 1 extra run. That's a pretty substantial increase in throughput.


Similar, but different, yeah: it gives you an easy button to make a new PR out of A + B + C + D together. We tend to batch up documentation changes this way.


That's close to what Google does internally. They have quite a few more than 40 commits a day to one giant repository.


They aren't serialized on each other, though. That's the main problem here.


> At some point though LLVM began doing a valid optimization that valgrind was unable to recognize as valid, and that made its application useless for us, or at least too difficult to maintain.

What optimization was that? Some searching didn't turn up any relevant results (other than this article).


I couldn't quite remember either, and brson told me that https://github.com/rust-lang/rust/issues/11710 was probably it. Can't seem to find where it was actually finally disabled though.


Money quotes:

"This is a known valgrind false positive and Rust already has a lot of suppressions for it. It's reported upstream on the LLVM bug tracker but is not really an LLVM issue, just a limitation of valgrind. LLVM is allowed to generate undefined reads if it can not change the program behavior."

[...]

"Just for the record, the undefined read doesn't change the behaviour because it's just a check whether free should not be called, i.e. either it jumps over the free call right away, or it performs the second check."


> The thing we do differently from most is that we run the full test suite against every patch, as if it were merged to master, before committing it into the master branch, whereas most CI setups test after committing, or if they do run tests against every PR, they do so before merging, leaving open the possibility of regressions introduced during the merge.

Hm, Travis CI on GitHub runs tests on a pull request before the merge. Is what they're doing really that unusual?


As it says in the quote you provided, the difference is that it tests what the result would be of the merge, which differs in that a merge might introduce regressions; if you've never had a bug introduced by a merge not doing what you expected, consider yourself one of the lucky ones.


I'm not sure I'm understanding things correctly. What I meant was that the test chronologically occurs before the merge takes place. But the test includes the merge itself; it's testing what the result would be after the merge. Have you used the system I'm talking about/do you know what I mean? Am I still misunderstanding something?


Here's an example scenario:

1. Your PR is tested with the merge against Master. All green!

2. Another PR is tested with the merge against Master. All green!

3. The other PR goes in first. All good!

4. Your PR goes in (merge happens, but introduces a bug). Oh no!


Thanks! Someone pointed it out in a sibling comment afterwards too. Kind of sucks, I wish they had a "do final check and merge" button that would merge PRs sequentially into master after re-testing when there's a collision like this. Any idea why they don't?


I think this is more of an issue with GitHub not firing another webhook, than Travis. Travis just does what it's told.


But that only works well if the PR is up to date with the targeted branch.

AFAIK, you have to manually merge the targeted branch into the PRs to be in an equivalent situation. Also, other commits might still be added into the targeted branch between the time the tests have been run and the PR is merged.


Oh I see, so the problem is that the merge commit is not the same as the one that the test runs for if there are any intervening commits between them. Interesting, I didn't quite realize that! (Though it seems obvious in hindsight, since a new push doesn't make all past PRs' tests re-run.) Thanks for explaining that! So it seems it's something between what they do and what they were trying to avoid. GitHub should really put a button to re-run tests on merges that are out of date...


Not unusual at all. Any organization using Gerrit with a rebase-only workflow with any popular CI systems (like Jenkins) have been doing this for ages.


I appreciate the thoroughness of this. I've been using Rust for 2 years now, and the compiler has been rock solid.


It would be really nice if Travis had Linux-on-ARMv7 and Linux-on-aarch64 options on real ARM hardware. I wonder what the economics of Travis building such a thing on top of an existing ARM cloud offering like Scaleway would be.


> Today the longest-running configuration takes over 2 hours.

I'm curious which configuration that is, and how it could be accelerated.


Looking at the build that's building right now, all of them have passed but one, which has been going on for 2 hours and 23 minutes: https://travis-ci.org/rust-lang/rust/builds/252052639

> "env": "RUST_CHECK_TARGET=dist RUST_CONFIGURE_ARGS=\"--target=aarch64-apple-ios,armv7-apple-ios,armv7s-apple-ios,i386-apple-ios,x86_64-apple-ios --enable-extended --enable-sanitizers --enable-profiler\" SRC=. DEPLOY=1 RUSTC_RETRY_LINKER_ON_SEGFAULT=1 SCCACHE_ERROR_LOG=/tmp/sccache.log MACOSX_DEPLOYMENT_TARGET=10.7\n",

iOS, it seems.

Looking at the other builds,

* s390x Linux

* i686-apple-darwin

* x86_64-apple-darwin

Seems apple is an overall slowpoke.


This has more to do with Travis-CI than Darwin et al.

https://www.traviscistatus.com/

Notice how the active OS X builds reaches the limit of 216 jobs and the backlog of OS X jobs increases without decreasing. This is a capacity issue that is likely related to high OS X build costs.

Another thing to notice is that the backlog correlates strongly with time of day in the EST timezone, where the backlog starts to take off after 9am EST.


It seems to vary from build to build... Compare https://travis-ci.org/rust-lang/rust/builds/251809226 where everything is under 2 hours except i686-apple-darwin.


As mentioned elsewhere in this thread, it stands to reason that Travis-CI is the culprit here.

https://www.traviscistatus.com/

Their slowest build jobs are all on Darwin/OSX-based systems and (notably) this is a problem numerous other projects have -- slow jobs on OSX build systems. There is a large backlog of OSX build jobs that starts around 9am EST.


Is that true for the paid version of Travis as well? If not, might be worth paying for it, to support a community as large as Rust's. (If the paid version is that backlogged as well, I'm surprised they don't put more resources into accelerating it; perhaps not enough of their customers care about macOS?)

Alternatively, given that Mozilla has their own extensive CI infrastructure (for Firefox), perhaps it'd make sense to migrate to that instead of Travis?


We are already paying Travis.

> Given that Mozilla has their own extensive CI infrastructure

While Mozilla does sponsor a lot of Rust stuff, it's fundamentally a community project, and so many of us (including myself) would prefer to not to move closer to something that's Mozilla-specific.


FWIW Mac jobs are also often capacity-limited on Mozilla infrastructure. It's not unusual to have thousands of jobs waiting on hundreds of machines, which leads to multi-hour queues at peak times.

Fundamentally the problem with CI testing on Mac is that Apple don't provide the components you want to build a large pool of machines at reasonable cost, so everyone ends up doing slightly mad things involving hundreds of mac minis (obviously the Windows situation also isn't great; this kind of infrastructure loves free software on commodity hardware).


Thanks for the explanations.

> We are already paying Travis.

Have you heard anything from them about improving the issues with macOS? You've certainly got a case study in it being a problem.

> While Mozilla does sponsor a lot of Rust stuff, it's fundamentally a community project, and so many of us (including myself) would prefer to not to move closer to something that's Mozilla-specific.

Fair enough; I only suggested it because I had the impression that Mozilla's bits were all based on Open Source systems, and it would just be a matter of sharing hosting infrastructure rather than duplicating it.


I'm not on the infra team so I'm not sure; I do know they pay attention to this stuff though.

Yeah, it's not really so much "it's not open source" as it is "Mozilla is the only organization who's really using this stuff". I used to worry about this a lot, but I think we've mostly gotten away from it now that a lot of organizations are using Rust. That is, if Rust was a Mozilla-only technology, I think Rust would have failed its mission. As such, I'm wary of things that seemingly tie Rust back into Mozilla too specifically.


If I may ask, how did you made Travis take your money?

We wanted to pay them (as an OSS project) for faster and more reliable builds for a long time but they always refused.


I wasn't directly involved, so I'm not sure. You should email the infrastructure team: infra-team@rust-lang.org


This comment should make things clearer: https://github.com/travis-ci/travis-ci/issues/7304#issuecomm...

> The Mac queues for travis-ci.org and travis-ci.com are separate in terms of provisioned capacity, but use the same underlying infrastructure. We make sure we have enough provisioned capacity for our customers on travis-ci.com who are testing private Repositories on our Mac infrastructure. For travis-ci.org we provide a greater capacity, in terms of concurrent VMs, but have had to make some hard decisions on how and when to increase this capacity as the cost for helping the open source community is fairly high.


FYI, it's i686-apple-darwin at the moment.


I have been taking inspiration from the Rust project and the way they go about things in general. In my workplace, we have a bot similar to bors, which does the merge request testing and is the gatekeeper.

Tools like bors act as a "force-multiplier".


Heads up that your Google fonts (in site.css) trigger a warning about insecure scripts, and in Chrome at least they therefore don't load by default.


I dream with the day we will have a GCC Rust compiler.


What would the benefits of that be?


Anyone care to enlighten me why fuzz testing and not mutation testing?


I don't get it, because you have just explained how to unit test everything else. Which is way better than nothing. Not sure how it is moot, also not sure how it is related to the article :)


Unit test in a nutshell, for a business system:

Practically every business system consists of a view, the logic and the database, plus some webservice and whatnot, but mainly its those 3 parts (the so called 3 layers).

Can you unit test, test the view layer? Not really. Can you unit test, test the database by doing real test? Again, not really and its not always possible. So, Unit test is mainly focused in the layer in between of the view and database and both, by their nature, can't be automatically tested.

I like the idea of Unit Test but, for business system, its moot.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: