
Testing Firefox more efficiently with machine learning - srirangr
https://hacks.mozilla.org/2020/07/testing-firefox-more-efficiently-with-machine-learning/
======
cmehdy
This is the kind of article I'd love to read more of, as in more of each bit!
It allowed me to discover the very well made docs to contribute to Firefox[0],
which feels very welcoming to an enthusiastic non-genius-expert engineer, who
happens to have some experience with CI, testing automation, and a couple
languages.

I assume the overhead of the project (and subsequent tweaks to model, re-
training and validation) is sufficiently negligible compared to the measured
benefits even if those weren't as clear-cut as 70%. I'm unaware of how much
compute is required for the task, but likely less than many compute-years per
day :)

One thing I did not notice in the approach to the modelization of the problem
is any link/tag regarding the platform for which the code changes are made,
and the programming languages used. There seems to be some evidence that
certain languages could lead to more defect fixing commits[1], and I don't
know if there's evidence that some platforms are more prone to bugs (I'm sure
wars of words have been fought over this). But would it make sense to have
that sort of information inform the model in a way? I fully understand that I
might be out of my depth here.

[0] [https://firefox-source-
docs.mozilla.org/setup/index.html](https://firefox-source-
docs.mozilla.org/setup/index.html)

[1] [https://cacm.acm.org/magazines/2017/10/221326-a-large-
scale-...](https://cacm.acm.org/magazines/2017/10/221326-a-large-scale-study-
of-programming-languages-and-code-quality-in-github/fulltext)

~~~
IAmEveryone
Looking at the code
([https://github.com/mozilla/bugbug/blob/master/bugbug/model.p...](https://github.com/mozilla/bugbug/blob/master/bugbug/model.py#L184)),
they test for any significant words in the code, comment, commit message,
tags, etc.

I wouldn't be surprised if language is explicitly included as, for example, a
flag in the "data" object. But the model should be able to figure it out by
itself otherwise by identifying keywords that only some languages (often) use.

------
hohenheim
Fantastic read. My only concern is that there wasn't any talk around cost of
false positives (selecting a test to run where it is unnecessary) vs false
negatives (incorrectly dismissing a relevant test), as those costs in terms of
their effect is not symmetrical.

The cost of a bug slipping through because a test being skipped will be higher
than running an irrelevant test to a commit.

~~~
jeffbee
There isn't any discussion of the cost at all. It just says the test run rate
is down by 70%, it doesn't say anything about the defect detection rate, even
though they say this is their cost function.

10 core-years per day sounds like a lot but it's only about a 10kW load, and
they've saved 70% of that, or about $20 of opex per day.

~~~
dmurray
Is that really all? That would be 3650 cores running full time. 3W per core
sounds too little for power consumption. And do power costs really dominate
the price of running CPUs? I'm guessing the savings here are at least one
order of magnitude more than your $20/day.

I get about $1000/day based on some EC2 prices for typical machines I've used,
though I'm sure Mozilla's requirements are different and they can negotiate
better prices than I can.

~~~
jeffbee
I probably missed a few factors, but I just hate a blog post that uses big-
sounding numbers when they aren't big.

~~~
bonoboTP
Big for who? Hundreds of machines running constantly is big for me.

------
pesenti
Similar work done at Facebook: [https://engineering.fb.com/developer-
tools/predictive-test-s...](https://engineering.fb.com/developer-
tools/predictive-test-selection/)

------
ackbar03
I always thought software/gui testing would be a great application for ai,
although I've never really sat down to think about how it could be done

~~~
kyawzazaw
Check this one out: [https://mesmerhq.com/](https://mesmerhq.com/)

It's for mobile apps mostly though.

------
srinivasupadhya
similar work at google:
[https://www.google.com/url?sa=t&source=web&rct=j&url=https:/...](https://www.google.com/url?sa=t&source=web&rct=j&url=https://research.google.com/pubs/archive/45861.pdf&ved=2ahUKEwjq-
ay_2sXqAhWSzDgGHZ6mB8IQFjAHegQIBhAB&usg=AOvVaw1XsFJdUcbLPk1oFl9HxWtD&cshid=1594488052487)

~~~
just-ok
URL without redirects & tracking IDs:
[https://research.google.com/pubs/archive/45861.pdf](https://research.google.com/pubs/archive/45861.pdf)

------
Tarq0n
Interesting. So for training they use features:

> In the past, how often did this test fail when the same files were touched?

> How far in the directory tree are the source files from the test files?

> How often in the VCS history were the source files modified together with
> the test files?

But for prediction all they input is a tuple (TEST, PATCH), and XGboost works
fine without the additional features?

~~~
dmurray
I think they're deriving the additional features at prediction time. The test
and patch don't contain all the information you need to compute the features,
but they contain sufficient information when combined with a big static lookup
table. At least that's the way I read it; agree it could be clearer.

------
sillysaurusx
The most interesting part of this to me was something tangential: they use
Redis Queues. Anyone have experience with this? Good or bad impressions?

The documentation is tantalizing, but hilariously short:
[https://devcenter.heroku.com/articles/python-
rq](https://devcenter.heroku.com/articles/python-rq)

Very "And then draw the rest of the owl." Oh really, you can just do `from
utils import count_words_at_url; q.enqueue(count_words_at_url,
'[http://heroku.com')`](http://heroku.com'\)`) and presto, your blocking
function -- whose source code exists locally -- is run successfully at the
other end?

I'll have to set aside some time to try this out. Python _does_ have
introspection facilities that could make that possible. I could imagine that
since the code is executed on the same box, it's relatively simple to send a
request like "here's which module the function was loaded from; here's the
order all modules were loaded in; load those modules and call this function."
But it leaves so many questions: serialization, performance, scaling, and all
the tiny bugs that inevitably come up.

I guess I was hoping someone could give me a quick gut check of
positive/negative reactions. The full RQ documentation is slightly better:
[https://python-rq.org/docs/](https://python-rq.org/docs/) but has some
worrying signs:

 _Make sure that the function call does not depend on its context. In
particular, global variables are evil (as always), but also any state that the
function depends on (for example a “current” user or “current” web request) is
not there when the worker will process it. If you want work done for the
“current” user, you should resolve that user to a concrete instance and pass a
reference to that user object to the job as an argument._

Yes, sure, global variables are the root of satan, but they're also a fact of
life in many scenarios.

Interesting approach... I wonder how much of a nightmare it makes devops...

~~~
kkaranth
I've used it a bit in production. Our use case avoided a lot of the potential
issues you mentioned, so it may not be entirely helpful:

* serialization: the input was passed a json string argument. the output was a file uploaded to S3, so just the URL was returned, again in a JSON string * global variables: the program was quite self-contained: there was an initial state setup that was not mutated afterwards. So RQ's fork-exec model(the default) worked well enough

Sorry I don't have much to say about performance and scaling. It was quite
fine for our needs, and we could scale horizontally upto a certain point by
just starting extra processes, and beyond that with more VMs. Since they all
listened on the same RQ, it worked fine. (the number of items in our queue
never really hit any of Redis' limitations either)

RQ lets you customize the worker model: so you could for instance use threads
instead of processes.

Regarding monitoring: there's RQ dashboard[1] which gives a nice web interface
to view jobs, failures, and restart them.

1: [https://python-rq.org/docs/monitoring/](https://python-
rq.org/docs/monitoring/)

~~~
sillysaurusx
Thank you very much for the detailed response :) I really appreciate the
thoroughness; it convinced me to use RQ.

~~~
sillysaurusx
+1 datapoint in favor of rq:
[https://twitter.com/sdan_io/status/1285687026386444288](https://twitter.com/sdan_io/status/1285687026386444288)
was built with it.

------
gorgoiler
Stability and speed from Firefox are always welcome. I’d love to see some
performance gains on armhf (raspberry pi 4) in particular. It’s good, and
close to being blissfully simple.

~~~
65536
Wouldn’t you run arm64 Debian rather than armhf Debian on the RPi 4? I haven’t
used the RPi 4 in a long while so I don’t remember. But it seems weird to me
that what you said would be the case.

[https://stackoverflow.com/a/48954012](https://stackoverflow.com/a/48954012)

[https://www.debian.org/ports/#portlist-
released](https://www.debian.org/ports/#portlist-released)

~~~
gorgoiler
32bit armhf is still the “supported” way of running Raspberry Pi OS. I believe
the 64bit build is on the horizon, but still considered beta.

If the consensus is its stable enough to give it a go though, I’ll give it a
go.

------
data_ders
really cool project. nice high-level overview of all the components. However,
I still don't understand the impact measurement -- how do you measure the
impact of this against the baseline? I didn't get that part in the
effectiveness section. Maybe I'm too newb -- but you could A/B test this,
right? 50% of PRs are subjected to automated tooling, 50% manual and compare
compute cost and failures b/w the two?

~~~
sfink
That's what the shadow scheduler is measuring. If you run a superset of the
AI- scheduled set, you can compute how well the AI is doing. Even if you don't
run a superset, you can infer the results from following test runs (on a tree
with the changeset in question, plus a few more, applied. You just have to be
careful to make sure you don't blame later breakage on your changeset.)

------
Diane09974
yesss

