
Scientist: Measure Twice, Cut Over Once - jesseplusplus
http://githubengineering.com/scientist/
======
nine_k
If you ever wondered why pure functions might be practical, this is an
example. For instance, you can calculate two pure (side-effects-free)
functions in parallel and compare performance and the results; this is what
`science` does.

------
vdm
Twitter's Diffy goes further: it first computes a diff between _two_ controls,
to detect non deterministic output (e.g. transaction IDs) as false positives,
which can then be omitted from the metrics, and then diffs this first diff
against the output of the candidate.

[https://blog.twitter.com/2015/diffy-testing-services-
without...](https://blog.twitter.com/2015/diffy-testing-services-without-
writing-tests)

------
onalark
Awesome, I'm a huge fan of new and innovative tools that help improve the
process of refactoring and improving existing code. This looks like a really
promising tool for Ruby developers, and I'm always grateful when companies and
their employees invest the time and effort to release their tools to the
community. I really liked the point about "buggy data" as opposed to just
buggy code, I think that's a really important point.

A few reactions from reading through the release:

Scientist appears to be largely limited to situations where the code has no
"side effects". I think this is a pretty big caveat, and it would have been
helpful in the introduction/summary to see this mentioned. Similarly, I think
it would be nice to point out that Scientist is a Ruby-only framework :)

You don't mention "regression test" at any point in the article, which is the
language I'm most familiar with for referring to this sort of testing. How
does a scientist "experiment" compare to a regression test over that block of
code?

Anyway, thanks again for writing this up, I'll be thinking more about the
Experiment testing pattern for my own projects.

~~~
gregmac
> Scientist appears to be largely limited to situations where the code has no
> "side effects".

That's one of the things I was initially thinking too, but then as I thought
about where I could have used it in the past, I can think of only a few cases
where it wouldn't be possible to keep it isolated.

For example, have your new code running against a non-live data store.
Example: When a user changes permissions, the old code changes the live DB,
while the new code changes the temporary DB. Later (or continuously), you
could also compare the databases to ensure they're identical (easier if the
datastore remains constant, a bit harder if you are changing to a different
storage schema or product).

Where it would be in the difficult-to-impossible range is when touching on
external services that don't have the ability to copy state and set up a
secondary test service, but even in that case, you could record the requests
made (or that would have been made) and ensure they both would have done the
same thing.

Definitely an interesting concept overall.

------
diocles
I've been thinking about the naming of this library, and I don't think
"science" is a good metaphor for what it does.

You can only test hypotheses of the form "A is exactly like B" \- no bug fixes
are allowed, because they will show up as differences.

So a more accurate (but less cool) name might be "Refactoring" \- you assert
that all your tests still pass, where your tests are your production data.

~~~
masklinn
> You can only test hypotheses of the form "A is exactly like B" \- no bug
> fixes are allowed, because they will show up as differences.

Of course they will. That's a feature. The system doesn't intrinsically know
whether a change in behaviour is a bug added or removed, it can only report
that there's a difference in behaviour, it's your job to investigate whether
the control or the experiment is correct.

------
platz
Is is also called a "Test Oracle" in property-based testing. The "Oracle" is
your known good impl, and you test your subject function using inputs
comparing against the oracle. Works great with property tests because you
don't need to manually specify the input, just generate them and check the
equality holds

~~~
eru
Of course, property-based testing is richer, and also works without a
reference implementation. Eg you can test properties like commutativity

    
    
       f(x, y) = f(y, x)
    

without a reference implementation.

------
noobiemcfoob
Overall, I love this type of approach. We've begun doing something similar at
work as well.

However, I don't get the restriction on code with side effects.

Would it not be possible to introduce another abstraction layer around those
side effects to allow comparison between the old code's side effects and the
refactor's code side effects?

~~~
onalark
I don't think this would work for a number of reasons. If it's a database that
you're modifying, you can see that a lot of operations (increment, delete,
etc...) will do the wrong thing if they're called twice. If the operations
themselves are idempotent, you wouldn't be able to verify that the intended
side effect was correct. This is one reason developers spend a lot of time
building mock objects: to capture "side effects".

~~~
lazaroclapp
How robust do you imagine it would be to just record the call / response pairs
of the mutable objects in the new code and then replay them when running the
experiment on the old code?

For example, suppose you have a db object and two versions of the code
new_code and old_code. You call something that looks like:

experiment.run(new_code, old_code, mutables=[db])

Then the infrastructure runs new_code normally, but records the arguments and
return value of every call to db (and any other object defined as mutable).
Then, the infrastructure runs old_code, but whenever a method of db is called,
it tries to match it with a call made by new_code and just directly returns
the return value that call returned. If it can't match the call, it signals an
error, but it never actually tries to call db, thus negating the risk of side
effects.

Obviously this would fail when the two versions perform different operations
in the database, even semantically equivalent, but non-identical operations
(say one retrieves a value and increments it inside a transaction, while the
other uses a stored procedure in db to increment without fetching). But it
still relaxes the constraint, now you can do this for code that has no side
effects and code that has the exact same side effects as represented by exact
call/return pairs to mutable objects.

~~~
masklinn
> How robust do you imagine it would be to just record the call / response
> pairs of the mutable objects in the new code and then replay them when
> running the experiment on the old code? […] Obviously this would fail when
> the two versions perform different operations in the database, even
> semantically equivalent

I'd think the latter would be the common expectation by virtue of a different
implementation.

------
gonyea
Wow. We are literally working on the same problem described in this post.

Looks like a great tool! We'll give it a spin.

------
vinceguidry
Sadly, the only project I can think to use this on is still on Ruby 1.8.7,
whereas the gem only works on 1.9+.

------
kelseydh
I wonder if there could be a way to abstract this for testing gem version
upgrades.

~~~
heynk
At first I thought tests would cover this, but it would be pretty cool to
compare performance across different gems. Unfortunately you can't really use
two versions of the same gem in the same runtime, so you'll probably go back
to other benchmarking methods.

------
pdm55
Will you run into copyright issues with that name?
[https://www.micromath.com/](https://www.micromath.com/) have a product with
the same name.

