
Best Practices for Testing in Java (2019) - moks
https://phauer.com/2019/modern-best-practices-testing-java/
======
closeparen
>If this test fails, it’s hard to see what exactly is broken.

Breaking existing tests is rare. I don't mind some detective work when it
happens. You start with a pretty strong lead, anyway: whatever you touched
since the last time the test passed. In my workflow this is unlikely to be
more than five minutes ago. Half the time I know what I broke without even
looking.

I can see how specificity in test results would be important if you're running
tests very infrequently and accumulating a lot of change between runs. But in
that case I would probably invest in faster feedback before I'd invest in
specificity. Huge, wordy test suites are expensive and can ossify bad design
much more strongly than they preserve correct behavior.

~~~
ThePadawan
> Breaking existing tests is rare. I don't mind some detective work when it
> happens. You start with a pretty strong lead, anyway: whatever you touched
> since the last time the test passed.

You describe the exact situation where the advice is not helpful at all.

Now imagine a large, very tightly interwoven codebase where in fact, making
changes to a section of code breaks tests on the complete other side of the
application.

If you don't follow that advice, you are immediately blocked, and pretty much
at the mercy of someone else fixing their test (probably to be less needlessly
specific or less assertive after changing requirements).

If they structured their test in a way that makes it clear why it should fail,
you have the chance to investigate yourself how this contradicts with your
changes.

~~~
throwaway_pdp09
> very tightly interwoven

Perhaps this is your problem.

~~~
ThePadawan
Of course. But you can't fix that by testing, only improve how to cope :)

------
lmilcin
Hardly "best" practices. "Best" would suggest there are no better practices.

I am fed up with unit testing. There is no feedback to tell if unit testing
setup is complete and if unit tests are comprehensive.

I am more and more turning to functional regression testing when entire
service is tested as a black box against spec.

Since we are testing API this tends to be specified much better than
individual components. There is usually already some kind of documentation.
API changes much less than individual components which means that if I start
doing some kind of refactoring I may change a lot of internals but don't touch
API at all. If I touch API then I need to update documentation and inform
users so writing tests for that is not a huge additional cost.

~~~
chii
> unit testing setup is complete

coverage reports. They tell you which branch wasnt executed in your test,
which can tell you what may be missing in your unit testing.

> more turning to functional regression testing when entire service is tested
> as a black box against spec.

functional/regression testing is important, but it can be difficult to use if
your app is huge. And you can not easily isolate an area to test, so you
always end up testing a common component repeatedly, and thus is slower to
run.

Not to mention some functional testing mechanisms (such as using a web
browser) can be flaky.

Both have their place in the automated testing paradigm.

~~~
2rsf
Although coverage reports are misleading, 100% coverage means that you passed
through every line of code, that's it.

You don't know if you covered edge cases, it cannot cover concurrency well and
although possible it will not tell you whether you used a series of values or
long-paths in your code.

~~~
mumblemumble
> every line of code, that's it

Yes. And it's worth noting what's left out of that. Passing through every line
of code is _not_ the same thing as following all paths through the code.

Here's a sampling of code features where that distinction is particularly
important:

    
    
      - successive if-statements
      - switch statements with fallthrough
      - loops
      - mutable variables with a large influence
    

The first two are because {{a}, {b}} ≠ {{a, b}}.

By that last one, I mean both mutable variables with a large scope, and
mutable variables within objects that have a large scope. That's probably the
trickiest thing, because, in the presence of mutability like that, it's not
just which lines of code were executed that matters, it's also the order in
which they were executed.

That doesn't strike me as a pretty picture. Overall, it implies that automated
code coverage metrics are the least useful in the kinds of codebases where
they are most desirable.

------
dlandis
> "Use Fixed Data Instead of Randomized Data Avoid randomized data as it can
> lead to toggling tests which can be hard to debug and omit error messages
> that make tracing the error back to the code harder. They will create highly
> reproducible tests, which are easy to debug and create error messages that
> can be easily traced back to the relevant line of code."

Good article, but I don't get this one at all; it almost seems like an anti-
pattern. Choosing fixed data instead of random because the results are more
"reproducible" seems to miss the point. If random data eventually helps
uncover more bugs, then it's worth using!

~~~
slavik81
Tests that use random input data are much more difficult to write correctly.
Your test needs to know the expected result not just for one case, but for
every possible case. That will vastly increase the number of bugs in your test
code, leading to a seemingly endless stream of false-positives.

The worst part is that the feedback is late. The test will randomly fail long
after it was written. There's a lot of needless overhead in relearning the
context of the failing code just so you can fix a test that is not general
enough for the data it was provided.

There are ways to effectively use randomly-generated test data, but it's
harder than you'd think to do it right.

~~~
chriswarbo
Tests with random inputs can be much easier to write. For example here's a
test for a single hard-coded example:

    
    
        public bool canSetEmail() {
          User u = new User(
            "John",
            "Smith",
            new Date(2000, Date.JANUARY, 1),
            new Email("john@example.com"),
            Password.hash("password123", new Password.Salt("abc"))
          );
          String newEmail = new Email("smith@example.com");
          u.setEmail(newEmail);
          return u.getEmail().equals(newEmail);
        }
    

Phew! Here's a randomised alternative:

    
    
        public bool canSetEmail(User u, Email newEmail) {
          u.setEmail(newEmail);
          return u.getEmail().equals(newEmail);
        }
    

Not only is the test logic simpler and clearer, it's much more general. As an
added bonus, when we write the data generators for User, Email, etc. we can
include a whole load of nasty edge cases and they'll be used by _all_ of our
tests. I've not used Java for a while, but in Scalacheck I'd do something like
this:

    
    
        def genEmail = {
          val genStr = arbitrary[String]
            .map(_.replace("@", ""))
            .retryUntil(_.nonEmpty)
    
          for {
            user   <- genStr
            domain <- genStr
            tlds   <- Gen.listOf(genStr)
          } yield (user + "@" + (domain :: tlds).mkString("."))
        }
    

Even using the default String generators, this still gives pretty good
examples, e.g.

    
    
        scala> genEmail.sample
        res1: Option[String] = Some(뀩貲㳱誷Ⓣ壟獝鈂ᗘ䮕鹍尛斿績線孞왁궽偔ቫ﯃쥑瞴䒨䏵艥꿊狿냩쩈簋㷡泪伺鑬鮘䦚벛妹乘饡㙧ꐤ큫ᯕ肳铬呢ூ靪୒틯鏱滄㧄Ｐ莶寕상䐀鹭ᚿ㌤ᇹ䶂攔ꅥ㶓⽌ꅂ뀏뛶盏⍸㇗ɧ⇽줓嵛@퇐䎁鱎馉琿䏍㔰蘿⠶㘮큵휕炠᭑㯠ꇷ氕瑤镦碋䓯鄓헛㥝籆찷舊⸦䁜⯞௵籋㟨㨴鯅鿸刡㽴ﮈ耾刏碓辁亁ᦨ氦ব꧓ꌝ飖䒱찮刍ฤ﯉⯛ஃ颰溯쨚ই媐䄇延醴熜䢐필㉕徫澬폁횱坠줊埬າ⟂.὆蹵箅䀈떥喹뭡㘺ꐟ⯉쉂㯻牏䢺梘쭸칹总⻎恵꣥ﴈ宎嶲ꎮ쌊䯧挤ꓯ⻯ꢠ鏿㰔操駚樅졇੩풻殜ᶝ팱.瘽៣뿔뱡䞿愝滠蛊妏뷩먰⦹挸긖᩿㣽ﮀ嫶ꚸ㗃鈴䇡⋥梏ڝ晡烡.줼欺ࠞ娫)
    

The other advantage is that automated shrinking does a pretty good job of
homing in on bugs. For example, if our test breaks when there's no top-level
domain (e.g. 'root@localhost') then (a) that will be found pretty quickly,
since 'tlds' will begin with a high probability of being empty and (b) the
other components will be shrunk as far as possible, e.g. 'user' and 'domain'
will shrink down to a single null byte (the "smallest" String which satisfies
out 'nonEmpty' test); hence we'll be told that this test fails for \0@\0 (the
simplest counterexample). We'll also be given the random seeds used by the
generators.

~~~
slavik81
That generator function illustrates exactly the problem I'm talking about. The
maximum length of a string in Java is 2^31-1 code points. If user is an
'arbitrary string', then it could be 2^31-1 code points long. If domain is
also an arbitrary string, then it can also be 2^31-1 code points long. When
you concatenate them and exceed the maximum string length, you will cause a
failure in the test code.

There are almost always constraints within the test data, but they're complex
to properly express, so they aren't specified. Then one day, the generator
violates those unstated constraints, causing the test to fail.

~~~
chriswarbo
> one day, the generator violates those unstated constraints, causing the test
> to fail

Good, that's exactly the sort of assumption I'd like to have exposed. As a
bonus, we only need to can fix this in the generators, and all the tests will
benefit. I've hit exactly this sort of issue with overflow before, where I
made the mistaken assumption that 'n.abs' would be non-negative.

In this case Scalacheck will actually start off generating small/empty
strings, and try longer and longer strings up to length 100.

This is because 'arbitrary[String]' uses 'Gen.stringOf':

[https://github.com/typelevel/scalacheck/blob/master/src/main...](https://github.com/typelevel/scalacheck/blob/master/src/main/scala/org/scalacheck/Arbitrary.scala#L148)

'Gen.stringOf' uses the generator's current "size" parameter for the length:

[https://github.com/typelevel/scalacheck/blob/master/src/main...](https://github.com/typelevel/scalacheck/blob/master/src/main/scala/org/scalacheck/Gen.scala#L913)

[https://github.com/typelevel/scalacheck/blob/master/src/main...](https://github.com/typelevel/scalacheck/blob/master/src/main/scala/org/scalacheck/Gen.scala#L1254)

The "size" of a generator starts at 'minSize' and grows to 'maxSize' as tests
are performed (this ensures we check "small" values first, although generators
are free to ignore the size if they like):

[https://github.com/typelevel/scalacheck/blob/master/src/main...](https://github.com/typelevel/scalacheck/blob/master/src/main/scala/org/scalacheck/Gen.scala#L275)

We can set these manually, e.g. via a config file or commandline args, but the
default is minSize = 0

[https://github.com/typelevel/scalacheck/blob/master/src/main...](https://github.com/typelevel/scalacheck/blob/master/src/main/scala/org/scalacheck/Test.scala#L198)

and maxSize = 100

[https://github.com/typelevel/scalacheck/blob/master/src/main...](https://github.com/typelevel/scalacheck/blob/master/src/main/scala/org/scalacheck/Test.scala#L199)

[https://github.com/typelevel/scalacheck/blob/master/src/main...](https://github.com/typelevel/scalacheck/blob/master/src/main/scala/org/scalacheck/Gen.scala#L338)

------
bvanderveen
Great article. The only thing I would add is that anything other than
`assertEquals(expectedValue, actualValue)` is not really permissible. Don't
make assertions about 'portion' or 'subset' of your results, make assertions
about the entire value! Otherwise, you're throwing away an opportunity to
encode a constraint; sooner or later the undefined behavior will make itself
known. We are writing tests to avoid that scenario in the first place, no?

~~~
joshka
Partial Disagreement on the phrasing of this - there are many classes of
assertion like `assertThat(expected).containsInAnyOrder(foo,bar,baz);` that
would not meet this. Order often does not matter and cannot be constrained.
There are other things similar to this.

~~~
bvanderveen
Value should probably be a `Set` then!

~~~
dodobirdlord
What if it's a repeated field of a protocol buffer? Parse it into a set and
then check it? That's exactly what
`assertThat(expected).containsInAnyOrder(foo,bar,baz);` is doing anyway.

~~~
bvanderveen
I'm not very familiar with protobuf, but in searching, repeated field seems to
have list- or set-like semantics, so yeah, I'd parse the value into a list or
a set and compare it to a list or a set.

~~~
dodobirdlord
If you were going to work with protocol buffers frequently you'd probably
quickly find yourself doing this over and over, and would then create the
aforementioned `assertThat(expected).containsInAnyOrder(foo,bar,baz);`.

------
sabujp
Java testing at google:

    
    
        * Don't mock if you have access to the remote server code, create fakes using calls to class methods being tested or use an in memory version of the entire thing if possible.
    

I know it's not easy and it takes time, but it's worth it in the long run. If
something changes you'll detect errors quickly.

    
    
        * When using guice injections, override explicitly what you need directly in the test setup. Don't create huge test modules with everything and the kitchen sink, eventually these will become unmaintainable and become difficult for future maintainters to understand where overrides and injections are coming from.
    

Integration testing is whole another subject but I would highly recommend it
for mission critical production code.

------
r0s
Good stuff. I especially appreciate the entreaty to limit assertions and
corner cases to one test concern.

The only thing I didn't like was the Given When Then recommendation. BDD style
story language is really meant for key specification examples, which can be
automated in functional tests, but not really appropriate for unit testing.

~~~
zoover2020
Why not BDD style in unit tests though? Our team adopted it and I really like
it over triple A.

~~~
lmm
I've found it just obscures what the test is doing. If I'm looking at a test
that failed, I want to see the code that caused the failure, as plainly as
possible; I'm quite capable of reading code (indeed I find code is better at
communicating clearly than English is), so I'd rather see a minimum of
ceremony than some misguided effort at making tests more readable for
imaginary business users.

------
Nihilartikel
I'm pretty lazy, but I like testing..

The habit that I've repeated on my last few project is this:

Work out a way to gracefully serialize/deserialize data structures in your
code into a human readable format, like EDN or pretty printed JSON. (this is
easier in more civilized languages... -winks at Clojure-)

Pick high level functions that exercise a lot of code for testing.

Generate input data to exercise that function either by hand if it's small and
tractable, or by instrumenting real runs of the program and outputting
serialized copies of that data (to file or console).

Write test helpers that call the function with the input data, and depending
on an environment variable, read the expected 'golden' data from disk, or if
UPDATE_TEST_RESULTS=true, write the function output back to disk with a
filename unique to the test, and mark the test as passed. If the test fails,
print a nice readable, structural diff of the expected/actual data - e.g.
python datadiff, or clojure.data/diff so the exact change is visible.

On the first run you set UPDATE_TEST_RESULTS=true and generate the expected
test results. Check your 'golden' result data into git.

Now you can easily keep expected results up to date even if logic changes, and
get a git diff of exactly how the data has changes.

Acknowledged that this doesn't speak to side effect riddled code that needs
mocks.. and doesn't buy into the build-the-test-first style of development,
but I've found it's a good return on investment for regression testing, and
provides visibility into subtle logic changes, expected or otherwise,
introduced by code changes.

~~~
soedirgo
Are you referring to snapshot testing[0][1]? i.e. you first "snapshot" the
output of the function and commit it to VC, and each test run will run the
same input and compare it against the "snapshot", failing and giving a diff if
it differs.

I'm about to try it soon, seems like a good ROI as you said.

[0] [https://jestjs.io/docs/en/22.x/snapshot-
testing](https://jestjs.io/docs/en/22.x/snapshot-testing) [1]
[https://github.com/mitsuhiko/insta](https://github.com/mitsuhiko/insta)

~~~
Nihilartikel
I think I am! Through convergent evolution at least - I hadn't found essays
explicitly advocating it at the time.

My use case was comparing results from chains of Spark RDD & Dataframe
transformations, so having fairly large realistc input/output datasets was
part of the game, and the main reason that manually writing all expected
results wasn't feasible.

------
beders
Mandatory link to
[https://www.youtube.com/watch?v=EZ05e7EMOLM](https://www.youtube.com/watch?v=EZ05e7EMOLM)

It clarifies a few of the points made in the article.

Personally I would add: Don't overuse mocks, only use dependency injection on
the module level (Ludicrous use of singleton DI will hinder your design, not
help it)

------
codeulike
_" Do programmers have any specific superstitions?"

"Yeah, but we call them best practices."_

(from
[https://twitter.com/dbgrandi/status/508329463990734848](https://twitter.com/dbgrandi/status/508329463990734848)
but thats probably not the original source)

~~~
brabel
Good luck coming up with your own way to write every test, making every
decision as if it were the first time!

~~~
codeulike
I'm happy to take advice, but given how debatable everything is with
programming I'm skeptical of anything labelled 'best practice' ... Over the
decades, fads come and go, todays best practice is often tomorrows antipattern
...

------
commandlinefan
I'd be happy if we could just get to the point where programmers wrote code
that _could_ be unit tested. i.e. don't write static initializers that connect
to live databases for a start.

~~~
exabrial
Dependency Injection + Mockito allows one to do that. There's a whole study on
writing code that's easy tested. Our rule is no more than 4 layers: initiator,
business logic, services, anemic domain model. Initiators abstract away the
inbound protocol and serialize to the common model. Business logic controller
handle all of the branching. Services effectuate single actions. Services
can't call other services. And the domain model is how everything talks. We
all build our apps this way and it's really easy to move people between
projects. Not perfect but works for about 85% of stuff one has to write.

~~~
rmdashrfstar
I'm not sure calling the domain layer "anemic" by default is correct, as it's
typically a (negative) descriptions of models which are too data-driven
instead of behavior-driven. I would suggest an alternative layer structure:
Initiators/Controllers -> Application/Domain/Infrastructure Services -> Domain
Models/Business Rules/Invariant Checking

~~~
exabrial
We've thought about that; the style I've described is very non-OO. It is easy
to teach though and it makes unit testing a breeze.

I like the architecture you've described. I think we arrived at where did
because it's a natural for when using a DI container, which manages state and
transaction groups for you.

------
vemv
> KISS > DRY

I'd agree elsewhere, but not for tests. Non-DRY tests often means that there's
an ad-hoc similarity between a series of tests.

Without DRYness, as time passes, domain knowledge fades away, and more
maintainers touch the same test code, it becomes quite likely that this
similarity will be broken. Which is a perfect recipe for false negatives: one
fixes one test but not all other tests, which might start exercising a useless
code path due to drift.

~~~
joshka
The more canonical form is DAMP rather than KISS. Descriptive and Meaningful
Phrases. DRY is fine in moderation in tests. E.g. instead of
`Person.builder().withName("Alice").withAge(100).build();` using
`People.grandmother()` is DRY-ish, but certainly something you'd see less of
in production code.

------
splittingTimes
To easily see in test failures what went wrong we follow the naming
convention:

UnitOfWorkUnderTest_Scenario_ExpectedBehavior

Which mimics the internal GWT structure.

~~~
joshka
I have suggested the same previously, but the display name approach is better.
Now test names should be short:

    
    
        @DisplayName("throws IllegalArgumentException when given bad data")
        void badData() {

------
twic
Recommending AssertJ and not mentioning Hamcrest? Hamcrest is more flexible,
more composable, and more extensible.

~~~
joshka
Having used both extensively, AssertJ's API is more obvious and generally
useful out of the box. It's replaced all the places where I'd previously used
custom Hamcrest matchers with generally smaller amounts of code that spit out
better errors when tests fail. It's not a either or though, sometimes it's
worth looking at whether hamcrest is the right approach.

------
aabbcc1241
The table of content is very helpful.

However, I'm curious why this hugo generated static site is consuming over
130MB memory? For reference, HN is consuming 5MB memory only.

(the numbers are observed from about:performance in Firefox)

