I _really_ have to dispute the idea that unit tests score the maximum on maintainability. The fact that they are _so_ tightly tied to lower-level code makes your code _miserable_ to maintain. Anyone who's ever had to work on a system that had copious unit tests deep within will know the pain of not just changing code to fix a bug, but having to change a half-dozen tests because your function interfaces have now changed and a healthy selection of your tests refuse to run anymore.
The "test diamond" has been what I've been working with for a long while now, and I find I greatly prefer it. A few E2E tests to ensure critical system functionality works, a whole whack of integration tests at the boundaries of your services/modules (which should have well-defined interfaces that are unlikely to change frequently when making fixes), and a handful of unit tests for things that are Very Important or just difficult or really slow to test at the integration level.
This helps keep your test suite size from running away on you (unit tests may be fast, but if you work somewhere that has a fetish for them, it can still take forever to run a few thousand), ensures you have good coverage, and helps reinforce good practices around planning and documentation of your system/module interfaces and boundaries.
> Anyone who's ever had to work on a system that had copious unit tests deep within will know the pain of not just changing code to fix a bug, but having to change a half-dozen tests because your function interfaces have now changed and a healthy selection of your tests refuse to run anymore.
In my experience this problem tends to be caused by heavily mocking things out more so than the unit tests themselves. Mocking things out can be a useful tool with its own set of downsides, but should not be treated as a requirement for unit tests. Tight coupling in your codebase can also cause this, but in that case I would say the unit tests are highlighting a problem and not themselves a problem.
Perhaps you're talking about some other aspect of unit tests? If that's the case then I'd love to hear more.
I'd also add to this that people often end up with very different ideas of what a unit test is, which confuses things further. I've seen people who write separate tests for each function in their codebase, with the idea that each function is technically a unit that needs to be tested, and that's a sure-fire way to run into tightly-coupled tests.
In my experience, the better approach is to step back and find the longer-living units that are going to remain consistent across the whole codebase. For example, I might have written a `File` class that itself uses a few different classes, methods, and functions in its implementation - a `Stats` class for the mtime, ctime, etc values; a `FileBuilder` class for choosing options when opening the file, etc. If all of that implementation is only used in the `File` class, then I can write my tests only at the `File` level and treat the rest kind of like implementation details.
It may be that it's difficult to test these implementation details just from the `File` level - to me that's usually a sign that my abstraction isn't working very well and I need to fix it. Maybe the difficult-to-test part should actually be a dependency of the class that gets injected in, or maybe I've chosen the wrong abstraction level and I need to rearchitect things to expose the difficult-to-test part more cleanly. But the goal here isn't to create an architecture so that the tests are possible, the goal is to create an architecture that's well-modularised, and these systems are usually easier to test as well.
There's an argument that this isn't a unit test any more - it's an integration test, because it's testing that the different parts of the `File` class's implementation work together properly. My gut feeling is that the distinction between unit and integration is useless, and trying to decide whether this is one or the other is a pointless endeavour. I am testing a unit either way. Whether that unit calls other units internally should be an implementation detail to my tests. Hell, it's an implementation detail whether or not the unit connects to a real database or uses a real filesystem or whatever - as long as I can test the entirety of the unit in a self-contained way, I've got something that I can treat like a unit test.
At one of my previous jobs we did very few unit tests (basically only pure functions) and tons of behavior/integration tests (ie. run the service with a real database, real queues, etc. but mock the HTTP dependencies, call its API and check we get the correct result and side effects) and it was the most stable and easy to work with test suite I’ve ever seen. It was extremely reliable too.
Not mocking the database and other pipes is the single best improvement everyone can make on their test suites.
We also had a test suite that followed the same principle but started all the services together to reduce the mocked surface, it executed every hour and was both incredibly useful and reliable too.
I wanted to drop in and say we had a version of this discussion internally while I was putting this post together. Your observation about fixing a bunch of tests for a simple one line change is something I have seen as well. What we ultimately landed on is that, especially in our service-heavy environment (though not necessarily micro services), the cost of creating and maintaining integration testing infrastructure that is reliable, reasonably fast, and reflective of something prod shaped turns out to be even more expensive. Specifically, we looked at things like the costs of creating parallel auth infra, realistic test data, and the larger, more complex test harness setups and on balance it actually ends up being more expensive on a per-test basis. In fact, in some cases we see meaningful gaps in integration testing where teams have been scared off by the cost.
This isn't to say that unit tests, especially those with heavy mocking or other maintenance issues don't carry their own costs, they absolutely do! But, and I think importantly, the cost-per breakage is often lower as the fix is much more likely to be localized to the test case or a single class. Whereas problems in integration tests or E2E tests can start to approach debugging the prod system.
As with any "experiential opinion" like this, YMMV. I just set out to try to contribute something to the public discourse that's been reflective of our internal experience.
The testing-type divide feels similar to the schism around ORMs, where one camp (mine) find that ORMs end up costing far more than the value they bring, while the other claim they've never had such issues and they would never give up the productivity of their favorite ORM.
Both sides appear to be describing their experiences accurately, even though it feels like one side or the other should have to be definitively right.
I feel exactly same for unit tests and ORMs. I'd like to what kind business applications these two items really delivered. In my experience they both worked in most trivial sense. Either projects were mostly POCs and no one really cared about production issues. Or they were so overstaffed that team can continue to support to provide daily work to devs.
> I _really_ have to dispute the idea that unit tests score the maximum on maintainability. The fact that they are _so_ tightly tied to lower-level code makes your code _miserable_ to maintain.
I disagree to the point I argue the exact opposite. Unit tests are undoubtedly the ones that score the highest on maintainability. They are unit tests after all, meaning they are self-contained and cover the behavior of individual units that are tested in complete isolation.
If you add a component, you add tests. If you remove a component, you delete it's tests. If you fix a bug in a component, you both add one of more tests that reproduce the bug and assert the expected output and use all existing tests to verify your fix doesn't introduce a regression. Easy.
Above all, unit tests serve as your project's documentation of expected behavior and intent. I can't count the times where I spotted the root cause of a bug in a legacy project just by checking related unit tests.
> (...) a whole whack of integration tests at the boundaries of your services/modules (which should have well-defined interfaces that are unlikely to change frequently when making fixes), and a handful of unit tests for things that are Very Important or just difficult or really slow to test at the integration level.
If it works for you then it's perfectly ok. To me, the need for "a whole whack of integration tests" only arises if you failed to put together a decent coverage of unit tests. You specify interfaces at the unit test level, and it's at the unit test level where you verify those invariants. If you somehow decided to dump that responsibility on integration tests then you're just replacing many fast-running targeted tests that pinpoint failures with many slow-rumming broad-scope tests that you need to analyze to figure out the root cause. On top of that, you are not adding checks for these invariants in the tests you must change if you mush change the interface. Those who pretend this is extra maintainability needs from unit tests are completely missing the point and creating their own problems.
I think the whole problem is just terminology. For example take your comment. You start talking about unit tests and units, but then suddenly we're talking about components. Are they synonymous to units? Are they a higher level, or a lower level concept?
People have such varying ideas about what "unit" means. For some it's a function, for others it's a class, for others yet it's a package or module. So talking about "unit" and "unit test" without specifying your own definition of "unit" is pointless, because there will only be misunderstandings.
This is obvious, as anoter commenter said, but this is nonetheless useful.
You can use it to show graduates. Why have them waste time relearning the same mistakes. You probably need a longer blog post with examples.
It is useful as a check list, so you can pause when working earlier in the lifecycle to consider these things.
I think there is power in explaining out the obvious. Sometimes experienced people miss it!
The diagram can be condensed by saying SMUR + F = 1. IN other words you can slide towards Fidelity, or towards "Nice Testibility" which covers the SMUR properties.
However it is more complex!
Let's say you have a unit test for a parser within your code. For a parser a unit test might have pretty much the same fidelity as an intergation test (running the parse from a unit test, rather than say doing a compilation from something like Replit online). But the unit test has all the other properties be the same in this instance.
Another point is you are not testing anything if you have zero e2e tests. You get a lot (a 99-1 not 80-20) by having some e2e tests, then soon the other type of tests almost always make sense. In addition e2e tests if well written and considers can also be run in production as synthetics.
After 10+ years working on testing practices inside Google, I have found that even the most obvious practices somehow get ignored or misunderstood. As with a lot of programing practices, for every person that has thought deeply about why the practices exists, there exist many many more who just apply the practice as a matter of course (eg mocking, dependency injection, micro services, etc).
It might be useful to provide a little more context for why I wanted to write this in the first place - Over the last 15 or so years we have been tremendously successful at getting folks to write tests. And like any system, once you remove a bottleneck or resource constraint in one place, you inevitably find one somewhere else. In our case we used to take running our tests for granted, but now the cost of doing so now has actual cost implications that we need to consider. I also observed some in internal discussions that had become a little to strident about the absolutes of one kind of test or another, and often in such a way that treated terms like "unit" or "integration" as a sort of universal categories, completely ignoring the broad, practical implications we have bound together into a few shorthand terms.
My goal when trying to develop this idea was to find a way to succinctly combine the important set of tradeoffs teams should consider when thinking, not about a single test, but their entire test suite. I wanted to create a meme (in the Dawkin's sense) that would sit in the background of an engineer's mind that helped them quickly evaluate their test suite's quality over time.
What's useful here? There's nothing actionable, no way to quantify if you're doing "SMURF" correctly. All the article describes is semi-obvious desirable qualities of a test suite.
You're not "doing SMURF". It's not an approach or a system. It's just a specific vocabulary to talk about testing approaches better. They almost spell it out: "The SMURF mnemonic is an easy way to remember the tradeoffs to consider when balancing your test suite".
It's up to your team (and really always has been) to decide what works best for that project. You get to talk about tradeoffs and what's worth doing.
I touched on this a bit up thread, but I just want to note that my intention wasn't to get anyone to "do SMURF correctly". My goal was to create an idea to compete with the "Test Pyramid" which, while a useful guide in an environment with limited or no testing, didn't lead to productive conversations in an organization with a lot of tests.
My hope is that this little mnemonic will help engineers remember and discuss the practical concerns and real world tradeoffs that abstract concepts like unit, integration, and E2E entail. If you and your team are already talking about these tradeoffs when you discuss how to manage a growing test suite, then you're you will likely find this guidance a bit redundant, and that's fine by me :)
This is interesting, but I see a few issues with it:
- Maintainability is difficult to quantify, and often subjective. It's also easy to fall into a trap of overoptimizing or DRYing test code in the pursuit of improving maintainability, and actually end up doing the opposite. Striking a balance is important in this case, which takes many years of experience to get a feel for.
- I interpret the chart to mean that unit tests have high maintainability, i.e. it's a good thing, when that is often not the case. If anything, unit tests are the most brittle and susceptible to low-level changes. This is good since they're your first safety net, but it also means that you spend a lot of time changing them. Considering you should have many unit tests, a lot of maintenance work is spent on them.
I see the reverse for E2E tests as well. They're easier to maintain, since typically the high-level interfaces don't change as often, and you have fewer of them.
But most importantly, I don't see how these definitions help me write better tests, or choose what to focus on. We all know that using fewer resources is better, but that will depend on what you're testing. Nobody likes flaky tests, but telling me that unit tests are more reliable than integration tests won't help me write better code.
What I would like to see instead are concrete suggestions on how to improve each of these categories, regardless of the test type. For example, not relying on time or sleeping in tests is always good to minimize flakiness. Similarly for relying on system resources like the disk or network; that should be done almost exclusively by E2E and integration tests, and avoided (mocked) in unit tests. There should also be more discussion about what it takes to make code testable to begin with. TDD helps with this, but you don't need to practice it to the letter if you keep some design principles in mind while you're writing code that will make it easier to test later.
I've seen many attempts at displacing the traditional test pyramid over the years, but so far it's been the most effective guiding tool in all projects I've worked on. The struggle that most projects experience with tests stems primarily from not following its basic principles.
> If anything, unit tests are the most brittle and susceptible to low-level changes.
I don't find this to be the case if the unit tests are precise (which they should be).
That is, if you are writing non-flaky unit tests which do all the "right" unit-testy things (using fakes/dependency injecting well and so isolating and testing only the unit under test), you should end up with a set of tests that
- Fails only when you change the file/component the test relates to
- Isn't flaky (can be run ~10000 times without failing)
- Is quick (you can do the 10000 run loop above approximately interactively, in a few minutes, by running in parallel saturating a beefy workstation)
This compares to integration/e2e tests which inherently break due to other systems and unrelated assumptions changing (sometimes legitimate, sometimes not), and can have rates of flakyness of 1-10% due to the inherent nature of "real" systems failing to start occasionally and the inherently longer test-debug cycle that makes fixing issues more painful (root causing bug that causes a test to fail 1% of the time is much easier when the test takes .3 CPU-seconds than when it takes 30 or 300 CPU-seconds).
Very few tests I see are actually unit tests in the above sense, many people only write integration tests because the code under test is structured in inherently un- or difficult- to test ways.
That's true. I probably used brittle in the wrong sense there.
What I mean is that after any code change that isn't a strict refactoring you will inevitably have to touch one or more unit tests. If you're adding new functionality, you need to test different code paths; if you change existing functionality, you need to update the related test(s); if you're fixing a bug, you need to add a unit test that reproduces it, and so on. All this means is that unit tests take the most effort to maintain, so I'm not sure why the article claims they have "high" maintainability, or that that's a good thing. In contrast, higher-level tests usually require less maintenance, assuming they're stable and not flaky, since you're not touching them as often.
> Very few tests I see are actually unit tests in the above sense, many people only write integration tests because the code under test is structured in inherently un- or difficult- to test ways.
Very true. I think the recent popularization of alternative guidelines like The Testing Trophy is precisely because developers are too lazy to properly structure their code to make pure unit testing possible, and see the work of maintaining unit tests as too much of a chore, so they make up arguments that there's no value in them. This couldn't be farther from the truth, and is IMO an irresponsible way to approach software development.
Sure. But how do you achieve that in practice? All functions can't be pure, and at some point you need to handle messy I/O and deal with nondeterminism. How you structure your code in the best way to do that, while also ensuring you can write useful tests? None of this is trivial.
After I wrote my comment, I realized I was actually somewhat wrong (I have been in this debate, on the anti-TDD side, for quite some time now, but it was kind of hard to figure out why for me, but I think I finally got it). Testability is more than just the referential transparency, RT it is a necessary but not sufficient condition for testability.
> But how do you achieve that in practice?
The way to achieve referential transparency is, in my mind, functional programming. Specifically, use monads to model side effects. You can model any computer system as a composition of lambda terms exchanging data (specific subset of terms) monadically, so in theory, this can be achieved. So you can imagine any program as a tree of functions, each builds a more complex function as a composition of smaller, simpler functions, until the whole program is put together in the main() function.
However, I need to add two other conditions that for a program to be testable: Each function to which you decompose your program has to (2) be reasonably short (has to have limited number of compositions) and (3) has to have a clear specification, based on which it can be determined, whether the function in itself is correct. The condition (2) is strictly speaking not required, but because we are humans with limited ability to understand, we want it to help us create (3).
Now I believe that the more you have RT and (3), your program is more testable. This is because that testing is pretty much just partial type-checking by sampling - you create some sample values of the type you expect, and you verify that the program produces expected values. The advantage of sampling is that you don't have to formally specify your types, so you don't need complete formal specification. The conditions RT and (3) are pretty much necessary if you want to properly type your programs (for example, we could specify every function using a type in dependent type theory).
So testable (is a spectrum) really means "close to type-checkable" (which is a binary). I however need to address a misconception (which I had), the types that we assign to functions (i.e. the specification) do not come from the program itself, but rather from the domain knowledge, which we expect to impart into the program. Literally, types are (or can be) the specification.
And by the way, the condition (2) determines how small are the units of the program you can be testing.
Now after the above setup, let me get to the main point, which I will call testability tradeoff: The conditions (2) and (3) are mutually exclusive for some programs, i.e. there is a tradeoff between (2) making units small and (3) giving them a good (easy to interpret) specification.
Let me give some extreme examples of testability tradeoffs for different programs to illustrate this concept. Library of functions has usually only little testability tradeoff, because most functions are independent of each other, so each of them (on the API level) can satisfy both (2) and (3). On the other end of spectrum you have things like a trained neural network or program that implements a tax code - even if you can decompose those programs into little pieces to satisfy condition (2), it is not possible to then assign these pieces a meaning sensible enough to construct useful tests per condition (3). Such programs are simply not understandable in detail (or better to say, we don't know how to understand them).
The hidden assumption of TDD folks (and proponents of massive unit testing, in the test pyramid) is that we can always convert program as much to have (2) and (3), i.e. in their view, the testability tradeoff can be always made as low as needed. But I don't think this is true in practice, and I have given examples above - we have complex useful programs that cannot be decomposed to little pieces, where each of the little pieces can be meaningfully specified in the business domain. Such programs, I claim, cannot be effectively unit tested, and can only be e2e or integration tested. (However, they can be type-checked against the full specification.)
So because (as stated above) testability of a program is pretty much your ability to meaningfully assign (expected) types and typecheck, now I think that TDD proponents, when they talk about testability, want to have as much specification as possible. Which is kind of funny, because they started as a kind of opposition to that idea. Ah well, paradoxes of life..
Anyway, I know my response is a bit messy, but hopefully I explained the main idea enough so it will make more sense on rereading.
This model ("mnemonic") feels like a good tool to reason about your testing strategy. I ran across the "testing trophy" in the past, which really changed my thinking already, having been indoctrinated with the testing pyramid for such a long time before that. I wanted to share my favorite links about the testing trophy, for those interested:
I think test pyramid is a great idea, in theory. In practice, having lots of unit tests with mocked dependencies doesn’t make me sure that everything works as it should. Thus, I use real database in my unit tests. There, I test serialization errors, authentication problems, network issues, etc. All the real problems which can occur in real life. Leaving these scenarios for integration tests layer will turn a test pyramid to a test diamond.
And what was the rationale behind mocking a database in the first place, speed? Disable synchronous wal writes, or run your postgres instance in ram. Your test suite execution speed will skyrocket.
This entire heuristic is not even about testing. The people who created it aren’t interested in testing— they want an excuse to release their shitty products.
They believe that experiencing a product is just an afterthought. They are like chefs who never taste the food they cook.
Testing is a process of investigation and learning. What this post covers is mechanical fact checking.
End-to-end tests verify high-level expectations based on the specification of the system. These high-level expectations generally stay stable over time (at least compared to the implementation details verified by lower-level tests), and therefore end-to-end tests should have the best maintainability score.
> end-to-end tests should have the best maintainability score.
End to end tests encompass far more total details than implementation or unit tests. If you're testing a website, moving a button breaks a test. Making the button have a confirmation breaks the test. The database being slower breaks the tests. The number of items in a paginated list changing breaks the tests. You're testing not just the behavior and output of interfaces, you're testing how they're composed. The marketing team putting a banner in the wrong place breaks the tests. The product team putting a new user tour popover on the wrong button breaks the tests. The support team enabling the knowledge base integration in your on-page support widget breaks the tests.
Moreover, the cost of fixing the tests is also often higher, because end to end tests are necessarily slower and more complicated. Tests often become flaky because of a larger number of dependencies on external systems. It's often not clear why a test is failing, because the test can't possibly explain why its assertion is no longer true ("The button isn't there" vs "The button now has a slightly different label").
E2E are less maintainable because by definition. They are the type of tests that has the most dependencies on sub-systems and other systems - this means if a test fails you ll need to work more to figure out what went wrong, than in an integration test, which depends on less sub-systems.
The expectations can be pretty stable, but because they cover so much of the system, they tend to be more fragile. End to end tests are also often flakier because they're dealing with the system at levels meant for human interaction, like by driving actual browsers. Because they encompass so much, they're also the slowest. When you have a problem with an end-to-end test, they can take way more time to debug because of that.
So I'd agree with them; E2E tests are the hardest to maintain.
another unit test defense: they're the most accessible and inspectable in the sense that you have practically zero barrier of running it immediately in your IDE and stepping through if necessary
I feel like "testing trophy" has been in vogue for a while now, and definitely feels more right as someone who's made a career of unfucking test suites, but there's almost no area of software engineering as involved in navel-gazing as testing.
In 2011 I wrote a blog-post[0] about the dangers of taking TDD too literally, and until I disabled comments on it ~5 years later people were still angrily shaking their fists at each other online. 139 angry comments for a very short post.
Someday I may write up the comparative experiences of joining an existing 120kloc mess where everything in the app code was done wrong and had no unit tests (that I recall finding) on one side, vs a carefully engineered ISO 9001 certified project on the other.
The mess repeatly won awards.
End users don't care about unit tests themselves, they care about the stuff the unit tests are a proxy for.
(Previously wrote "customers don't care", but sometimes the customer is a business where the tests are a requirement, YMMV).
Tests are the map, absence of bugs is the territory.
It always amazes me how speed and testing are placed in the same bracket. I want solid verification, and a strong pattern to repeat verification no matter what. This then allows for fast implementation. So then something involving integrating a number of components as possible reliably makes the most sense (verification wise): I want to verify value early. It is eyebrow raising this pyrmaid nonsense has hung around.
As an example, your search graph might define your nodes as possible application states and the edges as actions to transition to a new state.
Some edges might be "create a post" and "change post privacy settings". You might then define an invariant saying "after a user creates a post the new post is in their follower's timeline if privacy settings allow".
Your search might eventually come across some set of actions that lead to this invariant being violated which is a discovered bug, e.g. if a user A creates a post -> user A changes settings to private -> user B can still see the post.
Since the search space is infinite, the effectiveness is heavily influenced by how much time you can spend searching the tree & how fast you can evaluate states. This is a kind of exploratory test & wouldn't replace unit/integration tests, but it does let you catch certain kinds of bugs that would be impossible to find otherwise.
When the test suite is too slow, it becomes unwieldy, it gets in the way (instead of being part of a valuable feedback loop) and everyone starts looking for shortcuts to avoid running it.
For sure. And I'd add that the specificity of unit tests is hugely valuable. If I do some sort of refactoring and found I've broken an end-to-end test, it may tell me very little about where the problem is. But the pattern of unit test breakage is very informative. Often just seeing the list of broken tests is enough to tell me what the issue is without even having to look at an error message.
And yet, people focus endlessly on unit tests, pyramid in hand, saying things like: they're the best kind of test, or at least better than integration tests. It needs some maturity to articulate the nuances. And while I've tried, I think SMURF may be a good aid in that. While I've moved away from the religious focus on unit tests, long ago, I appreciate learning about SMURF today.
That's fair. It's easy to get philosophical about such things so something you can point to that's more based on metrics can help a discussion be more objective.
otoh countering opinion with fact doesn't always work well - it might just turn into quibbling over where on each axis different test types' strengths lie ;)
I think that a lot of folks have never heard anything else but the testing pyramid, repeated over and over. I find them often very open to other ideas, in my case I've previously heard about the "testing trophy", and found willing audiences.
The "test diamond" has been what I've been working with for a long while now, and I find I greatly prefer it. A few E2E tests to ensure critical system functionality works, a whole whack of integration tests at the boundaries of your services/modules (which should have well-defined interfaces that are unlikely to change frequently when making fixes), and a handful of unit tests for things that are Very Important or just difficult or really slow to test at the integration level.
This helps keep your test suite size from running away on you (unit tests may be fast, but if you work somewhere that has a fetish for them, it can still take forever to run a few thousand), ensures you have good coverage, and helps reinforce good practices around planning and documentation of your system/module interfaces and boundaries.