I really like the way they keep repeating this as the answer to so many questions. To me it reads as a softly softly approach to weaning people off what appears to be a mania for micro level unit testing, driven by people like Uncle Bob.
"The structure of your tests should not be a mirror of the structure of your code. The fact that you have a class named X should not automatically imply that you have a test class named XTest."
"The structure of the tests must not reflect the structure of the production code, because that much coupling makes the system fragile and obstructs refactoring. Rather, the structure of the tests must be independently designed so as to minimize the coupling to the production code."
Does anyone know of explations of this with a more hands-on approach, or is this simply a collection of ideas that can’t really be shown?
Now you are testing the behaviour of your application through its public surface. This reduces the brittleness of your tests because you can change the internal implementation without rewriting your tests. You have higher assurance. It'll also force your hand to enforce invariants and place guard clauses in the right places rather than "everywhere".
If you follow Cockburn's Ports and Adapters approach too, then you can substitute adapters like persistent state, buses, HTTP clients for appropriate in-memory equivalents.
So far, I've not been able to find any examples of test architecture that follows those principles online or in my day job.
I've also been too lazy to do a side project and explore a more decoupled and scalable test suite. Maybe I should get off my arse and finally do it.
Don't they use the wrong term for them?
I believe these tests are more commonly called "integration tests", not "unit tests".
Or, did I misunderstand something?
The jump indicated by the "therefore" confuses me. I essentially couple my test to production behavior, insofar as it is possible. I do this to test program behavior realistically and because I believe tests should, in some sense, form a specification for the program they test. I don't really care about "testing code"; I care about testing behavior.
I don't have the problem of small changes to code necessitating large changes to tests. It makes me wonder what he does that is causing that.
The more likely it is that a piece of code will break, and the more business damage it will do if it does break, the more tests I wrap around it.
For self-contained algorithms that have a lot of branches or complex cases, I use more unit tests. When the complexity is in the interaction with other code, I write more high-level tests. When the system is simple but critical, I write more smoke tests.
If I’ve got simple code that’s unlikely to break and it doesn’t matter if it does break, I might have no tests at all.
This is a valid observation but the problem clears up when you start integrating better debugging tools into your testing infrastructure: having the ability to intercept and inspect API calls made by your app, launch a debugger in in the middle of the test, etc.
It is also minimized by writing the test at the same time (or just before) the code that makes it pass.
In this same vein, I propose another question:
How similar does the test look to the production code?
I think this comes from "unit tests" where they have assertions such as
For testing dependencies, an environment where a test instance of the database, API, or other external service is needed. Then the code could be tested against an actual implementation rather than a mock of it.
I am not really following. You can write a seperate function that handles database query, then call this function in your code under test, or one of your argument in the code under test is the databse object handler (this allows you to abstract from a specific db handler). But if you don’t mock out the handler with a “dummy handler”, you will be making real databse call. That is not UT.
You will be doing functional testing.
So you can only abstract so much to get your code testable.
You can, but that still requires that you mock that call in order to test the method that calls your database query method. But, if you make it such that your method takes the result of the database query as a parameter, then you can test your method without having to use a mock. In other words, you can change this (which requires mocking get_db_result to test):
def a_method_to_test(param1, param2):
db_result = get_db_result(param1)
# do something with db_result
def a_method_to_test(param1, param2, db_result)
# do something with db_result
Basically, doing this will separate code that manipulates data from code that either writes or reads data. You can unit test methods that manipulate data, but you will need to do functional/integration testing of code that writes or reads data from other sources.
If you think this sounds like mocking, you’re right (it is like mocking, it isn’t mocking). It has the bonus of making it easier to inspect the logic & expose it to the user ("hey I’m about to do these things, does that sound ok to you?").
Well, not really. Because now I have the DB stuff all contained in a separate function (or handler which can be proxy to the correct DB engine/type of DB), so if with careful design I should be okay.
def create_account(user_manager, user_details):
Of course this is ideal, but that was the motive. I do agree on mocking == testing implementation in general, and functional testing is almost always the way to go... since mocking returns dummy data / fixture, might as well return from a real database. The downside is speed :/ (there are various of tricks like setUpClass, run every test or group of tests in docker containers in parallel, but takes much longer than UTs). Ugh trade-offs.
I'm still not sure how to get the good outcome every time. Maybe it's just well designed interfaces ️
Secondly it decreases your confidence in the refactor. In a classic refactor, nothing should break. This is because a good set of tests are a way of documenting and verifying the design/behaviour of the system.
Finally, often in a codebase with micro tests, there will be an expectation that you replace one set of micro tests with another.
That "feedback" is the pain you feel when you have to maintain one form of tight coupling (ordinary bad code) combined with another (micro-level unit tests).
Higher level tests won't cause you that 'guiding pain' because they aren't as tightly coupled to your implementation.
IMO you don't need to write tens of thousands of lines of unit test code to spot that you're building tight coupling in. It's something you can spot simply by knowing what to look for.
Moreover, if you make a design mistake or introduce tech debt lower down the stack, having a panoply of low level unit tests means that refactoring will break those tests even if you haven't introduced any bugs. IME that causes a dangerous cementing effect on bad code.
I think if you've got a module which is self contained, reusable and tightly scoped, it's worth surrounding with a lower level test. But, if you don't, you're doing more harm than good by surrounding a module with tests.
I don't think you do. I think a testing pyramid is a sign of an unhealthy approach and an unhealthy codebase.
I think what you need is a testhenge. For those who aren't familiar with henges, here's a typical (although poorly maintained) one:
Rather than being massive at the bottom and then getting thinner as it goes up, there is a continuous band around the top, covering everything, supported by uprights that dive deeper down, just where needed.
That is, you need a layer of high-level tests for everything (for a web application, some mix of browser tests or controller-level integration tests), along with component-level integration tests and isolated unit tests for the bits of the system that are particularly risky (complex, error-prone, critical, new, old, etc).
Because most of the time, it reduces "test precision".
Which means when such a test detects an issue, there's now a bigger area of the code where the issue could be located.
> Doesn't efficiency outweigh respecting feature boundaries in tests?
In the end, what we're trying to minimize here is feature development time.
This of course depends on the feedback loop duration (build + run tests), which is why test efficiency is important ; but there are other factors.
Feature development depends on the time it takes to locate the error when a test fails ; a test signalling that "something is wrong somewhere" is not very usefull (in some cases, it can be, if it runs extremly fast - because you can generally see the error with "git diff").
However, infinitely precise tests are often undesirable, because they tend to have extremely rigid expectations about the behavior/API of the code under test, discouraging refactoring, and thus slowing down the development process.
Let's keep in mind that "testing = freezing". More precisely, you're freezing what your tests depend upon (by adding friction to future modifications).
So be careful what you freeze: the initial intent of testing is to increase code flexibility. If you can't modify anything without breaking a test, you're probably missing their benefit.
Tests are not a debug tool. Tests are here to tell you when you broke something.
When this happen you can get your debugging toolbox out: stack traces, profilers and things like GDB. And then follow the steps your test script did.
But "when you click on this after that this happen while it shouldn't" is usually enough info to start debugging: you have reproducible steps. Which you can do with a debugger running so you see exactly were and how things break.
And it is a lot less brittle than "well we tried to refactor some minor thing and now everything is red; but we don't know if the software behavior changed or just because our tests were just checking the implementation".
Seen quite a few tests in my time that don't capture the functionality they think they are. They pass but wouldn't be able to tell if the underlying functionality they capture is genuinely broken. This is why I guess the standard practice is to go test red before you go test green.
The answer is always no. Even if you are the only person building something, future you will lick your boots clean in gratitude if there are tests.
Because even the best developers have to work with their own code sometimes.
I became a believer in automated testing when I worked with a large ETL process that worked with real estate data. This data was input by real estate agents, varied wildly in quality, and had both image files and structured text data. Releasing this code before automated testing was fearful. We'd push changes to staging and then wait for three (!) days of data, then examine the staging and production databases manually and see if there were material differences.
Needless to say, we didn't like to release this code very often.
When I left, we weren't doing automated testing on the image processing, but we were on the text side of things. It became far easier to do a release because we had a set of regression tests that gave us confidence we weren't breaking anything. If something did pop up, we could add it to the suite.
Human beings systematically undervalue their future selves. This is why we have a hard time saving for retirement, exercising and writing tests. Think of your future self and write tests.
What's also worrisome is that sometimes your incentives are misaligned; e.g. when product development and maintenance are handled by different teams, or features have tight deadlines and screw the debt that will be the future and we must deal with now.
I think the answer is mostly no but it's dangerous to think that it's always no.
I've been given many stories in the past for which writing a realistic
automated test would have taken days, manual verification took minutes
and the code was fairly isolated and did not get changed very frequently.
Writing a test under those circumstances is actually a pretty poor investment.
Also the time savings still pay off later, as automated tests usually take seconds to run and there’s no training required - once it’s in the test suite and the test suite runs are automated, it will always run and quickly identify a failure - no “oops, we forgot to show Jim the Intern that he had to test that part manually...”
Setting up your test suite and automation is longer for sure, but not days. Even a complex manual process can be automated relatively quickly... the manual process should be fairly scriptable in any OS nowadays, and most platforms have great test frameworks.
Most of the examples that require "days" (or weeks) for the test and 5 minutes for the test would involve the building or amending an elaborate mock/stub.
Some examples where this happens include rare interactions with weird hardware, odd legacy APIs that are scheduled to be replaced, race conditions and obscure edge cases with mocked services.
I document all of these special cases and they should clearly remain special but I'm not going to blindly assume that automating the story will have a positive ROI.
Also, integrating that tool would probably have applications beyond a single story so even if the change takes 5 minutes and making the test work with vaurien takes half a day, it's probably still worthwhile.
Assuming that no tool like vaurien exists, though (and there are plenty of scenarios out there where you'd have to build it from scratch), building the test scenario could become prohibitively expensive.
If you absolutely need to test connecting to an external system you should mock it or create an environment for testing against.
Your credentials should also be stored as env variables so the test ones are used only for the test environment.
Where are those env variables stored on the CI system? In the job config? I would use Vault for this.
Automatd test fundamentalists are subject to the same kinds of folly that other fundamentalists are.
How many customer minutes lost = 1 developer minute?
That said, I think tests are the most expensive and most brittle way to address this problem. They're necessary, but should be deployed sparingly.
A) In such cases I think testing against the real thing often has a greater chance of catching subtle bugs rather than the automated test scenario against the elaborate mock which is highly likely to share many of the same assumptions that caused the bug.
B) Code reviews ought to flag that a piece of code that is not under automated test is being modified and appropriate care should be taken (ideally this alert should be automated but I haven't yet reached that level).
I think it pays to approach these things on a case by case basis, and if a pattern of subtle bugs does appear that's a strong indication that you should change your behavior (I'm a stronger believer in bug retros than I am in any kind of testing dogma).
>How many customer minutes lost = 1 developer minute
Is that a relevant question to ask? If you introduce the presumption that an automated scenario test is more likely to catch a bug than a manual test then I guess it's relevant, but honestly for these types of scenario I think the opposite is true.
I didn't mention it before also but if you have manual testers on hand that changes the calculus too. I'd say it's normal for 3-4 manual tester minutes to be equivalent to about 1 developer minute.
As I mentioned above, I really don't think it pays to be a test fundamentalist.
>That said, I think tests are the most expensive and most brittle way to address this problem.
Or a type system fundamentalist.
Miserably bad. Unforgivably bad.
Slowly refining and testing systematizations of correct software building processes is perhaps the most important thing we can do in the first 2 centuries of "software" as a thing.
Because otherwise, all we'll do is continue to wallow in pride and failure, claiming it can't be helped. All the while using language like "case by case," that I have taken to mean: "I will never do that unless you force me to."
Fortunately, I think the scope of failure and fraud in the software industry has grown so late that folks are starting to take correctness as a requirement and not a nice to have. Another Equifax or two and maybe a nice DAO hack or something and folks are going to start saying, "Maybe it's just too bad we all learn to make bad software," turning to new techniques and practices.
Far-fetched? Maybe. But it is happening with AI...
I actually created my own open source BDD framework and a ton of tooling to help automate stories.
19 out of 20 was a pretty conservative estimate of how much I automate - it's probably more like 39 out of 40. I'm a little obsessive about it because I want to dogfood my work properly so I automate quite a few things where the cost/benefit for a normal programmer would seem a bit low.
I'm very cognizant that the industry as a whole is terrible at testing and I'm hoping my software can one day do a small part to help with that.
Also don’t forget that certain things like Linux are extremely battle-tested. It’s generally unlikely that any piece of software you write will receive that much real-world testing, so a growing set of automated tests will help you cover your own arse.
How much business value is this test adding? That is, if this test failed and we ignored it, how much would the business suffer?
Is the code easy to test? That is, does the design have lots of self-contained components with well-described input/outputs & conditions/assumptions? Do the docs clearly communicate that?
Will the test still work if we change the implementation? How much work to update/remove the tests if the behaviour has to change to follow new business requirements?
This is one that I struggle with in JS with React.js components. If you have a little helper component in a file that isn't exported but used in the same file by a component that is exported, it is sometimes difficult to test that non-exported component. Because of how enzyme shallow rendering works you don't get the full tree so that component, if sufficiently nested, might not ever be touched. This forces me to export the component just to test it.
Extracting code from a big component to helper functions and extracting those functions from the component can lead to cleaner code, and it makes it much easier to test the behaviour of the helpers directly than having to render the component with enzyme.
A good example of this is moving state changes to pure functions  which makes them much easier to test, but you'll need to export those functions to test them.
This boils down something I've had on my mind a lot of late. Though, with a different spin. I write a lot of Go. I prefer testing interfaces while some others what to use mock generators. This quote captures part of my reasoning behind avoiding mocks. I plan to write a detailed post at some point full of examples. I think this quote will work its way in there.
- Can I run the tests in random order?
- Are the tests optimised for readability?
- Are the tests unnecessarily testing third-party code?
- Do individual tests contain the whole story?
(Or do you have to look in many places to understand each test?)
Can I run these tests more than once? Will they ever go stale?
Can I run these tests and they'll clean themselves up?
Having to update tests because they didn't take into account the date changing (Happy birthday Joe Test!) or manually cleaning up data is a huge time suck.
* Are these tests relying on API that is likely to change?
* Can I make the API surface area used by all tests even smaller?
* Can I make a library that wraps the existing API of the unit to
get a smaller and/or more stable API for use in the tests only?
An acceptance test for an editing form is relying on the save button having a certain CSS style to find it and click it. This is API that is likely to change. An unrelated change in the looks of a button may break the test.
If we switch to using the text of the button ("Save"), that's better because that is what the user is likely to rely on too when they try to find a sav button. But its still not perfect as the text of the button could change.
The final step is to make a little library function that finds a save button within a certain form. Then we can encode the logic of the test but vary the kinds of text that are considered "save" (or even the method of finding a save button - perhaps a standard CSS class of save buttons in the future!); the test logic itself remains "permanent" since it doesn't rely on any implementation detail anymore.
The small library function would be `getSaveButton(form)` or even `save(form)` - now every form save test no longer encodes the knowledge of how a form's save buttons are made, whether that's by using a certain ID, class, text or anything else.
Now when we get that requirement for a screen with two forms, we'll no longer get mad and try to attack the idea (two save forms on one screen? that's inconsistent with our product, its confusing the users, etc etc) when the real reason is that it creates pain updating our tests. Instead we just update the save function.
The general idea is to encode the meaning of the test and separate the implementation details (clicking the save button might even be too concrete - "saving the form" is probably about the right level of abstraction). A good way to do this is to describe this test in prose and check if its encoding details that may vary - does this sound like something universally true / something that will be true forever, or accidentally true due to current circumstances?
But getting back to my original idea, what I want to highlight the need of adding cases to cover application security.
Since "breaking code" is super subjective, and generally speaking, trying to "cover everything" is a recipe for hell.
Anyone able to expand on what the author meant by this?
What I had in mind specifically in the answer, was the case of changing "interfaces" between parts of code. For example, the case of changing a function's arguments, or how it uses them, but omitting to change a call site. Checking that the call site just calls the function would not be enough to produce a failing test, especially without type safety. The test would actually need to assert on what the function does, e.g. its return value for a pure function.
Yes, I think trying to test against every single possible breaking change is not valuable.
Basically, you put the customer first. Make sure your features are tested such that they can't fail without a failing test before writing lower-level tests.
This is also the approach advocated by the popular book Growing Object-Oriented Software Guided by Tests.
Here is where testing becomes very helpful.
Only to the extent that your spec is incomplete.
Regarding testing "glue": static typing often gives evidence (but not proof) that code is glued together appropriately [refactoring even small projects without tests in Haskell is a joy: the compiler essentially tells you what to change]. However, it doesn't give evidence that the high level behaviour is what it needs to be. So I think higher level tests are still needed.
I think maybe changing the first question from...
> Am I confident the feature I wrote today works because of the tests and not because I loaded up the application?
> Am I confident the feature I wrote today works because of the tests and type checking, and not because I loaded up the application?
will probably help you to answer the question about how much static types allow you to forgo tests. My instinct is that in most cases, high level tests are still worthwhile.