Something I've learned in Ruby land (prob standard in other places, forgive my ignorance) that seems a bit different than what the article advocates for (fake services):
- Write your service wrapper (eg your logic to interact with Twilio)
- Call the service and record API outputs, save those responses as fixtures that will be returned as response body in your tests without hitting the real thing (eg VCR, WebMock)
- You can now run your tests against old responses (this runs your logic except for getting a real response from the 3rd party; this approach leaves you exposed to API changes and edge cases your logic did not handle)
For the last part, two approaches to overcome this:
- Wrap any new logic in try/catch and report to Sentry: you avoid breaking prod and get info on new edge cases you didn't cover (this may not be feasible if the path where you're inserting new logic into does not work at all without the new feature; address this with thoughtful design/rollout of new features)
My caution with recording is that I've found re-recording to get progressively hard/more brittle as time progresses, b/c the test system inevitably drifts further and further from the state it was in when the test was first recorded:
Granted, it creates more work, but personally I would follow the OP article's advice, and have your production code's tests only run against your in-memory fakes/stubs (which have no coupling to "this is the state of the external system when I happened to make the first recording").
And reserve the record/replays solely for testing your wrapper of the 3rd party API, with the rationale that "just testing the wrapper" will have lower coupling to "what the state of the external system is" vs. your ~10s/100s of production business cases that can be just assume the wrapper primitives like "save this user" / "create this user" work.
this
>But this isn’t obvious, because hundreds of lines of boilerplate obscure this.
And when this boilerplate in your own code or someone else's code/framework changes, then what?
This is a grossly naive take. We write tests in the first instance to prove that new code works. We then use them to prove that the code under test didn't regress as we make more code changes, framework updates, etc. Later we may change them to demonstrate that destructive modifications still lead to expected outcomes per new requirements.
The writer seems only concerned with the former, boiling testing down to moments in time etched into their codebase for eternity. It makes the whole argument rather a straw man.
The author has some great points, and combined with asimpletune's functional approach in a sibling comment, it can work wonders.
You probably have never seen purely functional complex code: I don't blame you as many developers haven't (I've seen engineers with 15 years of experience who haven't).
It becomes obvious how you can test all the intricacies "just enough". By that I mean that in a function integrating multiple pieces, you test just that those pieces are integrated: no need to test all edgecases of every implementation being used. OTOH, every implementation is independently unit-tested.
And in contrast to your claim, that approach really lends itself to constant refactoring that's simple and obvious (solve for the problems in front of you, not for imaginary problems), without leaving stray complicated tests that diminish in value as code gets refactored.
The only issue is getting the entire team to switch to this mindset so they don't start breaking the pattern leading to a bigger mess.
I work with purely functional code in my day job and on several side projects, I'm not sure why you'd assume that. 20 years experience as a SE here but honestly I don't think it adds or subtracts from my point, nor should you use it to prop up yours.
I think asimplemind's advice is good advice but it is complementary to good and established testing strategy that is in widespread use across paradigms.
I treat with deep suspicion the idea that a programming paradigm can subvert the fundamental principles of good testing.
I don't doubt your experience at all: I just extremely rarely see functional approach applied to larger scale projects, and as we mostly work on legacy projects throughout our careers, I find it more likely that people have not seen it than otherwise.
TBH, I've seen only two large projects completely following those patterns in my slightly shorter (than yours) 17 years of professional career (on top of a few years of active free software contribution): at my company, I am looking to introduce those patterns for new projects, but none of the existing stuff is doing that.
Basically, you get extreme benefits once you follow the pattern fully: smaller codebases are relatively easier to keep that way, but the benefits are not as pronounced either.
Anyway, my apologies if you see my previous message as a knock on your experience: I only consider myself lucky to have seen this applied on a complex project integrating many components (external and internal but from other teams) and with heavy development, and I really wish everybody gets to experience it once too :)
Again, you possibly did, which is wonderful! It's also ok if that didn't make you as much of a fan as me :)
The frustrating thing about these testing discussions is that they so often boil down to extreme positions, such as:
- only write integration tests
- only write unit tests
- never use mocks
- mock all your dependencies
- always use fakes instead of mocks (or the other way around)
- always use factories instead of fixtures
- unit tests can only test a single class/method
- there can only be one assertion per test
and so on.
What gets me is how divorced from the pragmatics of our tests these discussions are. It's as if we don't apply the same nuance to our test code that we're comfortable applying to our main code base (namely, that different parts of our code may require different coding styles).
First and foremost I wish testing wasn't treated as such an afterthought both in education and in actual practice; many developers say "I don't check test cases in code review", but badly written tests are as much a maintenance burden as other bad code.
We should work towards an understanding of different techniques and styles in testing, just as we cultivate an understanding of architectural patterns. When having to write - or maybe refactor - tests, we should start with a set of leading questions: what is worth testing more, the logic of the component itself, or its interaction with other components? is the external component that I use regular and predictable enough that I can get away with a shallow fake or is it weird, stateful and complex, so that I should run the tests against a real instance (e.g. a test container)? how can I write the tests such that they're comprehensible and maintainable? how can I write them to be as fast as possible while still giving me the guarantees I need? if I have a set of tests that are particularly hard to work with, or are really bad at catching regressions, how can I fix that and make the tests better?
Yes, that may mean that more time has to be spent on tests. But the alternative is messy tests of questionable value which slow down progress and don't catch enough regressions.
> The problem with this approach is it’s completely tautological. What are you asserting? The tests themselves.
Every time I've heard this kind of rationality it's from someone who doesn't understand what a unit test is and is pushing for integration tests.
It's extremely useful to have very rapid tests that boil down to making sure `add(2, 2) == 4` because you can have them running constantly and catch inadvertent logic changes.
Should you also have more heavy tests using local/fake services? Obviously - but it's not one or the other.
I think this is a natural conclusion from working with a lot of imperative code. The usual test you see for an imperative code block is like "mock this out, assert it was called N times with these arguments". That approach is usually redundant and tautological. I have a weakly-held preference (that I will not fight for in code reviews) for not unit testing imperative code at all because I think it has very limited usefulness and is just more code to maintain. Instead, I prefer to rely on integration or end-to-end testing to validate the behavior of those parts of the system.
But I think unit tests are extremely useful for functions. It looks silly when the example function is `add(2, 2)` but most functions aren't just redefining mathematical operators, but rather are encoding non-trivial domain logic. I think it is always useful to have edge case tests - when is it an error, what happens right on the boundary between error and non-error, etc. - and to have at least one illustrative happy-path example. (I think this can be frustrating for more mathematically-inclined people, who want to be able to test the function not just for a couple examples, but for the entire domain of inputs, which then leads to a lot of interesting but less pragmatic fancy ideas approaching formal verification.)
Where this breaks down is when the vast majority of code is imperative, which is very common in practice. My usual approach there is to lean heavily on higher-level testing, and refactor out little chunks that can be more reasonably unit tested, when practical. And I also write and maintain lots of these not-very-useful imperative tests, because people get itchy about low unit test coverage and I'd rather write the tests than spend time convincing people that they aren't very useful.
Clean functions with limited dependencies which encode non-trivial domain logic are indeed well served by unit tests - even if theyre a bit imperative.
But, outside of contrived interview scenarios this type of code is actually somewhat niche and outside of that niche unit tests suck.
Ive worked on tons of projects where it accounted for 1-2% of the code base. Unit tests not only werent capable of catching bugs they werent even capable of replicating 98% of them. Unit test theater.
Meanwhile, the other day I did TDD with hermetic end to end tests using playwright, mitmproxy, mailpit and screenshot/snapshot testing. It felt like I was living in the future.
In most software, the units themselves are not that interesting. Basic business logic and data transforms, surrounded by a larger codebase that glues all the integration points together into a working stateful system.
I've met a few unit-testing extremists who insisted on pristine-mocked unit tests and only unit tests. Their tests were incomprehensible (a good test should serve as documentation and acceptance criteria - but their code was 90% of concerned with mocking service apis) and importantly, their unit tests repeatedly failed to catch serious regressions that should have been apparent to anyone who simply executed "main" without mocks. We had production outages on the regular.
The funny thing is, the unit test extremists doubled down, rejecting the idea of integration or e2e testing entirely (yes, I'm serious) and claiming that prod failures indicate we need to unit testing EVERYTHING. They even went so far as to claim developers need not actually run their software, unit testing is fully sufficient. Not surprisingly, their code had perfectly constructed units but no overall design... the units composed into an unmaintainable dumpster fire as an actual system.
I would generally argue that if your unit tests have mocks, they're not actually unit tests. What you're describing sounds like integration tests that completely missed the point (i.e. the "integration" but).
Even very small "units" have dependencies in the sense of other objects or functions that they use. Or at least they should, if concerns are being separated. It's good to use fakes or mocks to make those dependencies not depend on state external to the test, but not to structure tests around what methods are called on which dependencies in what order.
Yeah, I guess I would make the distinction between simple stubs and heavyweight mocks and fakes. If you're using a lot of the latter in your unit tests, that's a code smell to me; unit tests should ideally not be concerned with the state of the world, which is where mocks (especially fakes) tend to come in. Obviously there are no absolutes, but that's my rule of thumb and it's served me well.
I think the distinction is still important because it frames how you should write the test. Unit tests basically just assert that, given some inputs, the output of the "unit" is what you would expect. Integration tests, on the other hand, test the business logic that ties different units of code together, which changes the assertions to be less about the direct output and more about the state of the world. For example, if you have a Twilio wrapper, the tests for that wrapper are unit tests because you're just testing that the wrapper makes the correct HTTP calls, etc; but the code using that wrapper should be tested via integration tests to ensure that the correct calls are made, that state transitions are handled properly, etc.
This might sound like a meaningless distinction, but I've found that a large portion of bad tests that I've seen in practice were integration tests written in the style of unit tests (i.e. treating stateful code as if it were just a pure function of its inputs).
Given that integration testing is inherently harder (as you've alluded to), I'd rather see more unit tests, but that requires an architecture that breaks code down into as small and focused of units as possible (e.g. a functional style), which isn't always feasible for practical reasons.
I agree with everything else you said but I totally disagree that functions like that are "niche". It's true that many codebases are written such that they have few of them, but that's "accidental" not incidental. Almost nothing is pure math, but almost everything is data transformation that is deterministic given some set of parameters and dependency configuration. And mocking out external state to control the behavior of the thing being tested is great. But the point is not to say "and then this method of that dependency is called with this input", it's to control the behavior of the thing being tested, by controlling what its dependencies return.
Writing unit tests can help catch early bugs, but once they're done and green, I don't think anyone expects them to catch new and exotic bugs (it happens but it's so rare).
Instead they stay as a live and enforced documentation of how the target function behaves.
> I think this is a natural conclusion from working with a lot of imperative code. The usual test you see for an imperative code block is like "mock this out, assert it was called N times with these arguments". That approach is usually redundant and tautological.
I agree. I usually see the I/O layer weaved with the data transform layer. Often in the same function call. And, usually, the answer is to pull those two layers apart, put tests around the transforms, and push the I/O to the edges. Usually...
> It looks silly when the example function is `add(2, 2)` but most functions aren't just redefining mathematical operators, but rather are encoding non-trivial domain logic.
(de)serialization functions are a good general case purely-functional example. Data is transformed and you need to verify the transform was correctly performed, has all kinds of unique edge-cases, and should gracefully inform the caller when something is wrong.
> I think this can be frustrating for more mathematically-inclined people, who want to be able to test the function not just for a couple examples, but for the entire domain of inputs, which then leads to a lot of interesting but less pragmatic fancy ideas approaching formal verification.
I am that person! I am a big fan of property based testing and try to add such tests whenever possible. They serve as "edge-case hunters" and are usually part of my programming loop. As soon as something has been modified, a test suite executes. Writing property tests first and the code later allows you think about the invariants and behaviour of what you are building in a way that regular unit tests just don't.
Yeah I get it. But if I have a function where the domain is signed 64-bit integers and the specification is some f(x) between zero and a billion and an error otherwise, I don't want to iterate through all the legal inputs and make sure they do the right thing. If I'm using a language that can do that with its type system or other kind of capability, then I'm happy to use that! But I'm also not grumpy about using python or rust or all the other normal languages just because they can't do this. I'm happy to just write a few cases mostly concentrated around zero and a billion.
Technically you can do that with property based testing just fine, e.g. for all values outside boundaries ensure a throw. The goal of property based testing is to help you identify invariants in your code and then verify them.
I’ve found those “obvious” checks to be useful to catch unexpected hardware issues as well. I had a customer upset that my code was crashing- I sent a debug version with asserts around some arithmetic core to the algorithm and lo and behold those asserts failed
His machines memory was bad and corrupting the internal state of the application. Until I could prove that his machine was calculating the equivalent of 2+2=5, he wouldn’t believe us.
he was talking about mocking, not tests such as CHECK(add(2, 2) == 4).
and BTW i would challenge anyone working with accessing complex real-world systems (for example, an stock-exchange, or a robotic warehouse, to write a worthwhile mock for them.
If I have a function that hits a JSON API and returns a path from the resultant object and I mock that API to return 4, I'm doing exactly that.
I can write a mock for every data type the API could return and one for every error the API can return and test all of them near-instantly without having to do a HTTP round trip to the fake server (which presumably also now needs configuring before each test)
The point of mocking is not to reimplement a service you depend on, it is to provide the answer that service would give in the specific example you're testing.
What's relevant to you is your element under test, not the dependency.
Does the component you're testing have a different behaviour for each different response? If yes, you're out of luck (but also, maybe you should split it in multiple components). If not, then you probably don't need to test every possible response the dependency returns.
You're not out of luck. Your testing framework should allow you to hardcode different responses to different requests. Include the time in the response if you need to, it shouldn't be hard.
I've started adding random elements to my mock responses to force other developers to not assert things that aren't relevant to the test.
anything is mockable if your code is broken into smaller easy to reason about easy to test chunks and you have a complete understanding of what the thing you're mocking is able to do.
If you don't have a complete understanding of what it's able to do then I think how you write your tests is the least of your worries.
true. but to mock something like a stock exchange, you are basically going to have to write a stock exchange (and mocked other clients than your client app). and then you are going to have to test that mock. this could all take a while.
i really don't think that people who suggest mocking realise how complex applications work and how they are always changing their interfaces - for SEs due to regulatory pressures, for example.
> true. but to mock something like a stock exchange, you are basically going to have to write a stock exchange
All your mock has to do is to provide the specific answer you're expecting to receive from that service in the specific case you're testing. In most situations, re-imlementing logic in a mock is a mistake that should be avoided.
with a stock exchange, it's a dialog - don't expect any particular response. for example - using sell and buy just to keep things simple, and i am the seller and i could not be bothered looking up real prices or tickers:
sell 10000 BT at 50GBP // i put this on the exchange
buy 500 // somebody bought some
buy 2000 // somebody else (or maybe the same) bought some
crash // exchange crashed (not rare) we recover
cancel 500 // they can do this in a limited window
buy 8000 // we have filled
you see, it isn't simple, but that is about as simple as you get
I'm not sure what's the component under test and what's the dependency here. But I don't see this as particularly complex to test.
Assuming you're writing a client:
- use the client to send a sell order, check that you handle properly the ack (or lack thereof) of the trading platform.
- check that the client handles properly a buy order received from the platform.
- check that the client handles properly a cancel order.
- check that when an order is filled, the client closes it properly after the cancel window is closed.
- check that the client recovers properly after a crash.
To do this, assuming the platform is a dependency of the client, in each test case, you have to implement one or two hardcoded messages from the platform.
> > You don't need the exchange's state to decide what case you are testing
> yes you do, thd exchange needs to know the state of the trade.
No you really don't.
For your given example, you configure your application state to think it has just submitted a buy order for 500 units. Then, you write a series of tests that check to see what it does if it receives
- One sell order for 500
- Two sell orders for 250
- Five hundred sell orders for 1
- One sell order for 501
- One sell order for 250 but nothing else for N simulated minutes
- Nothing for N simulated minutes
- The exchange socket closes with an error
- A sell order for -500
- A sell order for NaN
- A buy order for 500
- A corrupted message
- etc etc etc
For unit tests you are not checking the full behaviour of the system, you are checking to see how it behaves in a very specific set of circumstances per test.
If you do not know all the messages this API can reply with then just capture a few days or weeks of traffic and you can make a damn good guess from there.
If your code doesn't support this level of isolation then you need to rewrite it until it can, I highly recommend "Working Effectively with Legacy Code" by Michael C. Feathers to help you.
only if you have a stateful mock server with multiple mock clients attached to it. for example, when you put a sell on the server, you are basically going to be listening on a socket for multiple responses and what the responses are will depend on the buyers (which you must also mock). and don't get me started on buying, ticker conversions between different exchanges, currency, time, etc. it is too hard to mock.
to mitigate this some exchanges, such as london and paris, do provide test exchanges, which are crap, and available every other fridays, or whatever. when i worked in this area, my boss used to test my code by putting a trade on the exchange and then immediatly cancelling it. this was illegal, as he wasn't a licensed trader, and the exchanges got enraged. i refused to have anything to do with it.
but my point is - you cannot do meaningful unit tests on very complex systems. even integration tests can be very hard, but are worthwhile.
> but my point is - you cannot do meaningful unit tests on very complex systems.
I think you're missing the point: unit tests should not be testing entire systems; that's the point of integration or end-to-end tests, which are different beasts entirely. You can very much write unit tests against parts of a complex system because each test should only be testing one isolated bit of functionality in the form of inputs (including local state) and outputs.
My problem with these is that they are generally written by the developer who wrote the function.
So devs code the happy path in their function, and test the happy path in their unit test.
Yes, in fact, 2+2 is in fact 4.
But, do they test add(2,-2) or add(-2,-2) or add(null,2) or add (0,null) .. etc.
Whatever cases they forgot exist and never coded for, never get tested for either. If this is not caught in the initial commit by another dev, then it makes it through to prod happily despite having "unit tests".
This is why I think tests where the application is forced to interact with real world data are better because if your app will be broken by negative/0/null, and that state exists in the real world data, it will break in testing.
Of course this is all a long dissertation on how to code&test the addition of 2 integers. One of the simplest possible functions, now do this on more complex software.
The unit test I think biases later developers to believe the original developer actually coded the software correctly, when to me, in a large number of cases.. it was wrong to begin with, but uncaught.
Write everything as functions. Your dependencies are also functions, which are passed in as arguments to your functions.
To unit test your logic, pass a dependency that does the situation you’re looking to test, Eg () => panic. With this alone you can catch most of your logic issues.
For integration tests pass the real things in, but only test the interfaces you publicly expose. With this you can discover when a dependency has broken/changed out of band.
Last thing is write your functions as you write your tests. This way you can discover requirements while it’s still easy to change your code.
Every once in a while it makes sense to refactor. The process of writing your tests will tell you when, and the tests themselves will tell you that your new code is equivalent.
This is more or less what I’ve been doing for years now and it works. Very few bugs get past this. Code is easy to read and easy to change. Abstraction and documentation happen progressively too. Boss and team mates are also happy.
Maybe the one source of friction from this approach has been the type who scream “test against reality”. That and jira.
(The one thing I forgot to mention is to model data like records as much as possible. The data is really the key, everything else above is easy.)
I do a slight modification of the approach: I pass everything in, but if you need related functions passed in, it makes sense to pass them in as a service object containing those functions.
Those service objects really have no side effects (constructor basically simply keeps context in), allowing for smaller function signatures, especially for, by nature, side-effect stuff (logging, caching...). Basically, saves on passing both "get_from_cache" and "set_cache" functions in as you pass CacheService instance in.
But the premise is the same: keep code functional, allowing for easy testing and clear implementation.
The one thing I disagree is that this does not allow testing against the real thing: you simply have a test that confirms dependency and simulated fake (function or service) have the same basic behaviour your code expects. If the real thing works using an external service, you can decide if you want to test that as a system test (I run those ocassionally) or with mocks.
What language are you using? I'm using C# at work, but dabble in F# in my free time. I try to find a F# alternative to the C# way, seems like the simplest alternative to dependency injection through containers are just function passing in F#, like you are mentioning.
One thing that I would like to know, is how do you do the equivalent of "registering dependencies" like you would using a DI-container. Where do you setup all the (partially applied) "real" functions and all the "fake/mock" functions etc. I guess it's just a matter of structuring it in a nice way.
The above related mostly to work I did in Scala, but I am writing mostly Rust now. Each language has a way of doing things so you have to adapt what I said to how it makes sense for you. Still, the process is the same in spirit.
> how do you do the equivalent of “registering dependencies”
In what I was describing earlier, dependency injection is literally just calling a function. The dependencies are arguments to a function, and you “inject” them by supplying them to the function. The “registering” is just based on your function signature. That is already a contract. Dependencies are functions too.
> where do ou setup all the functions
Usually one of two places: tests or your app’s entry point. In tests you just setup your function wherever it’s needed. Like I said if you need to test what happens when a dependency of yours does something, then just write a function that returns what you are looking to test.
For your app, this same thing is in your app’s entry point, like main. So you read from config and build your dependencies, and then at the end you ultimately pass them to a function.
At the end of the day this is about inversion of control, but it’s something that makes more sense if you invent it on your own from scratch before trying to understand what that means through more complicated abstractions.
If using F# the easiest way is usually to make an interface for the impure stuff and write a mock implementation of that. Pass the interface in as a parameter to your functions. Keep that interface as small as possible.
I do not agree. Mocking doesn't mean you assert what you just mocked, mocking let's you test a piece of code in isolation. When writing unit tests I mock all the dependencies, even if I control them, as I only want to test a particular function at a time and not the external code. If you want to test more, then you can write integration tests or end to end tests, but that is another matter.
I think you miss the point. With a lot of code once you've mocked out all the dependencies you're not left with anything to test apart from assert derp == derp.
Is this really that common? Maybe I'm out of touch, but I find it hard to imagine much code that just strings calls to dependencies together and has none of its own behaviour. Surely decisions are being made based on the dependencies behaviour? Isn't that worth unit testing?
Yes. For example, most CRUD apps are like this almost by definition.
It's more common for domain logic to be fairly simple and smeared across multiple systems and languages, though - e.g. some is in SQL, some is in a separate microservices, some is in module A and some is in module B.
In either case unit tests end up being mimetic tests: assert derp == derp.
I worked with a lot of code like that. The decisions made on dependencies were trivial. Bugs happened when dependency answered with something we did not expected. Or when we send away wrong data (not knowing parameter X is mandatory when Y equals Z sort of thing).
Meanwhile, all those mocks were just clutter.
Integration tests made a whole lot of sense. Mocks did not.
It felt like the author wanted to test all of their dependencies in their unit tests, like that was the part of the system they didn’t trust. They don’t seem to value testing that their code functions as intended given expected behaviour of the dependencies.
We rarely write unit tests because we are worried about missing an edge case while writing the code. It's often faster to manually test than write the automated case.
We write tests because we need a specification that lives past that moment in time. We also write tests because other parts of the system change because business requirements change in other parts of the system, and the necessary conversation doesn't happen.
I trust what the system does on July 9, 2023. I don't trust what the system will do on July 9, 2024. How do I guard against that?
> When writing unit tests I mock all the dependencies
I dislike this approach because whenever any of those dependencies change, I need to change all mocks. I prefer designing my code in such a way that dependencies can be easily pulled into a test instead of mocking. It may be more work upfront but pays off in the future. Instead of spending the time to write and maintain mocks, spend the time to ensure dependencies can be pulled into a test.
I’m surprised they didn’t mention a superior way to test against reality - record&replay actual responses. Run your tests against actual external services once, and then use those responses in future local-only tests. Periodically refresh your local response log with a cron job.
I find simulated fakes that have behaviour that is testable with the same test between a real implementation and fake one to be really superior.
It's even better when you've got control of the external dependency (open source project, public API, or internal project) when you can implement something I call "real fakes": use as much as possible from the real code that's fast (i.e. input validation, output adaptation, same basic logic), and have that project export a fake service as a library for easy reuse. This avoids the most common problem with fakes: outdated simulated/mocked implementations.
Eg. wouldn't it be wonderful if every project had a "real fake": imagine Postgresql providing really fast fake-pg from the same codebase and tracking their releases, or Amazon SQS providing fake-sqs instead of having to rely on externally provided simulated fakes?
This very much captures what I am working on. It is (currently java) a sdk dependency + ide plugin, which records input/return values of Java methods. Once recorded you can either instantly replay it on a running process (also works on hot-reloads) OR save it as a fixture which you can replay after a process restart. In our workflow we save these fixtures to git so anyone in the team can replay them. But there are a couple of things I feel are needed on top of this
- a flexible assertion framework, currently I am doing a straightforward string equals to identify breaking changes
- a mocking feature for downstream calls, currently nothing is mocked out so it is basically an integration test, but as the discussion shows, mocking can be quite handy in a variety of situations
Another day, another blog post by someone who misunderstands the purpose of unit tests and mocks, probably because they've never worked in a huge codebase with complex business logic and lots of precise edge cases.
I've work in such environments. I've been bitten by mocks and I've been bitten by false positives from ugly E2E systems. I've seen flaky E2E tests linger indefinitely, so I know why trying to spin up the whole system for a small set of tests is bad.
I also know that test data management and systems like VCR are very difficult to maintain. So, we mock because it's more painful to do otherwise. (That is; the test data management does not keep every edge case because it's extremely expensive to do so.)
Yet, the mocking is painful too in the complex system. There's no way to tell if the mock's response has been changed from underneath. You introduce false confidence in your code. Most bugs, in my experience, are based on a misalignment of expectations between two separate parts of the system (or internal/external systems.) Some of this is based on changes, others are based on unknown edge cases.
So, given that this is a hard problem, where do you point intermediate practitioners?
You state unit tests and mocks do not catch all bugs. I absolutely agree with you.
But the answer is not, to not do it at all. The answer is to test in multiple ways. Unit, integration, E2E, manual - how much of each you need depends on what kind of system you're building.
Testing against real services (probably in some kind of staging environment) is mandatory, IMO, but it is only really useful for verifying the happy path, or some simple failure modes. Unit tests can verify the system's logic is correct in the face of complex edge cases that are difficult to replicate in integration or end-to-end testing.
Unit tests also allow you to add new functionality to the codebase with confidence, as you can know that your modifications did not break some specific edge case or cause some bug regression. If you're in a dynamic language they also help ensure your refactoring did not introduce some type mismatch somewhere.
Which is why I draw the conclusion that anyone who thinks unit testing and mocking boils down to `response = "derp"; assert response == "derp"`, must not have much experience working in such codebases.
I wouldn't call mocking testing against reality. Like all testing it has its limitations:
- Mocking doesn't validate schema. Vendors can change schemas on a dime and the schema you're working with is often the schema as you know it at the time. This is partially negated through aggressive dependency management and languages with type systems and vendor provided interfaces.
- Mocking doesn't validate behavior. APIs can be weird and have bugs, vendor bugs still become your bugs in mocking.
- Mocking doesn't simulate network calls. You can certainly implement mocked latency, but again this is a smooth facade over a complex problem.
It's also worth noting that using mocks, especially on vendor interfacing code, will shape how that code reads and not everyone is going to grok that right away. It's definitely different.
No testing is perfect and will catch all of the above, but if you're cognizant of the above you can get close by combining good unit tests. That said, calling it reality is a misnomer imo.
> Vendors can change schemas on a dime and the schema you're working with is often the schema as you know it at the time. This is partially negated through aggressive dependency management and languages with type systems and vendor provided interfaces.
This is why testing in production is so important. Building monitoring and alerting tools, making code observable for all of that to be feasible, and building an incident response culture can all fall under the "testing in production" umbrella to me.
Shifting testing left and right has been a big thing of mine for the last few years and I think I'm onto something. Planning your architecture and projects early and thinking about making things easy to test and observe in production, as well as monitoring and alerting after its live.
Cindy Sridharan has an amazing treatise on this topic.
The part I didn't cover that probably loops your ideas and my ideas together are that no testing is fool proof, and the ultimate goal of testing in modern software is minimize the time you spend in an incident - not to absolve yourself of them entirely.
The downside of synthetic testing is that it's relatively expensive. Doing it in staging is one thing, and personally as a SRE this is where I'd put it. Putting it in production requires cordoning these requests (and data) in some way which is additional overhead in a set of testing which is already relatively expensive in terms of operational complexity and point at which errors are discovered (later in the SDLC).
> The downside of synthetic testing is that it's relatively expensive. Doing it in staging is one thing, and personally as a SRE this is where I'd put it
I actually have the exact opposite opinion. It is _because_ synthetic testing is so expensive, you want it running in production where the value can be maximized. We abolished our staging envs completely because we realized all the effort put into maintaining them and duplicating all production changes to staging as well - was much better spent in making testing in production safe. Far too often an issue only exists in staging and not production, or staging doesn't catch something because of the plethora of delta that will always exist between different envs.
When I was in QA - I found myself caring less and less if an issue was present while testing locally, or in some sandbox or staging environment. I only cared about production, so that is where I invested time into testing.
If a synthetic test breaks production in some unanticipated way, that is incredibly valuable because one user shouldn't be able to break production for everyone and you just found one heck of a bug to address.
To your first point, I agree that's probably a smart allocation to make given the expense. Where we differ is the value driven by lower order tests; synthetic test breakages are also more expensive fixes since they've traveled the SDLC. Overweighting synthetic tests compared to unit and integration can lead to a lack of confidence in proposed changes. That's to say, I think it's good to have the larger volume of your testing allocated in unit and integration tests while synthetic tests can validate major contracts.
> If a synthetic test breaks production in some unanticipated way, that is incredibly valuable because one user shouldn't be able to break production for everyone and you just found one heck of a bug to address.
This is a valid point, however, I wasn't referring to breakages. I was referring to usage statistics, data, etc. Conflating synthetic usage and data with real usage and data can be problematic. There's ways to mitigate that, but they add overhead, which was my original point that synthetic testing and monitoring adds a good deal of operational and code complexity.
When you are “mocking” services that involve really doing network requests your tests are going to slow down dramatically and that is a problem because you run so many so often.
His diagrams miss how bad the problem is because it is not that client code is calling 20 different external servers but that something “simple” (a react component fetching a JSON document and drawing it) involves react calling your code multiple times. Then there is the deep asynchrony of an application that really is a message broker, where an email gets sent from here and received there.
When you are in control you really can break it all up into functions but I still don’t have an answer I like for writing tests in React other than “prop drilling”.
Great article & visualizations of how systems have gotten more complex over time.
Per his suggestion of "testing against a service stub / fake server" (instead of method-level mocking), my suggestion for large internal orgs is double down on this and have your service owners to ship their own stubs:
I.e. if your `accounts-service` team has an API that is used by ~20 other internal teams, why should each of those teams either a) pollute their tests with low-level method-based mocks that are brittle coupling to impl details, or b) each write their own `AccountsServiceStub` when the `accounts-service` team could ship both `accounts-service-impl` and `accounts-service-stub` artifacts that implement that same API.
But the stub artifacts is a) in-memory, and b) has additional methods to facilitate creating & asserting against the in-memory data, both of which should ideally create a pleasant out-of-the-box testing experience for their downstream consumers.
This is great advice for internal libraries as well. In my opinion, any sufficiently complex code that gets reused in multiple places should also expose a stub or mock for that code.
That mocking is tautological is true. However it helps to know that the “wiring” of your code “works” if you’re using a non-statically typed programming language like JS or Python. Unfortunately for the countless of junior engineers I have seen come and go, mocking seems to be the “real” way of testing because they usually don’t know something else besides JS and associated frameworks.
When I see this criticism of mocks, it appears to me that the critic is primarily used to seeing tests that are either mocking the thing-under-test or don't understand the boundaries of the thing-under-test.
I once briefly experimented with the idea of a programmable fake server:
1. At startup, the fake server is tabula rasa
2. The "before" part of the API test script will call specialised POST endpoints of the server to set up the fake endpoints (and the responses they are supposed to send) required for each test. The state is in-memory.
3. When the service that is the test target is started in test mode, all its dependencies point to the fake server.
4. The test script then runs the test on the service
This required zero application level mocking code. All the mocks were defined in the test scripts.
I believe we explored this, but instead settled on a simple express-based Node.js server because that was the tech stack familiar to the team, and also due to the programmable nature of responses, a non-statically-typed language like JavaScript was preferred.
- Write your service wrapper (eg your logic to interact with Twilio)
- Call the service and record API outputs, save those responses as fixtures that will be returned as response body in your tests without hitting the real thing (eg VCR, WebMock)
- You can now run your tests against old responses (this runs your logic except for getting a real response from the 3rd party; this approach leaves you exposed to API changes and edge cases your logic did not handle)
For the last part, two approaches to overcome this:
- Wrap any new logic in try/catch and report to Sentry: you avoid breaking prod and get info on new edge cases you didn't cover (this may not be feasible if the path where you're inserting new logic into does not work at all without the new feature; address this with thoughtful design/rollout of new features)
- Run new logic side by side to see what happens to the new logic when running in production (https://github.com/github/scientist)
I use the first approach bc small startup.