Understanding Why Resilience Faults in Microservice Applications Occur

leroman · on March 20, 2022

> in order to solve this problem I watched 77 presentations from industrial conferences and blog posts on the use of chaos engineering to identify the types of resilience issues that companies experience and how they go about identifying them.

I believe what is observed here is symptoms and not root cause.

My experience in this area tell me that after you go the "Micro Services" route, there is no coherent view of the system and a holistic design & architecture is derived from the many integration issues instead of trying to improve the data domain and it's inherent business challenges. So basically (over)engineering vs creating features..

I can't see how an academic could arrive to this conclusion unless he took part first hand in several organizations taking this route and contrasting this with first hand experience with a more "monolith" approach - or less emphasis on "micro-servicing-all-the-things"..

zeckalpha · on March 20, 2022

He has plenty of industrial experience, not a pure academic https://christophermeiklejohn.com/meiklejohn-cv.pdf

jameshart · on March 20, 2022

I think the main thesis here is that chaos testing is the only way to detect ‘unscalable error handling’, but that most ‘unscalable error handling’ faults could be eliminated by testing for ‘missing error handling’ and ‘unscalable infrastructure’, which should be able to be tested with less disruptive techniques than ‘chaos’.

I’m not sure I follow the argument though.

Just because you have demonstrated that a system is scalable, and that it is tolerant of errors, does not imply it is tolerant of errors at scale.

The example given of Expedia’s error handling that, they claim, could have been verified without chaos testing:

> Expedia tested a simple fallback pattern where, when one dependent service is unavailable and returns an error, another service is contacted instead afterwards. There is no need to run this experiment in production by terminating servers in production: a simple test that mocks the response of the dependent service and returns a failure is sufficient.

When the first service becomes unavailable, does the alternate service have a cold cache? Does that drive increased timeouts and retries? Is there a hidden codependency of that service on the thing which caused the outage if the first service?

Maybe that can all be verified by independent non-chaos scalability testing of that service.

But chaos testing is like the integration testing over the units that individual service load and mock-error tests have verified. Sure, in theory this service fails over to calling a different dependency. And in theory that dependency is scalable.

Running a chaos test confirms that those assumptions are correct - that scalability + error tolerance actually delivers resilience.

bob1029 · on March 20, 2022

After seeing the Audible block diagram, I'd add 4th & 5th takeaways:

> Most of this conversation can be obviated by spending time minimizing the number of systems, dependencies, vendors and other 3rd party items required to satisfy the product objectives. Prefer more "batteries-included" ecosystems when feasible.

> Start with a monolithic binary, SQLite and a single production host. Change this only when measurements and business requirements actually force you to. Plan for the possibility that you might have to expand to more than one production host, but don't prioritize it as an inevitability. There is no such thing as an executable that is "too big" when the alternative is sharding your circumstances to the 7 winds.