
Microservices – Combinatorial Explosion of Versions - taleodor
https://worklifenotes.com/2020/03/04/microservices-combinatorial-explosion-of-versions/
======
bethesque
I was at realestate.com.au when they started writing microservices in 2013 and
we were all facing this exact issue. Out of that experience, the (now open
source) contract testing tool called "Pact" was written (I am now one of the
maintainers). I wrote a blog about the combinatorial explosion problem here
many years ago now! [https://www.rea-group.com/blog/enter-the-pact-matrix-or-
how-...](https://www.rea-group.com/blog/enter-the-pact-matrix-or-how-to-
decouple-the-release-cycles-of-your-microservices/)

From the Pact docs: 'Contract testing is a technique for testing an
integration point by checking each application in isolation to ensure the
messages it sends or receives conform to a shared understanding that is
documented in a "contract".' By focussing just on the messages we get tests
which are fast, give us quick feedback, and scale linearly instead of
combinatorially.

Some good resources are:

[https://pact.io](https://pact.io) (for information about contract testing and
the Pact tool itself)

[https://pactflow.io/how-pact-works/](https://pactflow.io/how-pact-works/)
(explains how Pact works)

[https://docs.pact.io/faq/convinceme](https://docs.pact.io/faq/convinceme)
(answers the question of why you would want to do contract testing)

[https://slack.pact.io](https://slack.pact.io) (a friendly 1000+ member
community which is very experienced dealing with these kinds of issues)

[https://docs.pact.io/pact_broker/can_i_deploy](https://docs.pact.io/pact_broker/can_i_deploy)
(addresses how we handle and channel this combinatorial explosion for good
instead of evil!)

~~~
hinkley
I did a stupid simple thing for a system with a small fanout footprint. Not
sure this scales.

The organizing principal was that if my code passed the acceptance tests once,
the very same code should still pass with your service/library/data added to
the mix. If they don’t, it’s likely your code and not mine.

This might not seem like much but it cuts hours of finger pointing out of a
three party system. I expect this effect would be multiplied for a five or ten
party scenario.

And in particular, latency has a huge negative impact on behavioral
modification. People will keep doing things that they get yelled at over three
weeks later. They will stop doing things they get called on hours later.

Which I would hope offsets the additional test matrix complexity.

------
jennyyang
The diagram of the service in the blog post is absurd.

Anyone who designs a microservice environment like this should be fired. The
interdependencies between services in the picture make the entire system
fragile and everything will fail if a single service goes down.

A real production microservice environment should be designed so that
interdependencies are limited and system wide failures won't occur if a
service or two go down. Once you limit the interdependencies then the
combinatorial explosion doesn't exist anymore. You might have some services
that have a wide range of version used by clients, but you don't have the
interdependency that complicates things.

Then, versioning isn't such a big deal.

~~~
danielovichdk
You cannot build software without dependencies.

The dependency pattern around microservices is either a contract on a
queue/stream/topic or a version of an API from another depened service.

Don't care how many times you tell me that services should be interdepended
and bla bla bla... In the end a service has dependencies and those
dependencies must then have a versioning strategy so humans are not in doubt.

Versioning a microservice is no different than versioning a linked library or
a Rest API.

Documentation is key. Tell people what versions break.

~~~
taleodor
Thank you, I very much second this. Another way of putting this - if we need
to ensure identicity between any 2 environments - we have to care about fine
details and exact versions.

Otherwise, we're back at "works on my machine" mentality.

And yes - trying to make microservices independent from each other is also a
form of pruning and sometimes works well to a point, but also requires good
tooling to be done right.

~~~
jennyyang
My point is your article is based on a fake premise, or a poorly designed
premise. Your architecture that creates this combinatorial explosion is only
because your supposed microservice architecture is poorly designed. If you
look at a real microservice architecture, it won't be nearly as complicated
and it's not as big of a deal to version.

~~~
taleodor
> My point is your article is based on a fake premise, or a poorly designed
> premise.

That's a strong statement which you should at least try to provide counter-
example to support - just saying "you design it wrong" is not enough. I've
seen this kind of problem irl left-right-and-center - so I take it there is an
issue, and it's not happening because "everybody doing it wrong".

> If one service goes down, they all go down

I believe you don't understand the diagram. Lines are not dependencies, but a
way to connect components of different versions into single product. (I.e.,
you pick either v1 or v2 of each component, and it becomes single product
along the lines - it doesn't necessarily mean there is hard dependency).

To my point, I treat whole architecture as a product. I don't necessarily
speak about dependencies. Instead, I'm taking general view of this as a math
problem - first of all, I establish search space - that is number of available
versions to the power of number of microservices.

Then I very much support and want to discuss various ways to reduce this
search space via pruning - and what you're talking about is just one of the
options how to prune it (by reducing dependencies). But as others pointed
(rephrasing in my words), you can apply a greedy algorithm to NP-problem, but
first of all understand what the real problem you're dealing with is and
second of all realize that your algorithm is greedy - meaning it may have
flaws in the edge cases and it's better to be prepared to those.

In your specific case I claim that you can't be completely sure that you
actually don't have any interdependency between components - I believe it
would be impossible to prove for any system. And again I saw really hard bugs
irl coming from those assumptions. Recent example (this is not about
microservices - but pls try to solve this case):
[https://stackoverflow.com/questions/60486853/aws-ecr-
uploadi...](https://stackoverflow.com/questions/60486853/aws-ecr-uploading-
docker-image-give-below-error) \- Tools in question must be completely
independent, but somehow they are not. So assuming something is completely
independent is frequently dangerous.

------
simonw
A lesson I learned working with microservices that are deployed independently
of each other is that you need a policy that ALL changes must be backwards
compatible with existing clients.

This sounds crazy, but it's essential if you can't guarantee that you can
atomically deploy a new service version AND any other services that call it.

Creating backwards compatible releases means sticking to some rules, things
like:

\- you can add fields but you can never delete them

\- you can add new methods but you can't remove old ones

Having really detailed logging helps a lot. If you want to deprecate a method
you can remove it from calling clients first and then use the logs to confirm
it isn't being called any more before removing it from the service.

It's a lot if work, and teaching a large engineering team how to be productive
in this kind of environment is decidedly non trivial.

~~~
daotoad
Never say never.

Adding fields or methods is easy. You can do it at any time.

Removing them is difficult. You must first go through a long deprecation cycle
and ensure ALL clients are updated.

You need good logging AND good monitoring and analysis of the logs in both
client and server to be able to detect access to removed features.

~~~
simonw
A coworker pointed out that this is a huge point in favour of GraphQL: because
it forces clients to explicitly ask for every field you can log those queries
and use them to spot when a field is no longer in use.

------
mattlondon
The typical approach to catching this that I have seen is to deploy the new
version of Service X to only a small percentage of all machines for a certain
amount of time. E.g. 1% of machines get v1.1 while all others stay on v1.0.
You then monitor error rates in v1.1 (and its dependencies) and stop if there
are elevated levels, or crank the release up slowly if the error levels aren't
elevated. Obviously some automation in your release management and monitoring
tools helps a lot with this.

Of course this does not "solve" the problem of incompatibilities themselves -
for that the simple solution that I've seen used is just bog standard
versioning of APIs (/v1/... /v2/... /v2019-03-04/... etc) and some procedures
for managing the supported versions, e.g. not doing breaking changes in an
existing version, only supporting a set number of versions (i.e. previous,
current, and next), and having proper sunset-periods for older versions before
they are turned off so clients have time to update. Slows velocity to do all
this management but that is life anyway in a complex system

------
dgrin91
Don't monorepos solve this? Lets presume I'm not a FAANG company (though some
of those have monorepos) with a bazillion services, I'm just a startup with
say 10 services. At that scale I think its pretty realistic to maintain a
monorepo for all 10 services. That way every release only upgrades the version
count by one.

~~~
jennyyang
A startup has no business running microservices. Microservices are only when
the engineering team can't scale properly because they keep stepping on top of
each other.

A startup should be using a monolith until it becomes a victim of its own
success and then should migrate thoughtfully to microservices only when it has
to. It also requires a very heavy investment in devops and tools since
microservices will fail in production in every which way possible. It also
requires a huge investment in metrics and alerts and logging otherwise you
will have no idea what is wrong with your now too-complicated production
system.

~~~
danielovichdk
Do you know the domain of the business? Otherwose it sounds like you should
reconsider the advice you give out.

~~~
jennyyang
My statement applies across the board. It doesn't matter what the domain of
the business is. Microservices cost more and have a higher impact on
productivity and is only worth it if your engineering team needs the
flexibility, at the cost of increase devops investment and more down time.

~~~
danielovichdk
No.

~~~
jennyyang
Ok. Once you have real world experience, then you should continue commenting.

------
sudeepj
Few points based on my experiences with (Micro) services:

Note: None of these are cast in stone. These have worked well for us

1\. Decide in your team/org what constitutes a new version and what it means.
Does every change means version change? What if an API adds an optional
parameter; do I need to update the version?

2\. Not everything has to be a service. Think if the same functionality can be
consumed in the form of a library/sdk instead of a full service.

3\. The dependency amongst services should resemble acyclic graph and not
cycles. This limits the version change impact.

4\. Think about abstraction. Can we logically group set of services A, B & C
and provide unified API via pass-through service D? Only D needs to handle API
breakage most of the time.

------
BerislavLopac
If you have this problem, you don't have microservices -- you have a
distributed monolith.

If each service can talk to each other, the number of possible communication
paths grows exponentially -- this is manageable up to a point, but even the
largest teams will have to start investing into orchestration sooner or later.

There are many orchestration patterns and mechanisms, and this is not the
place to list them all -- which one is the right choice will depend on the
complexity and needs of the system.

~~~
taleodor
This is not about microservices talking to each other, but rather about seeing
customer-facing product as a whole with all its different components and their
versions.

And yes we can agree that there are various (usually "greedy") strategies to
simplify this - but this problem is a fact of life and it makes thinking
process easier to accept it as such. Same as accepting that it is impossible
to solve consensus in async system, but there are algorithms that perform
fairly well.

\+ I summed this up in more details in the comments down below.

------
fxtentacle
Turns out the latest hype is a dud for most companies ^_^

There are good reasons to use microservices, like when you have 100+ engineers
working on your planet-scale cloud or social network. At that scale, you need
to have internal API documentation to ensure people that never met can still
work together productively.

But if you are a normal company, you will probably never have enough engineers
to make microservices worthwhile in the first place. So just put everything
into one big git repo and call it a day.

Also, the ability to do rolling restarts and no downtime upgrades by running
multiple versions in parallel adds a lot of work and complexity. But for most
small to medium companies, a planned 5 minute downtime in the middle of the
night is completely no problem, so all that zero-downtime-upgrade work is just
wasted effort.

I mean, even my bank has a fixed offline maintenance window every night from
3:00 to 3:30 am. As does Amazon RDS.

So don't solve problems that you don't have :) Most companies do fine without
microservices.

~~~
fiedzia
> There are good reasons to use microservices, like when you have 100+
> engineers

Its enough to have 2 that work in different domains with different constraints
and different tools. Company scale is completely irrelevant here.

> But if you are a normal company, you will probably never have enough
> engineers to make microservices worthwhile in the first place

Many "normal" companies use microservices and common reason is that they
simply cannot function otherwise.

> But for most small to medium companies, a planned 5 minute downtime in the
> middle of the night is completely no problem,

The fact that some site doesn't function for 5 minutes may not be a problem,
the fact that any change requires many people agreeing on deployment time
often is.

~~~
jdc
_Its enough to have 2 that work in different domains with different
constraints and different tools._

Wouldn't that be more or less be just two monoliths?

~~~
fiedzia
If they both work for the same organization to achieve common goal,
communicate through some interface they agree on, having to coordinate
occasional changes that affect both codebases, than no.

If they are completely isolated, than yes, but its really rare scenario.

A common example would be a website that gathers/presents some data from
users, and a service that provides analytics of said data. Both domains may
require completely different technologies.

------
jayd16
Is this a real problem people deal with? People actually build these spaghetti
services and deploy several versions of them at once? Each of these versions
are somehow routable to and from each other version with no abstraction?

Surely this doesn't actually happen.

------
bcrosby95
Consumer driven tests can help with this too. Rather than just service
providers writing tests, service consumers write them too, which are then
given to the service provider as part of their test suite.

------
ewindal
I don’t see how this is a problem. I see that the combinatorial explosion
exists, but that isn’t an actual problem. The entire point of microservices is
independent deploys, meaning you literally only have two permutations per
deploy, new and old. If you think deploying 10 services at once is feasible,
you don’t understand the appral of microservices, and should never have opted
for them in the first place.

~~~
taleodor
Deploying one at a time may be fine if you always test that each one is
working in backward compatible manner. However you still have same search
space and same problem as before.

Imagine

a - you have 10 microservices, 1 update per each. Each supposed to be
backward-compatible. You start rolling out. One by one. 5 go fine, 6th breaks.
You end up in a weird state where 5 out of 10 are updated.

You hope it's fine due to backward compatibility but you never really tested
this config exactly. (Which is why I might prefer either converging to
deploying all 10 or rolling back fully - if that was a known good state - and
having canary cluster rather than canary microservice in many cases).

b - same as above but now you try to catch this behaviour on test / staging.
You still have same hard problem at hands.

Key here is you clearly can't try every possible variation of what may break,
so need to make conscious decisions about what to do in the case of failure.

------
pachico
Although it's an interesting calculation, you rarely have all your
microservices communicating with eachother since you tend to group by areas.
And even if that was the case, if good practices are in place (semantic
versioning, good test cases, etc) I don't see why that should be a problem.

------
atrandom
in my experience, most service API grow by accretion and need to break
backward compatibility very very rarely. Naturally, clients should always be
Tolerant Readers.

So in many case, you will have only a single version per service.

Moreover, imo, when we are talking about internal apis, there should be
nothing preventing the service owner from updating the consumers if he wishes
to converge more quickly after breaking compatibility - just like he would
when refactoring a monolith. The culture should allow and encourage this kind
of collaboration.

------
iblaine
This is an interesting void for micro services that’s not yet being met.

------
TeeWEE
if you: Deploy new versions of new services and quickly rollout the number of
combinations get manageable.

Making he interface of services backwards compatible, either by testing and/or
by using stricter contracts like grpc (which had a way to deal with
compatibility)

Then you're mostly covered. However integration testing indeed makes sense

------
jillesvangurp
Microservices only have two versions you should worry about: the one that's
currently live and the next one. Two for each micro service you are
responsible for. IMHO roll backs are not a thing. Instead git revert and roll
forward when you need to deploy a fix. So once a service has been fully
deployed, you permanently stop caring about the previous version.

Microservices should not be mass deployed together but be managed separately.
That vastly simplifies the combinatorial explosion of versions you need the
worry about: the currently live ones.

The rest is just applying SOLID principles to your microservices and avoid
having services with poor cohesiveness or tight coupling (most of the SOLID
principles boil down to affecting these two metrics). Bad service design with
tight coupling where deploying service A also requires redeploying B,C,and D,
is basically a design problem and not a micro services problem. There's a lot
of bad design in our industry. Monoliths allow you to get away with that but
it's also the reason that breaking them up is a hard problem. Just because you
are getting away with it does not mean it is not a problem though.

You can further mitigate integration issues by deploying using modern
practices like blue green deployments where you gradually move traffic to a
new versions and adapt on things like error rates and other metrics, AB
testing, etc. So if it breaks, you don't end up breaking it for everyone and
you can fix the problem and try deploying that in a controlled way.

Basically your goal is to keep your customers (aka. dependees) happy and make
sure you don't negatively affect them (which you should be actively
monitoring). Likewise if you have dependencies (to whom you are a customer)
and they stop working when you deploy something new, you probably want to
detect and fix that before you break all your customers. If that is a regular
thing, consider having integration tests, contract tests, etc.

Staging environments are a controversial topic in this context. My view is
that they don't make much sense in a properly run micro service deployment
since you are not testing in a real environment with real users, real data,
and lots of things happening concurrently and features interacting. And of
course your customers doing real things they care about. The only realistic
environment that has that in most complex micro service deployments is called
production. Once you add serverless and edge computing to the mix, these
things become even more true. The bigger the organization, the less feasible
it becomes to have a staging environment.

Update. I forgot to add this but doing continuous deployment means small
deltas that are low risk. Any bigger change can be behind a feature flag.
There are a few more strategies.

Also, I'm not actually a microservices proponent for small teams. It just
creates deployment and operational overhead (read go to market bottlenecks).

~~~
dosethree
Pretty much yes on all of this. I don't mind rollbacks though - there are
issues with rolling back but if I can, I will. If there is a database
migration, I typically will roll forward as you say.

Staging envs are generally a crutch. They can help but they can also hurt.

