
We built network isolation for 1500 services - p10jkle
https://monzo.com/blog/we-built-network-isolation-for-1-500-services
======
d4nt
Locking down this network of services is a massive security improvement and
they've used some very neat ways of achieving it. Overall, I really appreciate
them writing this up.

However, 1500 services? That really feels like they're separating things at
too granular a level. Does every one of those things really need to sit behind
a network call? Couldn't some of that re-use be via code libraries? I wonder
what the service to developer ratio is?

~~~
segmondy
Code libraries in microservices can lead to a distributed monolith. each
service needs to be be able to evolve independently. if you end up with a
library where an update will require more than one service to be updated at
the same time, then you are losing the advantage of microservices.

nevertheless 1500 is a lot, but if you think it's a lot, wait till someone
builds out their entire business with lambda/serveless. :-)

~~~
p10jkle
Yeah, our (mostly) ban on shared code is a major reason we end up with a lot
of services; there's even service.business-day which tells you if a day is a
working day

~~~
d4nt
I’m interested in the term “distributed monolith” and why shared libraries are
bad, can you elaborate? (I’m an experienced dev who’s never worked on micro
services). Say a service that consumed 6 other services became a service that
consumed 3 other services and 3 libraries. The interfaces were the same, the
libraries are consumed via package management and have good test coverage. Why
is that worse?

I’m actually not surprised you have a service for handling working days. The
list of UK bank holidays should ideally be stored in one place, with
everything else consuming that. But not all of the 1500 services have their
own data stores, do they?

~~~
trollied
It seems crazy to me that you'd have to do a HTTP(S) request to find out
whether a day was a working day. A HTTP(S) call which then probably does a DB
lookup over the network. Just seems like an extra layer of overhead, when you
could just have a library instead.

I've never worked on microservices, so I may just have a fundamental
misunderstanding of how granular you're supposed to be & also don't know if
this is the norm.

Would be nice to hear some other opinions on this.

~~~
eigenrick
Figuring out whether today is a working day is actually a dynamic problem. It
is different for every locale, and it changes more frequently than one might
expect.

It's actually a great example of something that changes rather frequently, and
it's a concern that cuts across dozens (or in Monzo's case, maybe hundreds) of
services.

Causing a roll-out to hundreds of critical business functions (whether hosted
in a monolith or microservices) because Uganda added a holiday seems quite
excessive.

The rule I and my company follows is that the shape of a module should follow
its deployment model and scope. Some services are global in scope (literally
targeted at the earth) others are scoped to a small subset of our inner
network. They should have a different code repo because they have a different
rollout schedule.

In the case of the working-days services. Only 1 rollout needs to occur, and
it seems like a fairly safe rollout.

Let's consider another service, the CriticalTransactionService, or CTS. It
executes critical business transactions in a very stateful way. When deploying
a new version, in order to avoid any loss of availability, often a special
rollout dance must be executed. E.g. switching the master transaction writer
to another region, which means changing databases from passive replication to
active master, and vice-versa.

One might consider this rollout "risky" and therefore only limits it to happen
at 3am on a Saturday. It seems like a good idea to limit the scope of this 3am
Saturday rollout only to the CriticalTransactionService, and most other
services are free to deploy to prod whenever they wish.

~~~
trollied
I guess what I don't get is that the service needs to be rolled out/restarted
too.

Doesn't the data just live in a database & there's a cache going on that can
be invalidated?

Or is everything just so overcomplicated these days?

~~~
eigenrick
This particular case could be a couple tables in a database. To follow proper
encapsulation guidelines, it would have to be front-ended by some stored
procedures, so that the underlying representation is free to change.

How would you roll out this change to production? If you just swapped out
tables and stored procedures, you'd have downtime. You can't just install a
parallel table and stored procedures, because there is no way to tell all of
the consumers to use the v2 of your functions. So you'd have to temporarily
remove availability of the WorkingDays functionality.

If it were deployed as a microservice, you are free to deploy a v2 of
WorkingDays in parallel. When it's live, the old one goes away, and there is
no loss of availability.

------
all_usernames
Great post. I really appreciate engineering blogs written in this storytime
format. I don't have time to dive into the implementation of Calico or <insert
one of the 1,261 kubernetes projects here> _, but I learn a lot from reading
the process a team goes through in figuring out and iterating on a solution.

_ [https://landscape.cncf.io/](https://landscape.cncf.io/)

~~~
dfc
What is the purpose of the link at the end of your comment?

~~~
fastball
It's a ref for his "1261 k8s projects" statement.

------
sansnomme
Another potential solution is to use a constraint solver like MSFT Z3, or if
you want a nicer syntax and more flexibility, Prolog.

E.g. [https://medium.com/@ahelwer/checking-firewall-equivalence-
wi...](https://medium.com/@ahelwer/checking-firewall-equivalence-
with-z3-c2efe5051c8f)

This is much more scalable in the long run.

------
gravypod
If the authors are reading this I was wondering two things:

1\. Why was static analysis of the code chosen over observing the system
during runtime and integration testing?

2\. What was the reason rhe CNI layer was chosen for the implementation of
this over the service mesh layer?

Something that really interests me about bazel/buck/pants/please is it
automates #1 entirely with dep queries.

~~~
p10jkle
1\. We have a lot of code paths, there isn't an integration test for
everything. And for runtime, just because something is rarely called doesn't
mean it never called; a bank can have yearly processes 2\. We'll do service
mesh too but this was probably an easier first step. We have rolled our own
mesh with envoy sidecars and moving it towards istio style behaviour isn't
trivial

~~~
tjungblut
> We have a lot of code paths, there isn't an integration test for everything.

but you will have the service names that you're contacting configured
somewhere (or in the worst case hardcoded). I would think this is much easier
to analyze than the traffic flows.

~~~
YawningAngel
It's done by importing a protobuf, so I imagine rpcmap just checks all the
protos imported by a service

~~~
p10jkle
It's slightly more complicated than that because you can import protos to use
a constant or to consume an event from another service. But that is the gist

------
z3t4
Applying network filtering, while being a nice extra layer, it should not be
the only layer. Services should need authorization like if it was an open api.

~~~
p10jkle
Completely agree

------
rawoke083600
"But we already have over 1,500" wow... I would start there...

------
purple_ducks
> attempt to find code that looked like it was making a request to another
> service.

> We generally fixed those cases by adding a special comment in the code that
> told rpcmap about the link

Why not enforce all endpoints/urls be defined in a config file and sidestep
this? - scanning code for URLs/constructed URL is overkill and brittle.

------
aSplash0fDerp
Nice write-up! Thats the beauty of scale, explain a part in detail, then go
with the 30,000 foot view.

IMM, the security orchestration may actually become the "app" as speeds
continue to increase, compute costs go even lower and losses incurred from
compromised data/networks increase.

A true zero trust platform that keeps all of the doors closed or
"instances/vm" offline until (the milliseconds) they're needed is the security
symphony we might see on the horizon.

Data silos and walled gardens may never go out of style, they'll just take on
new acronyms.

------
grandinj
Strikes me that some services ideally need to expose multiple interfaces, and
that isolation should be on a per-service-interface basis.

E.g. the monitoring service should only be able to access the metrics part of
each service.

~~~
p10jkle
Monitoring can only access metrics, it's scoped to port. We'd like endpoint
level controls and that's the next step

------
angry_octet
Impressive achievement. It still sounds like callee's have more knowledge of
callers than is justified. Is it a security property or a component
functionality property? How do those interact?

A centralised graph representation of the security/functionality properties
would be a better way to represent this information, so it can catch adding
interfaces which should be forbidden. Also able to be configuration managed as
sets of microservices.

If you have a connectivity graph it would be good to do taint analysis to see
how far bad information can propagate.

~~~
p10jkle
I think we do effectively have a graph, but its not stored in a centralised
way. If it was, then we'd have to somehow gatekeep that state, which when you
consider that we do hundreds of deployments a day, becomes tricky

~~~
angry_octet
Then the state at present is changing hundreds of times a day, without being
centrally visible, via runtime query or configuration service. Seems like it
would be hard to retrospectively analyze what could connect to what. Maybe
consider a permissive system, that allows a service code change to change the
connectivity graph without a review step, but centrally logs the change;
another service can supervise what's happening, and alert if strange things
happen. It would help to have a system for categorising service types; the
graph properties of the system might be useful for understanding that. Looking
at service changes in terms of graph properties, like edit distance, can also
be useful.

~~~
p10jkle
I feel like we could do all of that by consuming the GitHub API. We
essentially have directed graph stored as files

------
brentis
Nice work. If you define your policies based on a tagging taxonomy you could
centrally manage these inbound/outbound service relationships. Every new
instance or container would assume same network policies based on tag.

------
matdehaast
Curious if you looked at using oAuth with client credentials grant for each
service?

Also didn't see any mention of prior art like
[https://cloud.google.com/beyondcorp/](https://cloud.google.com/beyondcorp/).

Thanks for the great writeup!

~~~
p10jkle
This is probably our next project, although we might solve this problem with
mutual tls instead of JWTs

~~~
matdehaast
oAuth generally won't give you JWT's. Its recommended it gives you an opaque
token which you introspect. Whats nice is you delegate the Auth part to your
AS. And then your resource servers (micro-services) can make the
determinations based on the client_id/claims from introspection.

------
hu3
> This would read all the Go code in our platform, and attempt to find code
> that looked like it was making a request to another service.

Is there a link about how much Go does Monzo they use?

~~~
Spinosaurus
Almost all of the micro services referenced are written in Go.

------
mschuster91
1.500 services? What the... the run times for calls must be _atrocious_ with
all the network communication and latency that is happening.

~~~
jacquesm
It does not have to be. But that is the least of your worries, even so there
might be valid reasons for setting things up this way but it would be a
relatively rare case. They do have 150 devs and that alone is a difference
that makes this approach untenable for most companies.

------
voltarolin
Can a service mesh such as Istio provide the capability that Monzo have
implemented themselves here?

------
kasey_junk
Using YAML for critical infrastructure specification is one of the stupidest
things we’ve ever done as an industry.

~~~
pokoleo
It’s better than XML!

~~~
angry_octet
When you see how many parsing ambiguity problems there are with YAML you begin
to wish for XML. Other modern alternatives include:

[https://jsonnet.org/](https://jsonnet.org/)

[https://json5.org/](https://json5.org/)

See [https://arp242.net/yaml-config.html](https://arp242.net/yaml-config.html)

