
How does Monzo keep 1,600 microservices spinning? Go, clean code, strong team - dustinmoris
https://www.theregister.co.uk/2020/03/09/monzo_microservices/
======
SheinhardtWigCo
I hate to rain on the parade, but on the face of it, this sounds utterly
ridiculous.

I simply cannot believe that their product really calls for an architecture
consisting of 1600 individual microservices.

Don't get me wrong, I don't think one huge monolith is ideal either, but how
is this better?

Who at Monzo could possibly understand how a significant portion of their
platform works, in detail, and be confident that knowledge isn't going to be
outdated in ten minutes from now? Who understands how changes to each of these
tiny pieces will impact the others? What are they actually solving by doing
things this way?

Would this abomination exist if they didn't feel the need to hire people whose
primary objective is to get the words "microservice" and "Kubernetes" stamped
onto their resumes?

Am I just out of touch? Is this the new normal?

~~~
pjc50
Not to mention that it's a financial service, and doing transactions across
microservices is a total nightmare. I wonder how many of those services are
allowed to touch actual account data directly.

~~~
okal
There's a lot of back office tech you need to build/buy if you're working in
consumer, or financial tech. Monzo is both. Most people in working at a bank
aren't finance workers. There are entire domains that will never be touched in
the course of a transaction.

------
gregdoesit
I joined Uber in 2016, right around when on every conference you'd hear a talk
along the lines on "Lessons learned at Uber on scaling to thousands of
microservices" [1].

After a year or two, those talks stopped. Why? Turns out, having thousands of
microservices is something to flex about, and make good conference talks. But
the cons start to weigh after a while - and when addressing those cons, you
take a step back towards fewer, and bigger services. I predict Monzo will see
the same cons in a year or two, and move to a more pragmatic, fewer, better-
sized services approach that I've seen at Uber.

In 2020, Uber probably has fewer microservices than in 2015. Microservices are
fantastic for autonomy. However, that autonomy also comes with the drawbacks.
Integration testing becomes hard. The root cause of most outages become
parallel deployments of two services, that cause issues. Ownership becomes
problematic, when a person leaves who owned a microservice that was on the
critical path. And that no one else knew about. Oncall load becomes tough: you
literally have people own 4-5 microservices that they launched. Small ones,
sure, but when they go down, they still cause issues.

To make many services work at scale, you need to solve all of these problems.
You need to introduce tiering: ensuring the most ciritical (micro)services
have the right amount of monitoring, alerting, proper oncall and strong
ownership. Integration testing needs to be solved for critical services -
often meaning merging multiple smaller services that relate to each other.
Services need to have oncall owners: and a healthy oncall usually needs at
least 5-6 engineers in a rotation, making the case for larger services as
well.

Microservices are a great way to give more autonomy for teams. Just beware of
the autonomy turning into something that's hard to manage, or something that
burns people out. Uber figured this out - other companies are bound to follow.

[1] [http://highscalability.com/blog/2016/10/12/lessons-
learned-f...](http://highscalability.com/blog/2016/10/12/lessons-learned-from-
scaling-uber-to-2000-engineers-1000-ser.html)

~~~
imtringued
>And that no one else knew about.

Yeah exactly. Having two people dedicated to the "Account" service that has 10
features sound much better than having 2 people responsible for 5
microservices each. You might end up with a reset credentials service, a
register customer service, a delete account service without any coherent
overarching design strategy instead of having just having a plain CRUD service
with 6 extra operations. I can't blame them for having 160 services because a
lot of enterprise tier organizations are truly that complicated (I work at
one) but I do blame them for having 1600.

------
franze
From the same newspaper 6 months ago.
[https://www.theregister.co.uk/2019/08/06/monzo_pins_reset/](https://www.theregister.co.uk/2019/08/06/monzo_pins_reset/)

"[480k] ... of its [Monzo] customers to pick a new PIN – after it discovered
it was storing their codes effectively as plain-text in log files."

------
zoltrain
I'm not even entirely sure the microservice part is the bit that's ridiculous.
I remember interviewing with them a few years back and when I was told the
architecture I was like "so you have all these services but you run them off
the same database cluster, isn't that like super risky?"

Low and behold pretty much every major outage they've had has been because of
their Cassandra cluster. To be fair they have addressed this now, but for it
to take N outages and 5 years kind of tells me they could focus a little more
attention on properties people value in their banks, like I don't know...
uptime.

This is why they have to work pretty hard to make me switch over to it being
my primary bank, and 1600 microservices communicating over the most brittle
part of your stack doesn't win them any points in this space.

To put it in context I really can't remember the last time HSBC had a major
outage... Sure it's boring but it works pretty much all the time.

Monzo seems more of a "60% of the time it works everytime" situation, great to
send your mates money but I wouldn’t trust it to pay my rent on time. Having
1600 microservices kind of explains this.

If there’s anyone at Monzo listening, please focus on not going down as
everything else you do is great.

------
altmind
Its a weird to flex about how complicated is the system that you created.
Creating simple systems is harder.

~~~
okal
I think that's the point. It's a set of simple services. A microservice is as
simple as it gets. I imagine that orchestration can become hellish at a
certain point, but every team works on a few simple microservices, rather than
one complex system.

~~~
gdy
Maybe the services are simple but their interaction definitely isn't.

~~~
okal
"Definitely" is a strong word. I'd imagine that it's designed so that you
don't need more than a few to interact with each in real time to fulfill a
business process.

I think people's anxiety around this comes from thinking that they need to
understand every single system in intimate detail, or even a significant
number of them. On most days, you probably only need to understand what
happens at the boundaries of the ones that your systems need to interact with.
Good instrumentation means you have enough visibility into them without
needing to understand the implementation details.

~~~
Traster
Let's put this in concrete terms, I have 1600 pieces of cardboard, each one
only interacts with 4 other pieces of cardboard. In total, they come together
to form a single, fully functional product which shows the full picture.

We literally call this product a puzzle.

------
Narkov
That visualisation graph is horrific. There's no way someone can look at 1,600
services and say that this is a good idea.

~~~
Nextgrid
It _is_ a good idea when "profit" is way down on your companies' list of
objectives but "being hip" and "bragging about complex tech" is at the top.

The more complexity they have, the more people they can justify to hire, the
more "growth" they get, the more investment money they can solicit.

This kind of complexity benefits everyone individually. London has a huge
bullshit startup scene that enjoys complexity for complexity's sake (for the
aforementioned reasons) so it's a good career move for these developers to
have "microservices" on their CV. Managers also gain as they have to manage a
team 10x the size. The people above the managers can boast about how they're
"scaling teams" and battling complex engineering problems (even though these
problems are self-inflicted). The company enjoys it because it makes them look
hip and is presumably good bait for investors (just like AI and blockchain).
As a whole, everyone loses because a simple task is now taking 10x the effort
thus money but cares about that as the downside is a long-term cost that is
distributed across everyone when the upside is a direct, personal benefit to
someone's career.

When you're playing with investor's money it's a very good idea. When you're
playing with your own money and doing a cost/benefit analysis it's a terrible
idea.

------
Hates_
"Any organization that designs a system (defined broadly) will produce a
design whose structure is a copy of the organization's communication
structure."

 _— M. Conway (Conway 's Law)_

~~~
Traster
To compare and contrast: [https://monzo.com/blog/2018/06/27/engineering-
management-at-...](https://monzo.com/blog/2018/06/27/engineering-management-
at-monzo)

To be fair, the article didn't say what I was expecting which was "Our
organisation is entirely flat".

------
Slartie
This looks awfully much like back in the days when OOP really took off and
people started to brag about how many thousands of classes they have in their
systems. They even drew dependency diagrams awfully similar to those in that
presentation. And they thought all that complexity and interconnectedness was
a good thing, something to applaud! Of course most of this happened in small
presentations, not on HN and giant conferences. But it still happened.

Nowadays, bragging about your thousands of classes in your OO application
earns you laughs, or at best pitiful stares - rightfully so.

I'm wondering if history is going to repeat itself...

------
clementroche
I'd be interested to understand how they manage their "common" library,
especially the RPC bit.

I find it very painful to version APIs signatures and protocols: Let's say
service S1 has a method "DoSomething(a, b)" and you want to add a new feature
that makes is "DoSomething(a, b, c)" how do you handle that ? From what I've
seen you would do "DoSomething_v2(a, b, c)" but that seems hardly sustainable:
\- you need a new version of the shared library for the new RPC calls \- you
need all services to maintain a lot of code to support different signatures

And I don't think rewriting all the callers to use the new signature is really
an option. Or maybe they keep previous versions of the services deployed and
they have some kind of routing that would find which service supports what
signature (+probably some metadata like a minimum version number ?).

People of HN, what would be your advice on approaching this situation ?

~~~
okal
I'm generally quite against shared libs across services for this very reason.
Share patterns, sure. But code? You end up coupling services in ugly ways,
because Team X has to think about how Team Y will be impacted if they need to
make a change to the shared library. API clients (whatever protocol they may
be using to communicate), aren't generally the hardest part of building a
service. Trying to build shared ones (in my experience) is more trouble than
it's worth.

------
adtac
Go with mandatory error handling enforced by a linter is probably one of the
best ways I've seen reduce mistakes. Not all aspects of the language are
perfect, but I'm confident it is loads better than exceptions-based approaches
with catch-all exceptions and generic error handling strewn everywhere. I
imagine this is especially useful from a banking perspective where mistakes
can wipe out millions.

~~~
johannes1234321
Microservices however remove that. i.e. if a microservices fails to respond
you have no indication whether it succeeded or not, which can lead to
inconsistent state.

------
manthideaal
Using so many microservices seems like a good use of erlang, googling I found:
(1)

(1) [http://blog.plataformatec.com.br/2019/10/kubernetes-and-
the-...](http://blog.plataformatec.com.br/2019/10/kubernetes-and-the-erlang-
vm-orchestration-on-the-large-and-the-small/)

------
DelightOne
> We have an RPC filter that can detect you are trying to send a request to a
> downstream that isn't currently running, it can compile it, start it, and
> then send the request to it.

How does that work? dns not available == service not running & non-prod ->
contact deployment operator?

And what if those services need DBs? Or is the single cassandra/etcd instance
enough due to standardization?

And how those automatic metrics work is interesting too. Probably a prometheus
http handler auto-starting with tie-ins into rpc/DB access code? How automated
can it get.

------
cosmiccatnap
There are two ways to make system. One is so simple that there are obviously
no issues and the other is so complicated that there are no obvious issues.
The former being much harder.

------
helsinkiandrew
Monzo is rapidly growing and has only going for 4-5 years.

The real test will be when they have to retire or alter services or face
product/compliance changes that effect many services.

