Hacker News new | past | comments | ask | show | jobs | submit login
How does Monzo keep 1,600 microservices spinning? Go, clean code, strong team (theregister.co.uk)
39 points by dustinmoris on March 30, 2020 | hide | past | favorite | 54 comments

I hate to rain on the parade, but on the face of it, this sounds utterly ridiculous.

I simply cannot believe that their product really calls for an architecture consisting of 1600 individual microservices.

Don't get me wrong, I don't think one huge monolith is ideal either, but how is this better?

Who at Monzo could possibly understand how a significant portion of their platform works, in detail, and be confident that knowledge isn't going to be outdated in ten minutes from now? Who understands how changes to each of these tiny pieces will impact the others? What are they actually solving by doing things this way?

Would this abomination exist if they didn't feel the need to hire people whose primary objective is to get the words "microservice" and "Kubernetes" stamped onto their resumes?

Am I just out of touch? Is this the new normal?

The orchestration overhead must be huge, but it sounds like they have focused on automating the pain of that away as much as possible.

To me it looks like an Alan Kay style of OOP design. From that standpoint, replace 'microservice' with 'class' (or 'actor') and 1600 of them isn't that ridiculous.

It'd be an interesting system to browse. I'm actually curious if they've built any tooling around making the codebase easier to navigate and edit. Smalltalk handles a huge amount of tiny classes by having good browsing tools. What would you use for a Go codebase of this size?

You can replace 'microservice' with 'class, but completely asynchronous and any interaction with it can fail without caller knowing if it succeed or failed at all'.

Yes, that type of class is known as "actor".

As someone with a little experience troubleshooting actor based systems, having 1600 of them sounds (actors not instances) like a nightmare.

I have some experience with Akka. The amount of praying you have to do during production issues, no surprise it feels like a religion.

We tried it about 4 years back. Maybe there are some better debugging tools now, but for us both learning curve and prod issue pain did not make it worth it.

Some people love erlang.

Maybe a little out of touch. Not about the new normal, but about being a consumer bank.

We are not micro services, we are a 'monolith', but with over 1600 components our developers call 'applications' and over 7,500 touch-points between them our developers call 'interfaces'. A single 'customer information file' touches dozens and dozens of these 'applications'.

For us, at least, what's going on is regulation. Every aspect of banking is heavily regulated, and each component of the monolith has to be compliant.

A notorious example is check settlement order. When checks come in for payment, the bank is supposed to process those checks in an order favorable to the bank customer not to the bank. For example, $1000 in an account, and you write one $995 check and a hundred $10 checks, if the bank pays the first one, the rest would bounce, causing $3500 in bounced check fees. If the bank pays all the $10 ones, just the $995 one would bounce, causing $35 in check fees. This is now regulated.

There are countless such rules, and no dev team can know them all. So you assemble third party component software that has the rules baked in, or if trying to greenfield, you have compliance teams that study all the rules for their areas and bake them into the software requirements -- and good luck, because in the USA, these rules can be different state by state or even zip code by zip code.

Easier to just decompose your monolith into components maintained by a group that contractually guarantees regulatory compliance for their specialization, and then pray (or rather, spend months regression testing after any change for quality assurance) the whole thing works when you glom it together.

In that world, suddenly something like K8s makes a lot of sense.

Who at Monzo could possibly understand how a significant portion of their platform works, in detail, and be confident that knowledge isn't going to be outdated in ten minutes from now? Who understands how changes to each of these tiny pieces will impact the others?

Tbf, this can apply to any sufficiently complex system, whether or not it uses microservices.

It's why interfaces are so damned important, especially communicating them with stakeholders. Done properly, you only need to focus on the bits of a system that matter to you. At least microservices get to use OpenAPI.

There's a major difference between 600 components that follow the fallacies of network computing and 600 that don't.

Agree completely, but that wasn't the point I was replying to, which was about totally understanding your system.

Don't worry. I've suffered (and still do) as much as anyone working with microservices, trying to reconcile and decipher logs.

But total knowledge is rarely possible in a large system. At least microservices acknowledges this. If anything, they can help by following Conway's Law.

I think for traditional webservices it sounds a bit strange, but look at things like actors in erlang, or how ROS (robot operating system) uses nodes. It's all individual processes that are fully asynchronous, and the number can run into the hundreds or thousands.

Not to mention that it's a financial service, and doing transactions across microservices is a total nightmare. I wonder how many of those services are allowed to touch actual account data directly.

There's a lot of back office tech you need to build/buy if you're working in consumer, or financial tech. Monzo is both. Most people in working at a bank aren't finance workers. There are entire domains that will never be touched in the course of a transaction.

They're using Apache Cassandra which only got transactions in v2.0 and they're "lightweight transactions". It's not really designed for very strongly consistent data. I wonder how well that really works for financial OLTP.

Also, how many of those microservices do exactly the same thing, but are implemented by different people who don't know about the other microservices existing.

Sometimes I get paranoid and assume that startups of this kind get subsidised by the state to lower the unemployment. It looks like they had money to burn on engineers and the approach was to give them tasks to keep them busy so that they don't appear idle and give them an easy language to work with so that they don't complain.

Why should someone want/need to understand how the whole platform/significant portion of it works in detail? You just need to understand what happens at the boundaries of the systems you build or interact with. This works beautifully when you need to scale a team. Trying to get everyone to know every system in intimate detail seems the wrong way to go about it (to me). It's a bank. I imagine they automate a lot of things that would be considered minutiae in other organisations. Every Lambda, every little bot is a service. I get your cynicism, and I'd have shared it a few years ago, but 1600 now sounds like a perfectly reasonable number to me.

To understand the problem when everyone blames one another, every component seems to work fine but the entire system does not.

Suppose you design a bank. How many different RPCs would be involved? 1600 sounds modest.

Cant find how many developers they have to maintain this monster.

An article[1] from 2018 mentions 70 engineers. Let's assume 50% growth yoy which puts them at 157 engineers now. Making another assumption that only 20% of these engineers are productive which gives us 31 engineers. That's 1600 microservices for 31 engineers.

That sounds large but if you have 1600 microservices you are _most likely_ writing a single service where you'd normally write a class. Does 50 classes per engineer sound like a lot? Sure does but they didn't create this infrastructure out of thin air - this is an accumulation of many years. For sake of the arguments let's say 2 years. Across 24 months that's only 2 microservices per developer per month.

Not unthinkable, and this is assuming only 20% of their engineering force is actually productive.


I think a lot of people are extrapolating from their experience with monoliths and imagining that Monzo devs are maintaining 1600 tiny monoliths. That's an anxiety-inducing thought.

I would prefer maintaining a "tiny monolith" over a micro service. At least with the monolith you're writing value-delivering business-logic code. From my experience with micro services you spend more time on boilerplate and the whole HTTP/RPC part and the communication with other services than the actual task the service is supposed to be doing.

Or, maybe you are spending a bunch of time making sure you don't break a rats' nest of dependencies and waiting for things to build.

After the n^th iteration of building an rpc server, such things may be fairly standardised, no?

If you look at what Apollo Federation (graphql) is proposing, then you can see the architecture calls for essentially turning each database table (users, orders, etc) into a micro service running with its own port.

I joined Uber in 2016, right around when on every conference you'd hear a talk along the lines on "Lessons learned at Uber on scaling to thousands of microservices" [1].

After a year or two, those talks stopped. Why? Turns out, having thousands of microservices is something to flex about, and make good conference talks. But the cons start to weigh after a while - and when addressing those cons, you take a step back towards fewer, and bigger services. I predict Monzo will see the same cons in a year or two, and move to a more pragmatic, fewer, better-sized services approach that I've seen at Uber.

In 2020, Uber probably has fewer microservices than in 2015. Microservices are fantastic for autonomy. However, that autonomy also comes with the drawbacks. Integration testing becomes hard. The root cause of most outages become parallel deployments of two services, that cause issues. Ownership becomes problematic, when a person leaves who owned a microservice that was on the critical path. And that no one else knew about. Oncall load becomes tough: you literally have people own 4-5 microservices that they launched. Small ones, sure, but when they go down, they still cause issues.

To make many services work at scale, you need to solve all of these problems. You need to introduce tiering: ensuring the most ciritical (micro)services have the right amount of monitoring, alerting, proper oncall and strong ownership. Integration testing needs to be solved for critical services - often meaning merging multiple smaller services that relate to each other. Services need to have oncall owners: and a healthy oncall usually needs at least 5-6 engineers in a rotation, making the case for larger services as well.

Microservices are a great way to give more autonomy for teams. Just beware of the autonomy turning into something that's hard to manage, or something that burns people out. Uber figured this out - other companies are bound to follow.

[1] http://highscalability.com/blog/2016/10/12/lessons-learned-f...

>And that no one else knew about.

Yeah exactly. Having two people dedicated to the "Account" service that has 10 features sound much better than having 2 people responsible for 5 microservices each. You might end up with a reset credentials service, a register customer service, a delete account service without any coherent overarching design strategy instead of having just having a plain CRUD service with 6 extra operations. I can't blame them for having 160 services because a lot of enterprise tier organizations are truly that complicated (I work at one) but I do blame them for having 1600.

> And that no one else knew about.

Yep. I've encountered a production issue that was traced down to a Location service- you pass in some info, and get location information back- that had been running for 3 years without an owner. The developers had all left, and the team was dissolved.

Not an inherent fault of microservices, but having 1000s of them running around will cause some to slip through the cracks.

From the same newspaper 6 months ago. https://www.theregister.co.uk/2019/08/06/monzo_pins_reset/

"[480k] ... of its [Monzo] customers to pick a new PIN – after it discovered it was storing their codes effectively as plain-text in log files."

I'm not even entirely sure the microservice part is the bit that's ridiculous. I remember interviewing with them a few years back and when I was told the architecture I was like "so you have all these services but you run them off the same database cluster, isn't that like super risky?"

Low and behold pretty much every major outage they've had has been because of their Cassandra cluster. To be fair they have addressed this now, but for it to take N outages and 5 years kind of tells me they could focus a little more attention on properties people value in their banks, like I don't know... uptime.

This is why they have to work pretty hard to make me switch over to it being my primary bank, and 1600 microservices communicating over the most brittle part of your stack doesn't win them any points in this space.

To put it in context I really can't remember the last time HSBC had a major outage... Sure it's boring but it works pretty much all the time.

Monzo seems more of a "60% of the time it works everytime" situation, great to send your mates money but I wouldn’t trust it to pay my rent on time. Having 1600 microservices kind of explains this.

If there’s anyone at Monzo listening, please focus on not going down as everything else you do is great.

Its a weird to flex about how complicated is the system that you created. Creating simple systems is harder.

I think that's the point. It's a set of simple services. A microservice is as simple as it gets. I imagine that orchestration can become hellish at a certain point, but every team works on a few simple microservices, rather than one complex system.

Maybe the services are simple but their interaction definitely isn't.

"Definitely" is a strong word. I'd imagine that it's designed so that you don't need more than a few to interact with each in real time to fulfill a business process.

I think people's anxiety around this comes from thinking that they need to understand every single system in intimate detail, or even a significant number of them. On most days, you probably only need to understand what happens at the boundaries of the ones that your systems need to interact with. Good instrumentation means you have enough visibility into them without needing to understand the implementation details.

Let's put this in concrete terms, I have 1600 pieces of cardboard, each one only interacts with 4 other pieces of cardboard. In total, they come together to form a single, fully functional product which shows the full picture.

We literally call this product a puzzle.

But on bad days, you need to debug distributed transactions.

No, there is no way around knowing the entire system and by that I don't mean the implementation but rather knowing which services exist and what their responsibility is. If you have 10x as many services as usual your work just got 10x harder.

That visualisation graph is horrific. There's no way someone can look at 1,600 services and say that this is a good idea.

It is a good idea when "profit" is way down on your companies' list of objectives but "being hip" and "bragging about complex tech" is at the top.

The more complexity they have, the more people they can justify to hire, the more "growth" they get, the more investment money they can solicit.

This kind of complexity benefits everyone individually. London has a huge bullshit startup scene that enjoys complexity for complexity's sake (for the aforementioned reasons) so it's a good career move for these developers to have "microservices" on their CV. Managers also gain as they have to manage a team 10x the size. The people above the managers can boast about how they're "scaling teams" and battling complex engineering problems (even though these problems are self-inflicted). The company enjoys it because it makes them look hip and is presumably good bait for investors (just like AI and blockchain). As a whole, everyone loses because a simple task is now taking 10x the effort thus money but cares about that as the downside is a long-term cost that is distributed across everyone when the upside is a direct, personal benefit to someone's career.

When you're playing with investor's money it's a very good idea. When you're playing with your own money and doing a cost/benefit analysis it's a terrible idea.

"Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure."

— M. Conway (Conway's Law)

To compare and contrast: https://monzo.com/blog/2018/06/27/engineering-management-at-...

To be fair, the article didn't say what I was expecting which was "Our organisation is entirely flat".

This looks awfully much like back in the days when OOP really took off and people started to brag about how many thousands of classes they have in their systems. They even drew dependency diagrams awfully similar to those in that presentation. And they thought all that complexity and interconnectedness was a good thing, something to applaud! Of course most of this happened in small presentations, not on HN and giant conferences. But it still happened.

Nowadays, bragging about your thousands of classes in your OO application earns you laughs, or at best pitiful stares - rightfully so.

I'm wondering if history is going to repeat itself...

I'd be interested to understand how they manage their "common" library, especially the RPC bit.

I find it very painful to version APIs signatures and protocols: Let's say service S1 has a method "DoSomething(a, b)" and you want to add a new feature that makes is "DoSomething(a, b, c)" how do you handle that ? From what I've seen you would do "DoSomething_v2(a, b, c)" but that seems hardly sustainable: - you need a new version of the shared library for the new RPC calls - you need all services to maintain a lot of code to support different signatures

And I don't think rewriting all the callers to use the new signature is really an option. Or maybe they keep previous versions of the services deployed and they have some kind of routing that would find which service supports what signature (+probably some metadata like a minimum version number ?).

People of HN, what would be your advice on approaching this situation ?

I'm generally quite against shared libs across services for this very reason. Share patterns, sure. But code? You end up coupling services in ugly ways, because Team X has to think about how Team Y will be impacted if they need to make a change to the shared library. API clients (whatever protocol they may be using to communicate), aren't generally the hardest part of building a service. Trying to build shared ones (in my experience) is more trouble than it's worth.

You need some kind of compatibility. For example if you're using REST, you can add new parameter with default value and old clients will keep working. If you're using RPC, you need to research how it handles versioning, but generally you should be able to add new parameter with default value or add new method while keeping old clients working. You need a global update if you're changing existing method in an incompatible way, but that should be rare situation.

I'd really recommend listening to Rich Hikey's "Spec-ulation" talk, it's about changeing (growing and breaking) APIs.

The main part starts about 20 minutes in, but the whole talk is good. https://youtu.be/oyLBGkS5ICk

v2 as a suffix is ugly, so I'd probably have something like v2.DoSomething(a, b, c).

Beyond that, I'd typically see a single binary serving both v1.DoSomething and v2.DoSomething. Most of the code path would be shared, with v1.DoSomething calling the same method internally that v2.DoSomething does, except it provides some kind of reasonable default as the third argument. Vast majority of the code ends up shared.

Clients would choose which client library they prefer to use (and they can choose to use both, if they need to for whatever reason). Ideally older versions would end up deprecated over time, but it's rarer to deprecate than add a new version.

Go with mandatory error handling enforced by a linter is probably one of the best ways I've seen reduce mistakes. Not all aspects of the language are perfect, but I'm confident it is loads better than exceptions-based approaches with catch-all exceptions and generic error handling strewn everywhere. I imagine this is especially useful from a banking perspective where mistakes can wipe out millions.

Microservices however remove that. i.e. if a microservices fails to respond you have no indication whether it succeeded or not, which can lead to inconsistent state.

Using so many microservices seems like a good use of erlang, googling I found: (1)

(1) http://blog.plataformatec.com.br/2019/10/kubernetes-and-the-...

> We have an RPC filter that can detect you are trying to send a request to a downstream that isn't currently running, it can compile it, start it, and then send the request to it.

How does that work? dns not available == service not running & non-prod -> contact deployment operator?

And what if those services need DBs? Or is the single cassandra/etcd instance enough due to standardization?

And how those automatic metrics work is interesting too. Probably a prometheus http handler auto-starting with tie-ins into rpc/DB access code? How automated can it get.

There are two ways to make system. One is so simple that there are obviously no issues and the other is so complicated that there are no obvious issues. The former being much harder.

Monzo is rapidly growing and has only going for 4-5 years.

The real test will be when they have to retire or alter services or face product/compliance changes that effect many services.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact