I simply cannot believe that their product really calls for an architecture consisting of 1600 individual microservices.
Don't get me wrong, I don't think one huge monolith is ideal either, but how is this better?
Who at Monzo could possibly understand how a significant portion of their platform works, in detail, and be confident that knowledge isn't going to be outdated in ten minutes from now? Who understands how changes to each of these tiny pieces will impact the others? What are they actually solving by doing things this way?
Would this abomination exist if they didn't feel the need to hire people whose primary objective is to get the words "microservice" and "Kubernetes" stamped onto their resumes?
Am I just out of touch? Is this the new normal?
To me it looks like an Alan Kay style of OOP design. From that standpoint, replace 'microservice' with 'class' (or 'actor') and 1600 of them isn't that ridiculous.
It'd be an interesting system to browse. I'm actually curious if they've built any tooling around making the codebase easier to navigate and edit. Smalltalk handles a huge amount of tiny classes by having good browsing tools. What would you use for a Go codebase of this size?
We tried it about 4 years back. Maybe there are some better debugging tools now, but for us both learning curve and prod issue pain did not make it worth it.
We are not micro services, we are a 'monolith', but with over 1600 components our developers call 'applications' and over 7,500 touch-points between them our developers call 'interfaces'. A single 'customer information file' touches dozens and dozens of these 'applications'.
For us, at least, what's going on is regulation. Every aspect of banking is heavily regulated, and each component of the monolith has to be compliant.
A notorious example is check settlement order. When checks come in for payment, the bank is supposed to process those checks in an order favorable to the bank customer not to the bank. For example, $1000 in an account, and you write one $995 check and a hundred $10 checks, if the bank pays the first one, the rest would bounce, causing $3500 in bounced check fees. If the bank pays all the $10 ones, just the $995 one would bounce, causing $35 in check fees. This is now regulated.
There are countless such rules, and no dev team can know them all. So you assemble third party component software that has the rules baked in, or if trying to greenfield, you have compliance teams that study all the rules for their areas and bake them into the software requirements -- and good luck, because in the USA, these rules can be different state by state or even zip code by zip code.
Easier to just decompose your monolith into components maintained by a group that contractually guarantees regulatory compliance for their specialization, and then pray (or rather, spend months regression testing after any change for quality assurance) the whole thing works when you glom it together.
In that world, suddenly something like K8s makes a lot of sense.
Tbf, this can apply to any sufficiently complex system, whether or not it uses microservices.
It's why interfaces are so damned important, especially communicating them with stakeholders. Done properly, you only need to focus on the bits of a system that matter to you. At least microservices get to use OpenAPI.
Don't worry. I've suffered (and still do) as much as anyone working with microservices, trying to reconcile and decipher logs.
But total knowledge is rarely possible in a large system. At least microservices acknowledges this. If anything, they can help by following Conway's Law.
That sounds large but if you have 1600 microservices you are _most likely_ writing a single service where you'd normally write a class. Does 50 classes per engineer sound like a lot? Sure does but they didn't create this infrastructure out of thin air - this is an accumulation of many years. For sake of the arguments let's say 2 years. Across 24 months that's only 2 microservices per developer per month.
Not unthinkable, and this is assuming only 20% of their engineering force is actually productive.
After the n^th iteration of building an rpc server, such things may be fairly standardised, no?
After a year or two, those talks stopped. Why? Turns out, having thousands of microservices is something to flex about, and make good conference talks. But the cons start to weigh after a while - and when addressing those cons, you take a step back towards fewer, and bigger services. I predict Monzo will see the same cons in a year or two, and move to a more pragmatic, fewer, better-sized services approach that I've seen at Uber.
In 2020, Uber probably has fewer microservices than in 2015. Microservices are fantastic for autonomy. However, that autonomy also comes with the drawbacks. Integration testing becomes hard. The root cause of most outages become parallel deployments of two services, that cause issues. Ownership becomes problematic, when a person leaves who owned a microservice that was on the critical path. And that no one else knew about. Oncall load becomes tough: you literally have people own 4-5 microservices that they launched. Small ones, sure, but when they go down, they still cause issues.
To make many services work at scale, you need to solve all of these problems. You need to introduce tiering: ensuring the most ciritical (micro)services have the right amount of monitoring, alerting, proper oncall and strong ownership. Integration testing needs to be solved for critical services - often meaning merging multiple smaller services that relate to each other. Services need to have oncall owners: and a healthy oncall usually needs at least 5-6 engineers in a rotation, making the case for larger services as well.
Microservices are a great way to give more autonomy for teams. Just beware of the autonomy turning into something that's hard to manage, or something that burns people out. Uber figured this out - other companies are bound to follow.
Yeah exactly. Having two people dedicated to the "Account" service that has 10 features sound much better than having 2 people responsible for 5 microservices each. You might end up with a reset credentials service, a register customer service, a delete account service without any coherent overarching design strategy instead of having just having a plain CRUD service with 6 extra operations. I can't blame them for having 160 services because a lot of enterprise tier organizations are truly that complicated (I work at one) but I do blame them for having 1600.
Yep. I've encountered a production issue that was traced down to a Location service- you pass in some info, and get location information back- that had been running for 3 years without an owner. The developers had all left, and the team was dissolved.
Not an inherent fault of microservices, but having 1000s of them running around will cause some to slip through the cracks.
"[480k] ... of its [Monzo] customers to pick a new PIN – after it discovered it was storing their codes effectively as plain-text in log files."
Low and behold pretty much every major outage they've had has been because of their Cassandra cluster. To be fair they have addressed this now, but for it to take N outages and 5 years kind of tells me they could focus a little more attention on properties people value in their banks, like I don't know... uptime.
This is why they have to work pretty hard to make me switch over to it being my primary bank, and 1600 microservices communicating over the most brittle part of your stack doesn't win them any points in this space.
To put it in context I really can't remember the last time HSBC had a major outage... Sure it's boring but it works pretty much all the time.
Monzo seems more of a "60% of the time it works everytime" situation, great to send your mates money but I wouldn’t trust it to pay my rent on time. Having 1600 microservices kind of explains this.
If there’s anyone at Monzo listening, please focus on not going down as everything else you do is great.
I think people's anxiety around this comes from thinking that they need to understand every single system in intimate detail, or even a significant number of them. On most days, you probably only need to understand what happens at the boundaries of the ones that your systems need to interact with. Good instrumentation means you have enough visibility into them without needing to understand the implementation details.
We literally call this product a puzzle.
The more complexity they have, the more people they can justify to hire, the more "growth" they get, the more investment money they can solicit.
This kind of complexity benefits everyone individually. London has a huge bullshit startup scene that enjoys complexity for complexity's sake (for the aforementioned reasons) so it's a good career move for these developers to have "microservices" on their CV. Managers also gain as they have to manage a team 10x the size. The people above the managers can boast about how they're "scaling teams" and battling complex engineering problems (even though these problems are self-inflicted). The company enjoys it because it makes them look hip and is presumably good bait for investors (just like AI and blockchain). As a whole, everyone loses because a simple task is now taking 10x the effort thus money but cares about that as the downside is a long-term cost that is distributed across everyone when the upside is a direct, personal benefit to someone's career.
When you're playing with investor's money it's a very good idea. When you're playing with your own money and doing a cost/benefit analysis it's a terrible idea.
— M. Conway (Conway's Law)
To be fair, the article didn't say what I was expecting which was "Our organisation is entirely flat".
Nowadays, bragging about your thousands of classes in your OO application earns you laughs, or at best pitiful stares - rightfully so.
I'm wondering if history is going to repeat itself...
I find it very painful to version APIs signatures and protocols:
Let's say service S1 has a method "DoSomething(a, b)" and you want to add a new feature that makes is "DoSomething(a, b, c)" how do you handle that ? From what I've seen you would do "DoSomething_v2(a, b, c)" but that seems hardly sustainable:
- you need a new version of the shared library for the new RPC calls
- you need all services to maintain a lot of code to support different signatures
And I don't think rewriting all the callers to use the new signature is really an option. Or maybe they keep previous versions of the services deployed and they have some kind of routing that would find which service supports what signature (+probably some metadata like a minimum version number ?).
People of HN, what would be your advice on approaching this situation ?
The main part starts about 20 minutes in, but the whole talk is good.
Beyond that, I'd typically see a single binary serving both v1.DoSomething and v2.DoSomething. Most of the code path would be shared, with v1.DoSomething calling the same method internally that v2.DoSomething does, except it provides some kind of reasonable default as the third argument. Vast majority of the code ends up shared.
Clients would choose which client library they prefer to use (and they can choose to use both, if they need to for whatever reason). Ideally older versions would end up deprecated over time, but it's rarer to deprecate than add a new version.
How does that work? dns not available == service not running & non-prod -> contact deployment operator?
And what if those services need DBs? Or is the single cassandra/etcd instance enough due to standardization?
And how those automatic metrics work is interesting too. Probably a prometheus http handler auto-starting with tie-ins into rpc/DB access code? How automated can it get.
The real test will be when they have to retire or alter services or face product/compliance changes that effect many services.