
Why Monzo's bank transfers weren't working on the 30th of May - robinson-wall
https://monzo.com/blog/2019/06/20/why-bank-transfers-failed-on-30th-may-2019/
======
gregdoesit
This is a well-written post-mortem for public reading. I encourage people to
read through it.

Being someone who also works in the payments space currently, relying on
gateways, I have gone through several similar outages, where we detected a
gateway issue causing an outage, notified the gateway who ack’d... and then we
waited. More than one time, like Monzo, we built a workaround on our end,
before the gateway provider could even mitigate the outage.

Hats off to the Monzo team, who clearly have a solid oncall and incident
mitigation strategy in-place. They determined an outage happening in 4
minutes, built a workaround as best they could and deployed it in 2 hours,
while it took the gateway provider 9 hours only to mitigate their change that
caused the issue the first place. Granted the issue seemed complex, this is
still slow.

Unfortunately, in cases like this, the best one can do is make sure there is a
clear SLA in-place with the third party, with a contract stating financial
liability in case the third party fails to meet this SLA. Monzo will not tell
us much about this part, but I suspect the gateway will have to pay a hefty
fee to Monzo, as their availability dropped to under 99% for this month, which
should trigger payments/fee reductions from the third party with a well-
written contract. It is good to see they are pushing the third party to do a
proper post-mortem and prevention actions, as well as holding them
accountable.

Nice work!

------
mwexler
What I think is fascinating is not just that we applaud Monzo for this, but
that we allow other important services that control our lives to get away with
revealing nothing about what happened or what they've changed to prevent it.
Can you imagine any large bank (for the US, say JP Morgan Chase, Citi, Bank of
America, etc.) putting out a note with this level of transparency,
accountability, and clear direction to change?

For more about what makes a good apology, see
[https://withoutbullshit.com/?s=apology&submit=Search](https://withoutbullshit.com/?s=apology&submit=Search)
by Josh Bernoff, a former Forrester editor and a very direct writer.

~~~
dx7tnt
I can't imagine one of those giant banks having this kind of outage.

~~~
robinson-wall
It happens all the time, and they just don't tell you. We get automated
notices when banks connect and disconnect from the Faster Payments network,
something happens every few days. Not always this length, but occasionally.

Just yesterday a major high street bank stopped sending payments for an hour,
and was telling customers on Twitter that there were no problems.

Hell, the central system (what I called the Hub in this article) had a 12 hour
split brain meltdown last July which had banks emailing each other
spreadsheets back and forth for two weeks afterwards.

~~~
7ewis
Who manages the 'hub'?

~~~
hwatson
Faster Payments is managed by Pay.UK who contract the infrastructure to
VocaLink (which is 93% owned by MasterCard).

Pay.UK are currently holding a procurement exercise to find a new operator for
Faster Payments [http://www.fasterpayments.org.uk/infrastructure-
renewal](http://www.fasterpayments.org.uk/infrastructure-renewal)

------
ziddoap
Clear, detailed but accessible, plans in place moving forward, apology read
sincerely, providing support to affected customers immediately, and answering
follow up questions to technical users who are interested in more detail.

A+ job on handling the unfortunate situation, Monzo.

We can only hope more companies follow this great example.

~~~
ccrush
Every time I see a company say "we're sorry" I can't help but think about the
South Park episode where the BP CEO says "we're sorry!" It's either that or
the one with the Time Warner employees with the nursing flaps in their shirts.

~~~
_carl_jung
Using "we" in any public announcement, writeup, blog post, etc. is always in
danger of sounding contrived.

~~~
ziddoap
Personally, I prefer something written in a relaxed style rather than a
formal-voice only in most cases, especially for blog posts.

Formal only generally comes across, to me, as cold and distant. Great for a
persuasive essay or other mediums where you want to remove the topic from the
author, not so great for communicating with your audience and wanting to come
across as sincere.

If anything, a strict formal-voice only _blog_ post would come across, to me,
as contrived.

To each their own.

~~~
_carl_jung
I'm not advocating for formal over relaxed. I find blog posts work best in a
relaxed style from the first person singular. This sort of apology should come
from an individual at the top, and make reference to the whole team. You're
probably right about personal differences though.

------
robinson-wall
I just posted this semi-technical post-mortem on Monzo's about why we had an
outage with Faster Payments (UK bank transfers) last Month.

I'll hang around here to answer any more technical questions if anyone's
interested.

~~~
hc91
You guys are using Form3 for FPS, is this correct?

~~~
wrboyce
We can take an educated guess...

[https://status.form3.tech/incidents/wyhyxydxgh30](https://status.form3.tech/incidents/wyhyxydxgh30)

~~~
robinson-wall
I would note that this status page says "Our FPS Direct gateway provider".

~~~
wrboyce
Fair point. FWIW I thought your postmortem was excellent, and certainly puts
Form3's to shame regardless of where the blame lies.

------
playpause
This is a perfect post-mortem. Their communication and support has always been
really good. I've been using Monzo as my primary bank account ever since they
registered as a bank, and I've converted a lot of friends to it. But... over
the last year, the iOS app has fallen in quality: long UI freezes, frequent
sign-outs with no explanation, silly UI bugs. My non-technical friends have
noticed the same issues. It's a real shame.

~~~
Nextgrid
Agreed. This caused me to leave them for Starling Bank, though I’m considering
switching back - I’d rather take a faulty app but good customer support than a
good app but no support at all.

------
yingw787
@robinson-wall Nice writeup, definitely raises the standards in the banking
industry! I have a few questions:

1\. Was this post-mortem part of an official process or something of an
individual initiative? I saw it published on the blog, but it might be helpful
to have this information disambiguated from marketing material on a separate
site:
[https://status.cloud.google.com/summary](https://status.cloud.google.com/summary)

2\. I'm not sure how payment processors work, but would having multiple
payment processors from Monzo's interface make sense from a cost/benefit
perspective?

3\. Any plans to expand to the U.S. anytime soon, or recommend any banks that
follow Monzo's best practices? ;-)

~~~
robinson-wall
1\. A mix of both, we have a culture of being transparent by default - it's
one of the first things that attracted me to come and work here. I was the
incident lead for this on the day, and volunteered to write up this post-
mortem. I did have help from colleagues in the marketing team to try and make
this as accessible as possible.

As another poster mentioned we already have a status page where we post about
incidents as they happen (though obviously not in quite as much detail as
here). Personally I think our main blog is a reasonable place to have this ️.

2\. Multiple redundant payment processors would be great, but ultimately
infeasible. As a settling FPS participant we have to have a single Bank of
England settlement account, tied 1:1 to a "bank code". Multiple sort codes map
to a single bank code, and migrating sort codes between bank codes is non-
trivial.

It'd be great if we could migrate sort codes easily between redundant
connections, but as we build our own Gateway we'll have complete control over
how our failover mechanisms work. Here's to much greater uptime in the future!

3\. As another commenter mentioned - yes! We're just doing staff testing for
now, but we've got a waiting list up. It'll be a prepaid product issued by
another bank before we get a US banking license, just like we were in the UK a
couple of years ago.

~~~
breakingcups
I'll ask unashamedly, any plans for the EU?

------
PhantomGremlin
The software bug at the heart of the problem:

 _The bug was in a computer program the Gateway uses to translate payment
messages between two formats. When the program was operating under load, the
system tried to clear memory it believed to be unused (a process known as
garbage collection)._

 _But because it was using an unsafe method to access memory, the code ended
up reading memory that had already been cleared away, causing it not to know
how to translate the date field in payment messages._

So apparently a dangling reference.

~~~
seanmcdirmid
Is that really proper use of the term garbage collection? If you are doing
memory management manually, it sounds more like the lack of garbage
collection. Unless they were using an unsafe GC for C/C++?

~~~
jey
Sounds like they were hanging onto a pointer to an object allocated by GC. For
example, in Python/C API if you use a borrowed reference PyObject* after it
has gone out of scope and been GC'd.

------
edraferi
This is a very well-written postmortem. It’s clear enough that a non-technical
customer effected by the outage could understand the explanation, at least at
a high level. It’s also detailed enough that a technical person can trace the
root cause to a buggy garbage collector in format transformation function. The
whole thing uses clear language with a bare minimum of jargon. Nice work!

~~~
aeorgnoieang
> the root cause to a buggy garbage collector

Or, rather, unsafe access of memory managed by a garbage collector:

> The bug was in a computer program the Gateway uses to translate payment
> messages between two formats. When the program was operating under load, the
> system tried to clear memory it believed to be unused (a process known as
> garbage collection).

>

> But because it was using an unsafe method to access memory, the code ended
> up reading memory that had already been cleared away, causing it not to know
> how to translate the date field in payment messages.

------
retube
What I still don't understand with bank transfers is: what control is there to
ensure that debits and credits are offsetting. Doesn't this rely on the bank
be being honest? Can't the sending bank just not debit the senders account?

~~~
jessaustin
The sending bank is on the hook for that money when everything gets settled
up, so it has a strong incentive to perform that debit.

~~~
retube
what is this settling up? how does that work?

~~~
lukevp
I can speak to how it works for card processing and ACH is most likely
similar. To participate in payment processing in the banking network, you have
to have a Merchant ID that is tied to a bank account. The processor or gateway
is holding a suspense/escrow account on your behalf throughout the day and
when a batch of transactions settles, it will resolve the balance difference
with your bank account. The amount of payments allowed into or out of your
escrow account is set by the processors based on your company's financial
health and a risk analysis since if you just debited say $10 million from the
escrow account and you only had $5 in your account, the processor would need
to collect that debt from you, and they do not have a guarantee that they'll
be able to do so. This is how it works for debit cards and bank accounts since
$ amounts are real. It's slightly different for credit cards because the $
amount is in a way fictional, so they don't do the escrow holding and just
temporarily "allocate" part of the credit limit (this is called an
authorization) and when it is settled this is "captured", which enqueues the
authorization for future processing. A few days later it will process and be
included in a lump sum of funding into your merchant account. This reply is my
personal understanding and meant for educational reasons and doesn't represent
opinions or viewpoints of any company, and should not be considered advice of
any kind and it may be inaccurate.

------
ablation
Thank you for posting this. Great read, and nice to see the team at Monzo
sharing this level of detail: consumable but still detailed.

------
GordonS
The project I'm currently working on has a QA lag of 4-5 _days_ for code to
reach production.

I'm seriously impressed they were able to deploy mitigations to product twice
in the same few hours, especially given they are a bank (and a small one, at
that), and the consequences of fucking up are enormous.

It's been said here many times already, but I'll join those saying "well done"
for handling this so well, and for the extraordinary level of transparency!

------
spiderfarmer
Somewhat related question: How can Monzo offer 1.55% interest while the
interest with most banks is around 0,3%?

~~~
djhworld
The savings accounts are offered by third party banks, not Monzo.

The 1.55% rate is fixed term for 12 months with no withdrawals

------
sandGorgon
Just curious - whats the stack you guys run ?

I'm wondering what do you use to call these external processing APIs. I assume
these are blocking calls.

~~~
robinson-wall
There's a good writeup by Oliver, our head of engineering, about our tech
stack on our blog[1] with an accompanying Kubecon talk[2].

TL;DR- Largely Go microservices running on k8s, with http-based RPC calls for
synchronous communication, and kafka for asynchronous communication.

As for sending and receiving of this kind of payment message, they are largely
async but it does depend on the payment system we're talking about. When we
build our own FPS gateway we're going to have to have something to manage
"sessions" (TCP connections) which will block waiting for a response to an
individual payment messages. Right now our communication with our third party
Gateway is via a queue.

[1]: [https://monzo.com/blog/2016/09/19/building-a-modern-bank-
bac...](https://monzo.com/blog/2016/09/19/building-a-modern-bank-backend/)

[2]:
[https://www.youtube.com/watch?v=YkOY7DgXKyws](https://www.youtube.com/watch?v=YkOY7DgXKyws)

~~~
sandGorgon
actually I kind of like this one - [https://softwareengineeringdaily.com/wp-
content/uploads/2018...](https://softwareengineeringdaily.com/wp-
content/uploads/2018/06/SED616-Monzo-Bankbuilding.pdf)

We have learnt a lot from you guys as we build out similar systems in India.
Thank you for putting this stuff out!

Quick question that I have always wondered about - would you have used
something like Uber Cadence
([https://github.com/uber/cadence](https://github.com/uber/cadence)) as the
core of your infrastructure if it had been available back thhen ?

------
baby
I've been using Monzo less and less since I moved to the US due to the cost of
topping it up. It's really sad that there is no true equivalent to Monzo here
:(

~~~
Qasaur
TransferWise has a debit card and full banking facilities, I assume they are
available in the United States?

~~~
xchaotic
Yes, I used Transfer Wise borderless card in the US in 2017

------
kjlfhg8
Not related to the outage, but any plans to provide banking on pc's instead of
just phones and any plans to provide small businesses accounts in the future?

~~~
lol768
Yes, they've had job positions open for more web work. I think they will
expand this.

They already offer business accounts. I have one open for my Ltd company.

~~~
ownagefool
I registered my interest a while ago but they haven't given me one yet, so I
wouldn't say they offer business accounts, so much as they're going to/are
testing this.

Starling does offer business accounts now, but you can only have one Person of
Significant Control, i.e. over 25% owner. There is no monthly fee with their
offering though, so it's probably the better offer.

~~~
lol768
Note that Starling will prevent you opening a business account if you've held
a personal account in the past and closed it. That was the position I found
myself in.

For what it's worth, I've seen some suggestions regarding business account
pricing on the Monzo Slack, and there are plans for separate tiers (including
a free tier which lacks some of the more advanced accounting integrations).

~~~
ownagefool
I use a monzo personal account, would be more than happy using them for my ltd
if and when they support me doing so.

------
peteretep
> They later tell us they believed that datacentre was introducing the
> corruption

What now? Their datacentre was ... rewriting (presumably) encrypted packets?

~~~
robinson-wall
Sorry, perhaps this isn't very clear as I've tried to simplify the explanation
to make it accessible to a wide audience.

What I meant here is they could tell that the corruption was being introduced
by some component in their infrastructure, and they were only observing it for
messages passing through one of their two active-active sites.

~~~
noir_lord
I understood that as you intended.

It's a fine line between understandable to laymen and people been pernickity
sadly.

------
osrec
Tldr; unsafe memory management in a third party's software corrupted dates
(under high load, due to garbage collection), causing transactions to fail or
get reversed.

------
nvr219
You can do anything at monzocom

