
An IT migration corrupted 1.3B customer records - fagnerbrack
https://increment.com/testing/what-broke-the-bank/
======
dimitar
[https://www.tsb.co.uk/news-releases/slaughter-and-
may/](https://www.tsb.co.uk/news-releases/slaughter-and-may/) \- The bank has
commissioned an independent report

[https://www.tsb.co.uk/news-releases/slaughter-and-
may/slaugh...](https://www.tsb.co.uk/news-releases/slaughter-and-
may/slaughter-and-may-report.pdf) \- the report itself

Interesting notes from the summary:

* Functional testing took 17 months, and began very late due to project delays.

* The board was told that there were only 800 bugs, and yet the independent review found that there were more like 2000

* Non-functional testing seems to be rushed

* the Sabadell COO reported SABIS, one of its child companies and major IT contractor in the migration doesn't have the capacity to respond and solve incidents after go-live

And the TSB board decided to Go Live anyway.

Looks like a classic IT failure story - project is late, save costs on
operations, get tired of testing and fixing and decide that it is good enough
to be released. But also some very weird incentives - SABIS was apparently not
the best vendor for the job, but they are a part of Sabadell, the bank that
bough TSB.

Looking at the "Integrated Master Plan" Gantt chart of the project on page 44
is Waterfall as hell and they still failed to plan for Non-functional testing.

These some first impressions, it looks like a interesting read.

~~~
alecco
In UK/EU banking the way to get promoted is to be a yes man. Business has
complete control and abhors IT. Impossible deadlines and hands-off approach
make atrocious code bases. And this becomes a never ending feedback loop going
on for decades. There’s never time to pay for technical debt. And now with the
replacement of commercial staff for IT staff there’s a growing divide. Stay
away.

~~~
drcross
I have experience of this and can confirm. Banking is not a sector for young
smart people because it will wear you down. It's a constant battle of change
approval boards, fighting their risk adverse fear of causing service
distribution while at the same time expecting things to be delivered. Save
these types of roles for when you are old and want a boring job.

~~~
temporallobe
It’s the same in the US Federal government IT space, especially when dealing
with archaic legacy systems. Innovation is nearly impossible and change
approval boards reign. Everyone is risk-averse because of previous breaches,
even to the point where engineers are not allowed to run the code they wrote
themselves. You heard that right. We can write code, but we can’t run it
ourselves.

~~~
lowercased
Had similar experience talking to ... someone at fidelity. I can't remember
his title, but we were trying to negotiate some data exchange with them. He
asked for our development workflow/process, and balked when he saw that a
"developer" would also be the same person who had "access" to the database
with customer information.

"Developers can not touch the database"

Umm... someone needs to, and we're a small team.

"Developers can not touch the database"

"Why not?"

"They can exfiltrate data".

"So could someone else who has access to the database"

"Developers can not touch the database"

"If we give a non-developer access to the database, by definition, they have
access, and therefore the ability to exfiltrate data"

"Developers can not touch the database"

"Can you provide some examples of how other companies who've passed your
inspection are making any sort of database changes/upgrades without someone
having access to the database?"

"Developers can not touch the database"

And that ended it apparently.

~~~
orf
This is a pretty normal requirement. I believe what they are saying is that
you should not be able to access sensitive customer data by virtue of “just”
being a developer, and you should audit who has access to such data and why.
And by “touch” they mean unfiltered access, not “you cannot do anything with
the database and therefore cannot work”.

This is pretty basic information hygiene, and if you don’t have enough care to
implement even that then why should they partner with you? Sounds like a
liability waiting to happen.

~~~
lowercased
> This is pretty basic information hygiene

They would not give us any indication of what they expected.

> I believe what they are saying is that you should not be able to access
> sensitive customer data by virtue of “just” being a developer, and you
> should audit who has access to such data and why.

It should be pretty straightforward for them to provide a "here's our initial
base minimum requirements guidelines" and work from there.

The parameters they laid out were similar to "anyone who would be capable of
running any code on the database should not have access to any data". Every
clarifying question about "what about X? what about Y?" was met with
basically, "no, that person could exfiltrate the data". Which, yes, anyone who
has access to the data could, theoretically, exfiltrate the data.

I showed a demonstration (from laptop) and the guy got really upset that I had
'customer data' on the screen and in my database. It took several minutes of
back and forth to explain to him that it was "faked" data.

~~~
orf
It sounds like your team has no clue how to approach these discussions and
lost out on the partnership. Fidelity don’t know you from Adam, and you
seemingly gave no impressions that you had any idea what they wanted or how
this process works.

They shouldn’t have to spell out how to protect data to you, and simply by
asking “what about X” as if you are the only one to ever say this to them or
notice that there are of course ways data could be exfiltrated betrayed your
ignorance, not theirs.

Did you expect them to say “my god! Of course! Why didn’t I see this before?
This regulation we are legally and contractually bound to obey is not perfect,
so hey why don’t you just ignore it.”?

And to top it off you started showing realistic looking data without being
upfront and clear to them that it was synthetic? Come on.

~~~
lowercased
Not the same as "we're doing this and... " being cut off and being told "this
is 100% wrong, and we will not tell you any scenarios that will bring you in
to compliance with our view of secure".

We had plenty of secure process in place. The moment he heard the word
"developer" and "database", he basically shut down and refused any meaningful
engagement.

It is not his job to educate on secure practices. It was his job to inspect
what we had in place, and he stopped doing that the moment he heard those two
words.

"Would having a non-developer administrator connect and process the changes?"

"Developers can not have any access to any production data".

That's not a meaningful response to the question.

> And to top it off you started showing realistic looking data without being
> upfront and clear to them that it was synthetic? Come on.

I'm not sure how much more 'up front' you can be besides saying "we develop
with faked sample data, this is faked and unrelated to or connected with the
production system" at the start of that walkthrough. Perhaps if we'd opened up
a Northwind database it might have looked more plausible?

No one was expecting them to ignore anything. We were expecting someone to ask
"are you doing X? how are you doing Y?", then followup with "you need to
remediate these aspects of your system before we proceed further". That's what
I would expect.

~~~
lima
That's because the person you talked to probably did not know, either. They
were basically reading off a checklist.

------
theonemind
"making individuals within a company responsible for what goes wrong within
that company’s IT systems" seems patently ridiculous, especially without
"paying individuals in proportion to value generated by IT systems".
Otherwise, you have an unlimited downside, and a hard capped upside, which,
personally, means I quit IT and start serving coffee for a living.

It gets a bit murkier with executive positions, in my opinion, since the
unlimited upside does start coming in to play.

~~~
mrweasel
If companies want to hold engineers resposible to that extend, they would also
need to give them a way out. Meaning that if a developer, regardless of
deadlines, do not believe that his code is production ready, then you either
delay launch or the responsibility now shifts to the manager who choose to
ignore the developers concern.

As developers and engineers we’re responsible for letting management know if a
project is not on track. What they choose to do with that information is their
problem.

~~~
ignoramous
Individuals do get the stick, and it does make me uncomfortable. For example,
here's an Etsy engineer on how a bunch of people were fired for fudging the
scale-out:
[https://mobile.twitter.com/mcfunley/status/11947137113378529...](https://mobile.twitter.com/mcfunley/status/1194713711337852928)

Discussion:
[https://news.ycombinator.com/item?id=21849977](https://news.ycombinator.com/item?id=21849977)
(but not many folks discussing the layoffs).

------
3fe9a03ccd14ca5
> _Perhaps the easiest way to avoid outages is to simply to make fewer
> changes_

Sure, here’s another option: hire good engineers and pay them the market rate
for good engineering.

We all know what _really_ happened here: difficult engineering work was
outsourced to a cost leader.

Even if they had better “regression testing” (as the author laughably says was
missing) this project was doomed with this leadership.

~~~
latch
Outsourcing to save money is a straw man.

There are countless stories of large scale failures involving non-consultants
as well as very well paid consultants.

Software is hard and most developers aren't very good. As the complexity of
software increases, the number of developers and teams that can manage it
becomes vanishingly small; to a point where no amount of money can help.

The best chance for a project like this to succeed is to break it down and
migrate over the course of years (which might not always be possible).

~~~
BrandoElFollito
I witnessed the outsourcing of 3000 IT in a large company in the mid 2000. The
contract was great for the outsourcing company and signed off by a bunch of
bozos who had no idea how IT runs.

The idea was to have tickets for everything and "golden" ones vichy cost a
fortune but are super high priority.

IT was mad at being outsourced and played the game. "super important" thing
coming? Ticket. "just a cable, man"? Ticket.

What the idiots in management discovered, is how THEIR company actually works.
How important the human relationship is.

One of my colleagues in IT had the pleasure to tell a board member who came to
be serviced right oway to fuck off (his words) and to open a ticket, if he
knows what this is Then he picked up the phone as someone was calling and
turned away.

~~~
toyg
_> IT was mad at being outsourced and played the game._

This is called "work to rule", and anybody who doesn't know the risk and
repercussions of such an event happening, should not be in business management
in the XXI century. Unfortunately, a lot of people try to pretend that trade
unions never existed, and in so doing they lose an excellent occasion to learn
how workers actually think and act.

------
tnolet
Apart from the opening paragraphs, there is zero information what actually
went wrong here.

I’m super curious how customers ended up seeing other people’s accounts. That
seems like a massive major flaw in logic somewhere.

Sadly, the article goes into “general history and commentary”-mode way too
quick.

~~~
wolfgang42
_> I’m super curious how customers ended up seeing other people’s accounts._

I don’t know anything about this particular case, but the usual cause of this
is over-zealous caching (usually by an intermediate load balancer or the like)
that returns cached pages while ignoring the session data. The result is that
one person logs in and gets their account info, and then everyone else who
tries to view the same page gets _that_ person’s info instead of their own.
For example, this happened to Steam in 2015:
[https://arstechnica.com/gaming/2015/12/valve-explains-
ddos-i...](https://arstechnica.com/gaming/2015/12/valve-explains-ddos-induced-
caching-problem-led-to-xmas-day-steam-data-leaks-and-downtime/)

Of course, there are also other possibilities, but this is the most likely.
Alternatives that come to mind (though I’ve never seen any of these actually
happen):

\- A bad RNG creating the same session token for multiple users

\- Concurrency bug in the web server returning results to the wrong connection

\- Messed-up migration causing account IDs to be associated with the wrong
credentials (maybe IDs from the old and new systems got mixed up)

~~~
bonzini
> Messed-up migration causing account IDs to be associated with the wrong
> credentials (maybe IDs from the old and new systems got mixed up)

I would put money on this one.

------
Buge
> The paper also suggests a potential change to regulation—making individuals
> within a company responsible for what goes wrong within that company’s IT
> systems. “When you personally are liable, and you can be made bankrupt or
> sent to prison, it changes [the situation] a great deal, including the
> amount of attention paid to it,” Warren says. “You treat it very seriously,
> because it’s your personal wealth and your personal liberty at stake.”

Being sent to prison for buggy code... that's quite extreme. Are people even
fired for buggy code? If we want to get more extreme than firing, how about
losing 1 year's compensation in addition to being fired.

~~~
_ZeD_
I'm non payed enough to be legally liable for the code I do at my job.

...and moreover, what about the usage of third party software? (worse if it's
free software...)

~~~
_trampeltier
Look it in another way. Almost every worker in the not software world is
legally liable for his work. For the chef in the kitchen (no poison in food)
to construction worker (the building should not falling apart) or electricians
(proper insulation and ground, etc). I work in industry automation and I
really would see if software would be simple again. Todays software depend on
so many external things.

~~~
_ph_
In Germany, and I guess in most countries, almost no worker is legally liable,
as long as they are not actively ignoring the safety regulations applying to
their work. So to make a software developer legally liable, you first have to
set up a clear framework of "safety regulations" by which they can be judged.

~~~
_trampeltier
Yes every job has it regulations, except for pure software. Just software guys
seems to have no rules at all.

~~~
fierarul
I assume you never saw a 'workmanlike manner' clause in an independent
contractor contract?

Any industry, including ours, has best practices.

------
lsb
> The culprit was insufficient testing

The culprit was _not_ insufficient testing!

The culprit was insufficient testing, _plus_ lacks of restorable backups, on
multiple levels.

Last level of backup: there's 5 million records, use a few dozen reams of
paper and print out the account totals for every single account.

For all of the complex systems you can think of (planes, living organisms,
etc), the reason things usually go so well is that there are multiple levels
of checks and balances. Everything is usually veering towards entropy, and
fail-safe systems try to ensure when things go wrong that they fail safely.

~~~
ars
It's not possible to restore from backup when the transaction is between your
bank and a different bank.

You might restore from backup, but the other bank won't.

The issues here are far more complex than just keeping track of the current
balance: a first year CS student could do that part.

~~~
mcny
Wait. I don’t understand. If you have the opening balance and have all the
transaction logs, what are you missing?

I don’t see why it shouldn’t be possible to feed data to both the new system
and the old system for a while...

~~~
jyounker
But that requires forethought, planning, and a dedication to testing.

------
rullopat
Working as a test automation engineer, I'm not surprised at all. There is
small to no investment in testing in most companies because it's just
considered a cost. I cannot count the times where I asked for better servers
for testing, put more time and people in creating a better testing
infrastructure, but the answers was always "we will do it! We definitely need
to get better at testing!". Still waiting for any action. Most of managers I
met just want to meet the deadline they set and ask: "can't you test it in
another way? How is possible that it takes so much time?". They don't give a
damn about testing, they just want a pat on the shoulder that everything is
fine and give the responsibility to someone else if something goes wrong.

~~~
mtm7
Scientist: “I’ve created [new medicine]!”

Manager: “Perfect. Let’s sell it!”

Scientist: “Well, we’ll need to test it first. We should run some trials —“

Manager: “We definitely need to test more often! But I already agreed to have
it out by next Tuesday, so maybe we can just release it and aim for more
testing next time.”

~~~
mdpopescu
I call these "real scientists", because they only exist in books and movies.

In reality, the scientist will say "let's publish! who cares about testing,
publish or perish!".

~~~
rullopat
Publish or you'll be the trouble maker that will never have a career, that's
what you mean?

------
orangepenguin
> "Perhaps the easiest way to avoid outages is to simply to make fewer
> changes."

The exact opposite is true. When a process happens infrequently, it's more
likely that the people involved will make mistakes and overlook steps. The
correct answer is to make changes more often, and to develop robust processes
and automation around those changes.

~~~
aledalgrande
I think you are implicitly referring to change size too here: smaller changes
are better. Can probably be included in "robust processes".

------
fartbagxp
Ah, the classic Enterpise big bang migration.

Enterprise tend to favor big bang migrations on a specific date because
somebody higher on up set a particular date and everything falls into place
with a Gantt chart running waterfall. The reality is that it falls onto a few
technical folks to triage a large amount of teams, including the ones from the
company they're trying to break away from (which introduces friction). This
includes significant risk to the project.

"TSB chose April 22 for the migration because it was a quiet Sunday evening in
mid-spring."

This might've gone better if TSB chose months prior to April 22nd for the long
duration migration and testing to be completed, and a period of weeks or
months for going live post migration. The F5 load balancer (hardware
commodity) could've slowly cut over the traffic 10% at a time to the new
migration site to get a feel for user experience. Coordination with the TSB
network team would be necessary to accomplish that.

It is a tough spot though, I hope the team learned something from that.

~~~
orthoxerox
> The F5 load balancer (hardware commodity) could've slowly cut over the
> traffic 10% at a time to the new migration site to get a feel for user
> experience.

I doubt an F5 load balancer would work in this specific case. But there
definitely should've been a software router-adapter that routed requests to
two systems and converted their replies to a single format. This would've let
them migrate their customers in batches instead of a big bang cutover.

That's what I've always done when migrating data to a new banking system.

~~~
jazzkingrt
What I've seen work well, for multiple migrations, (at an admittedly smaller
scale) is using shadow writes/reads and a source-of-truth toggle.

The API layer made requests to both the old DB and a new DB that had been
populated during a small window of scheduled downtime.

We spent a couple weeks/months running checks in production that the old DB
and new DB were returning identical results, though still returning the old
DB's results as source of truth. Eventually, we flipped the source of truth to
the new DB, and some time later decomissioned the old DB.

~~~
Mouse47
Great approach imo. Once you get flipped over to the new source of truth you
have to make sure the business prioritizes decommissioning the old DB, though.
I've seen certain departments (looking at you, BI) treat the old database as a
permanent backwards compatibility layer.

~~~
8note
when the old database stops getting writes, it stops being useful pretty
quickly

------
Rainymood
The lack of testing is a symptom of a deeper root cause: cost cutting. The
audacious thinking that you can just shuffle around developers that have been
on the project for multiple years, outsource them to cheaper countries, count
that as a win because the balance sheet looks better for that year, and then
don't expect this to happen. Honestly, there is a lot of hidden knowledge that
NEVER gets fully transferred in these kind of knowledge transfers and you'll
only notice it once shit really starts hitting the fan.

------
planetjones
I have been out of big bank IT for eight years now. Clearly I don’t miss it.
For one programme we had to produce upfront estimates. Clearly anything we
produced would be wildly inaccurate. But we tried. The IT director then cut
the estimate in half and added another hundred people to it. Cue complete
chaos and a massive delay. That was the calibre of senior management back
then. I doubt it’s much better today.

~~~
rsecora
A classic: "Adding manpower to a late software project makes it later". Fred
Brooks - The Mythical Man-Month.

------
a012
I didn't read everything, but their erronous mistake is the resemblance in
"The Phoenix project" book[0], it's awful but realistically can happen at many
corporates.

0 - [https://www.goodreads.com/book/show/17255186-the-phoenix-
pro...](https://www.goodreads.com/book/show/17255186-the-phoenix-project)

------
alexfromapex
Having done a large data migration for medical records, I would be sweating
bullets doing this with presumably fluctuating transactional bank data. Out of
curiosity, does anyone have recommendations for their favorite data migration
tool?

~~~
andreskytt
IMHO it is unreasonable to expect these sorts of things to be possible to do
using a big bang approach. The complexity exceeds cognitive capabilities and
disaster strikes. Therefore, it is not optimal to see these projects as data
migration but functionality migration ones. You move some customers to another
set of functionality. Thus, my recommended tool is to use none

~~~
aledalgrande
100% - what I always do is start the new logic/system so it processes in
parallel with the old one and at the same time backfill the old data into the
new system. When everything is proven working, feature flag to switch the new
system as the main one, old one keeps running. After a period of quarantine,
switch off the old system.

I don't understand the requirement of doing the "nuclear button" migration,
except maybe a shortsighted way of trying to save costs.

~~~
joshschreuder
We do this too and I think it's a great approach. One thing I will say is you
should try and decommission the old system as soon as feasible - running both
has additional testing and development costs, particularly if for some reason
you need to add new features to both.

We generally aim to leave the old system toggled off for a release or two
(allowing us to switch back in the case of a serious defect) and then rip all
that old code out in the subsequent release.

------
tarsinge
The article lacks substance, it just repeat « it’s a very complex system, it
was not tested enough » 10 times. I was hoping some details.

------
enriquto
There's not a lot of details in that article. But it seems that they were
dealing with a moderate amount of data (about 100 GB ?), which is trivial to
backup many times. They do not explain what's the big deal with that.

------
mongol
Very interesting article. But I miss a story about what happened in the days
after the disastrous migration. Could customers find a correct balance on
their accounts eventually?

------
xenihn
Potentially dumb question: I've done front-end (iOS) my entire career. If I
wanted to go from 0 to being able to do a SQL DB migration in production
without disaster, what should I do?

I saw [https://www.masterywithsql.com/](https://www.masterywithsql.com/)
posted on HN a few months ago, and I'm planning to finally start it over
break. But what do I use after that?

~~~
readonly
\- pick a relational database (ie: mysql or postgres)

\- learn how to backup the database (I suggest via the official documentation)

\- restore that backup (on a different environment)

\- validate the restore is complete (compare the data)

\- if the backup and restore are good, now you can start learning how to
migrate without the stress that you'll lose historical data

~~~
tzmudzin
You‘re absolutely right, but we all can be talking about different subjects
here:

\- database restore \- database migration (your case) \- data migration to a
different platform and presumably different data model (the bank‘s case, or my
answer here)

Not sure which case the question was reffering to, though...

~~~
xenihn
How about all 3 :)

------
duelingjello
A bank fails at risk management. Hmm.

Commonly-held factoid statistic says they have a 50% chance of going out-of-
business within 6 months since they lost so much vital customer data. I hope
they made backups and didn’t just rely on snapshots, database transactions or
journaling.

IT professional prime directive #0: don’t lose vital data.

------
crdoconnor
This ought to be an object lesson in not cheaping out on hiring good
developers and architects to run sensitive projects like this.

Alas instead it's seen as an opportunity to push the risk on to developers.

With this kind of response and IR35 coming up I can foresee more TSBs
happening.

~~~
fierarul
Management was doing all right: the project was late and over budget. So,
money was there.

But I suspect money never reached the developers. Until the late panic stages
when it was too late to correct course.

------
peter_retief
I have bad dreams about this very thing, I cant even read the article

~~~
de_watcher
Don't worry, there aren't any interesting details in it.

~~~
netsharc
Such a weak article with so much filler. The first several paragraphs were
interesting and then he goes on to say in the 60s only employees used the bank
computers, but nowadays (to paraphrase) "there's this thing called online
banking".

You don't say... (Insert Nic Cage meme here)

~~~
warent
I admit feeling bewildered at that random flashback and scrolling endlessly
just to get back to the point of the article

------
blendo
Article quoted 2500 person years, over a period of a couple of years, to move
from a Lloyds system to a clone of an existing Banco Sabadell system.

Does that seem like a lot of effort for 5 million accounts? Maybe worth it
since they were already paying EUR 100 million per year to license the old
Lloyds system.

I’d bet they moved from one big honking RDBMS to another. Curious if old and
new were different DB vendors, if anyone here has insight.

~~~
masklinn
> Does that seem like a lot of effort for 5 million accounts?

It's… close to a person-hour per account… At that point you might as well
migrate each account by hand.

------
jacquesm
Git 'blame' would take on a whole new meaning.

------
OliverJones
Important safety tip. If you hear any vendor executive brag about the large
number of people it took to do an IT project, seek out a different vendor.

I don't get how small-value transactions could turn into large-value
transactions on migration, though. Hollerith cards? Wrong column numbers in
the COBOL program? Decimal point / comma confusion? Whisky Tango Foxtrot?

~~~
tgsovlerkhgsel
There was a recent post about in-flight entertainment/information systems on
the front page (clickbaited with "widewine" in the title), which showed a
(presumably _somewhat_ recently designed) JSON object containing flight data
information.

I think the date was "00YYMMDD" or something like that, time zone offsets were
represented as a float-respresented-as-string, but to indicate a negative
offset they added 80000 or so...

So if that's what happens to systems designed today, imagine how legacy cruft
from decades ago must look. I would not be surprised if the answer to
"Hollerith cards? Wrong column numbers in the COBOL program? Decimal point /
comma confusion?" was "all of the above, and then some".

~~~
OliverJones
Oh, terrific, some ancient program in C, where somebody cobbled in a poorly
designed and untested C version of JSON.stringify(). That is 21st century
craniorectal inversion, not decades-old tech debt.

I started out working in that era (no COBOL, but all the rest of it). At least
some of us were suspicious of data-in-a-few-characters and character column-
number based records (ummm, punchcards). Some of us checked all kinds of
things on input to try to reject garbage. Spaces in the middle of numbers?
BOOM. Something unexpected in the "record type" letter? BOOM. The card reader
actually had a diversion output tray where we could spit out the rejected
cards.

But, still, lots of bad stuff got through.

------
masonic

      Then, in 1967, the world’s first automated teller machine (ATM) was installed outside a bank in north London
    

I wonder how much rework was needed when decimalisation came a few years
later.

------
ausbah
Best software practices for safety critical systems seem essential, I don't
think it's crazy to think that some sort of regulation or licensing would help
enforce those practices.

~~~
DaiPlusPlus
Is banking “safety critical” in a legal sense, though?

------
StreamBright
It is not specific to finance, I would say the majority of IT operations are
like this. Why? Because this is a new profession and ignorance.

------
mikece
This article reads like a dystopian, horror version of the rollout of
"Phoenix" in the book The Phoenix project.

------
fock
seems like consultants are aiming right for the three 9s, but it's the cloud,
so I guess 2 nines are ok eventually.

