Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do you update multiple disparate databases?
93 points by todaysAI 6 months ago | hide | past | favorite | 51 comments
Talked to a CTO of a small bank. He has 4 disparate legacy databases which need to be updated when a mortgage rate changes, which is obviously a problem.

What solutions does the HN community have in dealing with this issue?

Banks use an ancient but powerful architecture called Source of Truth (SoT), or system of record. It's one of the first techniques developed to manage heterogeneous distributed databases.

One master is picked as the final authority, then changes flow to other databases, replicas, etc. either using database tools or applications.

Awareness of deadlines or how much latency (ie. when the updates are needed for each downstream pipeline) is helpful.

A practical example that solves a simple case is the recent release of a tool by Netflix.

Pro tip: when you interview people for a project like this, they should already know the above. :)

Source: DBA.



I get the feeling OP is the one being interviewed.

Mortgage rates change frequently.

The bank likely deals with this exact scenario on a regular basis and should have an established procedure in place.

Source of Truth should (in my opinion) be used in many sectors but even finance doesn't use it everywhere. Having each data point originate in only one system helps a lot with data quality but is complex to introduce in existing systems.

No, no interview. He's a friend of mine but I kept his problem in my head to possibly provide some solution.

You underestimate what how unknowledgeable a small financial institution can be.

I was in a client meeting a while back where one of their consultants corrected me with 'system of record' when I said source of truth. Your comment made me wonder where this distinction/difference was from? Do you know the history behind this?

Some good answers in this thread but I can flesh this out more.

The distinction and purpose is clearer when you think about aggregating data, such as the various Corona virus trackers are doing. This is a modern example but the scenario applies as far back as humans have been recording data. (And will apply as far forward as well, because physics)

Right now every hospital or testing center is recording the tests they give and their results. Plus they record how many inpatients, how many ICU beds, how many ventilators, etc.

Each hospital has a (at least one) database where that data is recorded.

Separately various news orgs and political offices are keeping track of all of the counts for their region, etc. So some number of times a day they call each of the hospitals to get their counts. (Ok, they don't actually call, but you get the idea). And they aggregate those counts.

Depending on the time that the organizations call the hospitals, a given hospital will give them a different count. The count As Of a particular time.

So different aggregation organizations may all have slightly different counts throughout the day.

You in following the progression of the cases will have picked some particular source that you choose to give you a trustworthy count.

So- the hospital, in keeping Records- is a System of Record.

The news orgs as Sources of Trustworthy information for consumers are Sources of Truth.

The two seem synonomous at first blush- and colloquially when people treat them interchangeably they really at referring to Source of Truth- but on reflection it should make sense that they serve quite different purposes and have quite different requirements. And once you the distinction you see it everywhere. Hope that helps.

It totally depends on the instance:

Was it a piece of atomic data at the system of generation? Likely SoR

Was it an aggregated/mastered/other data? Likely Source of Truth

SoR = Where it came from

SoT = Where you get it from

FWIW I've heard both used interchangeably (in banking, adtech, and other tech verticals).

Edit: the linked wikipedia articles discriminate the terms by defining SSOT as a single place to store and edit every data element, while SOR allows for replicas, but in the case of data discrepancies, the SOR's state wins.

It's probably just what they call it.

That's been my experience as well. People call it different things. I have met some that are down right religious about the name and chide you for not using their term, but in the end it is the same thing.

I have had people try to explain to me how they are different, but when you remove all their BS, it is the same thing.

I have no idea about this particular case, but for a lot of older concepts the root of divergent terminology is IBM. IBM invented its own lingo for lots of things, and persisted in using it even when the rest of the world settled on a different term. (This seems to be a habit companies pick up when they become huge and dominant; see also Micro-speak and Google-speak.)

The thing everybody else called a "hard disk," for instance, was long referred to by IBMers as either a "Winchester disk" (see https://en.wikipedia.org/wiki/History_of_IBM_magnetic_disk_d... for the etymology on that one) or a "fixed disk."

Cf Emacs' "windows", "frames", "kill", "yank", etc...

> A practical example that solves a simple case is the recent release of a tool by Netflix.

Which tool is this?

What do non-banks use?

The same. The concept of a system of record is fairly universal in the Enterprise IT world.

Thanks - that's what I thought but OP's phrasing confused me.

Step 1) Generate a GUID

Step 2) Send a message to all the `downstream` systems (i.e. not the Source of Truth systems) with the change description and GUID

Step 3a) Update source of truth with the new info, write the GUID to a log table in source of truth in the same transaction

Step 3b,c,d) Downstream systems poll the source of truth for existence of the GUID in the log table indicating the change has been committed to source of truth. Load latest value from source of truth and commit the change to the downstream system.

Step 4) If GUID never shows up, transaction was rolled back in source of truth (or an error occurred trying to send to one of the other downstreams, aborting before transaction)- do something relevant for your use case. Alert somebody, carry on, whatever.

This seems entirely reasonable, but you forgot the 0'th step.

Create a source of truth. I'd bet a 'source of truth' internal to the bank doesn't exist right now, and they're probably pulling something manually from an external 3rd party.

And on with the regress..

Step -1) What form will this internal source of truth take, how much will it cost, and who builds that system?

Step -2) Who gets to make that decision?

My uncle used to do this kind of work and man, he has some horror stories. The common thread among his clients seemed to be 'We know we need to do something, we know it will cost a lot of time and money, and it absolutely cannot fail.' The hard part always seemed to be finding the person within the org who had both the authority to spend the money, the political clout to make the decision. Nobody wants to own something like that going wrong. The actual technical part of the work, they had that down pretty well as it was all they did.

That's table stakes as far as I'm concerned. If it can't be done quickly and correctly, you're in for a world of hurt. An organization in that position shouldn't be doing their own engineering. An individual in that situation should be brushing up their resume.

It's amazing how hard the "system of record" conversation can be.

Note that downstream systems can skip updates with this system, but will always be eventually consistent. Buyer beware

Is this a known pattern in distributed systems?

Found that SOT, SOR and "distributed transactions" are the related topics.

It's basically a generational cache pattern

Honestly, the solution depends highly on your circumstances. Microsoft SQL has a concept of a distributed transaction, but I'm going to assume that "4 disparate legacy databases" aren't Microsoft SQL.

Here's the high-level approach I would take:

1: Understand which databases serve what, and what the needs for accuracy are. (For example, a database serving their public-facing website can probably be a few seconds behind compared to a database handling real money.)

2: Figure out which database is the source of truth. Update that database first.

3: Update the remaining three databases in a loop: Update the database, then check to make sure the update happened

You will probably need to add some state management where you put yourself into an "updating" state, so that if the update breaks in the middle, you can go back and recover.


Another approach is to refactor the applications that are critical (Those that handle money, not the public facing web site) to always read from the database you decide is the source of truth.

So, the first solution to consider is if this needs a technical solution at all.

If it happens infrequently enough, you can have simply have a proceed "change mortgage rate" that touches the 4 legacy databases. Done.

If it does need a technical solution, you have to establish the leading system / source of truth / whatever you call it. This can be either an existing system or a new one. And then all the others need to follow the leading system.

A more distributed approach (every of the 4 legacy systems can generate events that the other 3 work with) likely won't work with 4 legacy systems. It's often hard enough to get high-fidelity events/deltas out of a single system.

There might be an addendum to this good answer: OP seems to be thinking they all need to show the same rates at the same time, so in essence you're looking for a "distributed transaction", which, as we all know... well, exercise for the reader and all that.

However, one technique I've used which works with all your approaches and guarantees consistency if the data model allows it is to have "future value" entries.

If you just have a single value:

    mortgage_rate_percent: 3.0
And this is in 4 databases, there will be a period where they don't match. But if you put a future "valid_from" in there:

    mortgage_rate_percent: 3.0
    valid_from: 2020-04-01 00:00:00-0000
Well, you now have the ability to have the rates all input at different times in different databases, and by querying for only the most recent one (or use from/to dates and find the intersection), they all become active at the exact same moment in time (clock drift aside) as time passes. No magic needed.

This seems really obvious once you've seen it, but it's amazing how many people try and worry about source of truth, synchronisation and distributed transaction problems when in many cases (like this), the change can be planned so you can just say "this value will apply in the future" and you have ample time to make sure it gets synch'ed before that time and everything "just works"

> This seems really obvious once you've seen it

I think that's true for 90% of patterns/problems in dev. Did you see that Prince of Persia post the other day? How the original creator came up with Shadow Man? Kind of resonates with what you are getting at in being obvious.

I embarrassingly have a 8 year old top StackOverflow post asking what a fluent interface was. Keeping in mind I'd been devving for 5+ years already at that point. It was just a "ahh F* yeah, that's obvious" moment.

There could be a rate-change once a week on average. Of course if there is a central bank change then probably every mortgage product will need to be changed that day.

Rates are the important data-item but some of the legacy DBs will need to be notified of changes like fees, as well.

Change Data Capture. - share the consulting loot when you get yours. https://en.wikipedia.org/wiki/Change_data_capture

It would be hard to formulate a one-size-fits-all solution even if you were talking about relatively modern systems.

There’s a trade-off between having a manual vs. automated process in the best of circumstances. You might think intuitively that an automated process would be easier and more reliable, but depending on local factors for each of the databases, there might be issues like availability or locking that could make an automated solution more “fragile” than simply having an employee with domain expertise doing an update on a schedule. Reliably automating a process like this would require a lot of careful QA.

So throwing legacy systems into the mix would just aggravate this further. It may be that the cost/benefit of automating this just isn’t there. If it’s cheaper in terms of salary cost to do it by hand for N years than the one-time expense to automate, it may just not make sense to automate. (Obviously over an indefinite time scale it would almost always make sense to automate)

This is such a good answer because sometimes a system where people are used to solve the issue is the right way.

And I say 'system' because as long as the company knows that Sally does the updating each day, but if Sally is on holidays or sick or quits, then there is a protocol where the work is done and the checks/balances are done to ensure that that mortgage rate change is done.

There are no standard mechanisms for keeping heterogeneous databases in sync, or even DBs of the same types but with different schemas.

The standard thing to do is to have a source of truth, as described in various comments here like https://news.ycombinator.com/reply?id=22690991&goto=item%3Fi...

Sometimes you'll have multiple sources of truth for (hopefully non-overlapping) subsets of the data in question.

These things are messy. You have to do the hard work of writing sync tools. You want sync tools that work off of incremental / delta feeds, but you also need (slower) tools that can make-it-right based on the totality of your data -- at least the first time you sync two systems, and often also for auditing purposes to make sure that the incremental (fast) sync is working correctly.

CDC via Kafka Connect is one very powerful option.

Goldengate might have a higher billing rate.

Oh, I've done this. The 4DBs are for forward-origination, servicing and reverse origination, servicing.

We had a prime app, a fifth DB driving it. It had scripts to pull rates and a UI to adjust. Logs all changes.

These rates are pulled by the other systems (eg the AS400 pulled over FTP). Additional internal apps could query over SMB and HTTP.

The prime updated slowly and each of the services would poll it very frequently (4m) + on-demand when needed

1) Create an internal endpoint for each of the databases to change the mortgage rate. Each one is different so you will need to do this 4 different times. EG: setMortgageRate() . This is PUSH not Pull. Create this too, while you are at it: getMortgageRate().

2) Then once this is standardized, write a for loop.

Need to watch out for update failure, some sort of transaction management/atomic commit[2] such that if one update fails, other updated dbs should rollback.

1. https://en.wikipedia.org/wiki/Atomic_commit

I researched a pretty long article about this and documented my findings here: https://www.rallyhealth.com/devops/database-updates-across-t...

This simplest way possible. If the rate change can be processed via an endpoint that endpoint then updates the 4 databases. If talking to ACID databases, do a 2 phase commit to reduce the chance of them getting out of sync. Send me 10% of your consulting fee please :-)

Surprised no one has mentioned two phase commit:


Debezium plus Kafka would be good here - see https://www.youtube.com/watch?v=fU9hR3kiOK0

Exactly how small?

A little planned downtime goes a long way.

Oh...you don't want to know that... (in the tone of J Frank Parnell explaining you don't want to look in his Malibu's trunk)

this is just a comment test

Send 4 API calls to update the 4 databases.

Name the bank so we know not patronize patronize it.

Hey guys check it out! This guy just solved distributed databases.

This works until one of the DBs is down, or someone queries two if them at the same time. Without rollback you get the TSB fiasco.

Not sure why this would turn you off to an organization as this is a very common problem.

My money is on your suggestion being what most organizations actually do, though perhaps with a workflow process to manage it.

Legacy systems are hard.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact