
Ask HN: What is your solution to data duplication across applications? - Dansvidania
I work in enterprise and there is an incredible amount of replication and &quot;de-normalisation&quot; across both old and new applications.<p>I am investigating methods of record linkage, but it does not seem like a solved problem.<p>Is there a tried and tested method to go about matching the records and fixing the duplication?
======
tixocloud
TLDR: Reframe the data duplication problem into a business problem.

It's an incredibly tough problem and I haven't seen it solved with technology
alone (mind you, I've been working at banks for the last few years and have
usually 100k+ databases so projects usually run out of money before all the
data is consolidated!)

From a theoretical standpoint, the way to solve this and the holy grail would
be a single source that becomes the system of record. Except all applications
are now attempting to become the system of record for their domain which they
really shouldn't.

There isn't really a tried and tested method to match the records and fix
other than to go in there and create pipeline rules that fix it before it's
being saved/ingested. Tools like Trifacta could help.

The real way to solve the problem is a competent IT team at both the executive
and operator level to create a proper strategy that also is consistently
delivering value as the data cleansing exercise is being done. What this means
is having mapped out all the systems, prioritize and identify where the
business gets most value from the data cleansing exercise and start fixing it
there.

The challenge is that data cleansing exercises are usually multi-year
projects. At the same time, you're expected from the business to deliver value
so you need to figure out the strategy and consistently get and maintain buy-
in from the business. Vendors are always touting their latest and greatest
which exacerbates the problems further.

