
Why Doesn't Anyone Want to Solve Duplication? - tabtab
I&#x27;ve been doing CRUD programming &amp; design for a living for decades. These systems are typically for tracking, categorizing, and reporting&#x2F;searching on domain &quot;things&quot;: people, tasks, complaints, products, customers, etc. I&#x27;m surprised at the amount of redundancy in typical CRUD stacks. Info about a given field (RDBMS column) and its associations is <i>replicated</i> all over the place. My attempts to produce stacks that factor out duplication seem to confuse other developers such that they&#x27;d rather live with duplication.<p>There are always exceptions (variations) that need to be applied. For example, the default field title of a given field might be too long for a given screen listing, being the user wants to see a lot of info on a single listing. Thus, an exception (override) is needed for that title on that page. A centralized &quot;data dictionary&quot; is useful, but cannot carry the entire load. Further, sometimes the variations need to be computed dynamically.<p>A way to manage these variations is needed that doesn&#x27;t confuse developers. Perhaps if an industry standard were devised, then coders would just learn the factor-friendly system and we&#x27;d end the duplication. It feels to me the industry spends too much time chasing the latest UI trends&#x2F;fads instead of factoring.<p>I find our tools are too rigid to even propose specifics of such a stack. Code and&#x2F;or references to code would need to be managed in something akin to an RDBMS, and not a hierarchical file system nor OOP inheritance hierarchies. Trees are just too rigid. There are tools to automate duplication, but few that avoid it. Auto-dup only helps initial coding, not maintenance. Should we be expected to &quot;just live with&quot; duplication, or are our standards and tools lacking?
======
jakequist
I've felt your pain. So much so that I took 6 months off and put a lot of
groundwork into starting a company that would solve this problem. But in the
end, I decided to abandon it.

I realized that in order to be 10x better than the alternatives, I was going
to need to solve some very tricky AI problems. For example, acurately
deduplicating a customer record "John Doe" vs "Johnathon Doe" is not
straightforward. Maybe it's two different people?. Maybe it's just a spelling
mismatch? The system must have a great deal of context to accurately determine
if the data is indeed duplicated. And even if it does, perhaps there's a
perfectly good reason for the spelling mismatch. (e.g. perhaps one table is
his preferred name, while the other is just referential, etc). In the end,
deduplication often comes down to the requirements of the company and it's
hard to generalize.

I think there's space in the market for this kind of business, but it'll be a
slog. Unless you have a 10x solution (i.e. super AI), you'll be competing with
the likes of Trifacta, etc. And it's hard to compete with that kind of sales
force.

Really good question. Thanks for posting.

~~~
PaulHoule
@jakequist I think trifacta and similar tools are aimed at "data analysis"
more than operations.

I think the question is about line of business software and issues there are
very different.

For instance there is a literature on record matching and good techniques
exist, but without an exception handling workflow you don't have a way to deal
with the unusual cases the code works up.

I would love to talk and share notes about what you did.

~~~
tabtab
Paul, you are correct. It's more about preventing duplication in the software
design rather than cleaning it up after the fact (which is an interesting
problem, but a diff topic).

Think of it this way: one could build a detailed Entity Relationship diagram
(or OOP equivalent) in a machine-readable format with all the relationship and
column-size constraints defined. One could then push a button and have a
machine generate a working version of the software. Those tools do exist. But
they are usually missing useful details and result in UI's poorly tuned for
how employees will likely be using the system.

Many of the tweaks to make it "practical" will be exceptions or local
customizations to the original ER diagram data. Those
customizations/deviations are the bottleneck such that in practice most stacks
use duplication of info instead. See DRY ("Don't Repeat Yourself") in software
engineering slang sites.

------
PaulHoule
Standards and tools are lacking.

One take of the vision of the Object Management Group is that the application
should almost write itself once you've figured out the data structures. The
designers of Visual Studio Team Edition wrote this book:

[https://www.amazon.com/product-
reviews/0471202843](https://www.amazon.com/product-reviews/0471202843)

which describes a hypothetical product which is much much better than VSTE.

It is a hard problem. The structure of common sense knowledge is something
like: "A(x) is true" but "A(3) is false". In principle this kind of system
would be good

[https://docs.jboss.org/drools/release/7.17.0.Final/drools-
do...](https://docs.jboss.org/drools/release/7.17.0.Final/drools-
docs/html_single/index.html)

but the error messages it returns often make no sense at all and I haven't
learned the source code for the engine enough to be able to read them.
Products like that have several different mechanisms that can be used to model
"default logic" (X is true unless...) but there is no simple standard
mechanism that makes people happy.

(Look at my HN profile and contact me if you want to chat)

~~~
tabtab
Part of the problem is that each shop has different conventions & preferences
based on existing products/apps and domain-related issues. An out-of-the-box
solution probably won't work well without tuning it for the shop, but there
are often too many "knobs and dials" to know how to tune well without a big
learning curve. A mix-and-match kit along with some sample reference apps may
be a better approach rather than a sealed black box with lots of knobs (like
Visual Studio). I'll consider your offer. Thanks

------
tabtab
There are similar prior discussions such as:
[https://news.ycombinator.com/item?id=12061453](https://news.ycombinator.com/item?id=12061453)
However, I'm not really talking about making fancy API's or complex
abstractions, but more about just dealing with regular database column-
oriented information. Although, there may be some overlap.

