
Ask HN: How do you monitor data integrity? - superzamp
I&#x27;m finding little if none literature on this topic; all resources about monitoring are heavily focused on capturing and analyzing run-time traces: app logs, http statuses, etc. Database constraints and unit tests do exist, but so do bugs. As it is almost impossible anyway to prevent bad data from being saved sooner or later, do you have a strategy to monitor your data integrity?
======
hypewatch
You mentioned database constraints and testing already, which are great
automated methods.

Besides these automated methods another way to monitor data integrity is for a
human to manually analyze the data for inconsistencies. And to ask other
domain experts at your company to analyze the data for issues. Sometimes this
is called the “smell test”. This method can be an effective way to find deeper
integrity issues with your data that would be hard to find with automated
tests.

Maintaining data integrity is very difficult, domain dependent, and takes lots
of work. Good luck!

------
Someone1234
If it was possible to create a rule that allowed me to find "bad data" then
I'd use that same rule in a constraint and outlaw the data as it was written.
"Data integrity" is entirely domain dependent and externalities outside the
actual data can cause it to lose integrity (i.e. real life can ruin data's
integrity).

For example, you could have mapping data telling drivers how to travel between
sites, but what if a site's entrance moves? What if there's temp road
construction? What if a new bridge is built shortening the route? All of which
have harmed your data's integrity, without the data itself changing.

So my only strategy is:

\- When you find yourself correcting bad data, consider if a constraint is
_possible_.

\- Referential constraints are superior to logical ones, but both are better
than nothing (because logic could risk placing business rules into a
constraint, which can balloon over time).

\- Data Types are "super weapon" constraints. Define exactly what you plan on
storing, don't future proof, and avoid nullable types whenever possible.
Storing "magical strings" is often wrong, create a table.

\- Simpler data sets are generally higher quality for longer. Consider
simplifying and streamlining your actual data design (e.g. using generic
building blocks and references for common domain tasks).

In general finding bad data/low integrity is like being able to completely
define the whole problem domain in one go, and then deciding not to use that
same information for constraints, it doesn't really make sense. If you're
consistently finding yourself looking for and correcting data integrity
problems then go back to the design and fix that.

