
Data Consistency Checks - kiyanwang
https://slack.engineering/data-consistency-checks-e73261318f96
======
commandlinefan
An analog in code that I've seen so often I've given up fighting it is this
"pattern" (in Java, but it's universal):

    
    
        try {
            doSomethingImportant();
        } catch (Exception e) {
            logger.error("oops", e);
            // continue on, even though the result is useless at this point
        }
    

... and hope that somebody will pay attention to the logs (hint: they won't).
Of course, if you can actually recover from an error (hint: you probably
can't), you should catch it and recover from it, and of course, you should
definitely log what happened so you can investigate later but FOR THE LOVE OF
GOD DON'T JUST KEEP GOING!

Unfortunately, this sort of papering over error conditions is not just
accepted, but encouraged - most organizations bizarrely seem to prefer to
return wrong answers than admit that an error occurred. Old mechanical adding
machines had moving parts that would degrade over time, making the answers
less and less accurate as the parts started to wear out. Adding machine
manufacturers addressed this by adding a gear that would cause the machine to
lock up completely so that the operator knew not to continue using it. While
it seems obvious that most users would rather get an error than the wrong
answer, most programmers seem to consider doing the opposite the "safe play".

~~~
nicoburns
Oh god. I inherited a codebase like this (except in many places it didn't have
the log). It was horribly unreliable, and when a customer reported a bug it
was almost impossible to tell what had caused it. The funny thing was they had
a pretty extensive e2e test suite. But it was mostly useless because it only
tested happy paths. The rule is now that if you touch a file for more than a
single-line change, you have to remove any such code.

------
iblaine
There seems to be a trend where companies are creating tools to monitor data
quality...

* Great Expectations: [https://github.com/great-expectations/great_expectations](https://github.com/great-expectations/great_expectations)

~2 years old, does data profiling

* Data Sentinel: [https://engineering.linkedin.com/blog/2020/data-sentinel-aut...](https://engineering.linkedin.com/blog/2020/data-sentinel-automating-data-validation)

It's not open sourced however.

* Data Anomaly Detector: [https://www.slideshare.net/iblaine/using-airflow-for-tools-d...](https://www.slideshare.net/iblaine/using-airflow-for-tools-development)

Something I built at One Medical that is not open sourced...but this deck has
some interesting concepts in it. The approach here is to minimize the effort,
lines of code, number of clicks needed, to test your data.

* Amundsen: [https://github.com/lyft/amundsen](https://github.com/lyft/amundsen)

This is open sourced, and some contributes are building features for data
quality analysis. Amundsen seems like it has a bright future.

------
nicoburns
This is why I think schemaless databases ultimately cost more time than they
save. If your database is typed then you get a poor mans version of a lot of
these kind of checks essentially for free. Otherwise it's _very_ easy to end
up with all sorts of inconsistent data in your database, which will result in
lots of difficult to track down bugs which consume _far_ more time than
writing out a database schema.

------
mahmoudimus
I think we need to caveat _what_ is a database when we make statements like
this:

> Databases. They are the single source of truth for our most critical
> business data, yet as engineers we tend to overlook tooling with this in
> mind.

Most folks, like myself, will assume that we're talking about traditional
concepts of databases, like RDBMS or Key/Value stores etc. But over time,
after building distributed systems, it is important to realize that _logs_ are
also a form of a streaming database and we need it to treat it like such.

This is the basic principle behind the venerable article by Jay Kreps (one of
the authors of Kafka): The Log: What every software engineer should know about
real-time data's unifying abstraction [1].

Once we start thinking this way, then we realize that everything ends up
having a schema (protobuf, thrift, avro, etc.) -- just a "database" with a
completely different view point.

IMHO, and consequently this is true where I work, The Log is considered the
single source of truth and what is traditionally referred to as a "database"
is just a "just-in-time" view of data that is materialized from The Log.

Anyway, that is just my 2c.

[1]: [https://engineering.linkedin.com/distributed-systems/log-
wha...](https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying))

~~~
skybrian
It seems like there's a fundamental distinction between systems where you
control data entry (it's manually entered and you can refuse to accept data if
it doesn't follow the rules, until the error is corrected) and systems where
you are recording whatever you got.

In the latter case, it's often better to keep data for debugging purposes than
drop it on the floor. The raw data may not conform to any particular schema,
but perhaps you transform it or only use a subset of it for more advanced
queries, and do something else with the rejects.

In older systems, this may be the difference between the paper form you got in
the mail and what gets entered in the computer.

Refusing to accept input has implications for power relations. Who loses when
data is rejected? Who gets a chance to fix it, and how hard do they try? What
error messages are there and long does it take to retry?

Case study: vote by mail.

~~~
mikemotherwell
>In older systems

If only it were older systems!

I get data from a variety of undocumented sources, and I started logging what
I actually receive next to the normalised data, e.g. I either put raw JSON
into the database next to the actual data it contains, or I put all downloaded
data into a git repo and update the repo after every query.

Why raw JSON? I've had data come back as JSON that did not validate like not
escaping quotes, blank fields that were missing a value e.g. {"dog": , "cat":
"yes"}, just plain wrong data (a number field that was "no number at
present"). Putting it into the DB lets me fix the errors and reprocess.

With non-JSON, I've had a raft of other errors/problems, and I find git works
well here to show what changed and caused the errors.

When dealing with anyone else's data, not just validating but also logging in
a structured way that makes discovering errors easier is vital.

------
arpa
So, how is this approach significantly better than good database design and
practices ensuring data consistency? E.g. column length, foreign keys, stored
procedures for updates/inserts that require validation logic?

~~~
skybrian
From the article:

> Some databases can enforce relationships at the systems level, using foreign
> keys, constraints, and other similar concepts, but these built-in checks and
> constraints represent only a small set of possible characteristics that may
> be modeled in any given system. Likewise, ORMs like ActiveRecord can check
> data at write time using validations, but these often need to be disabled in
> performance sensitive code, and can themselves contain bugs.

~~~
gav
You can add domain-specific validations that are hard or impossible to capture
with database constraints.

For example, I built similar tool that you could write simple rules to
validate product data. This would run every night and generate a report for
the data quality team. One of the advantages was that you could decide have
stricter requirements and then choose if they only applied to new products, or
if the team had to go back and fix all the old products too.

Some rules I remember:

\- Validating min-length, max-length, and casing of product names. Spell
checking words founds.

\- Verifying product images existed and where the right dimensions, below a
certain file size, etc.

\- Making sure that certain product categories had dimensions, and that the
metric/imperial conversions were rounded in the industry-norm way (e.g. 3 3/4"
was 96mm)

\- Making sure that pricing/margins made sense on a vendor/product category
level, to try to catch typos like $1 or $1000.

The thing I found handy was to add new rules when problems occurred that fixed
historical issues.

------
jtwaleson
This is great! As a CTO, having live integrity checks would strongly increase
my trust, versus our current unit tests that cover data integrity. Development
is nice, but live systems are real.

~~~
tremon
Strictly speaking, data integrity isn't part of your role. Properly
implementing business-wide data governance policies is the domain of the CDO.
Data integrity and validity is not a property of any single application, but
should be formalized as part of the business requirements and processes. This
includes:

\- identifying data flows: which data travels across which systems?

\- determining data authority: which application/data store is authoritative
for which piece of data?

\- assigning data stewardship roles: which manager is tasked with maintaining
data integrity for which authoritative data?

\- documenting data definitions: what are the exact semantics of certain data
fields?

Once the business has formalized its requirements on its data, you can use
those requirements in your technology strategy (both in purchasing decisions,
in-house development and worker training). But in the end, these requirements
are business-driven, not technology-driven.

~~~
jtwaleson
You don't know a single thing about my organization. Why are you telling me
what my responsibilities are?

