Should there be an empty date, as a corollary to empty string, for representing ...

kbenson · on Feb 4, 2020

If that's the data policy of the group collecting the data, then an empty date (which some systems support, 0000-00-00 is valid in MySQL for certain modes of strictness IIRC), or a sentinel value could be used (1111-11-11, depending on expected values).

If the data needs to differentiate between those cases and a separate boolean bit of data to track whether it was collected is no feasible (as it so often isn't in the real world), then you do what you must. How that works out in the real world would be the policy for how to interpret the data in that field. It's really no different than any time you've seem a field that is named something counter-intuitive to what it holds in some schema. Something tracks that difference, whether it be institutional knowledge, come conversion code, or a label prior to display. That's what I mean by it being a "schema/data policy question".

talaketu · on Feb 4, 2020

Yes absolutely a data schema question. And yes, you can set whatever arbitrary policy you want. eg "nmn" for no-middle-name. Or require that a horse's birthday is recorded as 1st August.

But SQL and the relational model is meant to be a logical system, and I'm interested in preserving some conceptual integrity in the domain modelling.

kbenson · on Feb 4, 2020

I would say there's a spectrum. My guess is that whatever schema you could come up with could have an extra layer of metadata applied to describe the data to some benefit, but at some point you have to make a decision about how something is best represented.

To keep with the example of asking questions, if a sruvey is asked and each question is represented by field in a table for that survey, but they also want to track whether any particular question is answered, there are multiple ways to track this information. It can be done through a separate table (or multiple), it can be done with an extra boolean field per original question field, or it could be done by making each question field nullable, and ensuring that no question answered will ever be left null.

Which of those cases breaks the conceptual integrity of the domain modeling? I would argue that as long as they are applied consistently and accurately, none of them break it.

talaketu · on Feb 4, 2020

If your SQL query tells you the most popular name is "nmn" or that that a whole lot of people died before they were born (DOD = 1111-11-11) then yes, I think there has been a loss of integrity.

kbenson · on Feb 4, 2020

That only matters if your SQL data is ever meant to be used in isolation. If the only appropriate way to access the data and retain integrity is either through the same set of applications that insert it, or through fucntions written to mediate access within the SQL instance, than how the data is stored on disk is mostly irrelevant.

If all your access is mediated by a layer that takes care of this, then complaining about how it's stored in SQL is no different than complaining about how your DBMS is storing the data on disk. It doesn't matter, you aren't supposed to be doing anything with it there anyway.

Data integrity is not something a database can ensure, for any non-trivial use. ANSI SQL provides tools to help ensure data integrity, through data types, referential integrity checks if you define them and they are supported, but ultimately, those tools can only account for what the database supports. Will they prevent the favorite_color field from receiving the data that should have been entered into the pet_name field? Unlikely?

It's no different in programming languages and bugs (as a data integrity problem is a bug in data domain). Using C, Java, Rust and Haskell will not prevent all your bugs, and Perl, Python, JavaScript and Bash do not mean you will have bugs. A bug free Perl script is possible, and so is a bug riddled Rust script, as not all bugs are (or even can be) caught by the assurances the more strict languages provide.

Unless all your SQL in a schema use is through a directly connected SQL client that loads data directly from native SQL formats, runs queries, and exports data as the output of those queries, the applications that have ingress and egress to that database are what really matters for data integrity, and worrying that a date field might contain 1111-11-11 as a special sentinel value that means something other than what null value in the same field means is being unable to see the forest for the trees.