If you have a business need to represent "empty" or "n/a" or "declined to answer...

crazygringo · on Aug 24, 2022

You can't without defining yet another field, usually. It's really annoying to double the number of columns so that [time_started, time_finished] becomes [time_started_exists, time_started, time_finished_exists, time_finished].

NULL values makes business logic far more compact and intuitive. For enumerated values in fields it's easy enough to define another value in the same field to mean 'unknown' or 'not entered', but you can't do that for strings, numbers, datetimes, etc. -- you have to throw in a bunch of unwieldy additional boolean fields instead.

Quekid5 · on Aug 24, 2022

The underlying problem here is that SQL lacks Sum Types (aka Tagged Unions). Such types solve all these problems effortlessly.

In contrast to what SQL has, Sum Types combined with Product Types (which is basically what a row is) are actually a universal way to model all possible data[0]. (Of course you may want syntax sugar, etc. on top of that, but Sums and Products at the bottom is sufficient.)

[0] I'm actually not sure if I should qualify that -- I believe Sums+Products can actually model anything, assuming you allow recursive type definitions -- which might be hard to make perform well. Storing a linked list in a database field, e.g. might not be the best idea.

Koshkin · on Aug 24, 2022

Indeed, the issue of non-existence is a somewhat tricky philosophical question (e.g. there are many ways something may not exist).

andy81 · on Aug 24, 2022

A better option in many cases is to check the primary key.

e.g.

select questions.Id , questions.Text , answers.Id , answers.Text

from questions left join answers on answers.QuestionId = Questions.Id

answers.Id is non-nullable as a primary key, so if answers.Text is null but answers.Id is not null they've declined to answer.

dspillett · on Aug 24, 2022

That may be implementation or circumstance specific. For instance in MS SQL Server with a heap table (one without a clustered index) or a table where the primary key is not the clustering key, it will result in extra page reads to check the other field's value (the query planner / engine could infer from it being the PK that it can never be null, so the lookup to check is unnecessary, but IIRC it does not do this). As the columns used in the join predicate have to be read to perform the join, no extra reads will result from using them for other filtering.

In your example it is very likely that the primary key is the clustering key, so will be present in the non-clustered index that I assume will be on answers.questionId, making my point moot, but if for some unusual reason neither Id nor questionId were the clustering key checking Id may result in extra reads being needed.

In DBMSs without clustering keys implemented similarly to SQL Server, there may be such concerns in all cases.