
Database Design: Relation Predicates and “Identical Relations” - sgeneris
http://www.dbdebunk.com/2016/09/database-design-relation-predicates-and.html
======
gnud
Wow, this was hard to read. Not because it was very complex, just because of
the incomplete sentences and unessescary use of abbreviations.

In the other post [1], linked from this one, he nit-picks because the question
uses the terms "tables" and "fields" instead of "relations" and "attributes".

I'm sure the author knows more about relational modelling than me, but I'm not
sure I would enjoy asking him for advice.

1:
[http://www.dbdebunk.com/2012/09/NormalOrthogDBDesign_3.html](http://www.dbdebunk.com/2012/09/NormalOrthogDBDesign_3.html)

~~~
sgeneris
> Wow, this was hard to read.

Well maybe the reader had something to do with it.

~~~
mhluongo
Are you trying to communicate, or be right on the internet? If you're trying
to communicate, readability matters.

------
garyclarke27
Article is poorly written, long winded, nonsense. I am happy for Relation
names to distinguish entities. He is basically arguing that this violates some
sacred principle and that all meaning should be available from column names,
so that relations names can be ignored. I don't think that even Chris Date who
is a zealous RM fanatic would argue this - his books are actually quite good
much more logical than Fabian Pascall. I'm also sure that Joe Celko would not
agree with Fabian on this.

------
kahirsch
> If two relations have identical designs, the difference can only be encoded
> in the relation (and possibly attribute) _names_ , in violation of the IP
> [information principle]. As such, it is inaccessible to the DBMS, which
> cannot rely on the RPs [relation predicates] to accept/reject tuples as
> correct/incorrect. Users--with little, or no help from the DBMS--must ensure
> that the correct tuples are inserted in the proper relation. Moreover,
> because information is not represented explicitly (as data values),
> relational operations lose it: if you UNION the two relations, you get users
> that either viewed or downloaded items, or both.

The author appears to be suggesting that the name of a relation has nothing to
do with the meaning of a tuple in it. I've read dozens of books and papers on
database design and I have never come across anyone ever suggesting any such
thing.

He also seems to be saying:

(1) The name of a relation is inaccessible to the DBMS.

(2) You should be able to throw tuples at a DBMS and it should just figure out
where they go.

(3) If there are tables A and B in your database and you can't reconstitute A
and B separately from (A UNION B), then there's something wrong with your
database design.

~~~
catnaroek
> The author appears to be suggesting that the name of a relation has nothing
> to do with the meaning of a tuple in it.

The name of a relation has to be a good mnemonic for human use, but
ultimately, the real meaning of a relation is what the DBMS can enforce about
its tuples, just like the meaning of a type is what the type system can
enforce about its inhabitants. You can't prove much about how humans will
actually use a relation (or a type), since that depends on their whims, but
you can prove useful things about the _limits_ of what humans can do with a
relation (or type), even in principle. And the purpose of a good database
schema (or typeful design) is to make nonsensical things impossible, even in
principle.

~~~
ulber
>And the purpose of a good database schema (or typeful design) is to make
nonsensical things impossible, even in principle.

Can you elaborate on what is the nonsensical thing possible with two tables of
the same type?

To me the (A UNION B) example makes very much sense exactly in the case that
you want all x s.t. (x is in A) OR (x is in B). With differently named id
columns you might have to write this as (excuse my pseudo code):

    
    
      (RENAME(A, a_id TO a_or_b_id) UNION RENAME(B, b_id TO a_or_b_id))
    

which is indeed more explicit. Having written that I might see the rub here.
In the article the column "item_id" is kind of a "view_or_download_id" and
which it is is only specified the namespace (i.e. the name of the table ITEMS
or DOWNLOADS the tuple belongs to). In an OO world this would be the
difference between:

    
    
      class Person { public string Name { get; set; } }
      ...
      var employees = new List<Person>(...);
      var managers = new List<Person>(...);
    

and the version where employee and manager types are incompatible:

    
    
      class Person { public string Name { get; set; } }
      class Employee : Person {}
      class Manager : Person {}
      ...
      var employees = new List<Employee>(...);
      var managers = new List<Manager>(...);
    

The latter would indeed prevent e.g. employees.Concat(managers) on accident.
However, in many programming scenarios the former would be preferred for the
increased code reusability. Is this not an objective in database design (with
e.g. stored procedures that could be applied to more than one set of tables)?

~~~
catnaroek
> Can you elaborate on what is the nonsensical thing possible with two tables
> of the same type?

There's nothing intrinsically wrong with it a priori. But see below.

> To me the (A UNION B) example makes very much sense exactly in the case that
> you want all x s.t. (x is in A) OR (x is in B).

To me, the very idea of making a collection whose elements have different
logical types is completely nonsensical. I consider it a weakness of SQL that
I can form the relation:

    
    
        select customer_id as id, customer_name as name
        from customer
        union
        select employee_id as id, employee_name as name
        from employee
    

The attributes customer_id and employee_id should have _different abstract
types_ , so that they can't be accidentally conflated with one another, even
if their internal representation is the same (say, an autoincremental int).
Then, if you want to make a derived relation containing both customers and
employers, you should be forced to use a sum type (à la Haskell's Either) as
the primary key.

> In an OO world this would be the difference between: (snippet)

Object-orientation is simply the wrong paradigm for building data-oriented
applications, at least if you care about automatically enforced data integrity
and declarative data manipulation (which you should):
[https://news.ycombinator.com/item?id=11861520](https://news.ycombinator.com/item?id=11861520)

------
combatentropy
Hey, I can talk like that too.

    
    
      A mental construct has occurred to my consciousness
      subsequent to the exposure of said mind to myriad
      graphological artifacts (GE's) conjecting observations
      that are encapsulated in a word structure (WS) of such
      syllabic cornucopia that the reliable transmission of
      said observation to the recipients takes on a high
      degree of uncertainty.
    

Now it's your fault if you don't understand it.

~~~
sgeneris
I wouldn't expose your inability to understand if I were you.

~~~
combatentropy
Why is that?

------
burnstek
I would say the biggest decision point here is how either entity might evolve.
If these are truly different domain entities then one may end up with
different attributes than the other in the future, meaning these should
definitely be different tables. They just happen to have the same set of
fields at this point in time.

------
dragonwriter
Wow, for a site that claims to be "database fundamentals made accessible",
that's a fairly inaccessible, overwritten piece.

I think the fundamental argument it is making has some value; if 2 or more
tables [0] are structurally different _only_ in the name of the table, then
the facts in those tables are almost certainly related instances of some
general common supertype, such that there should be one table, with an
additional field disambiguating the subtype (and which probably a foreign key
into a new table specifying the valid subtypes, which have a 1:1 relationship
to the tables in the "bad" schema.)

OTOH, in lots of practical applications no application would ever care about
that "ideal" base table, and all interaction would be through views that
exactly correspond to the tables in the "bad" design. Outside of a DB serving
as an ideal, application-independent store (which is an important role, and
even in single-app DBs designing for this can have advantages in dealing with
growth and change and unexpected future uses), the _value_ of this ideal
transformation may be minimal.

[0] base relations, if you prefer.

------
cafard
Fabian Pascal tends to write as if the point of database work were to
construct perfect ontologies. Perhaps we would be better off if we did that,
perhaps not. In this case, perhaps it would be a cleaner design to have a
single table with ID, USER_ID, FIRST_VIEWED, FIRST_DOWNLOADED. But would I
lose sleep over this? No.

~~~
catnaroek
> Fabian Pascal tends to write as if the point of database work were to
> construct perfect ontologies.

Wait, is it not?

------
makebelieve
I found this approach deeply troubling because it moves us away form semantic
design towards logic design... which always runs into problems when the
database itself becomes semantic content. The write up was rather confusing.

The solution seems straightforward. A single table that captures the meaning
expressed by the separate VIEWS and DOWNLOADS tables. eg. USERACTION (USER_ID,
ITEM_ID, ACTIONTYPE) where ACTIONTYPE is a value like V for view and D for
download. Of course, that solution is hard to see because it's a synthesis of
meanings occurring at different levels and not the product of predicate logic.

~~~
sgeneris
Database design IS logic. That's the point of the RDM: to formalize and
symbolize semantics such that the DBMS can enforce integrity on and manipulate
data, such that logical and semantic correctness is guaranteed. Leave that to
users in apps at your peril. We used to do this before the RDM and the whole
shabang collapsed. And we're still doing it because practitioners know nothing
beyond SQL and coding.

~~~
dragonwriter
> We used to do this before the RDM and the whole shabang collapsed.

No, it didn't, and relational-theory-purists aren't going to sell their ideas
to practitioners in the real world by pretending that it did. The RDM
certainly _offers_ all kinds of abstract benefits, which practitioners often
do not fully understand or leverage, and there is a very real problem when the
not fully leveraging is due to not fully understanding (rather than weighing
practical costs and benefits in the particular use case.)

OTOH, the reason that things built on the relational model took off in
practice wasn't that non-relational systems had reached a point of
catastrophic logical failure that led to their rejection, but because the
relational model had a convenient mapping to implementations that were
convenient and efficient in the technology of the day (particularly, hard disk
storage), combined with _some_ of the structural improvements over other
approaches being particularly attractive for important application domains.

> And we're still doing it because practitioners know nothing beyond SQL and
> coding.

Yeah, look, we're probably never going to have a time when most practitioners
are deep theoreticians rather than expert tool users, and if you want to sell
practitioners on deeper consideration of the underlying theoretical models,
you're going to need to make explanations of the practical benefits _much_
more accessible than you have in the source article _or_ your comments in this
thread (and you're going to need to be a _lot_ less personally abusive.)

~~~
sgeneris
Right, there is never time to do it right and lots of time to do it over.

I do not think that your reading of history of the field is anywhere close to
reality. I do suggest that you read as carefully my comments as I write them:
I did not say practitioners ought to be theoreticians, I said they should no
engage in a field founded on logic without ANY intro to logic. Big difference.

In fact, the initial mapping to implementation--direct image SQL
implementations--was not in the relational spirit at all and is in large part
responsible to logical-physical confusion and confusion of tables with
relations. And to call those initial implementations efficient in the
technology of the day is from another planet. IBM would not budge implementing
the RDM until Oracle forced it. It just so happened that even the limited
relational fidelity of SQL proved superior to the rigidity, complexity and
lack of soundness of hierarchic and network technologies. Have you ever seen
IMS or Codasyl code?

Listen, have done nothing but exactly making the practical implications of the
theory for the last 40 years. I suggest you read my stuff and tell me exactly
what is wrong with it. The problem is lack of fundamental education which has
been replaced by tool training. Practitioners are not even aware that there is
something beyond experience with tools that they need to know.

So pls.

~~~
dragonwriter
> Right, there is never time to do it right and lots of time to do it over.

Purists like to snear when they say this, but in point of fact its often true:
its often more efficient to do things _good enough for now_ and fix the things
that turn out to need to fixing later (because the real pace of change often
means the things that are broken-in-theory, but good-in-enough-in-practice are
going to need completely replaced because of requirements changes before they
become problematic in practice.)

But its true, OTOH, that lack of knowledge of relational theory, the anomalies
that it identifies that are tied to improper data models, and the practical
impact of these can lead to poor analysis of the tradeoffs, and that the
common cargo-cult rules of thumb (say, for degree of normalization to pursue)
that are frequently used in practice are poor substitutes for deep
understanding of the relational model and the problem of concern, and analysis
of the real risks in the system under design of taking shortcuts.

> I do suggest that you read as carefully my comments as I write them

One of the biggest problems with your comments is that they don't appear to be
written carefully -- particularly, if you hope to influence the practitioners
that you treat with such condescension, you have failed to put due care into
consideration of your approach to the audience, which is a central element of
any communication.

> I did not say practitioners ought to be theoreticians, I said they should no
> engage in a field founded on logic without ANY intro to logic.

You seem to also, however, keep suggesting that either a lack of deep
familiarity with relational theory or a disagreement with your interpretation
of how that theory ought to shape practice are equivalent to (or can only be a
result of) a complete lack of grounding in logic. Whether you are actually
_conflating_ these things or just engaging in particularly obnoxious
condescension and personal abuse when you do this is less than clear, but
neither is helpful or useful.

> In fact, the initial mapping to implementation--direct image SQL
> implementations--was not in the relational spirit at all and is in large
> part responsible to logical-physical confusion and confusion of tables with
> relations.

I'm not sure what you mean by the "relational spirit". SQL's design was
clearly shaped by the relational data model, though SQL itself (in its current
form as well as its early forms) is certainly not _ideal_ from a relational
perspective, even before considering the whole NULL controversy.

> And to call those initial implementations efficient in the technology of the
> day is from another planet.

The initial implementations weren't what became popular though; the
implementations that _were_ efficient were key to driving popularity.

> It just so happened that even the limited relational fidelity of SQL proved
> superior to the rigidity, complexity and lack of soundness of hierarchic and
> network technologies.

Sure. I just think that its easy for theory-purists to overstate the degree to
which the "lack of soundness" was the problem driving adoption, rather than
"rigidity and complexity". Insofar as the _linguistic_ and _expressive_
features of SQL and the relational model proved attractive, simplicity and
flexibility were particularly important, and while the capacity for soundness
_is_ an important improvement, its one that's been underused since day one.
Its simply not the case, as you seem to present, that we've "fallen" to a
state where that is ignored from some rosier days when that aspect of RDBMS
capacity was strongly embraced and effectively and rigorously used by
practitioners.

> Have you ever seen IMS or Codasyl code?

I've even had to write (well, modify) some IMS code, far more recently than
I'd prefer to have.

> The problem is lack of fundamental education which has been replaced by tool
> training.

I don't think that's true. I think that the _number_ of people with
"fundamental education" in the field is probably greater than ever before.
Sure, the number of people with tool training has increased faster, but that's
not tool training replacing fundamental education, its just that with _any_
technology, the first generation of users will _all_ (or, at least,
disproportionately) be versed in the underlying principles because they are
also the builders of the technology, but over time that's going to fade as,
even with more people educated in the principles, people who are just
pragmatic users of the technology with a more limited focus are going to grow
at a faster rate.

> Practitioners are not even aware that there is something beyond experience
> with tools that they need to know.

Most practitioners I've encountered seem to be aware that relational theory
_exists_. Sure, lots of them aren't well versed in it or what light it has to
shine on their craft, but abstruse descriptions without clearly explained
pragmatic benefits aren't an effective way to correct that and motivate them
to dig more into theory, and neither is condescension and abuse.

~~~
sgeneris
> Purists like to snear when they say this, but in point of fact its often
> true: its often more efficient to do things good enough for now and fix the
> things that turn out to need to fixing later (because the real pace of
> change often means the things that are broken-in-theory, but good-in-enough-
> in-practice are going to need completely replaced because of requirements
> changes before they become problematic in practice.)

Do you have stats that prove your point, or is it based on the fact that this
is how it is usually done because that is what is possible given the poor
level of education and knowledge in the industry? If you're not part of that,
don't underestimate its size. I spent 40+ years demonstrating the ignorance
and its consequences.

By the way what does "impure theory" mean?

> But its true, OTOH, that lack of knowledge of relational theory, the
> anomalies that it identifies that are tied to improper data models, and the
> practical impact of these can lead to poor analysis of the tradeoffs, and
> that the common cargo-cult rules of thumb (say, for degree of normalization
> to pursue) that are frequently used in practice are poor substitutes for
> deep understanding of the relational model and the problem of concern, and
> analysis of the real risks in the system under design of taking shortcuts.

> Glad we agree on something. My claim is that there's more to that than you
> seem to think.

> One of the biggest problems with your comments is that they don't appear to
> be written carefully a-- particularly, if you hope to influence the
> practitioners that you treat with such condescension, you have failed to put
> due care into consideration of your approach to the audience, which is a
> central element of any communication.

The title of one of my books is "for the THINKING practitioner". He is the one
I try to influence and aren't that many. It's not entirely their fault--it's
how the industry and business in general operates. I can detect very easily
the difference between a thinker who is uninformed and a non-thinker and I
treat them differently. It's just that there's many more of the latter than
the former.

> You seem to also, however, keep suggesting that either a lack of deep
> familiarity with relational theory or a disagreement with your
> interpretation of how that theory ought to shape practice are equivalent to
> (or can only be a result of) a complete lack of grounding in logic. Whether
> you are actually conflating these things or just engaging in particularly
> obnoxious condescension and personal abuse when you do this is less than
> clear, but neither is helpful or useful.

So according to you it's not possible to detect the difference between a poor
argument due to ignorance of logic and one grounded in logic? Again, I have
spent 4 decades doing this and pls permit to believe that I discern quite
readily who should be treated with respect and who not. You're entitled to
disagree.

> I'm not sure what you mean by the "relational spirit". SQL's design was
> clearly shaped by the relational data model, though SQL itself (in its
> current form as well as its early forms) is certainly not ideal from a
> relational perspective, even before considering the whole NULL controversy.

The authors of SQL did not have a good grasp of the RDM, which is why SQL
cannot be considered truly relational. It violates too many rel. principles.
This according to Codd and Date who were both at IBM when SQL was developed.
The specific spirit I was referring to is physical independence, a core
objective of the RDM which a direct image implementation is not in the spirit
of.

> The initial implementations weren't what became popular though; the
> implementations that were efficient were key to driving popularity.

Yes, but many of the reasons for which SQL was slow to make efficient was its
poor relational fidelity. I've written a few articles on that subject.

> Sure. I just think that its easy for theory-purists to overstate the degree
> to which the "lack of soundness" was the problem driving adoption, rather
> than "rigidity and complexity". Insofar as the linguistic and expressive
> features of SQL and the relational model proved attractive, simplicity and
> flexibility were particularly important, and while the capacity for
> soundness is an important improvement, its one that's been underused since
> day one. Its simply not the case, as you seem to present, that we've
> "fallen" to a state where that is ignored from some rosier days when that
> aspect of RDBMS capacity was strongly embraced and effectively and
> rigorously used by practitioners.

I did not say soundness drove the adoption. In fact, there is no way this can
happen given that a vast majority of practitioners have no clue of how a
formal foundation for db mgmt is different and superior to non-formal ones.
This is not different than the notion that data science is science. That's
precisely why I keep stressing that substituting training for education killed
the capacity to appreciate the difference.

> I've even had to write (well, modify) some IMS code, far more recently than
> I'd prefer to have.

My sympathy.

> I don't think that's true. I think that the number of people with
> "fundamental education" in the field is probably greater than ever before.
> Sure, the number of people with tool training has increased faster, but
> that's not tool training replacing fundamental education, its just that with
> any technology, the first generation of users will all (or, at least,
> disproportionately) be versed in the underlying principles because they are
> also the builders of the technology, but over time that's going to fade as,
> even with more people educated in the principles, people who are just
> pragmatic users of the technology with a more limited focus are going to
> grow at a faster rate.

Well, having spent so many years documenting the deterioration of education, I
believe my evidence than your perceptions.

> Most practitioners I've encountered seem to be aware that relational theory
> exists.

Very well put. That's about the gist of it. And whatever little they know
about it is wrong.

> Sure, lots of them aren't well versed in it or what light it has to shine on
> their craft, but abstruse descriptions without clearly explained pragmatic
> benefits aren't an effective way to correct that and motivate them to dig
> more into theory, and neither is condescension and abuse.

They should not be motivated to do it on their own. There should be a basic
level of education required to be inducted into the profession by faculty that
are themselves proficient in the material and not industry hires that are
teaching coding and tools because that's what univs teach now to be
"relevant".

~~~
Roboprog
I hear you on the education thing. Intuitively, the "two tables with same
structure" approach is redundant, but it's helpful to have the math/logic to
unambiguously define what "redundant" is.

Having 2 tuple types with 3 values seems harmless. This explodes quickly when
done with many columns or many tables. Having some kind of foreign key to act
as a "discriminator" scales much better to constrain the number of value types
(# of tables x # of columns) that must exist.

But I only say this because of having to deal with a legacy DB at work that
violates the Hell out of this :-)

... which brings us full circle to the tragedy of a lack of education.

------
ErwinSmout
"If two relations have identical designs, the difference can only be encoded
in the relation (and possibly attribute) names,"

True, though the (and possibly attribute names) part seems dubious to say the
least. Different attribute names makes the designs non-identical, no ?

in violation of the IP [information principle].

This interpretation of the IP is outright absurd.

One, the "I" in "IP" was obviously intended to cover only ever the
[end-]user's own business information. The "information" that is being
"hidden" under this absurd interpretation is the mapping that applies from
relation names to intended interpretation. That information is never part of
the [end-]users "genuine business information". How could it ? The mapping in
question arises only when the models are being developed. As long as no
computers are involved, no information models and no mapping, but the "genuine
business information" stays the same.

Two, the relation names _are_ present as a value of an attribute in a tuple in
a relation. In the catalog that documents the structure of the database that
will contain the [end-]users "genuine business information".

As such, it is inaccessible to the DBMS, which cannot rely on the RPs
[relation predicates] to accept/reject tuples as correct/incorrect.

Not sure what is intended here. Is it claimed that because "table names are
inaccessible to the DBMS", it is impossible for the DBMS to enforce declared
constraints ? Ouch. The SQL REFERENCES clause ought to suffice to counter
that.

Users--with little, or no help from the DBMS--must ensure that the correct
tuples are inserted in the proper relation.

So they must know the mapping from relation names to intended interpretation.
That is not an insurmountable problem. Hundreds of thousands of developers
have already been doing that [or something extremely similar] since before
databases even existed.

Moreover, because information is not represented explicitly (as data values),
relational operations lose it: if you UNION the two relations, you get users
that either viewed or downloaded items, or both.

Then don't union the two together. (It is alas left unclear whether the usage
of the term UNION here refers to invocations of that relational operator on
the two relations in the original design (in which case the "loss of
information" is (a) intentional and (b) not really loss of information because
the original relations have not magically disappeared by computing the union),
or in the sense of blindly mergeing the two tables together at the schema
level without adding an indication of view vs. download. In which case it's
just a stupid design mistake even my cat probably wouldn't commit.)

~~~
sgeneris
The RDM is an attempt to maximize database management by the DBMS. What is
left only in the mind of users--particularly the kind of users we witness here
--is nothing but trouble. Reliance on sheer user discipline--for no good
reason, for the only reason that SQL is not truly relational and people don't
know, understand and appreciate the theoretical foundations of the RDM is what
is really absurd here.

That's the reason the IP insists that ALL information is represented
EXPLICITLY and in EXACTLY one way, to maximize soundness, power and
simplicity. Whoever does not understand this will get nothing better than SQL.

