
A Simple Guide to Five Normal Forms in Relational Database Theory (1982) - brudgers
http://www.bkent.net/Doc/simple5.htm
======
ef4
My biggest pet peeve in the world of data storage is that the most powerful
ideas from relational data theory are still mostly locked up in books and
papers, and most programmers who think they know what "relational" means
aren't aware of the full picture.

The people who did all the pioneering work on the subject (mostly E.F. Codd
and Chris Date) seem to have made several blunders in trying to popularize
their ideas. For example, one of Date's later books ("Go Faster: The
Transrelational(tm) Approach to DBMS Implementation") contains some very
useful ideas. But it was written in 2001 and not published until 2011, because
he was working with a now-failed startup that was trying to keep it all a
trade secret.

Most of their writing is not available online. You have to buy their books.
Which is an author's prerogative, but seriously limits the reach of the ideas.

The world thinks it already has relational databases that are good enough.
Convincing it otherwise requires a web-savvy marketing approach that has so
far been lacking.

~~~
jules
I just checked out that book. The results are presented as some kind of
revolution but:

1\. The system presented is simply an inefficient way of doing a single column
index on each column.

2\. The book contains NO benchmark results. It only claims that it's efficient
because of X, Y and Z without any numbers to back that up. Meanwhile the
method involves a lot of pointer chasing which is extremely slow, especially
on disk. Queries that require multi-column indices will also be extremely
inefficient of course, because they will require scans.

I wouldn't go so far as saying that the book is worthless, but its claims are
certainly dubious at best.

~~~
barrkel
"Queries that require multi-column indices will also be extremely inefficient
of course, because they will require scans."

If a query uses just the columns that are in a multi-column index, and doesn't
sort in any order other than the index order, then the query may be much more
efficient than a with a single-column index - since all the data may be
retrieved from the index, rather than seeking to the row.

I think you left something out.

~~~
jules
We can decide which indices we have, you know.

The conditions that you state are not by far the only conditions when a multi
column index helps you. Even if you do not sort by the index order, and you do
not have all the data in the index, the index may still be way more efficient.
For example if you have SELECT A WHERE X=3 AND Y=5 ORDER BY B with an index
(X,Y).

------
geebee
I actually got a chance to chat about nosql with one of the people who coined
the term "nosql" (here on hn, no less). It sounds like it was a term that got
away from them a bit. I don't think anyone really intended to suggest that sql
should be completely abandoned. notalwayssql, or
maybeconsidersomethingotherthansql might have been better terms (well, not
_better_ , but more specific to what was actually intended).

Relational databases and sql are an exceptionally sound approach to a lot of
data storage and retrieval problems. Some of the "nosql" technologies emerged
from new problems that emerged (such as full text indexing, networks).

Progress is an interesting thing, and doesn't always happen on an even keel.
These days, I think you're far less likely to encounter knee-jerk resistance
to an rdbms or sql. The dust is settling a bit, and in this case, that's a
good thing.

------
jules
Can't you summarise all of that simply: "a database schema should be able to
represent only valid data".

e.g. employee, department, location is not good because that schema can
represent invalid data: a single department can't be in 2 locations but the
schema can represent that.

The same rule of thumb applies to types in programming: a data type should be
able to represent only valid values. This is why sum types are superior to
error codes: a function can return a result AND an error code, when what we
usually want is that the function returns a result OR an error code. When
invalid values are unrepresentable that makes a lot of errors impossible.
Though sometimes your type system can't deal with all the constraints that you
have, e.g. an array should be sorted (an unsorted array is representable but
not valid => bad).

~~~
bcoates
The various normal forms are named for the kinds of invalid data they reject,
so they can be used as a design process to ensure the higher level goal (that
"a database schema should be able to represent only valid data")

------
graycat
I thought that the main explanation of normals forms was that "the rows be a
function of key, the whole key, and nothing but the key".

------
crshults
_First normal form excludes variable repeating fields and groups_

What are the definitions of "variable repeating fields" and "groups?"

What would violating records look like?

~~~
yangyang
A text column with a list of {comma, space, ...}-delimited values representing
a set, e.g. of tags, is something you occasionally see.

------
kellros
If you find it hard to read, set the body style e.g. width:600px; margin:auto;
font: 18px/24px Calibri,Arial,Helvetica; color:#333; background:#F3F3F3;

------
zvrba
OT, but: the guy retired in 2000. and died in 2005. That's sad :(

~~~
brudgers
From Amazon page for _Data and Reality_ :

    
    
       About the Author
    
       William Kent (1936-2005) was a renowned researcher
       in the field of data modeling. Author of Data and 
       Reality, he wrote scores of papers and spoke at 
       conferences worldwide, posing questions about 
       database design and the management of information 
       that remain unanswered today. Though he earned a 
       bachelor's degree in chemical engineering and a 
       master's in mathematics, he had no formal training 
       in computer science. Kent worked at IBM and later 
       at Hewlett-Packard Laboratories, where he helped 
       develop prototype database systems. He also served 
       on or chaired several international standards 
       committees. Kent lived in New York City and later 
       Menlo Park, Calif., before retiring to Moab, Utah, 
       to pursue his passions of outdoor photography and 
       protecting the environment.
    

While 69 is fairly young, his death does not appear to be tragic in the sense
of unfulfilled ambitions.

~~~
zvrba
> While 69 is fairly young, his death does not appear to be tragic in the
> sense of unfulfilled ambitions.

My comment was from the perspective of course of life. After retirement, he
had only 5 years to pursue "his passions of outdoor photography and protecting
the environment". (Note that he did not continue working on his research.)

He was under pressure to deliver results (school, working, etc.) for the
majority of his life, and he had only 5 years of total and independent (within
financial constraints) freedom. _That_ is sad.

------
eksith
It's worth noting that this model works quite well for the vast majority of
client facing applications out there.

I.E. Things categorized by tags or groups where 1-to-n relationships are
necessary. Of course, the extreme of this is EAV (Entity Attribute Value)
which I would limit to meta/taxonomy data for performance reasons.

But for information where the labels (or quantity) aren't known before input
are easier to deal with in normalized databases.

E.G. Getting metadata on a specific entry with no prior knowlege of said data
except parent id:

    
    
      SELECT id, label, content FROM meta WHERE id IN (
        SELECT meta_id FROM posts_meta WHERE post_id = :id
      );
    

Or if you're using Postgres, you can return the metadata as an associative
array. E.G.
[http://stackoverflow.com/a/11942726](http://stackoverflow.com/a/11942726)

The following is an excerpt from the actual schema I used on a very simple
forum that ran on SQLite for years before switching to Postgres fairly
recently. The normalization (varying forms) afforded a lot of flexibility.
Maybe someone will find it useful.

    
    
      CREATE TABLE posts (
      	id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, 
      	created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, 
      	updated_at DATETIME DEFAULT NULL, 
      	title VARCHAR NULL,
      	summary TEXT NOT NULL, 
      	body TEXT NOT NULL, 
      	plain TEXT NOT NULL, 
      	quality FLOAT NOT NULL DEFAULT 0, 
      	status INTEGER NOT NULL DEFAULT 0, 
      	reply_count INTEGER NOT NULL DEFAULT 0, 
      	auth_key VARCHAR NOT NULL 
      );
      
      
      CREATE INDEX idx_posts_on_status ON posts ( status );
      CREATE INDEX idx_posts_on_created_at ON posts ( created_at );
      
      CREATE VIRTUAL TABLE posts_search USING fts4 ( search_data );
      
      CREATE TABLE posts_family (
      	parent_id INTEGER NOT NULL, 
      	child_id INTEGER NOT NULL, 
      	PRIMARY KEY ( child_id, parent_id )
      );
      
      CREATE TABLE taxonomy (
      	id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, 
      	label VARCHAR NOT NULL, 
      	term VARCHAR NOT NULL, 
      	created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, 
      	updated_at DATETIME DEFAULT NULL, 
      	status INTEGER NOT NULL DEFAULT 0
      );
      
      
      CREATE TABLE posts_taxonomy (
      	post_id INTEGER NOT NULL, 
      	taxonomy_id INTEGER NOT NULL, 
      	PRIMARY KEY ( post_id, taxonomy_id ) 
      );
      
      CREATE TABLE taxonomy_family (
      	parent_id INTEGER NOT NULL, 
      	child_id INTEGER NOT NULL, 
      	PRIMARY KEY ( child_id, parent_id )
      );
      
      CREATE TABLE meta (
      	id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, 
      	label VARCHAR NOT NULL, 
      	parse_as VARCHAR NOT NULL DEFAULT "text", 
      	content TEXT NOT NULL
      );
      
      CREATE TABLE posts_meta (
      	post_id INTEGER NOT NULL, 
      	meta_id INTEGER NOT NULL, 
      	PRIMARY KEY ( post_id, meta_id ) 
      );
      
      CREATE UNIQUE INDEX idx_taxonomy_on_terms ON taxonomy ( label ASC, term ASC );
      CREATE INDEX idx_taxonomy_on_status ON taxonomy ( status );
      
      
      -- Triggers
      
      -- Post create procedures
      CREATE TRIGGER post_after_insert AFTER INSERT ON posts FOR EACH ROW 
      BEGIN
      	INSERT INTO posts_search ( docid, search_data ) 
      		VALUES ( NEW.rowid, NEW.plain );
      	UPDATE posts SET updated_at = CURRENT_TIMESTAMP WHERE id = NEW.rowid;
      END;
      
      -- Post update procedures
      CREATE TRIGGER post_before_update BEFORE UPDATE ON posts FOR EACH ROW 
      BEGIN
      	DELETE FROM posts_search WHERE docid = OLD.rowid;
      END;
      
      CREATE TRIGGER post_after_update AFTER UPDATE ON posts FOR EACH ROW 
      BEGIN
      	INSERT INTO posts_search ( docid, search_data ) 
      		VALUES ( NEW.rowid, NEW.plain );
      	UPDATE posts SET updated_at = CURRENT_TIMESTAMP WHERE id = NEW.rowid;
      END;
      
      -- Post deletion procedure
      CREATE TRIGGER post_before_delete BEFORE DELETE ON posts FOR EACH ROW 
      BEGIN
      	UPDATE posts SET reply_count = ( reply_count - 1 ) 
      		WHERE id != OLD.rowid AND id IN (
      			SELECT parent_id FROM posts_family WHERE child_id = OLD.rowid 
      		);
      	DELETE FROM posts_family WHERE parent_id = OLD.rowid OR child_id = OLD.rowid;
      	DELETE FROM posts_search WHERE docid = OLD.rowid;
      	DELETE FROM posts_taxonomy WHERE post_id = OLD.rowid;
      	
      	DELETE FROM meta WHERE id IN (
      		SELECT meta_id FROM posts_meta WHERE post_id = OLD.rowid
      	);
      	DELETE FROM posts_meta WHERE post_id = OLD.rowid;
      END;
      
      -- Post parent insert procedures
      CREATE TRIGGER posts_family_after_insert AFTER INSERT ON posts_family FOR EACH ROW 
      BEGIN
      	UPDATE posts SET reply_count = ( reply_count + 1 ) WHERE id IN (
      		SELECT parent_id FROM posts_family WHERE child_id = NEW.rowid
      	);
      END;
      
      
      -- Taxonomy procedures
      CREATE TRIGGER taxonomy_after_insert AFTER INSERT ON taxonomy FOR EACH ROW 
      BEGIN
      	UPDATE taxonomy SET updated_at = CURRENT_TIMESTAMP WHERE id = NEW.rowid;
      END;
      
      CREATE TRIGGER taxonomy_after_update AFTER UPDATE ON taxonomy FOR EACH ROW 
      BEGIN
      	UPDATE taxonomy SET updated_at = CURRENT_TIMESTAMP WHERE id = NEW.rowid;
      END;
      
      CREATE TRIGGER taxonomy_before_delete BEFORE DELETE ON taxonomy FOR EACH ROW 
      BEGIN
      	DELETE FROM posts_taxonomy WHERE taxonomy_id = OLD.rowid;
      	DELETE FROM taxonomy_family WHERE parent_id = OLD.rowid OR child_id = OLD.rowid;
      END;

~~~
yangyang
> Or if you're using Postgres, you can return the metadata as an associative
> array. E.G.
> [http://stackoverflow.com/a/11942726](http://stackoverflow.com/a/11942726)

That's not an associative array, it's just an array of a composite type. The
hstore type is probably the closest you'd get to an associate array in
PostgreSQL, but the keys and values are always strings.

~~~
dragonwriter
> The hstore type is probably the closest you'd get to an associate array in
> PostgreSQL

No, a table (possibly a temp table with the results of a particular query)
with an appropriately defined primary key (or other unique index) would be the
closest you can get to an associative array -- since a relation with _n_
candidate keys is exactly identical to _n_ different associative arrays with
the same data but different key/value splits.

But an array of a composite type drawn in such a way that some set of columns
is guaranteed unique is pretty much the same thing from a data perspective,
even though you may need to do work on the client side to load it into a
structure that supports associative array operations (e.g., efficient
key/value lookups).

~~~
yangyang
I meant as a self-contained type. PostgreSQL doesn't support nested tables.

The "associative" in associative array implies association is a property of
the data structure, not the way you happen to be using it.

------
mixologic
Keep in mind that preserving space was of the utmost of importance in 1982.
Eliminating data redundancy was considered the utmost of design, eschewing
maintainability, intelligibility, and real application behaviors.

Please do not allow your data model designs to be artificially constrained by
the requirements of the past.

~~~
olavk
Normalization might actually take up more space, e.g. if you have to break out
a column into a separate table and introduce a new primary key.

The reason for normalization is to preserve integrity of the data. If the same
information occurs multiple times, it is possible to change it in one place
and not the other, thereby introducing inconsistent data. This might have
disastrous results depending on the application.

As far as I know, data integrity is still pretty important even today.

Redundancy should be a cause for concern regardless of the amount of storage
available. Redundancy might be used deliberately eg. for caching or
robustness, but even the you usually want to have one canonical version of the
data.

~~~
collyw
>As far as I know, data integrity is still pretty important even today.

Tell that to the NoSQL hipsters.

~~~
virmundi
For the love of (insert small thing here). NoSQL is not necessarily about
throwing data integrity out the window. It's about re-organizing your data so
that the things you care to be integrated are, such as a document with a sub-
document or list. Pretty much every data store promotes some kind of data
integrity.

It might not be ACID data integrity. It might not care to enforce referential
integrity outside of the document, or value, or whatever the modeling paradigm
is. There is still integrity. If you don't like that, fine. Fair enough. Some
might be okay with such a system. Name calling is of little benefit. It is a
form of violence. "Violence is the last refuge of the incompetent" \- Isaac
Asimov.

If you would like to learn more about the new data stores, see my short book,
New Data: a Field Guide.
[https://leanpub.com/NewDataAFieldGuide](https://leanpub.com/NewDataAFieldGuide).
It's target at managers and other with little time to get a good idea.

~~~
phkahler
>> Name calling is of little benefit. It is a form of violence. "Violence is
the last refuge of the incompetent" \- Isaac Asimov.

I hope you saw the irony in that.

~~~
virmundi
It's ironic in the Alanis Morissette way.

However, it was more apt against the parent's comment. It did not say why
NoSQL users are problematic, or even wrong. It just flat out lambasts anyone
who walks away from the orthodoxy of ACID-SQL with an epithet. It truly was an
incompetent comment.

I'm using ArangoDB on my project. My reason is that honestly what I'm doing
really is document-y. I could use MySQL or PostGRE, but I'm opting for
something different for the sake something different. I do want to have
integrity, though. Arango has integrity at the document level as well as
distributed transactions. It doesn't enforce references though. I'm fine with
that.

What annoys me is when Internet commentator attack something new without at
least a cursory show of reasoning.

~~~
brudgers
When the internet annoys me, I try to hit the back button. Though sometimes, I
just have to turn it off.

[Disclaimer: this was written instead of dumping gasoline on the fire with the
first version.]

