
Understanding database normalisation: a very short tutorial - delan
http://azabani.com/blog/2012/05/understanding-database-normalisation/
======
neilk
I find it a bit disturbing how this post reaches the top of HN. But I suppose
I shouldn't be surprised.

I probably live in my own little bubble, but only lately have I realized that
NoSQL has two audiences: (1) People for whom normalization can't work because
of their application's characteristics and the limitations of current
hardware. (2) People who just don't understand basic relational concepts, or
why they were invented in the first place.

It's kinda sad. I've consulted on projects where people implemented sharding
before adding any indices to MySQL.

The thing about being in group (1) is that you can also recognize when the
ground shifts beneath your feet. Artur Bergman is one of those guys.

[http://www.youtube.com/watch?v=H7PJ1oeEyGg&feature=youtu...](http://www.youtube.com/watch?v=H7PJ1oeEyGg&feature=youtube_gdata_player)

~~~
diminish
The point in NoSQL is mostly in group (3), web2+ startups with a possible 1B
user and 1000B things (comments etc) per user. Joins and normalization here
are a bit costlier and what is written in your Oracle development manuals, for
your small workgroup intracompany app don't work here.

So the group (0), which are mostly db developers of the client/server
architecture; when they attack the web2.0 problems, they fail because they
stick to dogmatic notions as if they are true. Though (2) people are ignorant
of relational concepts, (0) people are stubborn, uneducatable people who end
up creating all types of scalibility problems. They overuse the notions of
normalization, but forget they attack the wrong problem with the wrong tools.

PS: I am not yet using NoSQL in production and have a solid past in
Oracle/Db2/Ms SQL/Sybase, and now now doing startups in MySQL/Postgres and
Mongo.

~~~
sanderjd
Your group (3) is just his group (1) stated differently.

~~~
raleec
He denormalized the data.

~~~
diminish
yea, I show the nosql way by repeating the data.

------
japhyr
This is one of the cleanest explanations of normalization I have come across.
I will use this with a few high school students I am working with, and I
expect it will be pretty easy for them to understand.

~~~
ajross
Really, normalization is just a special case of DRY. And it has the same value
for the same reasons. The attempt to enumerate the "kinds" of normal forms is,
IMHO, a mistake. The only point to doing it is reducing bug surface (and just
maybe storage space efficiency), so really: who cares whether or not your data
is second or third normal form? Does the duplication of fields help or hurt
your application?

A good example of this is US ZIP codes. In theory (I think), a ZIP uniquely
identifies a city and state. But in practice no one stores it this way because
the maintenance of that data (by all parties: some people don't know zips, or
type them incorrectly) is more expensive than the savings from normalization.

~~~
joe_the_user
_normalization is just a special case of DRY_

Uh, relational databases are a _well_ established field. The different normal
forms, among other useful things, provide a guide to way to prevent specific
kinds of logical errors that can otherwise appear during inserts, updates and
deletes. There's a lot here.

DRY - "don't repeat yourself" - is simply a slogan that doesn't provide
specific procedures for making things compact, how you do it or when its
appropriate. What you're saying is a lot like "abstract algebra is just a
special case of thinking logically" - sort of true but mostly deceptive if it
lets you dismiss a lot of accumulated knowledge with "just think logically".

~~~
ajross
I don't think you're seeing my point: the _value_ to "normalization" isn't its
specificity or the fact that it's grounded in a formal description. It's that
it's a straightforward way to reduce the maintenance burden of software using
the database (your words: "prevent specific kinds of logical errors"). That's
precisely the value of DRY philosophy.

I don't see it as controversial at all to paint it as a "subset" of DRY; I'm
sorry if it offends your academic aesthetic.

~~~
chris_wot
If you are curious as to why you have been voted down: your position is "why
do people care whether your data is in a specific normal form"? The answer, of
course, is that if you don't normalise a relational database then you will
find all sorts of problems down the track.

Enumerating the normal forms is important as you typically must do one after
another. It's not just academic formalism to expound three normal forms. I
suspect you don't know much about relational theory, which is why you feel
that normalisation is a "mistake".

------
tomc1985
So are we supposed to be designing all our databases to conform to third-
normal form? I am not very adept at DB stuff but doesn't that increase the
number of lookups needed to retrieve a row of useful data? Is the performance
hit from that less painful than storing redundant (perhaps simply for caching)
fields in one table?

~~~
mwexler
This is the classic denormalized model vs. the traditional database model. The
denormalized model, of storing all data in one lump, is popular with nosql
systems and object oriented systems, in that all the data you need for one key
is all in one "place". If you know that that will always be your primary
access pattern, then yep, it may be faster for you.

However, if you have a variety of ways to get at your data by this key for
this situation, or this key for that one, or if you have large amounts of
redundant data that you'd like to be updated "all at once", such as what
salesperson is assigned to the Northeast region for all customers, then you
may find a normalized form is more efficient.

The big learning is YMMV: for years, relational was a religion. Snowflake and
other designs showed that there was a middle ground... and nosql has shown
that non-relational designs can work well also, esp in many online and large-
data situations. The trick is to pick the right one for the expected workload
your app will be facing, not to adhere to any religious dogma (3nf, nosql, or
anything inbetween)

~~~
sophacles
So a common thing in tech seems to be "oh this new tool will solve
everything!". I think this phenomenon is good in the sense that it means that
"religious" adherents to a tech will push its boundaries, find out where it is
well suited and where it lacks, etc. I think this is just a natural outgrowth
of excitement about the new shiny thing -- software people are surprisingly
passionate about their craft. Of course this has a tendency to lead to
annoying religious wars (sometimes at the level of pointless minutia like
semi-colons :) ).

But for me the exciting part is when the fervor dies down a bit, and there is
a clear understanding of "opposing" tools' strengths and weaknesses. That is
when interesting tech and techniques are developed to handle both based on the
strengths.

I think we are starting to get to that point with relational and nosql. Tools
are starting to emerge on the relational DB side to allow more nosql-like
structures within traditional DBMS systems. I think in the coming year or two
we'll start seeing a lot more interesting tools to mix relational and non-
relational datastores. Currently the common way of mixing is to use kv or
document stores as a caching layer for relational data, or to have relational
systems separate from nosql systems, with custom code to select from each. My
hope tho is that there will be more uniform access methods developing,
allowing the relational and non-relation data to have a common, well
understood access API layer.

For example I have a project where the core data consists of users, certain
user objects, and relations between users, between users and their or others
objects, and between objects, basically shaped like a graph. Using neo4j +
gremlin makes dealing with this portion of the app trivial - the graphdb
literall solves problems for me. But another aspect of the app is relational
data on objects related to user records, basically it is shaped as rows where
n objects' rows need to be merged, filtered and then operated on (aka JOINed).
SQL makes this set of operations trivial, in this case a DBMS literally solves
problems for me. But, as it stands, I need to query one DB, get the object
record pointers I need, then connect to a different DB, and query the relevant
tables if I want to let the relevant DBs do the work. This involves different
APIs, weird connection management, and sysadminning two different database
environments.

Essentially I'm hoping someone smarter then me will come up with a way for me
to:

    
    
       SELECT MY_COLLATE(d1.a, d2.b) FROM
          GREMLIN{find object o1}.data as d1 JOIN
          GREMLIN{find object o2}.data as d2 ON
              d1.col == d2.col
       WHERE conditions

~~~
deafbybeheading
>I think we are starting to get to that point with relational and nosql. Tools
are starting to emerge on the relational DB side to allow more nosql-like
structures within traditional DBMS systems. I think in the coming year or two
we'll start seeing a lot more interesting tools to mix relational and non-
relational datastores.

Yep. Postgres showed some pretty strong commitment to key-value datatypes in
2006 by including the hstore contrib module in the mainline tree, and native
JSON support is coming in September with 9.2.

------
fabricode
Very nice article, but I did not like the example data he was using.

The problem is that both population and tournament city could be dependent on
year.

Population (obviously) changes from year-to-year, so either his original data
is incorrect or he's recording a city's latest population along with all
historic tennis championship winners -- neither interpretation makes a lot of
sense. Why not use something unlikely to change such as either elevation or
country?

Likewise, tournament city is not a fixed value for a tournament. For example,
the Australian Open, which he uses as one of his examples, has also been set
in Sydney, Adelaide, Brisbane, Perth, Christchurch (NZ), and Hastings (NZ).

To resolve this, he would either have to introduce a two column primary key
(tournament, year), or pick some simpler data. I suggest the latter.

Even with all of this criticism, I think it's one of the cleanest
introductions of the normal forms for beginners.

~~~
delan
These are very valid points that I did not consider when I was constructing
the example data.

I've replaced city and population with country and area to reduce confusion.

~~~
fabricode
Fantastic.

I clicked on your CV just because the link was available on the side, and I
was _shocked_ to find out that you're still in high school. There are
professionals that would have taken 10x as long to explain it only half as
clearly. For context, I've been in the business for ~25 years. Keep up the
great work.

------
chris_wot
Very clear. I found that the met comprehensive explanation of normalization
was in an APress book, _Applied Mathematics for Database Professionals_.

Incidentally, did you cover 4NF and 5NF in the course? IMO, you'll almost
never need 6NF.

~~~
hzy
By 3NF _most_ database schemas are good enough(tm), going further usually
results in minor changes for large amounts of work.

~~~
ticks
Over the years, I have come to realise that even the third normal form is
perhaps too much of an overhead for web apps. I guess that's why we have all
these modern alternative DBs.

~~~
hzy
I think 3NF is a good starting point, then as I said in another comment, you
start profiling to find where expensive queries are coming into play and
hurting your performance, then denormalize those issues away.

~~~
jebblue
There must be a better answer than denormalizing. I'd try to get a DBA or DBE
involved to look at the situation if possible. Re-structuring the query might
be all that's needed.

~~~
hzy
Of couse that's an over-simplification, de-normalisation certainly fits in
after many other forms of optimisation.

~~~
chris_wot
I agree, I think denormalizatuon is one of those things you do _after_ trying
to rewrite the query. After all, if you can get the optimiser to choose a
different access plan that's quicker by rewriting the query to be more
efficient, then you're better off overall. There's costs to denormalization
too... Not to mention you need to ensure that your application code writes
correctly.

------
zerovox
Surprised to see no mention of BCNF or 4NF, as these are the two most commonly
used in practice.

~~~
henrikschroder
I suspect that 1NF or 2NF are the most commonly used forms, simply because
most database schemas are made by people who don't really know or care, and
are satisfied when it works.

------
mattsouth
This is a good summary/introduction. Would it be improved with examples of
insertion/deletion/modification anomalies using the context of your data
example? This seems like a loose end to me at the moment.

------
jebblue
This was the best short article I've ever read on this complex subject.

