Hacker News new | comments | ask | show | jobs | submit login
Understanding database normalisation: a very short tutorial (azabani.com)
152 points by delan on May 5, 2012 | hide | past | web | favorite | 56 comments



I find it a bit disturbing how this post reaches the top of HN. But I suppose I shouldn't be surprised.

I probably live in my own little bubble, but only lately have I realized that NoSQL has two audiences: (1) People for whom normalization can't work because of their application's characteristics and the limitations of current hardware. (2) People who just don't understand basic relational concepts, or why they were invented in the first place.

It's kinda sad. I've consulted on projects where people implemented sharding before adding any indices to MySQL.

The thing about being in group (1) is that you can also recognize when the ground shifts beneath your feet. Artur Bergman is one of those guys.

http://www.youtube.com/watch?v=H7PJ1oeEyGg&feature=youtu...


There's a group 3, but it's likely just as small or smaller than group 1. Where your data is easily normalized (and easier to work with in that form), but the cost of a getting an rdbms to support your write load is an order of magnitude more expensive than a persistent 'nosql' datastore that you do batch dumps into an rdbms.

The catch is that very few people end up in group 3 and still have the cost of running/administering the rdbms over the 'nosql' one actually matter.


The point in NoSQL is mostly in group (3), web2+ startups with a possible 1B user and 1000B things (comments etc) per user. Joins and normalization here are a bit costlier and what is written in your Oracle development manuals, for your small workgroup intracompany app don't work here.

So the group (0), which are mostly db developers of the client/server architecture; when they attack the web2.0 problems, they fail because they stick to dogmatic notions as if they are true. Though (2) people are ignorant of relational concepts, (0) people are stubborn, uneducatable people who end up creating all types of scalibility problems. They overuse the notions of normalization, but forget they attack the wrong problem with the wrong tools.

PS: I am not yet using NoSQL in production and have a solid past in Oracle/Db2/Ms SQL/Sybase, and now now doing startups in MySQL/Postgres and Mongo.


Your group (3) is just his group (1) stated differently.


He denormalized the data.


yea, I show the nosql way by repeating the data.


I have to disagree with you. Normalisation is a very basic concept. Whilst it can cause a couple of problems, I am very sceptical about most startups hitting them. I'd like to see DBs introduce denormalisation as a feature, separating the logic from practicality.

Having said this, I like working with MongoDB. I like schemaless design and flexibility.


I was just going to write similar thing. This shows that there is a significant percentage of HN users, who haven't had formal introduction to basic CS concepts.


I'm one of those people. I was your typical web monkey, learning everything from how-to guides and O'Reilly books. Luckily I had the chance to read O'Reilly's Oracle Design (1997), which is actually a cleverly disguised general purpose RDBMS design handbook, including a good description of the normalized forms.


The upvotes could also mean that people appreciate a good presentation of introductory material, and they may see a use for it as something to send to the newbies on their team.

For this subject, I usually send people to Bill Kent's Guide to the Five Normal Forms: http://www.bkent.net/Doc/simple5.htm


I'm one of those. I keep a list of good intro material I can send to friends, nephews, etc, who want to learn programming. HN and reddit are two good sources of those kinds of posts.


This is one of the cleanest explanations of normalization I have come across. I will use this with a few high school students I am working with, and I expect it will be pretty easy for them to understand.


It would have been a good explanation of normalization had the author picked some better examples for his data. The way it's written, the normalization ends up changing the semantics of the data, which is not what it's supposed to do.

- The normalized schema assumes that the location of a tournament will never change. If the 2014 Australian Open were to be held in Sydney, then looking up the location of the 2012 Australian Open would thereafter yield "Sydney".

- Population has the same problem. It's not likely that the population of a city will remain constant over time. The original database would give you the population at the time the tournament was held. The normalized version always gives you the current population of the city.


I've updated the article to use data that does not (generally) change from year to year: country and area instead of city and population. This should eliminate the semantic issue and make it clearer.


Probably okay for your tutorial...but country areas change from time to time due to territorial disputes being resolved, or due to older erroneous info being changed.

Cities can also change country. Pristina used to be in Serbia but is now in Kosovo (although this is not universally acknowledged). Bratislava used to be in Czechoslovakia but is now in Slovakia.

While just pedantic in this case, it demonstrates how hard it can be to create a strong data model.


Yes, that's quite true. Creating even what I would have thought to be a 'simple' example isn't as trivial as I have expected.


Really, normalization is just a special case of DRY. And it has the same value for the same reasons. The attempt to enumerate the "kinds" of normal forms is, IMHO, a mistake. The only point to doing it is reducing bug surface (and just maybe storage space efficiency), so really: who cares whether or not your data is second or third normal form? Does the duplication of fields help or hurt your application?

A good example of this is US ZIP codes. In theory (I think), a ZIP uniquely identifies a city and state. But in practice no one stores it this way because the maintenance of that data (by all parties: some people don't know zips, or type them incorrectly) is more expensive than the savings from normalization.


normalization is just a special case of DRY

Uh, relational databases are a well established field. The different normal forms, among other useful things, provide a guide to way to prevent specific kinds of logical errors that can otherwise appear during inserts, updates and deletes. There's a lot here.

DRY - "don't repeat yourself" - is simply a slogan that doesn't provide specific procedures for making things compact, how you do it or when its appropriate. What you're saying is a lot like "abstract algebra is just a special case of thinking logically" - sort of true but mostly deceptive if it lets you dismiss a lot of accumulated knowledge with "just think logically".


I don't think you're seeing my point: the value to "normalization" isn't its specificity or the fact that it's grounded in a formal description. It's that it's a straightforward way to reduce the maintenance burden of software using the database (your words: "prevent specific kinds of logical errors"). That's precisely the value of DRY philosophy.

I don't see it as controversial at all to paint it as a "subset" of DRY; I'm sorry if it offends your academic aesthetic.


If you are curious as to why you have been voted down: your position is "why do people care whether your data is in a specific normal form"? The answer, of course, is that if you don't normalise a relational database then you will find all sorts of problems down the track.

Enumerating the normal forms is important as you typically must do one after another. It's not just academic formalism to expound three normal forms. I suspect you don't know much about relational theory, which is why you feel that normalisation is a "mistake".


There are actually ZIP codes that intersect multiple cities, counties, and even states. This is because ZIP codes are designed to make delivering mail easy. Your main point still stands, of course.


So are we supposed to be designing all our databases to conform to third-normal form? I am not very adept at DB stuff but doesn't that increase the number of lookups needed to retrieve a row of useful data? Is the performance hit from that less painful than storing redundant (perhaps simply for caching) fields in one table?


This is the classic denormalized model vs. the traditional database model. The denormalized model, of storing all data in one lump, is popular with nosql systems and object oriented systems, in that all the data you need for one key is all in one "place". If you know that that will always be your primary access pattern, then yep, it may be faster for you.

However, if you have a variety of ways to get at your data by this key for this situation, or this key for that one, or if you have large amounts of redundant data that you'd like to be updated "all at once", such as what salesperson is assigned to the Northeast region for all customers, then you may find a normalized form is more efficient.

The big learning is YMMV: for years, relational was a religion. Snowflake and other designs showed that there was a middle ground... and nosql has shown that non-relational designs can work well also, esp in many online and large-data situations. The trick is to pick the right one for the expected workload your app will be facing, not to adhere to any religious dogma (3nf, nosql, or anything inbetween)


So a common thing in tech seems to be "oh this new tool will solve everything!". I think this phenomenon is good in the sense that it means that "religious" adherents to a tech will push its boundaries, find out where it is well suited and where it lacks, etc. I think this is just a natural outgrowth of excitement about the new shiny thing -- software people are surprisingly passionate about their craft. Of course this has a tendency to lead to annoying religious wars (sometimes at the level of pointless minutia like semi-colons :) ).

But for me the exciting part is when the fervor dies down a bit, and there is a clear understanding of "opposing" tools' strengths and weaknesses. That is when interesting tech and techniques are developed to handle both based on the strengths.

I think we are starting to get to that point with relational and nosql. Tools are starting to emerge on the relational DB side to allow more nosql-like structures within traditional DBMS systems. I think in the coming year or two we'll start seeing a lot more interesting tools to mix relational and non-relational datastores. Currently the common way of mixing is to use kv or document stores as a caching layer for relational data, or to have relational systems separate from nosql systems, with custom code to select from each. My hope tho is that there will be more uniform access methods developing, allowing the relational and non-relation data to have a common, well understood access API layer.

For example I have a project where the core data consists of users, certain user objects, and relations between users, between users and their or others objects, and between objects, basically shaped like a graph. Using neo4j + gremlin makes dealing with this portion of the app trivial - the graphdb literall solves problems for me. But another aspect of the app is relational data on objects related to user records, basically it is shaped as rows where n objects' rows need to be merged, filtered and then operated on (aka JOINed). SQL makes this set of operations trivial, in this case a DBMS literally solves problems for me. But, as it stands, I need to query one DB, get the object record pointers I need, then connect to a different DB, and query the relevant tables if I want to let the relevant DBs do the work. This involves different APIs, weird connection management, and sysadminning two different database environments.

Essentially I'm hoping someone smarter then me will come up with a way for me to:

   SELECT MY_COLLATE(d1.a, d2.b) FROM
      GREMLIN{find object o1}.data as d1 JOIN
      GREMLIN{find object o2}.data as d2 ON
          d1.col == d2.col
   WHERE conditions


>I think we are starting to get to that point with relational and nosql. Tools are starting to emerge on the relational DB side to allow more nosql-like structures within traditional DBMS systems. I think in the coming year or two we'll start seeing a lot more interesting tools to mix relational and non-relational datastores.

Yep. Postgres showed some pretty strong commitment to key-value datatypes in 2006 by including the hstore contrib module in the mainline tree, and native JSON support is coming in September with 9.2.


That's a process called de-normalisation, covered here: http://en.wikipedia.org/wiki/Denormalisation You should only consider doing that after thorough profiling showing your joins are causing the performance loss.


Premature optimization.

Normalized data is one of the important elements of relational database modeling, and almost all relational databases expect the data to be normalized, so they have optimizations in place to make doing those lookups fast. It will usually create some automatic special indices for the foreign keys or something.

The first thing to do is use a normalized data model and profile it. If it's too slow add some indices and profile it again. If it's still too slow, update the table to store redundant information and profile it again.


Not really. If you are concerned with extra lookups, then you might want consider a materialised view (Or an indexed view in SQL Server). But bare in mind that write performance will be affected, whether this is worth the cost is a tradeoff. Also consider that databases have some very efficient join mechanisms. YMMV though.


I read somewhere long time ago,

"Normalize till hell, denormalize till scale"

Or some thing like that :).


Normalize 'til it hurts, denormalize 'til it works.


Get it correct, then make it fast.


That's it. Normalisation is about correctness.

Every other aspect of software development using RDMSs will attempt to challenge decisions made for correctness' sake.


More important for storage space, quicker updates and faster indexing. If these things are less important(most cases) don't worry.


Cool, thanks :-)


Very nice article, but I did not like the example data he was using.

The problem is that both population and tournament city could be dependent on year.

Population (obviously) changes from year-to-year, so either his original data is incorrect or he's recording a city's latest population along with all historic tennis championship winners -- neither interpretation makes a lot of sense. Why not use something unlikely to change such as either elevation or country?

Likewise, tournament city is not a fixed value for a tournament. For example, the Australian Open, which he uses as one of his examples, has also been set in Sydney, Adelaide, Brisbane, Perth, Christchurch (NZ), and Hastings (NZ).

To resolve this, he would either have to introduce a two column primary key (tournament, year), or pick some simpler data. I suggest the latter.

Even with all of this criticism, I think it's one of the cleanest introductions of the normal forms for beginners.


These are very valid points that I did not consider when I was constructing the example data.

I've replaced city and population with country and area to reduce confusion.


Fantastic.

I clicked on your CV just because the link was available on the side, and I was shocked to find out that you're still in high school. There are professionals that would have taken 10x as long to explain it only half as clearly. For context, I've been in the business for ~25 years. Keep up the great work.


Very clear. I found that the met comprehensive explanation of normalization was in an APress book, Applied Mathematics for Database Professionals.

Incidentally, did you cover 4NF and 5NF in the course? IMO, you'll almost never need 6NF.


CJ Date describes Normal Form as common sense in Database in Depth, "...the principles are common sense-but they're formalised common sense. (Common sense might be common, but it's not so easy to say exactly what it is!)".

Formalisation of your normalisation! Applied Mathematics for Database Professionals is on the reading list, looking forward to it.


By 3NF most database schemas are good enough(tm), going further usually results in minor changes for large amounts of work.


I worked as a DBA somewhere that did everything in 5NF. It was a nightmare. I only stayed there 3 months. In my last week there, I remember having to debug a 35 table join. I'm so glad I left.


The performance must have been horrendous! The query optimiser must have gone nuts - that's 35! possible join combinations :-(


I had left before they were willing to performance test the application. But I heard that after they launched that it was awful.

The Oracle optimizer quits optimizing after the first ~7 joins so we had to do manual optimization after that. Almost every query was at least 7 joins. The guy running the DBA group was following the Data Model Resource Book to a "T", which I think is appropriate for an OLAP database, but is not cool for OLTP.


If you want to see something hilarious, get a 12 join query and add "AND 1=0" on the end of the WHERE clause.

Oracle doesn't look at the query and immediately return no rows, instead the optimizer continues to generate a full plan. Apparently, according to Oracle support, they have no intention of finding "AND false" predicates to shortcut query plan compilation.


That could work in PostgreSQL with the genetic join optimizer. You would not get the best plan or necessarily a stable one but it could be fast enough. I believe the worst reporting queries I have written could have been almost as bad as 35 joins. Of course they were not fast but not as slow as one might think.


Over the years, I have come to realise that even the third normal form is perhaps too much of an overhead for web apps. I guess that's why we have all these modern alternative DBs.


I think 3NF is a good starting point, then as I said in another comment, you start profiling to find where expensive queries are coming into play and hurting your performance, then denormalize those issues away.


There must be a better answer than denormalizing. I'd try to get a DBA or DBE involved to look at the situation if possible. Re-structuring the query might be all that's needed.


Of couse that's an over-simplification, de-normalisation certainly fits in after many other forms of optimisation.


I agree, I think denormalizatuon is one of those things you do after trying to rewrite the query. After all, if you can get the optimiser to choose a different access plan that's quicker by rewriting the query to be more efficient, then you're better off overall. There's costs to denormalization too... Not to mention you need to ensure that your application code writes correctly.


It depends what you are doing. If you have a lot of semi structured data, like documents, then go with a document oriented database. If you have semi-structured data that you don't want to do much analytics on, consider a key/value store. But it you require full ACID properties, want to reduce your application code and not worry so much about integrity, then an RDBMS is still the way to go.


Surprised to see no mention of BCNF or 4NF, as these are the two most commonly used in practice.


I suspect that 1NF or 2NF are the most commonly used forms, simply because most database schemas are made by people who don't really know or care, and are satisfied when it works.


The 3A/3B course has very little regard for modern best practises, just what was relevant 15 years ago when the course was written.


This is a good summary/introduction. Would it be improved with examples of insertion/deletion/modification anomalies using the context of your data example? This seems like a loose end to me at the moment.


This was the best short article I've ever read on this complex subject.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: