

Maybe Normalizing Isn't Normal - bdfh42
http://www.codinghorror.com/blog/archives/001152.html

======
ars
That 6 join query is wrong, and the rest of the article follows from that
incorrect join.

Why would you ever want a query that returned the user data multiple times,
once for every combination of screen_name and phone_number?!? No real site
would ever want to do that.

If you need phone number, you get phone numbers, if you need screen names, you
get those - but why would you need every possible combination of phone number
and screen_name?

He's simply wrong. Speeding up joins is never a reason to denormalize for the
simple reason that it doesn't!

A valid reason to denormalize is to precalculate data, which he touches on
very briefly. But you always write that twice: once normalized, and that's the
primary data, and then again, the "cached/precalculated" version. You should
always be able to regenerate that from the normalized data.

OK, one final reason to denormalize which he didn't even write: if you need to
do a where clause from one table, but the order by from a different table, you
need to denormalize because you can not create a combined index from both
tables. (Databases with function indexes might be able to but that's pretty
complicated.)

------
gaius
Firstly, he has misunderstood database design. The design that the system
analysts product will be in 3NF because their job is to understand the data in
its most abstract sense. They will also specify - again, fully normalized -
the main queries that will need to be run. Then it will go to the implementors
who will transform the tables and the queries for performance. Denormalizing
too early is a premature optimization. Not denormalizing at all is
inexperienced.

Secondly, if you need to scale you are most likely using a query cache (e.g.
Tangosol) between your RDBMS and your middle tier, so you can preserve a far
cleaner data model now then you could back in the day.

~~~
biohacker42
I think a lot of people also completely miss the fact that there is more then
one way to have a normalized DB. Much like OO design, there's a big gap
between smart,good, efficient implementations and the rest.

------
davidmathers
Signs, of a certain type:

1\. A post about relational databases where the word "relation" is never used.
Resulting in...

2\. Missing the whole idea/point of relational databases--that relations are
logical data structures, not physical data structures. Resulting in...

3\. Thinking that normalization applies to the physical table structure,
rather than the logical relation structure. And...

4\. Not realizing that there are only 3 relations being represented by the
6-table representation. Resulting in...

5\. A lot of writing about an absurd false dichotomy.

------
Tichy
Not a db expert, but I can't imagine joining tables is much of a problem. At
least in theory, the db should be able to handle the tables internally as if
it was only one table. Not sure if all db engines would be good enough at
optimizing, but it is probably one of the easier approaches to performance
tuning. Therefore I would expect that most db engines already handle that
nicely.

I think it is OK to cache some things, like aggregated values (for example how
many friends does a user have). But those cached values (or views) I would not
consider part of the actual base data model. Even though those values might be
stored in the same db, I would consider them to be another layer in a way.

~~~
michaelbuckbee
I'm not sure what you mean by "handle the tables internally", but as a really
rough exercise, you can think of just raw table reads.

If you have 6 tables that need joined, you're going to need to "look through"
each to find the ones with a matching id. If it's all one table it's
essentially just one read.

Now, there are many many caveats to all of this (how indexes are setup,
relative sizes of tables, etc. but in general it is more computationally
expensive to do more joins).

~~~
Tichy
In theory the db could keep the separate tables in one table internally. But
granted, probably it won't always do so.

~~~
olavk
Yeah, this is called a materialized view.

------
michaelbuckbee
It's not mentioned in the article, but I think the advent of services like
Amazon SDB and Google App Engine's datastore (ala BigTable), where traditional
joins are not possible is effecting application design.

It's probably a worthwhile exercise to think about how you would model your
data if you needed to port it to GAE/AWS SDB, just to figure out where the
faults in your original data model were.

------
neovive
Indexing your heavily used queries and page/view caching are also good
alternatives to denormalization.

------
brianr
_Duplicated data and synchronization problems are often overstated and
relatively easy to work around with cron jobs._

Yikes, I hope he's not actually doing that!

------
edw519
Normalization is not an either/or proposition.

You can have one table, normalize all the way 4th normal form, or stop
anywhere in between, depending on the requirements of your app. You can use
multi-values within fields to simplify your schema. You can also write your
own processes to extract data, without using just pure SQL.

There's more than one way to do things, all with pros and cons. Naturally, in
the early phases of your project, you're probably best off insuring the
integrity of your data (that is, normalize) and worry about scaling later.

------
ltbarcly
Yea, I think this article is just flaunted stupidity.

The author assumes that all times where _any_ user information is needed, that
_all_ the information about that user will be needed. That is silly. Secondly,
they completely ignores the billion and one problems he will now have.

For example, how do you ensure that the same AIM username is not used on
multiple accounts? With the normalized schema, this is trivially done via a
unique constraint on the user_id, screen_name, im_service triplet (probably
the 3 make up the composite primary key of the table anyway, so viola,
consistent data is ensured).

With the de-normalized schema, you are hopelessly fucked. All data consistency
would be forced into some kind of ugly trigger.

Finally, joins are really-really-really fast for things like this. Frankly,
the real thing that bites you is inconsistent data that violates basic
assumptions your application makes, leading to unintended behavior of your
application, most often exceptions that are not handled reasonably. In this
case, for example, you might have a way to find your friend via AIM login
name, but your app blows up when 2 seperate users are returned by the query
due to there being 2 results (or you or your orm only get the first result,
making finding the actual friend impossible).

And on and on. It is better to have slower fully reliable data than faster
bullshit.

