
Denormalization for Performance: Don't Blame the Relational Model - sgeneris
http://www.dbdebunk.com/2014/02/the-costly-denormalization-for.html
======
EGreg
From my experience, denormalization of a relational model is a special case of
"caching". What you are really doing is caching data where it is most likely
to be used at access time.

The overview is like this:

1) You need some operation to have low turnaround time / latency

2) So you maintain and update a cache, typically while writing to the data
store.

3) Like all caches, you can either invalidate it before changing the data,
slightly harming availability, or you can have it lag behind (eventual
consistency).

So the article is actually wrong, you don't need to trade consistency for
performance. You can increase read performance (lower latency) without losing
consistency, by having a cache (denormalization) and invalidating the relevant
caches BEFORE writing data (which lowers write latency, but not necessarily
write throughput... typically, we don't care about write latency as much as
read latency.)

~~~
idbehold
Well you clearly have cache invalidation figured out. Any tips on naming
things?

~~~
CurtMonash
Similarly -- any thoughts about off-by-one?

~~~
endymi0n
Concurrency

Nice one. The only hard thing missing now is:

------
perfunctory
"Well implemented true RDBMSs..." exist only in the weird fantasy world of
relational purists. In that perfectly normalized world you just wave your
magic wand and all performance problems disappear.

~~~
kpmah
I think most are pragmatic and admit you may have to do denormalise for
performance (e.g. I've read this in C.J. Date's books). I interpret their
point as the need for denormalisation is not a flaw with the relational model,
but a flaw in its implementation. The relational model does not specify an on-
disk format.

~~~
calpaterson
So rare to actually have to denormalise for performance today though. Query
planners are miles ahead of where they were in the 80s and 90s. SSDs delivered
an order of magnitude boost to performance. Rapidly expanding memories hugely
increase the how much working set you can hold in memory.

No one seems to talk about this but the "Moore's law has ended" meme does not
apply to most database scenarios. Database servers are normally not CPU bound:
they are bound by available memory and disk bandwidth and these are both still
increasing.

~~~
anonymous_fun
Antidotally, I started at a co 6 years ago where 100's of database servers
were all spinning disks. It was common to have to trace down bad queries
causing performance issues. Once the servers were upgraded with more ram and
SSD's, the need to micromanage performance issues disappeared.

~~~
sgeneris
The sheer fact that much of the comments are about physical storage is an
excellent validation of the article's claim that data professionals don't know
and understand data fundamentals and the RDM.

------
exmicrosoldier
I find zero explanation of how to solve performance with a relational model.

As I understand the article, it seems to say...just because all the existing
databases you have seen suck at performance when normalized doesn't mean
normalization can't be fast.

~~~
julochrobak
There are several basic concepts you can apply to improve performance in the
RDBMS and still avoid denormalization. For example:

* use as many constraints as possible (this helps the query optimizer)

* use indexes which bring better performance in your use case (e.g. bitmap join index or even index-organized tables)

* apply table/index partitioning

* use materialized views as a query result cache

~~~
LordHeini
Or you just avoid all that hassle and have some duplicates.

I have never seen a properly denormalized table. In practice you will get a
"historically grown" system way to often and doing anything like that will
break things.

The whole article seems to be quite academic from my personal experience a
textbook normalized database is slow beyond belief (i did exactly that once
and we had to revert it back).

~~~
MustardTiger
Or better yet, since you don't care about your data anyways, just don't bother
storing it. Infinitely scalable and always blazingly fast.

>I have never seen a properly denormalized table

Do you mean normalized? There's no such thing as "properly" denormalized,
anything that is not normal is denormalized.

>The whole article seems to be quite academic from my personal experience a
textbook normalized database is slow beyond belief (i did exactly that once
and we had to revert it back).

I've seen lots of people say that, but then consistently found those same
people don't actually know what the normalization rules are, and all they did
was create a different denormalized database that happened to have poor
performance for the queries they were using.

------
haddr
Very nice article, but the claims are so far from what is practiced in the
industry nowadays (see all companies working with tons of data and popping up
big data solutions).

So, while it might be perfectly true it somehow misses some big point.

~~~
sgeneris
No, I think you missed its point.

I dare you to demonstrate that the "big data solutions" guarantee logical and
semantic correctness the way true RDBMSs do. Without that, garbage in garbage
out. But because those "solutions" are so complex that nobody understand them,
including those who designed them, their results are BELIEVED to be correct,
which is not the same thing as being correct.

------
hvidgaard
I have never worked at true "web scale", so my experience may or may not apply
to such scenarios. I tend to make a more or less fully normalized structure,
and introduce aggregated data outside of the core database. It is essentially
introducting eventual consistency, for performance critical data. This data
can be stored in a database or a cache layer in front of the database if it is
accessed frequently enough to keep it off the disk all the time.

------
woliveirajr
Good luck trying to show the number of likes in twitter, instagram, face,
Kardashian's tinder...

Any query that needs to calculate some result and has many rows benefit from
storing pre-calculated values. With a trade-off in precision, correctness, and
so on.

Might apply to your business or not.

~~~
sgeneris
What does it have to do with normalization?

------
RmDen
As the saying goes... Normalize till it hurts..then denormalize till it works
:-)

~~~
sgeneris
What do you know about the hurts of denormalization? Can you even specify what
they are?

~~~
mateo411
Updates, inserts, and deletes become more complicated and expensive when you
denormalize.

------
sgeneris
One more thing: As these comments demonstrate, there is almost exclusive focus
on performance, but nobody considers the drawbacks (cost) of denormalization.
Practitioners are oblivious to them.

------
thom
What systems that are usable today come closest to the Platonic ideal of
RDBMSs?

~~~
calpaterson
Today most RDBMSs have good query planners. Even recent versions of MySQL. In
my experience you very often get improvements in performance by normalising
(to BCNF). Most of the benefit comes from a) narrow tables reducing the
physical size of the working set (hold more of it in memory) and b) narrow
tables handle writes quicker (more rows per page) c) more opportunities for
covering indices.

~~~
rco8786
Isn't table width dictated by your data model though? And thus, in a purely
normalized db schema you have no control over table width?

~~~
calpaterson
Normalised designs invariably have pretty narrow tables. In most BCNF schemas
most tables will end up with 3 or 4 columns at most. Additionally in the
situation where you have a lot of non-prime columns you always have the option
of decomposing the table into multiple tables sharing the same primary key.
Useful if some columns are large but rarely used.

~~~
MustardTiger
What are you talking about? A table has as many columns as the thing it
describes has attributes to be described. You can have a fully normalized
database with 50 column tables in it, and there's nothing unusual or suspect
about that.

~~~
calpaterson
No, relations describe facts, not things. Heath's theorem means you can
decompose a table/relation that has n non-key attributes into n tables. This
is something you have to do for BCNF.

Theoretically it's possible to imagine a normalised relation with 50 columns.
In practice a table like that is unusual and suspect (frankly, even in a
denormalised design).

~~~
sgeneris
There NO relationships whatsoever between normalization and the number of
attributes, which are determined by the number of properties that entities
have in the real world i.e., the conceptual model.

~~~
calpaterson
That is not in conflict with what I said

~~~
sgeneris
No. I just thought the clarification was important.

