

Why do we need database joins? - timf
http://lemire.me/blog/archives/2010/11/29/why-do-we-need-database-joins/

======
Groxx
Sets up a straw-man:

    
    
      Article table like:
      authorID, authorName, publisher, title
    

Where an article only ever has one author. Then proceeds to knock it down
(summarized and slightly satirized):

>That's not even in second normal form. /makes an Author table. Viola! Future
name changes are free.

>But wait! We don't change author names on published articles! * gasp * _we
don't need joins!_

No, you've just demonstrated that _blind optimizations are sometimes
incorrect_. Duh. They _are_ correct in this respect, however:

> _The dogma of normalization [too] often leads to over-engineering._

But all dogmatic adherence to X results in sometimes-incorrect use of X. This
is true everywhere, not just for RDBMS normalization.

------
blahedo
Meh. If you don't have "database joins", meaning SQL queries that have the
word JOIN in them, then you end up effectively replicating the join in your
glue code somewhere (either on the insert side or on the query side). And if
you put the join in the code, it is algorithmically fixed; there is no chance
that the database can optimise this for you.

On the other hand, if you write your query in SQL on a moderately normalised
database, you have the potential for the database itself to figure out what to
denormalise, and what to parallelise, and any other sorts of optimisations it
might do. It just has to know what the typical queries are.

There's a lot of power in using a declarative rather than an imperative
paradigm.

------
jazzychad
Ok, serious question. I have been writing SQL for 7 years now and can
normalize schemas with the best of them; but I am wanting to learn the proper
techniques for creating scalable data sets like what one would store in GAE
BigTable or EC2 SimpleDB...

Let's say I have a book review site. In my SQL world I have tables of 'users',
'authors', 'books', and 'reviews'. When a user creates a review, it creates a
new row in 'reviews' which contains the userid, bookid, and rating. Throw some
JOINs together and I can pull out rows that have the user data, author data,
book data, and the rating all together.

What do you do in non-normalized world? When a user creates a review, just
insert a row into a 'review' table that has user_id, user_name, author_name,
book_title, book_isbn, book_etc, ..., rating ??

I can see how this would be a much more scalable table and fetching a review
is just one row, but it just seems like so much duplicate data...? Or maybe I
am thinking about it the wrong way?

~~~
riffraff
you're not, denormalization implies duplication (though not all normalizations
are created equal). It also usually implies that you have to replicate DB
functionalities in the client (e.g. write data in two places yourself).

But denormalization is not only a mean to "infiniscale" storage, you can use
it even in smaller scale to just get better performances in certain things.

As usual, everything is a trade off.

------
boyter
I think that his arguement is a bit flawed. Just becuase you can come up with
one arguement why joins dont work dosnt mean they are never the best solution.
I know that for my day job they are the correct solution because I need the
data to be transactional and correct. I shudder to think how much effort it
would be to port them to a keyvalue store and maintain the same levels of data
integrity.

~~~
artsrc
I don't know what 'I need the data to be transactional' means. One way to get
isolation is to put a transaction in one document. Often this is easier and
more flexible that trying to take the 'car to pieces' everytime your park it.
It is also more efficient in network bandwidth.

Everyone wants their data to be correct for some definition of correct.
Normalized data is incorrect if the expectation is that relationships are
preserved as the author created them, rather than re-created as they would be
given the current database state.

For example, say a financial transaction record has an account holder and
their related address.

For a naive implementation, if the address on the record should stay as it was
when the transaction was created then:

    
    
      * The normalized implementation is wrong 
      * The de-normalized implementation is right
    

If the address should be resolved at query time:

    
    
      * The normalized implementation is right
      * The de-normalized implementation is wrong
    

A lot relational tools (Foreign Keys, Unique Keys) become more complex for
temporal schemas, which address these kinds of issues.

~~~
DougBTX
Although I understand your point, I think you're confusing things by saying
that you will get different answers depending on the degree of of
normalisation. Strictly speaking, if two databases differ only in their level
of normalisation, then they will store the same information, just in different
places. When you update the denormalised version, you will have to update the
same values in multiple places.

If you don't want to update that value everywhere, then what you have isn't
denormalisation, it is actually a different data model, which stores different
information. Since you wouldn't be able to get to the "normalised" version you
describe without losing information, it is probably wrong to call it
"denormalised" in the first place.

------
tzs
I looked into using some kind of NoSQL solution at work, and one thing puzzles
me. Assume a key/value database, where values are JSON documents or something
similar.

Using such a database, with the key being customer account ID, and the value
being a document that contains all the information about that customer's
account would be convenient and useful. Information about the account would
include things like name, address, email, status of the account, when their
subscription to our service expires, re-billing information, all their
purchases, and so on.

This would definitely be a lot more convenient than our current SQL-based
solution for two of the three things we do with our customer data.

One of those things is display a customer's information for our customer
support people when they are dealing with the customer on the phone, or
working on a ticket in our help desk. With the NoSQL, it would be a simple get
the JSON document for that customer, and display it nicely.

The second thing is providing information to the installed software on the end
user's computer. Our software provides a service that can partly be done
locally, and partly requires back end support. All the back end information
needed to support a given customer's installed software would be in that JSON
document, so again the key/value store would work great.

However, I'm at a loss as to how the third thing would be handled efficiently:
reporting.

For instance, suppose I'm asked for a quick report on how well a new SKU has
sold over the last week. With our SQL database, this is a trivial select from
the "orders" and "line_items" table. It is almost instantaneous. With the
NoSQL solution, it would appear that I would have to go look at every single
customer's entry.

All in all, we've got a couple dozen regular reports, many running hourly, and
a few being interactive on-demand via a web interface, that are fairly
straightforward and efficient in our SQL-based solution (some might involve
joining 6 or 7 tables, but it is still straightforward and efficient).

Thus, at first glance, it would seem that going to NoSQL would be a nightmare
for my reports. Many times an hour, it would appear that the database would
get slammed by reports that would have to be accessing every freaking record
in the database. That doesn't sound like something that's going to make me
happy.

How do people deal with reporting when using NoSQL?

~~~
artsrc
If an SQL (any) database is fast there are a few possibilities:

    
    
       1. There is an index
       2. The database is in memory and a scan is running in process
       3. The query result was precomputed
       4. The query was not that hard
    

With couchdb you could create a view on sku and week, and reduce to a count.

~~~
tzs
I'll have to give couchdb a look.

------
wanderr
Thought experiment: imagine you are going to port Bugzilla to a noSQL
solution. How would you do it? There are some pretty obvious relationships
between bugs, authors, commenters and comments.

Imagine a bug that has 100 comments from as many commenters. A normalized
soluion with no joins is going to require at least 200 round trips to look up
all comment and commenter information, so the need to denormalize is pretty
obvious.

So let's say you enormalize all the way and save all relevant information in
he bug document. Each bug has comments embedded in it and each comment has all
of the commenter's information in it. Now each bug requires one lookup, which
is awesome, but what about the fact that commenter information can change?
Sure it's a somewhat rare event, but users can have thousands and thousands of
comments. When a user changes their email address, do you tell them, this
might take a few hours, in the meantime I hope you still have access to your
old address? What kind of load do those massive updates place on your system?
Unlike reads which are fairly steady and regular, these high load queries will
be sporadic. How do you plan for and accommodate the spikes?

~~~
artsrc
One idea is that if an email address is updated the bugs are not updated
straight away (inconsistency). They are updated as load on the system allows.
Then you control the spikes as you see fit, to match your performance
requirements.

Another ideas is you keep a list of un-applied changes and post process with
those (fix the inconsistency). Banks used to do those when I was at university
(balance = overnight balance + todays transactions, plus an overnight batch to
get the balance up to date.)

If you original system was normalized, and did not fit in memory, then the
joins cost (disk_access_time * items), which could have been 2 seconds of disk
io. The notion that a db round trip is worse than a disk io is just wrong.

~~~
wanderr
There's a solution for when data doesn't fit in memory on one box called
sharding. Anyway, not trying to make the case that one round trip is worse,
but certainly hundreds or thousands can be.

~~~
artsrc
Sharding does not allow arbitrary joins to work efficiently. This is a really
big issue with the relation model for data, and flocks of cheap hardware for
storage.

------
andrewcooke
We need database joins so that we can derive new, interesting propositions
from the data already in the database.

A database is a set of facts about something. Joins are the logic reasoning
that lets you deduce new things.

If you never need to find out new things, then you don't need joins. So you
can get away without joins when you have a simple system that is doing
repetitive work. You can't avoid them when you are doing complex analysis or
presenting data in a variety of different ways.

------
zmmmmm
This seems to overlook really hard part where joins become important which is
for things that need to transactional, such as how much money I have in my
bank account. I might not care if it takes 30 seconds for my change of name to
propagate to 300 different papers I have written, but it is crucially
important that I have only one total for my bank account and that it is never
allowed to fall below zero (simplistically).

------
rfugger
I can only conclude that this is an ironic statement implying that we need
database joins to keep a web server responsive...

------
Swizec
The site is down for me.

Anyway, I don't think database joins are all that important. They make sense
when structuring your schema - only store a unique piece of data once and then
connect other data to it. This is a very sound design choice.

But in light of modern key/value stores, I don't think one needs to perform
joins on the database anymore. Sure it's theoretically a bit faster than
performing the join in your app, but considering all the other implications
and quickly you realise you don't need those.

Consider these two options: Pull data from a db, it's neatly tied together
with all its meta-data. But the query takes almost a second and when the DB is
always worried about consistency and stuff so writes are kind of slow too.

Pull data from a db. If you need some related data, you make another query
based on the key stored in the first piece of data. Each query takes a
fraction of a second, together they don't take as long as the full join query
... and you're only joining stuff when you actually need all of the data.

The added benefit is that in this situation the DB doesn't have to worry about
almost anything and is thus much faster.

If that made any sense, I'm preparing a seminar on NoSQL so I'd love to hear
some input on my understanding of things :)

~~~
dstorrs
Down for me too, so I'm only responding to your comment.

>[using a DB join means that] the query takes almost a second

This is, at best, an overgeneralization and at worst outright FUD. I don't
know what sort of data or database engine you're working with that might make
this true, but it's definitely not a requirement that it be this way. The
RDBMS backing my site is trivially able to return all needed data in fractions
of a second for most cases regardless of the presence of joins. If it can't,
that generally just means that I forgot to index something properly.

Moving past this, going with a simple key/value store instead of an RDBMS
means giving up at least three of the ACID properties in favor of nothing more
than a speed gain.

\- Atomic: No transactions. If the system dies between two related
commits...oops, your dataset is inconsistent.

\- Isolated: No transactions, again. You cannot make guarantees about what
information an object will have -- e.g., you cannot say "Every user will
always have at least one entry in Pages, because we automatically create a
non-deletable Home page for every user at signup." If you rely on that, you
will eventually hit a user who is halfway through the signup process and has a
user entry but no Page entry.

\- Durable. If your backing store has not actually flushed your write to disk
and the system crashes...oops, you're hosed. I hope you weren't storing your
Accounts Receivable data in that NoDB.

~~~
Groxx
> _Moving past this, going with a simple key/value store instead of an RDBMS
> means giving up at least three of the ACID properties in favor of nothing
> more than a speed gain._

This is, at best, an overgeneralization and at worst outright FUD.

> _The CouchDB file layout and commitment system features all Atomic
> Consistent Isolated Durable (ACID) properties._

<http://couchdb.apache.org/docs/overview.html> (to pick a _single_
counterexample)

I have no idea what anti-NoSQL things you've been reading, but they've been
wildly incorrect. And as to guarantees of what info an object has, schema-
based NoSQL databases exist, just as schema-free RDBMS databases exist.
Triggers frequently exist - you can automatically create a non-deletable
(permissions exist too) home page for every user at signup.

You're entirely correct that DB joins taking a long time is similarly wildly
incorrect, though I'm not certain that's what they're claiming. It sounds to
me more like they're saying pulling _all related info_ on primary object X in
your system takes a long time in a highly-normalized RDBMS compared to a
document-based DB where it's all in the same location. Which is frequently
correct.

