
Relational Data, Document Databases and Schema Design - binbasti
http://www.paperplanes.de/2010/7/5/relational_data_document_databases_schema_design.html
======
gfodor
Document databases are a passing fad, a backlash against the flaws in the
currently available open source databases. These flaws have nothing to do with
the relational model.

If there was an open source RDBMS that allowed instant/lazy schema migrations,
direct map-reduce access to its data, was able to automatically partition and
shard, and somehow magically had performant joins you would never see these
document database come to be so popular. Document databases have no
mathematical grounding analagous to the relational model; the reason is
because they are a hackneyed application of the relational model that allows
implemention to bleed even further into the model than current RDBMses do. For
example, the fact a developer now will have to choose between "embedding" or
joining documents together application side has absolutely _zero_ different in
terms of the true data model. The data model is unchanged in either scenario,
it is just being _materialized_ differently on disk for implementation issues.

Edit: To be clear, the way things _should_ work is you should model your data
as a normalized, clean model. Then, you should be able to annotate the data or
provide alternative views that "embed" documents and so on to reduce cross-
table joins. This is an optimization layer and the actual materialization of
these views should be maintained by the database. It should be part of the
query optimizer, and your joins should simply speed up once they can access
the "embedded versions". The fact that you have to do this all manually and
store the data yourself, and basically take on the role of the query optimizer
as well, shows that we've traded C for assembly code in the db world with
these document databases.

It's truly a sad state of affairs that peoples horrible experiences with
whatever particular RDBMS they've used are causing them to react by using one
of these document databases. Now, they've got a whole series of new problems.
There is no free lunch until someone addresses some of the pain points like
schema evolution in a way that doesn't sacrifice everything else.

~~~
ljlolel
Hi, I think that you are completely right. Schema evolution and easy
partition/sharding are broken on current open-source RDBMS systems. Relational
data has strong mathematical guarantees as well as useful normalization
properties for ensuring consistency.

I have a question about what good solution you have to solve these problems.
Are you suggesting that Oracle does this well as a commercial solution? Is
anybody at all working on a next-generation RDBMS that makes schema migration
truly easy?

For the sharding/partitioning question, is it even possible to make a
relational data scheme that stays available and consistent in the case of
network partition?

~~~
gfodor
The solution is pretty clear, simply make it easy to do what the author
suggests in their article, in a safe and repeatable way. You define your new
schema, and you define the tranformation rules to run on the old data. In
turn, the database applies these _lazily_. There'd be some formality necessary
to prove certain operations could be applied lazily to provide a consistent
view of the data. People are doing this by hand now; it should be part of the
system. One can imagine a mathematical model that involves deferred relational
operators. The constraints that would be imposed would probably be that
certain operations would not be permissable until others complete. For
example, it might be necessary, when running one of these lazy migrations, to
"flush" the remaining untransformed rows to the end before you are able to
apply another one. The upside is your migration scripts run instantly, becuase
they have no side effects until the next transaction.

The CAP theorem I think often gets rolled up into these "NoSQL" discussions
erroneously. One can still model data relationally but materialize it in a way
that reduces ACID compliance to survive in the face of network partitions.
Arguably this is a red herring though. In practice, the "network partition"
problem to me seems to be only really worth worrying about if you are Google
or Amazon and are running cross-data center data storage. For the majority of
the world this is not something that needs to be worried about. (See: VoltDB's
choice on the matter.)

Edit: By the way, I'm not saying any of this is easy. It's hard, really hard.
(I couldn't do it.) But the point here is that it's easy to get caught up and
think these 'new' document databases are solving these problems. They're not.
They're putting you back down to the assembly code level and forcing you to do
these things yourself. The whole point of the RDBMS is to provide an
abstraction layer. They're removing this abstraction layer and heralding it as
a step forward. It is, in a bizarre sense similar to how it's a "step forward"
for you to be able to access the registers on your CPU, but it's surely a step
backwards in many ways as well.

~~~
jchrisa
The compelling thing about CouchDB is the ability to take a copy of the
database offline, make changes, and then sync back up with other copies of the
database later (or never).

If this isn't a network partition I don't know what is.

Of course thinking like this is totally unlike what most relational users are
accustomed to, but this use case is only becoming more prominent as
applications spread to diverse and occasionally connected devices.

I have a hard time imagining practical offline APIs without the full MVCC
document model. Once you see this use case's value it's hard not to want to
write all your apps so that they can be replicated offline for disconnected
use.

~~~
mjw
One relevant point here is that it's relatively easy to merge two branches of
changes for a simple (key/value) data model, but harder to merge changes to
relational data where there are additional constraints (like foreign keys)
which must be preserved.

That said, just because it's easy to merge key/value data, doesn't mean that
the results of the merge are necessarily going to be correct. The constraints
still exist, they're just not explicit in the data model, meaning that when
your network partition gets joined up again and it magically merges all those
changes at the key-value level, you may be ending up with data that's broken
in subtle, difficult-to-track-down ways because other implicit constraints
weren't respected by the merge.

So, merging changes in a way that respects constraints which apply to that
data, is a hard problem in general. Key/value stores, with their less-
structured data model, sweep that problem under the carpet somewhat in order
to make things easier in distributed settings. Relational databases suffer
from trying to address the problem more fully.

~~~
gfodor
This is a really important point. There seems to be this joy people are having
from shedding their schema since they can just get stuff done and not have it
get in the way. This should be just as much scary as liberating. You're
basically flying on manual now, and there's nothing protecting your data
integrity anymore.

------
wanderr
I stopped reading when he said consistency is in your hands (with noSQL).
Since you have to manage relationships, and doing so requires several round
trips to the datastore, and generally doesn't involve transactions, what
happens when inserts 1 and 2 succeed, but insert 3 fails? What happens if you
crash before insert 3? In high volume applications, these situations will
arise and you will end up with inconsistent data. That doesn't prevent noSQL
from being useful in some situations, but it's a far cry from putting control
over consistency in your hands.

~~~
jchrisa
CouchDB has full ACID on a single document, so you can avoid these scenarios
by writing everything that needs to happen in a transaction into a single
document. Modeling your data like this means you are also more robust against
partial replications (when using offline mode).

It is possible to do banking this way (CouchDB is probably better for banking
than relational systems). Here's an example:

<http://books.couchdb.org/relax/reference/recipes>

~~~
wanderr
I'm not too familiar with CouchDB in particular or what exactly a "document"
is. Let's say I have a site about books. I need to be able to look up books by
author, title, isbn and users who have it in their catalogue. Is there a way
in CouchDB to ensure consistency while maintaining those relationships?

~~~
jchrisa
Yes, that's what the incremental map reduce indexes are used for. Each book
would be a document, which you could query by any of those attributes. I'd
store the fact that each user owns a book as it's own document, and then you
can easily get a view of books by user, or of users by book.

