
Use the database built for your access model - eigenrick
http://queue.acm.org/detail.cfm?id=2696453
======
eigenrick
I posted this here in hopes to start a discussion on what, if any, follow ups
one might like to see to this article. It is rather topical, and even as such
was over 5000 words (apologies).

Are there any related, more specialized topics relating to databases people
would like to learn more about? Distributed? Transactions and serialization
guarantees? Disks?

~~~
nkurz
More specific discussion of "These sophisticated schemes work well for 1 GB of
data but not so well for 1 TB of data" would be nice. Instinctively, multiple
indexes be more useful on large datasets than on small. What works better on
the bigger datasets?

I also felt uneasy in the piece about the transitions back and forth from
"latency" to "throughput" without discussion of "concurrency". For example,
"At present, a 7200-RPM disk has a seek time of about four milliseconds. That
means it can find and read new locations on disk about 250 times a second. If
a server application relies on finding something on disk at every request, it
will be capped at 250 requests per second per disk" assumes a queue depth of
1, whereas "Assuming that writing a record of data takes five milliseconds,
and you have to write 20 different records to disk, performing these
operations in the page cache and then flushing to disk would cost only a
single disk access, rather than 20" assumes that the writes can all be issued
in parallel. Perhaps a discussion of Little's Law and how it applies to
database architecture?

------
dragontamer
This article has a fundamental misunderstanding of relational databases.

The building block of relational databases... is the mathematical term
"relation".

[http://en.wikipedia.org/wiki/Finitary_relation](http://en.wikipedia.org/wiki/Finitary_relation)

Relations are most efficiently stored in tables. The use of tables and all
that has absolutely nothing to do with disc space, but everything to do with
deduplication.

You see, when relations are pure and understood, it is literally impossible
for certain data-corruption issues to come up. Each step of normalization
wipes out a possible data-corruption issue.... from update inconsistencies to
insert inconsistencies.

" A man with two watches never knows what time it is. "

Similarly, if you ever store the same data twice, you'll never really know
which one is real. Especially if that bit of data were stored differently
twice.

This is the fundamental rule that Relational Databases / Normalization
follows. Normalization furthermore can be applied to any database. Relational
Databases just so happen to be the most common... so almost algorithmic
processes have been invented.

I would highly suggest to the author that he read Chris Date's relational
database books before writing any more articles on databases in general. Chris
Date is one of the inventors of the relational database btw, so I'll trust his
opinion of relational databases over anyone else's.

~~~
eigenrick
I agree that relational models have everything to do with deduplication.

It is one area where purity in mathematics translates well to solving
practical problems. It is often the case that the theoretic models of
computation or storage translate very poorly to everyday use. If they can be
represented, they are done so at cost, and are therefore not applied.

Maybe I'm a pessimist, but I argue that, while normalization does do nice
things for data consistency, people often use/used it because of disk storage
savings. In my experience, they rarely, if ever, achieve a high enough level
of normalization to get much benefit.

~~~
dragontamer
I mainly study publicly known databases for this sort of stuff. I'm not a
professional web developer, but I do what I can to keep up with technology.

From this perspective, we can study Wikipedia's schema.

As far as I can tell, the schema is designed for data consistency.
[http://upload.wikimedia.org/wikipedia/commons/3/36/Mediawiki...](http://upload.wikimedia.org/wikipedia/commons/3/36/Mediawiki_database_Schema.svg)

As far as I can tell, Wikipedia is mostly operating in 3rd normal form (or
higher), the general standard that most places go to. (6th normal form is the
ultimate that no one bothers to get to).

Besides, just because Relational Databases offer an easy path to normalization
doesn't mean you have to use it. If you want to minimize the number of joins
in your database for some reason, 1st normal form (aka: no normalization what-
so-ever) is always available.

Only recommended if joins really are becoming a problem though.

\----------------

Besides, Relational Databases are hardly the most efficient at disk space.
Every index, foreign key constraint or check condition often adds a ton of
data to every datum.

I'd argue that Key-Value stores are the most efficient at disk space usage.
Certainly more-efficient than your typical relational database IMO.

------
hallmark
Would you consider graph to be a fourth type of database, after relational,
key value, and document/hierarchical?

~~~
eigenrick
Ooh. Good question. In implementation, they end up looking a lot like
key/value stores, since most of the ones I know are implemented as edge-vertex
associations.

However, there is certainly a lot of specialized functionality on top of that.
You can then turn around and apply both relational models and hierarchical
models with them.

There are definitely some use cases for which I would heartily recommend a
graph database over the others, so, yeah, it is another category. It is also
something that should have been mentioned in this article. :)

~~~
jandrewrogers
A simple key-value model is a very low-scalability, low-performance method of
implementing a graph database. The key to graph database performance is
maintaining consistent locality over traversals, which is no small trick from
a computer science standpoint. I know a lot of graph databases work using
naive key-value lookups but it is not recommended.

Most modestly scalable graph databases are implemented as traditional
relational-style engines with highly optimized equi-join recursion. The most
scalable implementations use topological methods that are beyond the scope of
this post but definitely not simple key-value designs.

