Are there any related, more specialized topics relating to databases people would like to learn more about? Distributed? Transactions and serialization guarantees? Disks?
I also felt uneasy in the piece about the transitions back and forth from "latency" to "throughput" without discussion of "concurrency". For example, "At present, a 7200-RPM disk has a seek time of about four milliseconds. That means it can find and read new locations on disk about 250 times a second. If a server application relies on finding something on disk at every request, it will be capped at 250 requests per second per disk" assumes a queue depth of 1, whereas "Assuming that writing a record of data takes five milliseconds, and you have to write 20 different records to disk, performing these operations in the page cache and then flushing to disk would cost only a single disk access, rather than 20" assumes that the writes can all be issued in parallel. Perhaps a discussion of Little's Law and how it applies to database architecture?
The building block of relational databases... is the mathematical term "relation".
Relations are most efficiently stored in tables. The use of tables and all that has absolutely nothing to do with disc space, but everything to do with deduplication.
You see, when relations are pure and understood, it is literally impossible for certain data-corruption issues to come up. Each step of normalization wipes out a possible data-corruption issue.... from update inconsistencies to insert inconsistencies.
" A man with two watches never knows what time it is. "
Similarly, if you ever store the same data twice, you'll never really know which one is real. Especially if that bit of data were stored differently twice.
This is the fundamental rule that Relational Databases / Normalization follows. Normalization furthermore can be applied to any database. Relational Databases just so happen to be the most common... so almost algorithmic processes have been invented.
I would highly suggest to the author that he read Chris Date's relational database books before writing any more articles on databases in general. Chris Date is one of the inventors of the relational database btw, so I'll trust his opinion of relational databases over anyone else's.
It is one area where purity in mathematics translates well to solving practical problems. It is often the case that the theoretic models of computation or storage translate very poorly to everyday use. If they can be represented, they are done so at cost, and are therefore not applied.
Maybe I'm a pessimist, but I argue that, while normalization does do nice things for data consistency, people often use/used it because of disk storage savings. In my experience, they rarely, if ever, achieve a high enough level of normalization to get much benefit.
From this perspective, we can study Wikipedia's schema.
As far as I can tell, the schema is designed for data consistency.
As far as I can tell, Wikipedia is mostly operating in 3rd normal form (or higher), the general standard that most places go to. (6th normal form is the ultimate that no one bothers to get to).
Besides, just because Relational Databases offer an easy path to normalization doesn't mean you have to use it. If you want to minimize the number of joins in your database for some reason, 1st normal form (aka: no normalization what-so-ever) is always available.
Only recommended if joins really are becoming a problem though.
Besides, Relational Databases are hardly the most efficient at disk space. Every index, foreign key constraint or check condition often adds a ton of data to every datum.
I'd argue that Key-Value stores are the most efficient at disk space usage. Certainly more-efficient than your typical relational database IMO.
New, redesigned OLTP systems like VoltDB are also much faster than the original disk-based RDBMSs for transactional workloads.
I'm not an expert on the subject, but wasn't MapReduce the primary driver of NoSQL? Hadoop/HBase, etc. etc. Key-Value Store definitely (which has been mentioned), but even the most basic Key-Value store is more complicated than a glorified hash table.
There are a number of databases which are using unique storage models to achieve great numbers in circumstances for which their optimized (duh) DeepDB is an interesting one that is getting great throughput while running complex queries on incoming data. I believe they're using LSM Trees under the hood.
I was definitely trying to avoid doing a survey of all of the products out there and what they're using (that would be a great article, though) I more wanted to make people cognizant of the models that they might encounter and why/how they're used. It was a bit ambitious to attempt that in a single article.
A graph database is essentially a relational implementation that supports recursive joins. While not part of the strictest, most minimalist relational implementation in theory, most mature SQL databases support this to some extent. Some databases that support recursive joins will limit you to directed acyclic graphs.
A document database is a relational database implemented around materialized joins. It saves you from doing common joins dynamically. This is more restricted than a graph because it is a directed acyclic graph only. Again, most mature SQL databases support this to some extent or another.
A relational database that supports both materialized and recursive joins has the same expressiveness as a document and/or graph database. However, it trades some query performance in those individual cases for flexibility and performance in other cases that both document and graph databases are relatively poor at. Document and graph databases have existed since databases were first invented. Most of the implementation optimizations for documents and graphs are optionally available in good SQL databasess.
Key-value databases are a different beast. Most people only consider primitive databases based on key equality relationships or ordered ranges, which are quite limited. You can also build them based on multidimensional spatial relationships (i.e. keys are hyper-rectangles that can be relatively addressed), which are equivalent in expressiveness to relational implementations. If you added recursive joins to implementations of the spatial variants, you'd again have something that could express all of the models mentioned.
In summary, these models are APIs, they are not fundamentally different things if the underlying database engine is adequate for the purpose. It is why, for example, PostgreSQL can be used quite effectively these days as a document database.
However, there is certainly a lot of specialized functionality on top of that. You can then turn around and apply both relational models and hierarchical models with them.
There are definitely some use cases for which I would heartily recommend a graph database over the others, so, yeah, it is another category. It is also something that should have been mentioned in this article. :)
Most modestly scalable graph databases are implemented as traditional relational-style engines with highly optimized equi-join recursion. The most scalable implementations use topological methods that are beyond the scope of this post but definitely not simple key-value designs.