Hacker News new | past | comments | ask | show | jobs | submit login
Use the database built for your access model (acm.org)
38 points by eigenrick on Dec 10, 2014 | hide | past | web | favorite | 14 comments



I posted this here in hopes to start a discussion on what, if any, follow ups one might like to see to this article. It is rather topical, and even as such was over 5000 words (apologies).

Are there any related, more specialized topics relating to databases people would like to learn more about? Distributed? Transactions and serialization guarantees? Disks?


More specific discussion of "These sophisticated schemes work well for 1 GB of data but not so well for 1 TB of data" would be nice. Instinctively, multiple indexes be more useful on large datasets than on small. What works better on the bigger datasets?

I also felt uneasy in the piece about the transitions back and forth from "latency" to "throughput" without discussion of "concurrency". For example, "At present, a 7200-RPM disk has a seek time of about four milliseconds. That means it can find and read new locations on disk about 250 times a second. If a server application relies on finding something on disk at every request, it will be capped at 250 requests per second per disk" assumes a queue depth of 1, whereas "Assuming that writing a record of data takes five milliseconds, and you have to write 20 different records to disk, performing these operations in the page cache and then flushing to disk would cost only a single disk access, rather than 20" assumes that the writes can all be issued in parallel. Perhaps a discussion of Little's Law and how it applies to database architecture?


This article has a fundamental misunderstanding of relational databases.

The building block of relational databases... is the mathematical term "relation".

http://en.wikipedia.org/wiki/Finitary_relation

Relations are most efficiently stored in tables. The use of tables and all that has absolutely nothing to do with disc space, but everything to do with deduplication.

You see, when relations are pure and understood, it is literally impossible for certain data-corruption issues to come up. Each step of normalization wipes out a possible data-corruption issue.... from update inconsistencies to insert inconsistencies.

" A man with two watches never knows what time it is. "

Similarly, if you ever store the same data twice, you'll never really know which one is real. Especially if that bit of data were stored differently twice.

This is the fundamental rule that Relational Databases / Normalization follows. Normalization furthermore can be applied to any database. Relational Databases just so happen to be the most common... so almost algorithmic processes have been invented.

I would highly suggest to the author that he read Chris Date's relational database books before writing any more articles on databases in general. Chris Date is one of the inventors of the relational database btw, so I'll trust his opinion of relational databases over anyone else's.


I agree that relational models have everything to do with deduplication.

It is one area where purity in mathematics translates well to solving practical problems. It is often the case that the theoretic models of computation or storage translate very poorly to everyday use. If they can be represented, they are done so at cost, and are therefore not applied.

Maybe I'm a pessimist, but I argue that, while normalization does do nice things for data consistency, people often use/used it because of disk storage savings. In my experience, they rarely, if ever, achieve a high enough level of normalization to get much benefit.


I mainly study publicly known databases for this sort of stuff. I'm not a professional web developer, but I do what I can to keep up with technology.

From this perspective, we can study Wikipedia's schema.

As far as I can tell, the schema is designed for data consistency. http://upload.wikimedia.org/wikipedia/commons/3/36/Mediawiki...

As far as I can tell, Wikipedia is mostly operating in 3rd normal form (or higher), the general standard that most places go to. (6th normal form is the ultimate that no one bothers to get to).

Besides, just because Relational Databases offer an easy path to normalization doesn't mean you have to use it. If you want to minimize the number of joins in your database for some reason, 1st normal form (aka: no normalization what-so-ever) is always available.

Only recommended if joins really are becoming a problem though.

----------------

Besides, Relational Databases are hardly the most efficient at disk space. Every index, foreign key constraint or check condition often adds a ton of data to every datum.

I'd argue that Key-Value stores are the most efficient at disk space usage. Certainly more-efficient than your typical relational database IMO.


In addition to that, I think the author is also conflating "model" vs "implementation". Some of the legacy relational databases are somewhat slow, but 1 TB of data is hardly much even for something like PostgreSQL. The more sophisticated column-oriented databases and data warehouses would still be considered relational databases, and are much much faster (as I understand, for many applications, the primary downside of those is cost, rather than anything inherent in the relational model or the implementation).

New, redesigned OLTP systems like VoltDB are also much faster than the original disk-based RDBMSs for transactional workloads.


Heck, PostgreSQL 9.3 and above is consistently beating MongoDB in speed for instance. So Relational Databases can be (and often are) very fast databases.

http://vibhorkumar.wordpress.com/2014/05/15/write-operation-...

I'm not an expert on the subject, but wasn't MapReduce the primary driver of NoSQL? Hadoop/HBase, etc. etc. Key-Value Store definitely (which has been mentioned), but even the most basic Key-Value store is more complicated than a glorified hash table.


I love me some PG. :)

There are a number of databases which are using unique storage models to achieve great numbers in circumstances for which their optimized (duh) DeepDB is an interesting one that is getting great throughput while running complex queries on incoming data. I believe they're using LSM Trees under the hood.

I was definitely trying to avoid doing a survey of all of the products out there and what they're using (that would be a great article, though) I more wanted to make people cognizant of the models that they might encounter and why/how they're used. It was a bit ambitious to attempt that in a single article.


Would you consider graph to be a fourth type of database, after relational, key value, and document/hierarchical?


Not really (see below). The article confuses a number of things and assumes only the simplest or most naive implementations.

A graph database is essentially a relational implementation that supports recursive joins. While not part of the strictest, most minimalist relational implementation in theory, most mature SQL databases support this to some extent. Some databases that support recursive joins will limit you to directed acyclic graphs.

A document database is a relational database implemented around materialized joins. It saves you from doing common joins dynamically. This is more restricted than a graph because it is a directed acyclic graph only. Again, most mature SQL databases support this to some extent or another.

A relational database that supports both materialized and recursive joins has the same expressiveness as a document and/or graph database. However, it trades some query performance in those individual cases for flexibility and performance in other cases that both document and graph databases are relatively poor at. Document and graph databases have existed since databases were first invented. Most of the implementation optimizations for documents and graphs are optionally available in good SQL databasess.

Key-value databases are a different beast. Most people only consider primitive databases based on key equality relationships or ordered ranges, which are quite limited. You can also build them based on multidimensional spatial relationships (i.e. keys are hyper-rectangles that can be relatively addressed), which are equivalent in expressiveness to relational implementations. If you added recursive joins to implementations of the spatial variants, you'd again have something that could express all of the models mentioned.

In summary, these models are APIs, they are not fundamentally different things if the underlying database engine is adequate for the purpose. It is why, for example, PostgreSQL can be used quite effectively these days as a document database.


What about distributed databases? I agree with your point that Postgres is actually the best option for most use cases. But as a practicing programmer, when should I consider choosing something like MongoDB?


Ooh. Good question. In implementation, they end up looking a lot like key/value stores, since most of the ones I know are implemented as edge-vertex associations.

However, there is certainly a lot of specialized functionality on top of that. You can then turn around and apply both relational models and hierarchical models with them.

There are definitely some use cases for which I would heartily recommend a graph database over the others, so, yeah, it is another category. It is also something that should have been mentioned in this article. :)


A simple key-value model is a very low-scalability, low-performance method of implementing a graph database. The key to graph database performance is maintaining consistent locality over traversals, which is no small trick from a computer science standpoint. I know a lot of graph databases work using naive key-value lookups but it is not recommended.

Most modestly scalable graph databases are implemented as traditional relational-style engines with highly optimized equi-join recursion. The most scalable implementations use topological methods that are beyond the scope of this post but definitely not simple key-value designs.


The "how your DB stores the data" and "how you query it" is kind of interesting. Time series DBs are similar to graph DBs — implementation-wise they're not really distinct, but the specialized functionality makes them extra useful.




Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: