Hacker News new | past | comments | ask | show | jobs | submit login
Codd's relational vision – has NoSQL come full circle? (opensourceconnections.com)
107 points by lukaseder on Dec 14, 2013 | hide | past | favorite | 33 comments

Excellent article. As promised, the author brings a historical perspective that's typically lacking in this debate (at least for someone like me who has casually followed it on HN).

Not to stray too far off topic, but this article raised a question that I've had since starting software development in earnest... Why do we have to choose between heavily "normalized" relational databases that structure all data & thus allow more arbitrary queries, OR un-structured databases that are fast and flexible but often slow to query in complex ways? Why can't there be a hybrid "smart" database that dynamically generates indexes (to use the term loosely) based on how it's queried, in order to speed up similar queries in the future? With some kind of additional weight given to the most frequent queries. Granted, it would need some time to "warm up" (not unlike a tracing JIT compiler?), and the implementation might be fairly complex, but other than that, I can't think of any downsides...?

PostgreSQL is the closest to what you're looking for.

It's really fast; if it doesn't meet your requirements you're probably already aware that you need a specialized solution.

Besides explicit indexing, it also analyzes your data and maintains stats that inform the query planner. As your tables grow and your schema evolves, it will change its query plans accordingly. It can be quite smart. (A few examples: it maintains histograms of common values, null percentages, correlation between physical ordering and logical ordering or rows, and much more).

Of course, you also have a lot of flexibility to denormalize and get the best of both worlds.

AFAIK RavenDB, which describes itself as a "2nd generation document database", does exactly what you describe. Whenever you run a query that's missing an index, it will automatically create a temporary "dynamic index" to service it. If it finds that this index is used a lot, it will automatically promote it to a permanent index. I haven't tried it yet however so can't comment on how well it works.

But yes, generally speaking, I'm also surprised to see that in 2013 we're creating our indexes manually. While there are of course applications where you really want to ensure that all the right indexes are created ahead of time, for most applications having indexes created automatically based on query patterns would seem like a much better solution.

I agree that automatic creation of indexes looks like a reasonable thing to do for a database like Oracle. Oracle could gather statistics and auto-create indexes where it seems fit, i.e. where there is a lot of querying and little writing going on.

In a way, this is what caches can do. Besides that, Oracle is already very good at giving you statistics and tuning hints to help you assess where you could add an index:


DB2 and SQL Server probably have similar tools. As far as I know, they all don't go as far as automatically creating or dropping any indexes.

Googles appengine datastore will suggest indexes, but it's rather bad at it IMO.

You might enjoy PostgreSQL, then. It supports a variety of "weakly" structured data types, such as JSON, hashmaps (key/value types, also called hstore in PostgreSQL).

Many databases also support XML data types. And of course, you can always store unstructured data in CLOBs and BLOBs

Yes, but usually you don't want to use these 'weakly structured' BLOBS if you have an alternative. Relational dbs have some weak points but structured data storage is not one of them. I can't imagine a database without a fixed schema, it always exists even if you don't have to declare it upfront.

Actually, you read my mind! I'm currently working on a Rails project that uses PostgreSQL with hstore for semi-structured data such as user information, configuration, etc. Haven't gotten too far into it yet, but it looks promising. :)

What I was trying to imagine in my post, though, was some hypothetical DB where all this would be abstracted away--i.e., it would expose a NoSQL-like interface for all data, while working behind the scenes to provide SQL-like speed for frequent queries. Hence the analogy that it would be the "tracing JIT" of databases.

I've been thinking along similar lines, I think. One problem I see with data in RDBMSs is that data gets transformed in the process of importing it to match the schema, and this transformation loses information, i.e. it's hard to recover the original meaning and context, because it's been contorted to fit the schema. Schemas tend to need to evolve, and so you'd like not to be stuck with the assumptions of schema 1.0 forever. Data isn't that big these days in relation to the speed of computers, so you could keep the data in its original form and leave open the ability to "reimport" to fit the current schema as needed. If you're building a website, for instance, then your web app is going to be doing a limited number of queries over and over. If you start from the queries, you can infer things about how the data should be organized, e.g. which indexes are needed. In fact, if you took a query and pre-generated all the possible results of that query and saved them, that's basically an index, the one that fits that query. The inputs to this hypothetical database would be the original unprocessed data, the set of queries you need, and mappings to transform the unprocessed data into queryable form, i.e. the equivalent of a schema. The inputs apart from the data proper would also be considered as data.

The major downside is that it is an inherently non-scalable behavior for a database engine. A few databases do what you suggest but they are limited in the amount of data they can organize in this way without performance becoming problematic. Since data is becoming quite large rather quickly, it is not a sensible architecture choice these days.

There are several reasons why you would not want to design a large-scale database this way. It implies an enormous amount of extra data motion which is arguably the major killer of performance in parallel databases; it is the same reason Hadoop is so slow and inefficient for analytic queries. Additionally, secondary indexing structures and similar types of de-normalization offer poor performance in distributed environments due to the implied consistency and coordination problems; it will severely degrade insert/update/delete operation performance as the data structures grow. Most database applications also value a high-degree of predictability of performance under load, which this definitely will not allow. These tradeoffs are unacceptable for many (most?) applications.

In short, while you could design a database that does what you suggest, it would only be useful at tiny scales or for read-only databases where consistent performance characteristics are not a requirement.

>The major downside is that it is an inherently non-scalable behavior for a database engine

well, that is true for the whole SQL, but certainly not true for some subset of SQL which is rather big. And how much is "quite large"? We have client who claims that 160GB is big (he is running cluster of 8 machines now, poor bastard)

Actually, it is not true for the whole SQL. It is just true for implementations that require secondary indexing or extensive denormalization. This is essentially the way you would implement SQL if you were copying 1990s database kernel design but it is not the only way. Unfortunately, almost everyone still implements database engines this way because that is the way they have traditionally been designed; designing a genuinely new database kernel from scratch is not for the faint of heart.

I normally design around databases in the tens of terabytes to tens of petabytes range, it is not an unusual scale. A database engine designed as the parent outlined would start to exhibit unacceptable performance characteristics at single-digit terabyte scales. A terabyte is tiny; I can buy servers with that much memory.

Complex multi-attribute selection and joins are efficiently parallelizable but most databases do not implement algorithms and data structures that allow this to be realized.

Hey thanks for liking the article (I'm the author)

It seems maybe some of the NewSQL stuff is doing this. You use "SQL" but the data is still organized by some kind of hierarchy (like in F5). I'm trying to get into Google's Dremel paper to learn more about how they do it.

In general, though, it seems if there's any chance something might be normalized, but behind the scenes maybe its actually denormalized a bit, then folks try to layer on some kind of SQL-likeness to the database.

Fundamentally, the two problems (max throughput for transactions vs. query flexibility) are different. So while you may build a jack of all trades system that can do it all, but if you keep growing, eventually you'll need to make tuning decisions that negatively impact queries or throughput.

The solution in most cases is to maintain transaction and query optimized databases separately.

Well, Oracle has its SQL query tuning tools... but in a production system you often don't want too much magical optimization leading to erratic behavior/performance problems. Oracle also has a "plan stability" feature for this very reason.

It's probably possible to incorporate very smart query engines in purely relational database systems that would obviate many of the complaints about relational databases. But SQL does not really implement the relational model and makes that infeasible. See "The Third Manifesto" and co.

Which reminds me of Fabian Pascal, probably somewhere drinking heavily to forget the current state of the DB industry.

(Context: for more than a decade he tried to educate people on the relational model, the ill-thought back-to-the-past ideas that are NoSQL DBs, and more besides. Unfortunately his old site was scrapped, it seems, and has as of 2012 turned it into this blog: http://www.dbdebunk.com/ )

> My point is that it’s dangerous to just grab a database off the shelf because its fun and trendy. You might end up creating a lot of technical debt for yourself. Do you know you’re handling an extremely high volume of a single somewhat-well-defined “thing”? Then the hierarchical model might be an amazing bet for you. Do you have many different kinds of records that link together, but don’t necessarily contain parts of the whole? A relational data store might be the ideal solution.

This seems to me the most important point of the article and I fully agree here. What I'd like to add is that a DBMS like CouchDB is IMO fully suitable to the use case you describe. ArangoDB, which was also presented at that conference, appears to be even easier for those kinds of datastores - I haven't been able to play around with it however.

There is no such thing as a perfect DBMS - and that applies to RDBMS as well. With Couch you give up sharding, Ad-hoc querying and per-document security, in return for conditionless replication, relative fool proofness in terms of indexing, the ability to have fully normalized relational data (compared to other NoSQL) and mobile offline capabilities.

One thing about ArangoDB. Why don't they implement SQL instead of their weird COBOLESQUE AQL language?

I have been asking myself the same question - they tried to explain it but I didn't really understand the point to be honest. From what I get SQL was what they first pursued, but gave up at some point when it became clear that it wouldn't be ideal for their use case of a multi paradigm database (document as well as graph database). In general I agree that when developing a new product, it's good to question everything rather than blindly repeat what everyone has done for 20+ years. I'd love to read a blog entry about this decision though.

I agree that one should question 1-2 things. But when you look at an example (from https://www.arangodb.org/manuals/current/Aql.html):

    FOR u IN users
      FOR f IN friends
        FILTER u.active == true && f.active == true && u.id == f.userId
        RETURN u.name
What would keep anyone from writing this as

    SELECT u.name
    FROM users AS u
    JOIN friends AS f ON u.id = f.userId
    WHERE u.active = true AND f.active = true
Writing an entirely new language may inhibit adoption, if anything... Their rationale is given here in the FAQ (https://www.arangodb.org/faq)

> For example, think of nested list structures inside a document, graph traversals etc. There is no way to query such structures in standard SQL, and deviating from standard SQL does not make much sense.

I would have thought that PostgreSQL had solved this pretty well. So has Oracle with its OBJECT types. There must have been a complete SQL show-stopper in their use cases for them to invent yet another language.

I guess the language was inspired by XQuery, even if they don't mention XQuery in the FAQ.

Yes, AQL was inspired by XQuery.

I think SQL is really good for querying relational databases.

Actually we started with implementing something like SQL's SELECT clause in the very beginning of ArangoDB. The rationale was: "why invent another language? SQL is everyhwere, so let's use it!".

We very soon found that SQL is not a good fit for working with semi-structured data. There is no definite schema for a collection in ArangoDB, so it is unknown which attributes (think: columns) a document (think: row) will have. Thus using standard SQL would have introduced a lot of potential ambiguities. Example:

    SELECT a, b, c 
    FROM c1 
    INNER JOIN c2 
    ON (c1.x = c2.y)
When inspecting the above query initially, the database has no idea if attribute "a" will come from c1 or c2. Each document in both collections can have an attribute "a", "b", "c" or none at all. So a query like the above could throw an ambiguity error at runtime only, and not at query compile time. Fully qualifying attributes with "table" names would have worked (e.g. "SELECT c1.a"), but would be a deviation from standard SQL, which doesn't require that. And then people would have asked "they claim to support SQL. But why doesn't my SQL statement work in ArangoDB?". Probably a lot of confusion.

Apart from that, it is common to have multi-valued attributes in document databases (and thus in ArangoDB). Think of an attribute which itself is a list.

SQL really is not designed for this. Putting a horizontal lists into a single attribute/column is an anti-pattern in the relational world. Instead, you would normalize most data into separate n:m mapping tables etc., and join them later.

No need to do this with a document database: horizontal lists are supported, and there is less NEED for normalization. It is up to you how to model the data.

With all that in mind, we very soon switched to a language we thought would better fit a non-relational database such as ArangoDB. We intentionally decided against using SQL keywords, to avoid confusion.

I hope the decision for starting AQL now is little more comprehensible.

One thing that is never directly mentioned in the history of SQL is the available memory, cpu and storage at the time. When a mainframe has 192KB of core memory and a 300KB/s disk transfer speed, coming up with a way to access all that data was important. The OP example of data in a text file would be okay(ish) today because you could load up the whole in memory and be done with it. Yes, we have more data, more rows, and larger records but are solving different problems with different tools and significantly more horsepower at our disposal. This is not anti-SQL, but the SQL-as-a-solution in history needs more context. With that context, those lessons may be less relevant.

> This is neat because, in my Artificial Intelligence class in college, this kind of “reasoning” is presented as something advanced. In fact there’s a whole language, Prolog, that’s whole job is to take assertions about the world and give you extra facts. I had to learn all that stuff in AI when mundane-old SQL was right under my nose giving me the tools to learn anything I want about the world my relations represented!


I guess broken worldviews like this are why Datalog never really got off the ground in the DB community.

See also: What Goes Around Comes Around (Stonebraker and Hellerstein)


That one's from the time where XML-databases were all the rage. Unfortunately, its publication year is missing. (The youngest cited article is from 97)

Gotta love Stonebraker!

See also:

SQL and NoSQL are two sides of the same coin, mathematically


That seems more like a plea for LINQ than anything else ;-)

One of LINQ's main visions is to unify querying regardless of the data source - and it does so quite well. At the time the author of that article conceived LINQ, he intended to unify querying RDBMS, XML, and objects. "NoSQL" (whatever that is) is a type of data store that was not part of LINQ's original master plan. Thus, LINQ has to be retrofitted / enhanced to accommodate new requirements. It would be all too nice if things were as easy as a simple duality, specifically given the fact that the author of that article has now also created a company called Applied Duality Inc.

But history will teach us where these things go. I currently don't see a second E.F. Codd to solve the complexity introduced with the new abundance of NoSQL data stores - yet.

(As always, when Erik Meijer writes something: Very interesting read, nonetheless)

Author thinks he is being contrarian by defending SQL... huh?

It's a NoSQL conference, so yes.

Ah, I missed that bit.

This happens when someone is trying to actually use NoSQL instead of just praising it ;)

hey thanks for your comment (I'm the author)

I must be biased by my peer group :). Maybe I'm less concerned about SQL being a default choice than Mongo or something being the default choice? I say this as somebody that helps people migrate to NoSQL databases as a job.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact