Codd's relational vision – has NoSQL come full circle?

sowhatquestion · on Dec 14, 2013

Excellent article. As promised, the author brings a historical perspective that's typically lacking in this debate (at least for someone like me who has casually followed it on HN).

Not to stray too far off topic, but this article raised a question that I've had since starting software development in earnest... Why do we have to choose between heavily "normalized" relational databases that structure all data & thus allow more arbitrary queries, OR un-structured databases that are fast and flexible but often slow to query in complex ways? Why can't there be a hybrid "smart" database that dynamically generates indexes (to use the term loosely) based on how it's queried, in order to speed up similar queries in the future? With some kind of additional weight given to the most frequent queries. Granted, it would need some time to "warm up" (not unlike a tracing JIT compiler?), and the implementation might be fairly complex, but other than that, I can't think of any downsides...?

smilliken · on Dec 14, 2013

PostgreSQL is the closest to what you're looking for.

It's really fast; if it doesn't meet your requirements you're probably already aware that you need a specialized solution.

Besides explicit indexing, it also analyzes your data and maintains stats that inform the query planner. As your tables grow and your schema evolves, it will change its query plans accordingly. It can be quite smart. (A few examples: it maintains histograms of common values, null percentages, correlation between physical ordering and logical ordering or rows, and much more).

Of course, you also have a lot of flexibility to denormalize and get the best of both worlds.

MehdiEG · on Dec 14, 2013

AFAIK RavenDB, which describes itself as a "2nd generation document database", does exactly what you describe. Whenever you run a query that's missing an index, it will automatically create a temporary "dynamic index" to service it. If it finds that this index is used a lot, it will automatically promote it to a permanent index. I haven't tried it yet however so can't comment on how well it works.

But yes, generally speaking, I'm also surprised to see that in 2013 we're creating our indexes manually. While there are of course applications where you really want to ensure that all the right indexes are created ahead of time, for most applications having indexes created automatically based on query patterns would seem like a much better solution.

lukaseder · on Dec 15, 2013

I agree that automatic creation of indexes looks like a reasonable thing to do for a database like Oracle. Oracle could gather statistics and auto-create indexes where it seems fit, i.e. where there is a lot of querying and little writing going on.

In a way, this is what caches can do. Besides that, Oracle is already very good at giving you statistics and tuning hints to help you assess where you could add an index:

http://stackoverflow.com/a/2937047/521799

DB2 and SQL Server probably have similar tools. As far as I know, they all don't go as far as automatically creating or dropping any indexes.

pepijndevos · on Dec 15, 2013

Googles appengine datastore will suggest indexes, but it's rather bad at it IMO.

lukaseder · on Dec 14, 2013

You might enjoy PostgreSQL, then. It supports a variety of "weakly" structured data types, such as JSON, hashmaps (key/value types, also called hstore in PostgreSQL).

Many databases also support XML data types. And of course, you can always store unstructured data in CLOBs and BLOBs

lafar6502 · on Dec 14, 2013

Yes, but usually you don't want to use these 'weakly structured' BLOBS if you have an alternative. Relational dbs have some weak points but structured data storage is not one of them. I can't imagine a database without a fixed schema, it always exists even if you don't have to declare it upfront.

sowhatquestion · on Dec 14, 2013

Actually, you read my mind! I'm currently working on a Rails project that uses PostgreSQL with hstore for semi-structured data such as user information, configuration, etc. Haven't gotten too far into it yet, but it looks promising. :)

What I was trying to imagine in my post, though, was some hypothetical DB where all this would be abstracted away--i.e., it would expose a NoSQL-like interface for all data, while working behind the scenes to provide SQL-like speed for frequent queries. Hence the analogy that it would be the "tracing JIT" of databases.

mtdewcmu · on Dec 15, 2013

I've been thinking along similar lines, I think. One problem I see with data in RDBMSs is that data gets transformed in the process of importing it to match the schema, and this transformation loses information, i.e. it's hard to recover the original meaning and context, because it's been contorted to fit the schema. Schemas tend to need to evolve, and so you'd like not to be stuck with the assumptions of schema 1.0 forever. Data isn't that big these days in relation to the speed of computers, so you could keep the data in its original form and leave open the ability to "reimport" to fit the current schema as needed. If you're building a website, for instance, then your web app is going to be doing a limited number of queries over and over. If you start from the queries, you can infer things about how the data should be organized, e.g. which indexes are needed. In fact, if you took a query and pre-generated all the possible results of that query and saved them, that's basically an index, the one that fits that query. The inputs to this hypothetical database would be the original unprocessed data, the set of queries you need, and mappings to transform the unprocessed data into queryable form, i.e. the equivalent of a schema. The inputs apart from the data proper would also be considered as data.

jandrewrogers · on Dec 15, 2013

The major downside is that it is an inherently non-scalable behavior for a database engine. A few databases do what you suggest but they are limited in the amount of data they can organize in this way without performance becoming problematic. Since data is becoming quite large rather quickly, it is not a sensible architecture choice these days.

There are several reasons why you would not want to design a large-scale database this way. It implies an enormous amount of extra data motion which is arguably the major killer of performance in parallel databases; it is the same reason Hadoop is so slow and inefficient for analytic queries. Additionally, secondary indexing structures and similar types of de-normalization offer poor performance in distributed environments due to the implied consistency and coordination problems; it will severely degrade insert/update/delete operation performance as the data structures grow. Most database applications also value a high-degree of predictability of performance under load, which this definitely will not allow. These tradeoffs are unacceptable for many (most?) applications.

In short, while you could design a database that does what you suggest, it would only be useful at tiny scales or for read-only databases where consistent performance characteristics are not a requirement.

rusabd · on Dec 15, 2013

>The major downside is that it is an inherently non-scalable behavior for a database engine

well, that is true for the whole SQL, but certainly not true for some subset of SQL which is rather big. And how much is "quite large"? We have client who claims that 160GB is big (he is running cluster of 8 machines now, poor bastard)

jandrewrogers · on Dec 15, 2013

Actually, it is not true for the whole SQL. It is just true for implementations that require secondary indexing or extensive denormalization. This is essentially the way you would implement SQL if you were copying 1990s database kernel design but it is not the only way. Unfortunately, almost everyone still implements database engines this way because that is the way they have traditionally been designed; designing a genuinely new database kernel from scratch is not for the faint of heart.

I normally design around databases in the tens of terabytes to tens of petabytes range, it is not an unusual scale. A database engine designed as the parent outlined would start to exhibit unacceptable performance characteristics at single-digit terabyte scales. A terabyte is tiny; I can buy servers with that much memory.

Complex multi-attribute selection and joins are efficiently parallelizable but most databases do not implement algorithms and data structures that allow this to be realized.

softwaredoug · on Dec 15, 2013

Hey thanks for liking the article (I'm the author)

It seems maybe some of the NewSQL stuff is doing this. You use "SQL" but the data is still organized by some kind of hierarchy (like in F5). I'm trying to get into Google's Dremel paper to learn more about how they do it.

In general, though, it seems if there's any chance something might be normalized, but behind the scenes maybe its actually denormalized a bit, then folks try to layer on some kind of SQL-likeness to the database.

Spooky23 · on Dec 15, 2013

Fundamentally, the two problems (max throughput for transactions vs. query flexibility) are different. So while you may build a jack of all trades system that can do it all, but if you keep growing, eventually you'll need to make tuning decisions that negatively impact queries or throughput.

The solution in most cases is to maintain transaction and query optimized databases separately.

sehugg · on Dec 15, 2013

Well, Oracle has its SQL query tuning tools... but in a production system you often don't want too much magical optimization leading to erratic behavior/performance problems. Oracle also has a "plan stability" feature for this very reason.

a8da6b0c91d · on Dec 14, 2013

It's probably possible to incorporate very smart query engines in purely relational database systems that would obviate many of the complaints about relational databases. But SQL does not really implement the relational model and makes that infeasible. See "The Third Manifesto" and co.

coldtea · on Dec 15, 2013

Which reminds me of Fabian Pascal, probably somewhere drinking heavily to forget the current state of the DB industry.

(Context: for more than a decade he tried to educate people on the relational model, the ill-thought back-to-the-past ideas that are NoSQL DBs, and more besides. Unfortunately his old site was scrapped, it seems, and has as of 2012 turned it into this blog: http://www.dbdebunk.com/ )

m_mueller · on Dec 15, 2013

> My point is that it’s dangerous to just grab a database off the shelf because its fun and trendy. You might end up creating a lot of technical debt for yourself. Do you know you’re handling an extremely high volume of a single somewhat-well-defined “thing”? Then the hierarchical model might be an amazing bet for you. Do you have many different kinds of records that link together, but don’t necessarily contain parts of the whole? A relational data store might be the ideal solution.

This seems to me the most important point of the article and I fully agree here. What I'd like to add is that a DBMS like CouchDB is IMO fully suitable to the use case you describe. ArangoDB, which was also presented at that conference, appears to be even easier for those kinds of datastores - I haven't been able to play around with it however.

There is no such thing as a perfect DBMS - and that applies to RDBMS as well. With Couch you give up sharding, Ad-hoc querying and per-document security, in return for conditionless replication, relative fool proofness in terms of indexing, the ability to have fully normalized relational data (compared to other NoSQL) and mobile offline capabilities.

lukaseder · on Dec 15, 2013

One thing about ArangoDB. Why don't they implement SQL instead of their weird COBOLESQUE AQL language?

m_mueller · on Dec 15, 2013

I have been asking myself the same question - they tried to explain it but I didn't really understand the point to be honest. From what I get SQL was what they first pursued, but gave up at some point when it became clear that it wouldn't be ideal for their use case of a multi paradigm database (document as well as graph database). In general I agree that when developing a new product, it's good to question everything rather than blindly repeat what everyone has done for 20+ years. I'd love to read a blog entry about this decision though.

lukaseder · on Dec 15, 2013

I agree that one should question 1-2 things. But when you look at an example (from https://www.arangodb.org/manuals/current/Aql.html):

    FOR u IN users
      FOR f IN friends
        FILTER u.active == true && f.active == true && u.id == f.userId
        RETURN u.name

What would keep anyone from writing this as

    SELECT u.name
    FROM users AS u
    JOIN friends AS f ON u.id = f.userId
    WHERE u.active = true AND f.active = true

Writing an entirely new language may inhibit adoption, if anything... Their rationale is given here in the FAQ (https://www.arangodb.org/faq)

> For example, think of nested list structures inside a document, graph traversals etc. There is no way to query such structures in standard SQL, and deviating from standard SQL does not make much sense.

I would have thought that PostgreSQL had solved this pretty well. So has Oracle with its OBJECT types. There must have been a complete SQL show-stopper in their use cases for them to invent yet another language.

I guess the language was inspired by XQuery, even if they don't mention XQuery in the FAQ.

jsteemann · on Dec 16, 2013

Yes, AQL was inspired by XQuery.

I think SQL is really good for querying relational databases.

Actually we started with implementing something like SQL's SELECT clause in the very beginning of ArangoDB. The rationale was: "why invent another language? SQL is everyhwere, so let's use it!".

We very soon found that SQL is not a good fit for working with semi-structured data. There is no definite schema for a collection in ArangoDB, so it is unknown which attributes (think: columns) a document (think: row) will have. Thus using standard SQL would have introduced a lot of potential ambiguities. Example:

    SELECT a, b, c 
    FROM c1 
    INNER JOIN c2 
    ON (c1.x = c2.y)

When inspecting the above query initially, the database has no idea if attribute "a" will come from c1 or c2. Each document in both collections can have an attribute "a", "b", "c" or none at all. So a query like the above could throw an ambiguity error at runtime only, and not at query compile time. Fully qualifying attributes with "table" names would have worked (e.g. "SELECT c1.a"), but would be a deviation from standard SQL, which doesn't require that. And then people would have asked "they claim to support SQL. But why doesn't my SQL statement work in ArangoDB?". Probably a lot of confusion.

Apart from that, it is common to have multi-valued attributes in document databases (and thus in ArangoDB). Think of an attribute which itself is a list.

SQL really is not designed for this. Putting a horizontal lists into a single attribute/column is an anti-pattern in the relational world. Instead, you would normalize most data into separate n:m mapping tables etc., and join them later.

No need to do this with a document database: horizontal lists are supported, and there is less NEED for normalization. It is up to you how to model the data.

With all that in mind, we very soon switched to a language we thought would better fit a non-relational database such as ArangoDB. We intentionally decided against using SQL keywords, to avoid confusion.

I hope the decision for starting AQL now is little more comprehensible.

Delphiza · on Dec 15, 2013

One thing that is never directly mentioned in the history of SQL is the available memory, cpu and storage at the time. When a mainframe has 192KB of core memory and a 300KB/s disk transfer speed, coming up with a way to access all that data was important. The OP example of data in a text file would be okay(ish) today because you could load up the whole in memory and be done with it. Yes, we have more data, more rows, and larger records but are solving different problems with different tools and significantly more horsepower at our disposal. This is not anti-SQL, but the SQL-as-a-solution in history needs more context. With that context, those lessons may be less relevant.

SeanLuke · on Dec 15, 2013

> This is neat because, in my Artificial Intelligence class in college, this kind of “reasoning” is presented as something advanced. In fact there’s a whole language, Prolog, that’s whole job is to take assertions about the world and give you extra facts. I had to learn all that stuff in AI when mundane-old SQL was right under my nose giving me the tools to learn anything I want about the world my relations represented!

[Sigh]

I guess broken worldviews like this are why Datalog never really got off the ground in the DB community.

brasetvik · on Dec 15, 2013

See also: What Goes Around Comes Around (Stonebraker and Hellerstein)

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.113....

That one's from the time where XML-databases were all the rage. Unfortunately, its publication year is missing. (The youngest cited article is from 97)

lukaseder · on Dec 15, 2013

Gotta love Stonebraker!

alok-g · on Dec 14, 2013