Not to stray too far off topic, but this article raised a question that I've had since starting software development in earnest... Why do we have to choose between heavily "normalized" relational databases that structure all data & thus allow more arbitrary queries, OR un-structured databases that are fast and flexible but often slow to query in complex ways? Why can't there be a hybrid "smart" database that dynamically generates indexes (to use the term loosely) based on how it's queried, in order to speed up similar queries in the future? With some kind of additional weight given to the most frequent queries. Granted, it would need some time to "warm up" (not unlike a tracing JIT compiler?), and the implementation might be fairly complex, but other than that, I can't think of any downsides...?
It's really fast; if it doesn't meet your requirements you're probably already aware that you need a specialized solution.
Besides explicit indexing, it also analyzes your data and maintains stats that inform the query planner. As your tables grow and your schema evolves, it will change its query plans accordingly. It can be quite smart. (A few examples: it maintains histograms of common values, null percentages, correlation between physical ordering and logical ordering or rows, and much more).
Of course, you also have a lot of flexibility to denormalize and get the best of both worlds.
But yes, generally speaking, I'm also surprised to see that in 2013 we're creating our indexes manually. While there are of course applications where you really want to ensure that all the right indexes are created ahead of time, for most applications having indexes created automatically based on query patterns would seem like a much better solution.
In a way, this is what caches can do. Besides that, Oracle is already very good at giving you statistics and tuning hints to help you assess where you could add an index:
DB2 and SQL Server probably have similar tools. As far as I know, they all don't go as far as automatically creating or dropping any indexes.
Many databases also support XML data types. And of course, you can always store unstructured data in CLOBs and BLOBs
What I was trying to imagine in my post, though, was some hypothetical DB where all this would be abstracted away--i.e., it would expose a NoSQL-like interface for all data, while working behind the scenes to provide SQL-like speed for frequent queries. Hence the analogy that it would be the "tracing JIT" of databases.
There are several reasons why you would not want to design a large-scale database this way. It implies an enormous amount of extra data motion which is arguably the major killer of performance in parallel databases; it is the same reason Hadoop is so slow and inefficient for analytic queries. Additionally, secondary indexing structures and similar types of de-normalization offer poor performance in distributed environments due to the implied consistency and coordination problems; it will severely degrade insert/update/delete operation performance as the data structures grow. Most database applications also value a high-degree of predictability of performance under load, which this definitely will not allow. These tradeoffs are unacceptable for many (most?) applications.
In short, while you could design a database that does what you suggest, it would only be useful at tiny scales or for read-only databases where consistent performance characteristics are not a requirement.
well, that is true for the whole SQL, but certainly not true for some subset of SQL which is rather big. And how much is "quite large"? We have client who claims that 160GB is big (he is running cluster of 8 machines now, poor bastard)
I normally design around databases in the tens of terabytes to tens of petabytes range, it is not an unusual scale. A database engine designed as the parent outlined would start to exhibit unacceptable performance characteristics at single-digit terabyte scales. A terabyte is tiny; I can buy servers with that much memory.
Complex multi-attribute selection and joins are efficiently parallelizable but most databases do not implement algorithms and data structures that allow this to be realized.
It seems maybe some of the NewSQL stuff is doing this. You use "SQL" but the data is still organized by some kind of hierarchy (like in F5). I'm trying to get into Google's Dremel paper to learn more about how they do it.
In general, though, it seems if there's any chance something might be normalized, but behind the scenes maybe its actually denormalized a bit, then folks try to layer on some kind of SQL-likeness to the database.
The solution in most cases is to maintain transaction and query optimized databases separately.
(Context: for more than a decade he tried to educate people on the relational model, the ill-thought back-to-the-past ideas that are NoSQL DBs, and more besides. Unfortunately his old site was scrapped, it seems, and has as of 2012 turned it into this blog: http://www.dbdebunk.com/ )
This seems to me the most important point of the article and I fully agree here. What I'd like to add is that a DBMS like CouchDB is IMO fully suitable to the use case you describe. ArangoDB, which was also presented at that conference, appears to be even easier for those kinds of datastores - I haven't been able to play around with it however.
There is no such thing as a perfect DBMS - and that applies to RDBMS as well. With Couch you give up sharding, Ad-hoc querying and per-document security, in return for conditionless replication, relative fool proofness in terms of indexing, the ability to have fully normalized relational data (compared to other NoSQL) and mobile offline capabilities.
FOR u IN users
FOR f IN friends
FILTER u.active == true && f.active == true && u.id == f.userId
FROM users AS u
JOIN friends AS f ON u.id = f.userId
WHERE u.active = true AND f.active = true
> For example, think of nested list structures inside a document, graph traversals etc. There is no way to query such structures in standard SQL, and deviating from standard SQL does not make much sense.
I would have thought that PostgreSQL had solved this pretty well. So has Oracle with its OBJECT types. There must have been a complete SQL show-stopper in their use cases for them to invent yet another language.
I guess the language was inspired by XQuery, even if they don't mention XQuery in the FAQ.
I think SQL is really good for querying relational databases.
Actually we started with implementing something like SQL's SELECT clause in the very beginning of ArangoDB. The rationale was: "why invent another language? SQL is everyhwere, so let's use it!".
We very soon found that SQL is not a good fit for working with semi-structured data. There is no definite schema for a collection in ArangoDB, so it is unknown which attributes (think: columns) a document (think: row) will have. Thus using standard SQL would have introduced a lot of potential ambiguities. Example:
SELECT a, b, c
INNER JOIN c2
ON (c1.x = c2.y)
Apart from that, it is common to have multi-valued attributes in document databases (and thus in ArangoDB). Think of an attribute which itself is a list.
SQL really is not designed for this. Putting a horizontal lists into a single attribute/column is an anti-pattern in the relational world. Instead, you would normalize most data into separate n:m mapping tables etc., and join them later.
No need to do this with a document database: horizontal lists are supported, and there is less NEED for normalization. It is up to you how to model the data.
With all that in mind, we very soon switched to a language we thought would better fit a non-relational database such as ArangoDB.
We intentionally decided against using SQL keywords, to avoid confusion.
I hope the decision for starting AQL now is little more comprehensible.
I guess broken worldviews like this are why Datalog never really got off the ground in the DB community.
That one's from the time where XML-databases were all the rage. Unfortunately, its publication year is missing. (The youngest cited article is from 97)
SQL and NoSQL are two sides of the same coin, mathematically
One of LINQ's main visions is to unify querying regardless of the data source - and it does so quite well. At the time the author of that article conceived LINQ, he intended to unify querying RDBMS, XML, and objects. "NoSQL" (whatever that is) is a type of data store that was not part of LINQ's original master plan. Thus, LINQ has to be retrofitted / enhanced to accommodate new requirements. It would be all too nice if things were as easy as a simple duality, specifically given the fact that the author of that article has now also created a company called Applied Duality Inc.
But history will teach us where these things go. I currently don't see a second E.F. Codd to solve the complexity introduced with the new abundance of NoSQL data stores - yet.
(As always, when Erik Meijer writes something: Very interesting read, nonetheless)
I must be biased by my peer group :). Maybe I'm less concerned about SQL being a default choice than Mongo or something being the default choice? I say this as somebody that helps people migrate to NoSQL databases as a job.