Hacker News new | past | comments | ask | show | jobs | submit login
In defence of SQL (seldo.com)
125 points by seldo on July 12, 2010 | hide | past | web | favorite | 126 comments

I've never understood why programmers don't love SQL the language. It's the closest thing you can get to the silly voice interface they use in Sci-Fi:

"Computer, give me a list of orders for red sweaters, but skip ones from customers who signed up more than 3 months ago"

... and it will figure out how to get you that list. And the syntax is so simple that you can do pretty much everything you'd ever want with six keywords. The query to get the list above is actually shorter than its English language description.

Whenever I see some new ORM come out with its own way of asking for records from a database, a little question mark pops up over my head. Why would you do that? Did you really spend time building something to make your problem harder? Your code less elegant? Why why why?

The only thing I can imagine is that people simply don't bother to learn SQL before deciding they don't like it. Too bad too, because writing queries is one of the most enjoyable bits of the software development game.

Part of the attraction to ORMs is that you just define your classes and essentially freeze-dry and thaw objects with a couple of lines of code.

[tangent] That's part of the attraction to NoSQL as well. You just serialize your objects to JSON or something else (probably already being done if you target the web). It's simple.

Back to ORMs, you can't avoid SQL wrappers at least to help w/ injections, and you probably want a layer that serializes/deserializes your objects as well. Once you've written that a few times it's boring so you abstract it. ORMs are pretty much inevitable when writing the common web app.

ORMs should borrow good things from SQL, like LINQ. Mainstream languages are better now than when SQL was created. I mean, it's not the exact syntax you like right? I prefer writing code rather than preparing and parsing strings. Plus if you ever decide to swap out the backend it's far easier when things are loosely coupled.

SQL is akin to JS in the browser in that you need a library to abstract away the implementation differences or you'll be tightly coupled to a specific one. Maybe you don't need to run on all backends simultaneously but options are always nice.

It all depends what the task is anyway though.

Part of the attraction to ORMs is that you just define your classes and essentially freeze-dry and thaw objects with a couple of lines of code.

The problem here is that you stop thinking about your data as an independent thing that must be designed for long-term persistence.

Code itself should be mutable, easily changed and modified. The on-disk data is the opposite of this -- it must be iterated on carefully, and its requirements must be clearly defined as to be supported in the long-term.

I'm very wary of tools that do not make it easy to explicitly and rigorously define the data model entirely separately from the object model that will be used to represent it in a specific version of one application.

I think tools like LINQ are valuable because they make working with relational algebra and projections cleaner, but do not abstract the fact that you are working with a long-term persistent data store that is not the same thing as your comparatively short-lived object graph.

Part of the attraction to ORMs is that you just define your classes and essentially freeze-dry and thaw objects with a couple of lines of code.

That's not unique to ORM. We don't use an ORM at Expat, yet all our objects still have .Load() and .Save() methods. I hope you didn't read my post to mean I advocate hand-writing your CRUD!

I think LINQ is a great example of needlessly re-inventing something that wasn't very hard in the first place. Especially since its syntax is so close to SQL. Why not simply write your ad-hoc reports in SQL itself and save a lot of trouble?

LINQ is far more useful than just as language sugar over SQL. The amount of boilerplate it removes is substantial for anything working with sequences. I actually havent used the SQL backend for it.

But since I often work with large collections of heterogenous data outside of a DB, I use it all the time to transform and filter collections in ways that would be very tedious in other languages. Libraries can get you some of the way there, but the language integration is what makes it so painless.

There's a lot of power and flexibility using a Model with LINQ. The strongly typed nature has saved me many times. When you alter a table and remove or rename a column you know exactly which queries are now broken at build time. If you were generating or hardcoding sql there's a chance that you hosed a lot of things and need to dig through all the code.

Dynamically creating queries is amazing in LINQ. Building them with strongly typed expressions is much better than messy string concatenation.

Again, that's not unique to LINQ. Drop a db column in the thing we use and the project will fail to compile until you fix up the references. And of course, if you're keeping your stored procedures in source control and recompiling them at build time (which of course you should), you'll know when you break them too.

As you say, strongly typed dataobjects are pretty cool. Regardless of how they interface with the database.

In contrast to client-side JS in case of SQL, there are at least some cases where limiting oneself to a single target can make sense. Not every application needs to run on a pletora of different databases.

In addition, over time, the official SQL standard got better and better (quite unlike EcmaScript which took a 10 year break and then returned with minimal improvements) and the databases began following the standard more and more as being standards compliant became a selling feature (which is also quite unlike the ecmascript case where not even the innovative browsers support all of the new features from ES5)

Hence, I would claim that the need for an SQL abstraction layer is way smaller than it is for client-side JS

Agreed. Personally, I really like SQL and whenever possible I'm going as much straight SQL skipping strange abstraction in ORM layers.

Performance wise too, nothing beats hand-optimizing queries rather than hoping that the ORM of your choice doesn't decide that the particular method of retrieving the total value of a specific order will need one query per line item plus manually calculating the total.

I've never understood why programmers don't love SQL the language.

Many programmers do love SQL. It's not a perfect solution, but it's better than most of the via alternatives.

Anyone that "hates" SQL should be forced to read C.J. Date's Database in Depth, after which, if they still hate SQL, it will at least be for the right reasons.


Hey, that's my joke:-) I've always imagined one of the old Star Trek episodes where Kirk says "Computer, select course where Klingons equal to zero" or something to that effect.

SQL, or rather the relational model, is great for what it does. A thorough knowledge of both is central to most of what I've done over the last many years.

But there are places where you don't really want named attributes with the names known beforehand, you want to generate new attributes, perhaps named on the fly, perhaps designated ordinally. The relational model doesn't really have room for this. The various flavors of SQL offer bolted-on ways to do this, incompatible & quite different. But it's a living...

I've been using SQL heavily for 12 years and mostly agree. The problem with SQL though is that it lacks abstractions. Any time you're forced to use dynamic SQL, the language has failed.

For example: suppose you want to write a SQL script that populates a database with dummy data. There's no way to do it without dynamic sql, since SQL doesn't let you parameterize table and column names.

I think he's got it backwards. He says SQL is a great language in front of a crappy model, the relational model. I think SQL is simply a limited language in front of a great model. The NoSQL folks are throwing out both, so I guess that's "the only way to be sure." Going to be fun seeing them re-invent the wheel for the next 5 years.

What we need is a more elegant, flexible language instead of SQL for data retrieval. SQL++, since the relational model is sound. Beyond that, we need a more intelligent data store that will be able to translate a fully normalized schema into a high performance, distributed, persistent representation that is queryable by this language.

There's no reason in theory this can't happen, its just hard. To think that dropping us down to GET and PUT operations is the answer is just saying "we give up."

That's an overly simplistic framing of 'NoSQL'. Despite every piece of FUD to the contrary, 'NoSQL' is a stupid name for a set of tools designed to solve an entirely different set of problems.

I don't want some data in a relational model. Yes, I could put it there, but I would be doing so solely so that it fits in the relational model. Which is pretty pointless.

The fact that the relational model is "sound" doesn't mean that everything belongs in it. Plus, I'm not holding my breath waiting for an "intelligent data store that will be able to translate a fully normalized schema into a high performance, distributed, persistent representation that is queryable by this language". That's the description a mythical database panacea that solves all problems with no downsides.

The continual pitting of NoSQL vs SQL is a gross mis-characterization perpetuated out of holy-war fervor, yet based on few facts or even understanding of the differences.

Well, the defining aspect of "NoSQL" databases is that they don't use SQL. Generally, they have different semantics (key-value, document, etc.) too, but it's not necessary. Its name tends to make it sound adversarial towards relational databases in general, though.

I'd like to see more relational, column-based databases, with better query languages - I like the relational model, but find SQL to be a bad compromise at best. I think the distinction is lost on a lot of the people caught up in NoSQL hype, though.

The relational model is a comprehensive, complete way of representing any form of data. Just because the abstractions in front of the relational model make it somewhat klunky to access, say, keys and value based data models, has nothing to do with the mathematical soundness of the model. The limited "models" most of these NoSQL stores provide are simply subsets or re-projections of the relational model where the implementation basically bleeds all the way up to the modelling abstraction of the system: they are one in the same.

The database I described is not a panacea, you'd still have to do both good modelling and describe the ways to optimize the materialization of your data. It doesn't make the core challenges of data architecture any simpler, you still need a brain. It does, however, provide a clean abstraction and would in theory maintain most of the good things that RDBMSes have brought us without throwing all the hard work of our elders in the trash can.

If you (and everyone else who've made the same argument) substitute "Syntactic Sugar" for "Abstraction" and "Turing Complete" for "Relational Model" I think you've just made the claim that every application can or should be written in Brainf*ck or one of the other Turing tarpits. Just because a model is mathematically proven comprehensive, (I'm not 100% sure if relational is but that doesn't matter) doesn't mean it's the most natural way to express/solve your problem. And really, that is the only thing that matters. At least when comparing models and not implementations.

> The relational model is a comprehensive, complete way of representing any form of data.

Yea, but it's not the best or the easiest to represent all forms of data. Just because a problem can be solved with the relational model doesn't mean it should be; there are many ways to skin a cat and sometimes non relational solutions work better for the problem at hand.

> I don't want some data in a relational model. Yes, I could put it there, but I would be doing so solely so that it fits in the relational model.

Almost all data fits in the relational model. And it's an excellent way of organizing data. A lot of NoSQL solutions are simply unorganized blobs. That might be fun at the beginning of projects (or for small projects) but it's a step in the wrong direction.

i've designed plenty of organized blobs that were queried with SQL. Garbage in...

I've used a SQL database as a document store, using a blob column to store a serialized data structure. It was used to store heterogeneous documents of various types across various installations. I did for all the reasons that have been given for schemaless data storage in NoSQL solutions. For each new document type, there was no need to adjust any database schemas -- whatever properties you have are just stored.

However, unless I had a really large number of documents, I wouldn't do this again. With a schemaless solution, you're just moving the problem of schema changes into the code and whatever structure your data was at any time you have to support forever. Right now, I have conditional branches in code to support an document structures that haven't been relevant in years. In SQL, you can alter the table and write a quick query to update your data and you're done. Whatever mistakes you made in your design are history.

Proponents of schemaless storage claim that it's great for development but I disagree. I change my design constantly -- add tables, remove tables, split columns, you name it. I don't hinder myself because my data is organized into typed columns. I alter the data as needed to fit the new structure. As a benefit, I never have to support my previous mistakes.

Would you say that the schemaless storage debate is then similar to static/dynamic typing, with the same kind of tradeoffs?

Sort of, but worse. If your database has no schema, any mistakes you make can accumulate subtle damage to the integrity of your data that you don't have any way of going back and fixing. I saw this happen in a Notes shop. Over time, some of the documents had been updated by many different versions of the code, until they were in such bizarre and unintended states that not even the developers could say what the correct app behavior should be anymore.

If you are not migrating your data to your current schema, it will decay to garbage. If you are, you already have the old and new schema in your mind, so why not write them down and get some help from the tools?

The other way out is to keep the authoritative version of your data in a store with a schema, and maintain a summary of it in your low-latency store that you can regenerate at need.

I think it's wrong to make the analogy to static/dynamic typing -- but this comes up a lot. It'd be closer to having a type system without any classes, structures, or prototypes -- just hashes. And once you write code to construct an object instance, you can't change that code.

Yea you're right. Relational data is awesome because it enforces consistency which makes everything you build on top of it so much easier.

Well, when the name is "NoSQL", I think it rather much invites being pitted against its namesake.

I guess it's wrong point of view for a hacker. X vs NOX is all about ego, this kind of silly games should be dismissed by a hacker, focusing on technical merits instead.

I completely agree with you that it immature and childish. However, it is something that "hackers" do have a propensity for, just look at all of the "Closure vs. Python" and other garbage threads that come up here. From evidence, these kinds of games are apparently what "true hackers" like the most.


Your point is purely rhetoric.

Does GNU invite being pitted against its namesake by hardcore oldtime Unix fans?

Still, focusing on whether 'NOFOO' name is appropiate or not is quite silly.

It's not a name everyone in the field is happy with, IIRC.

There's no reason in theory this can't happen, its just hard.

Sure there is. I won't attempt some silly proof because it should be obvious that tradeoffs exist in all software. Supporting ad-hoc relational queries restricts the types of optimizations that are possible. Every index you add is going increase write times and yet they may be necessary for some queries. SSDs open up some doors, but still, everything about the way a database is designed is a tradeoff.

I think the point of NoSQL is that nowadays with web software, often we have very simple data that needs to scale massively, and it is just easier to build some functionality with rougher primitives than it is to try to tease the performance out of an SQL database which may not offer fine-grained enough control to offer the basic functionality we need in the performance profile we want.

To think that dropping us down to GET and PUT operations is the answer is just saying "we give up."

SQL database technology is approaching 50 years old. It's had countless billions of dollars and man hours poured into its optimization. Big DBs like Oracle can do amazing amazing things. Even today development continues rapidly on SQL database of all types, both new and old. I'm not sure how a few people working on alternate data stores with radically design criteria can be called "giving up".

Arguably if Oracle was free we'd not be having any of these discussions.

That said, there's still some stuff to be learned from things like BigTable and Dynamo that simply weren't considered when building Oracle due to CAP. I'd wager that what you're seeing now is the dark ages of where we regress back to infancy, forgetting the lessons of the past (throwing out the schema? really?) before someone comes along and solves the root problems correctly.

Edit: Oh, and to your other point about tradeoffs. I'm not saying there aren't tradeoffs involved. The point is, the "tradeoff" should be "oh, please configure this table to not support custom WHERE clauses by materializing it into a simple K-V store. Thank you Mr. Database", or "relax consistency here so we can have availability across data centers on this table, thank you Mr. Database." But these should be easily described and diverge from the default, ACID-compliant, normalized, schema'd, and relationally-queryable data storage.

I agree the relational model is much closer to optimal than SQL.

Looking at JQuery as a query language was a breath of fresh air, and made me realize how wonderful a query language can be, and how capable the software community can be.

The relational model is not a sane way to store a book, a compressed song, a code base, or many other things.

SQL databases are a great way to store some kinds of data.

The relational model is not a way to store anything since there are no implementations which are both faithful to the model and solid data stores.

Real world SQL data stores are suboptimal in different ways for almost all tasks, so progress is possible.

Ultimately a disks (and memory) is get and put. Anything model we put above that makes some assumptions, and they may be the wrong ones.

So where are we going. Here is what I think: requiring the articulation of the domain twice, once for a datastore, and once for application, is spitting on humanity in a grotesque and obscene way, so SQL must die.

Almost every database I've ever seen was used by more than one application, and from different languages. If you succeed, you're going to have ops and sales and support and finance and bizdev needing access, and they're going to want to use their favorite tools rather than wait for you to build poor imitations from scratch.

So we are stuck with SQL forever because it currently has the broadest set of access tools?

We aren't necessarily stuck with SQL (for instance, right now there are LDAP or XML or RDF-oriented datastores out there with fairly open APIs and decent interop), but we are stuck with "articulation of the domain twice, once for a datastore, and once for application" unless one programming language finally conquers all the others and becomes the only platform your schema needs to be usable with.

"Anything model we put above that makes some assumptions, and they may be the wrong ones."

One of the main points of the article is that if you don't have a good handle on what assumptions you need to make about your data, choose a relational database. That is because it makes the fewest assumptions about the kinds of questions you will need to ask about your data. If you pick a different solution that does not support the kinds of questions you would like to ask, you will experience much pain.

If the question you end up wanting to ask, is "Can you make it easy to distribute writes to this store?", then SQL is a poor choice.

We have had a stagnant SQL monoculture as a default choice for data, for 20 years.

Given a project of with an uncertain future we are now starting to get other choices.

We know that using an SQL databases will create pain. Suffering that pain in the hope that later questions may be answered less painfully, is becoming a mindful decision. Especially in startups, not the default that it has been in the enterprise world.

Google seems to be a lot better at answering questions than Sybase is. Out of date, offline, full text indexing is not something SQL databases do well.

For content like movies, music and large documents, we simply don't (and should not) create a relational model, we use big blobs and dump them into whatever tools we have around.

We all would agree that if your data looks like source code you are better off with git than with MySQL. So where else does "if you don't know use SQL" breakdown.

This article reiterates a fundamental point that object stores quite clearly present less friction during development than relational stores. This is plainly true: your domain object is meant to model the domain, and if you can store it directly that is the best you can hope for. There's a reason there's a whole topic of study and research called "Object-relational impedence mismatch" (http://en.wikipedia.org/wiki/Object-relational_impedance_mis...)

However I think ORMs are too quickly dismissed. In the simplest case, an ORM can appear to be a key-value store at the API level. Of course that is a gross abuse of a relational store, but there's no reason it can't be done. Additionally, the author fails to note that an intelligent ORM can provide a caching layer that improves performance over raw SQL. A more detailed examination of the various patterns to resolve the impedence mismatch, such as active record versus data mapper, ORM caching strategies, and how those compare to a key-value store solution, I think would be a valuable topic of discussion.

If you find that your objects don't map to a relational structure, it's probably because you're doing things in the wrong order.

The early design for any system should be done in your database's diagramming tool. Figure out a data structure that holds the information you're application is going to use, and once you've got that you've also got your object model.

You can then generate your whole backend by pointing your favorite code generator at the database schema. You'll end up with the same User, Customer, BlogEntry or whatever objects you would have come up with if you'd just started coding from scratch, but they'll come with a .Save() method that stores them away in a data model that fits them perfectly.

This sounds like the waterfall model of software development to me.

I mean, don't get me wrong, this is what I do, too (I only started using MongoDB over the weekend after years of MySQL and Postgres), but the similarity to the arguments is pretty interesting.

I suppose, provide you never changed the data model once you'd designed it, and you designed the entire application in one go, before writing any code. That would sort of negate the advantage of having a datalayer that regenerated itself on the fly every time you messed with the schema though.

Personally, I find that starting from the schema makes one more agile (small 'a'), not less so.

How do you deal with modifications to the generated code when it's time to change the schema?

I guess the answer is to just make your domain objects glorified structs. I've never liked that answer, as it seems to defeat the whole point of object oriented design.

If the ORM layer is providing a caching layer on top of the RDBMS, that caching layer is effectively a non-relational store for the data. If that works better, why not use it as the primary store?

Look, honestly sometimes ORMs are very helpful for developers, especially when they're built directly into a framework (e.g., Django). They may be more useful if the database itself had knowledge of what you're trying to do (hint, hint). If you're arguing that they will have necessarily worse performance characteristics, you're wrong.

Source: I'm currently a SDE intern in the SQL Server team at Microsoft.

Because you want to do arbitrary queries on the data? Because the data needs to be exposed to other applications?

Not necessarily - you could store a resultset based on the query and parameters, and if the same query and parameters come within a certain amount of time, you return the whole resultset from the cache. In this scenario, the resultset could contain a filtered list, aggregate functions, etc, that are more difficult to replicate in a key-value store.

That is exactly what Oracle Coherence (nee Tangosol) does.

As you state in your article, the ability to easily phrase ad-hoc questions and have them answered with almost no programming is a very useful tool.

Why not use two tools where two are needed?

Many databases run a mix of OLTP (which requires fast cache and no ad-hoc queries) and ad-hoc business intelligence queries.

> easily phrase ad-hoc questions and have them answered with almost no programming

Queries are programming, the worst kind, programming without composable abstractions.

What people actually mean when they say this is they like having a workspace and a live environment. It's like working in Smalltalk or Lisp in a live image.

Put your email in your profile, I'm disappointed you didn't continue the discussion yesterday after saying do it over email, which you don't make available.


It's not plainly true at all. You write as if there's a rigorous, agreed-upon notion of "object". Not so. As such, it can be as vague as you need it to be.

Not every person agrees on every aspect of the relational model, or of the SQL implementations.

There is a rigorous notion of .net object.

Relational algebra is agreed upon. SQL is loosely based on it. On the other hand, there's not even a single agreed upon notion for objects.

There's a rigorous notion of a .NET object but it's not compatible with Java's, C++'s, Smalltalk's, ...

>> The birth of ORM lies in the fact that SQL is ugly and intimidating

I disagree. In my opinion, the birth of ORM lies in the fact that a lot of people want to work on problems that have almost nothing to do with SQL but need persistence. They want to spend time working with objects in their domain model and just have persistence that works.

So if that's the case, why did ORM even appear? Why didn't everyone just spend time figuring out "NoSQL"/key-value stores/object stores and move away from RDBMS? 1) SQL and relational data stores, as the author himself points out, are trustworthy and standard. People may not like working with them or be intimidated by them, but most people can at least figure out simple joins and things. The relational model may not be the best answer for some things, but it at least gets the job done. 2) Just because I want to work with my domain objects in my application, and just because I'm short-sighted and all I care about at the moment is enabling a business idea or solving an interesting problem with my app, doesn't mean that I won't want to "get answers" from my data in the future. Maybe it won't be me - maybe it'll be another group in my company with more RDBMS knowledge than I have that's interested in data-mining my store, and because I've used an RDBMS, it will work extremely well.

I don't think developers are scared of SQL, I think they just don't care about working with it because it doesn't have much to do with what they're interestd in. Maybe ORM isn't a perfect solution, but it lets me have my cake (persistence that gels with my domain model) and eat it too (a relational backend when we need BI and reporting).

I agree. The OP seems to fall into the camp that, if ORM can't replicate everything that you would do in native SQL, then therefore ORM is bad. Which is absolute nonsense of course.

I just don't see how the pure SQL people and ORM people are philosophical opponents, other than how ORM tools make dealing with SQL (which is an awesome, proven technology) so much easier, I suppose it could theoretically be a threat to their livelihood. The way I see it, they are partners in the same cause.

I'm the same as you, I write applications that need to store and interact with data. The difference between using an ORM or not is like light and day, and for me, this constitutes 90% of my work. Certainly, there are situations where it is just common sense to drop down to raw SQL (and if your ORM doesn't support this, you are screwed), but people like the OP seem to subscribe to the "throw the baby out with the bathwater" philosophy of life....if you can find a single problem with a solution, throw the whole thing out.

This is one of those arguments that will never, ever go away. I used to be passionate about the subject, but I don't care much anymore. The anti-ORM people will continue to say it doesn't work (ever), while the pragmatists will continue to use ORM because it does work, and for the edge cases, they will drop down into pure SQL, which most developers know very intimately anyways.

On a related topic, how many DBA's have you met that understand SQL, query plans, etc deeply, and have a strong knowledge of database system tables (as well as you do as a developer?) Most DBA's I've met basically follow: http://en.wikipedia.org/wiki/Argument_from_authority

The reason pure SQL people and ORM proponents don't always get along is that ORMs make it very easy to write really crappy queries. I'm fully in the ORM camp, but it is true that I can't turn on SQL statement logging without being terribly embarassed. And tuning and optimizing these queries isn't always as trivial as some make it sound. For fetching some data, it's not terribly hard, but try explaining to Hibernate that you want to bulk delete a bunch of objects (with inheritance) without screwing up your session...

It's all about people, not the ORMs or other technical inventions.

Most shithead queries I encountered during years of DBAing were written by hand by poor folks abandoned by their colleagues and over-confident managers.

Just no one cared to educate them, but everyone expected "progress" from them. So they did what they could..

In fact I wished many times that they used some abstraction layer which would save them from doing such harm, simply because this layer would be written by more educated people.

> I suppose it could theoretically be a threat to their livelihood.

Bingo. Every time I've seen someone hating on an ORM, it's always a DBA.

You wouldn't believe the number of queries that I've written by hand while thinking "ActiveRecord would have just written this for me" because An ORM Couldn't Possibly Be Performant Enough For Us!

The OP seems to be very fond of SQLs ability to answer questions about your data: ORMs are very good for quickly and easily putting your apps data in a RDBMS, so you can answer questions about its data using SQL.

These arguments make me feel like DBMS technology is insufficiently distinguishable from magic.

Databases do two things: they persist data, and they answer queries about data. Neither of those requirements put any restrictions on how they serialize data. Relational storage, object storage, document-based storage, key-value storage, graph-based storage, etc.—it all just seems like different optimizations on the same "real" data, something that should be abstracted away from the database administrator entirely.

Why can't we just have a database that gets fed a normalized, relational-algebra-style model of the data, but then profiles the kinds of queries fed to it, and the kinds of data given, and builds whatever kind of denormalized tables/indexes/k-v stores are optimal for your usage patterns, using a Relational, or Object DB, or Graph DB, etc. engine when each is appropriate (perhaps firing them off in parallel and returning whichever answer comes first, killing the other threads)? Is this something anyone is working on?

A "query optimizer", let's call it.

That'd be so cool. I wish RDBMSs came with query optimizers. Maybe something that could ANALYZE (or, better yet, EXPLAIN) their performance, so people could tune it.

And yeah, sarcasm aside, it's easier to have a DBA tune things than some AI - of course: "Why can't we just have a compiler that gets fed a description of the problem, then solves it, calculating the optimal solutions for all of the engineering trade-offs, and then builds whatever kind of interfaces are optimal for your users choosing Design Patterns as appropriate. Is this something anyone is working on?"

Yes, RDBMSes already have the capability to tell you how to best optimize them. The problem—and it's a sociological problem, not a technological problem—is that people don't want to tune their databases. Instead, they want to install something that is already tuned 80% correctly (i.e. a NoSQL store), and then ignore the other 20% of possible optimization.

There's a big, unaddressed opportunity here: if you can just make your DB smart enough to tune itself 81% correctly (and that's not very hard—certainly not AI hard) then you can conquer the market for people who don't know how to EXPLAIN their way out of a paper bag. It's not like it would be inventing its schema on the fly; you'd still be feeding it a formal, completely-specified but unoptimized set of tables. It would just do things like:

* Notice when it gets a query with the same parameters 10000x in a row, and grind out an index for it;

* Notice when a large (>100000) set of rows in a table is identical save for their primary key, and only queried by COUNT(foo), and turn those rows into a single row with a foo_count field (and a log table in case a query ever does come in that cares about the IDs);

* Notice when two tables are only ever queried joined together, and morph the primary store for them into a denormalized table, turning the previous, separate tables into views on the denormalized table.

And so on. If you look at, say, Python's hashtable library, it has all sorts of these little "rules of thumb" for when to alter its internal representation, because of what it guesses the user is up to (storing unique pairs, creating mathematical sets, nesting hashes to mimic a complex data structure, etc.) Databases, on the other hand, much bigger systems purely for flexibly storing and querying data, just stubbornly sit there until the DBA (because there must always be a DBA, even if the DB is just being fed migrations written by Rails CRUD coders) comes along to tell them to do the things they already know would make them faster. It's just silly.

Are you really saying that people are too lazy to draw lines between tables in their database? Too lazy to right click a column name and click "Index"? That you wouldn't notice on your own that a particular query took four minutes to run and maybe open it up and poke around? That this should be something the database does for you???

That is not the hard part of my job. That is the easy part. No thank you.

But, as you say, it is part of your job. The difficulty, or ease, of the problem, has nothing to do with it; at some shops, it's nobody's job, and so it just doesn't get done. In 90% of shops, the last time anyone so much as glances at the database is to run the CREATE TABLE query. No one profiles, no one optimizes—you're lucky if anyone even declares any indexes (ahead of time, as a guess, of course.)

And, consider: if it's the easy part, why are [some] people switching to NoSQL just to avoid doing it? ;)

The most frightening part of this debate is that people who are too lazy to properly define indexes are considering NoSQL solutions (and most of those require indexes too)! Properly utilizing a NoSQL requires more work than an RDMS, that's trade off for using a more low-level technology.

I think you misunderstood me.

Part of writing the CREATE TABLE query you mention is hooking up primary and foreign keys and setting up indexes. That's just how you build databases. Even if you never look at it again, you still need to do it.

I agree with your points and ideas of auto-optimization.

I'd just replace "at some shops" with "at most shops" to reflect reality a bit more.

Haha, classic.

This sounds suspiciously like a sufficiently smart compiler.

I was mostly thinking of the profile-driven optimization of the HotSpot JVM—something that was repeatedly called a Sufficiently Smart Compiler, until it happened to actually work.

Apple's Core Foundation data structures do this sort of "usage-based implementation selection," though in a much simpler context: (see e.g. http://ridiculousfish.com/blog/archives/2005/12/23/array/). I'm not sure why this technique isn't more widespread.

This is what is going to need to happen. I think now that people really understand how to build distributed data stores like HBase and Cassandra you'll see those be the "storage engines" behind such a system. The abstractions need to stop leaking into the query language and modelling tools. We'll get there I think.

If you can build this, please, please let me buy it. Or preferably invest in it.

I don't believe that there really are all of these developers around who lack knowledge of SQL.

I learned SQL while I was learning to program. My very first (real?) program was a php hit counter that wrote to a hits.txt on the server. My second one was a guestbook that used mysql. My third one was a basic hierarchical forum. I wasn't a wizard, I was just a kid reading a book, bored at the Pennsylvania Farm Show.

Thus, I have spent my fair share of time working with sql queries in php code. Guess what? The SQL queries weren't very hard at all.

However, SQL Queries had almost nothing to do with actual using SQL itself. 90% of the code written in such a project is guaranteed to be cruft code surrounding the sql queries. Special cases abound. The difference between "column = $val" and "column is null" is so painful that I just can't imagine going through that anymore.

And that's just the first example! What does "" mean if you save it in a db column? Should you save it as a null value, or as a string with zero characters?

I just don't care about this at all, and you'd have to be a damned fool to think that I would ever willingly go back to doing that bollocks.

I already have my in-memory data structure laid out. ORM lets me keep from worrying about it it again (at least not nearly to the detail I would with SQL). With an ORM, 90% of the database code that I used to have to deal with is simply a non issue.

No personal offense intended, but there seem to be a lot of people who kinda-sorta understand SQL in that way, but run into performance issues because they don't understand important basics (indexing, for one), and are too quick to blame their problems on "SQL not scaling".

When I hear people psyched about the whole NoSQL thing but actually quite ignorant about relational databases (like a guy who gave a presentation on CouchDB at a barcamp here last year), I understand why people sometimes call it the "NoMySQL" movement - a lot of the hype seems to come from people whose experience seems couched in PHP+MySQL tutorials, specifically.

    I don't believe that there really are all of these developers around who 
    lack knowledge of SQL.
I'm not sure why you don't believe that, but as someone that has done development at a variety of companies: trust me, there are plenty of those. They may even be reasonable at PHP or Java, but they couldn't subquery their way out of a paper bag. At one place, the CTO, formerly main PHP/MySQL developer, didn't know what an index was. You can figure out the rest ;)

Sorry; I'm not saying that all developers know SQL. I'm saying that the ones that we are all talking about know, or are competent enough to quickly know, SQL.

To me, the people that you are talking about have no context in the discussion in the least. They would be the last to jump on the NoSQL bandwagon because they don't really care, for example.

The universe of discourse only includes the devs who care, e.g. the people who are creating ORMs, using ORMs, using NoSQL, etc, not _all_ developers. I kinda assumed that was implied.

I think you are misunderstanding the article; it's not that object models for data are bad, it's that mapping them to a relational store is inefficient and pointless.

The point against ORMs is not that they have inferior APIs -- in fact I say how bad SQL is. The point is that they wrap quite nice APIs around an inappropriate data store. If you have an object-like in-memory data structure, why not use an object store and have it in your data as well? The relational nature of your store is working against you.

If your ORM exists to keep 90% of the database code hidden, why not dispense with the RDBMS entirely and lose that useless 90%?

Maybe because it makes that 90% legacy code useable?

This article has its amount of religious bias and wishful thinking. The author tries to establish his base on a fact that SQL is all about relational algebra and it's the right paradigm to query your data. While I agree with the latter, the author (and lots of other people) seems to forget that SQL is a dumbed-down-aimed-to-mediocre-programmers contraption inspired by relational algebra, but not really properly relational! It's really far from what Codd had in mind. If you read the history of early SQL dialects and DB2 vs Oracle competition it's clear that SQL was marketed as a language everyone can program in (thus the stupid imitation of a natural language) -- we know in XXI century that this is a bs...

I'd say that SQL is a COBOL of relational algebra...

Are there any (more suitable) alternatives to SQL?

The K/Q people (http://kx.com/) have one interesting interpretation. It's quite expensive for production use* and uses an APL dialect as a query language, which tends to scare away a lot of developers. I'm utterly fascinated with it, albeit from a distance. (J is cool, too, but doesn't come bundled with a high-end in-memory database.)

* I've heard it can run $60k/CPU. It's designed for realtime stock data mining (https://cs1.cs.nyu.edu/cs/faculty/shasha/papers/hpts.pdf), so they get away with it.

Can you give us some syntax examples?


I meant to reply to this, didn't notice until after the edit window closed...

If it's anything like APL you probably don't want to see it...


Chris Date is worth reading for his ideas on what a good language for relational algebra should be like and why SQL isn't that language.

"The birth of ORM lies in the fact that SQL is ugly and intimidating"

The notion that people steer away from SQL because it is intimidating is absurdly ignorant.

The communities that embrace ORMs and NoSQL are the ones who are learning and creating new things all the time. No one's first language is Ruby, they didn't learn node.js in college, they didn't take an (N)Hibernate certification course. They learn in their free time!

Meanwhile, if you were good in SQL 12 years ago then you still are today!

Smarter people than me have written on these subjects, but for me, the answer is simple. As Fowler says, the most pervasive smell in software is duplication. If I litter my code with all the CRUD SQL (even worse with sprocs that contain column sizes) then my software is quite brittle.

If you little all your code with CRUD, your code is horrible indeed.

If I hard code filenames all over my code, I also have horrible code. This doesn't mean that files and file names are a problem or that they caused duplication.

The normal (I hope) practice is to this in terms of transactions or activities and create functions that wrap those activities. SQL should only live within those functions that provide a logical interface to the DB. Its called TAPI or XAPI (transaction API). Think of it as your own domain specific ORM.

> SQL should only live within those functions that provide a logical interface to the DB. Its called TAPI or XAPI (transaction API). Think of it as your own domain specific ORM.

Which is generally repetitive and full of duplication, exactly the thing that drives people to use ORM's in the first place. It doesn't matter that it's isolated into a separate layer, it's still more manual work than an ORM.

It is more work than ORM, but if efficient access to data is an important part of your application, it seems to be extra work that brings benefits. At least that has been my experience.

I'm not clear on why this layer has to be repetitive and full of duplication.

> I'm not clear on why this layer has to be repetitive and full of duplication.

Let's say you have a database with 100 tables in it. How many insert statements might there be, just a rough estimate, in your layer of functions? My ORM has 1. At the most basic level, a simple ActiveRecord ORM has 1 insert, 1 update, 1 detele, and 1 select statement. How many are in your typical app? If it's more than 1 of each, then it's duplication.

I haven't written an insert/update/delete statement in years because I don't need to. When I need a fancy select that's an aggregate of several tables, I write a view and then map it into a list of objects with the ORM. This is virtually no work, I spend all of my time working on the actual problem at hand and virtually none marshaling data in or out of the database.

I'm totally not following here...

How can multiple insert statements be considered duplication if they are inserting different things to different tables? My apps have multiple insert statements. They also have multiple calls to "print". Do you count that as a duplication as well?

We can continue the discussion by email if you prefer. It is interesting to me, but we may be getting off topic.

> How can multiple insert statements be considered duplication if they are inserting different things to different tables?

Because they're all the same pattern of code. Whenever you see common patterns, you're missing an abstraction that can greatly simplify things.

Why would I write 100 different insert functions when I can write one that generates itself dynamically using reflection over whatever object it is given? You only need one insert statement, it just needs to be smarter rather than static hand written code.

You need only tag your objects with a little bit of metadata using attributes to give a generic insert/update/select/delete everything it needs to dynamically generate parameterized statements on the fly for every table in the db that you use. Put simply, the only thing I need to do to use a User table in an app is write something like the the following...

    class User : ActiveRecord<User> {
        [PersistentString(MaxLength=50, Required=true)]
        public string UserName;

        [PersistentString(MaxLength=50, Required=true)]
        public string Password;

        public string About;

        public User(){}
And I'm done, I can now Find, FindAll, Save, Delete, Count, Sum, Min, Max, Avg, Count, Exists, Not Exists, etc... on those records without writing a line of SQL. All of these methods are inherited from the superclass and work on the table User. The same thing applies to view and stored procedures.

CRUD duplication is fine, because if you're doing it right it's all generated from the schema at build time. It'll have columns sizes in the stored procedures, because those are the sizes of the columns as of last time you updated the schema.

Rule One is "Don't touch the CRUD".

The code you write touches the auto-generated base classes that wrap the auto-generated CRUD. It never need concern itself with the duplication you speak of.

Well, it's all happening now. First step towards that which I noticed is Moneta (Ruby gem), that is giving you some abstraction over key value stores. I don't think that it's a whole lot of performance overhead and it does let you switch when you realize it's time to.

Overall, you may also want to use orm for sakes of interchangeability of your SQL as well, and it certainly gives you advantages.

Im working on a project that involves no orm and uses oracle as a primary data store. Though I don't want to install oracle On my machine but repository idea built-in to project doesn't let me use postgre for development. And yet schema creation involves varchar2 that's simply inaccessible in postgre. Its quite easy to fix but I'd expect to have some kind of a fallback....

In any case, it's a good point that key-value storages are not for every problem, but that's quite obvious. someone said that every time you say "nosql", a kitten dies. so, careful there.

SQL always seemed to me to be the way it is in order to lend itself to optimisation at the database end. It is easy to look at any SQL query (that doesn't use subselects) and see how the DB engine can efficiently execute it by walking down the appropriate indexes.

Try doing that with an ORM or a key/value store.

Can you elaborate the issues an ORM has in dealing with indexes? An ORM can be made aware of indexes via, e.g., annotations or other declarative syntax, and prefer those when joining and filtering, and will probably be more consistent about it than a human developer.

I didn't mean that ORMs have problems with indexes; just with efficiency: because you can tell the SQL engine what you want done instead of having to bring the data to the code (in particular when iteration can be folded into the query).

Yes; but even this only goes so far. Execution plans change based on statistics collected by the query optimizer. SQL was invented because query optimizers do a better job than humans. The same is commonly true for ORMs.

In defense of SQL syntax, I think that, in its very constrained domain, it models table / set operations with the best balance of expressivenes, readability, and generality today. At least compared to any other syntax I have seen, including ORM's. We take our old friends for granted until they are gone, and I have had to work with a number of data environments without SQL and its lack was sorely missed. The SAS community was overjoyed at the addition of PROC SQL for this reason, and the goofiness one has to do in R/ SPLUS is quite unpleasant.

Even in its own domain, SQL sacrificed parsing for "user friendly" English-like appearance.

SQL error messages are very unhelpful in every DB. You can't easily debug misbehaving queries and there are no tools to help. There is even a lack of indentation standard.

Fixing very long queries that fetch the wrong results is something I've been doing daily for the past 10 years of my life. It is not easy, and I believe the language is to blame for a large part of the difficulty. In some aspects of doing unexpected things with no warning - it is worse than C.

Further: SQL is not composable. If I have a fragment of SQL in my hand, there is no reasonable way to insert that into a larger expression. You can further tell this is a problem by the way any nontrivial SQL function degenerates into an enormous hellstorm of SQL that any Haskell programmer will see as full of redundancy. This is because SQL is almost immune to abstraction. (I have to specify the "Haskell programmer" bit because of the characteristics of this sort of language is that working in it directly for a long time causes you to think natively in that language and thereby miss the opportunities for abstraction that you no longer see. FWIW, Erlang also bothers me this way.) Yes, the "bad programmer" issue is in play too, but as a good programmer who has tried to write good stored procedures and has experience in numerous other languages I can not help but notice how much SQL fights me on this front.

SQL is very difficult to debug in general, which combines rather poorly with the previous characteristic. Perhaps there are nonstandard extensions in Oracle that make it easier but there's nothing I know of in the core language.

SQL has fooled people into thinking it is equivalent to the Relational Model, when in fact as kunley points out elsewhere it is merely inspired by it.

SQL dates from the 1970s and it shows. There are syntax quirks around every corner. The behavior of NULL is sensible, but it is merely one of several sensible choices and there are reasons why the SQL choice is not the one that has "won". Why is "value IN ()" a syntax error? It's perfectly reasonable. What's with the syntactic drama around subselects, which ought to be a core part of the better language that we really ought to be using? (A language in which the equivalent is probably going to be so simple we won't even believe people had a special syntax for it.) What's with the rigid specification of the order of GROUP BY and ORDER BY and everything else, when those should simply be instances of some sort of combinator that applies in order, in a sensible manner. Why is SELECT used both for data queries and aggregate queries when in fact the two operations are so different they should have entirely different keywords? Why is it so hard to mix an aggregate and non-aggregate query together when it's obvious to me exactly what I want?

Why can't my columns contain tables themselves, recursively? The fact that there are databases that can do this proves it's not impossible.

SQL needs to be replaced by a combinator-based language that backs to a richer data store that is more sensible, both from a relational model point of view and from a recursive point of view. LINQ points the way, but still is hobbled by LINQ-to-SQL being the flagship LINQ product. (And I don't think it's complete as-is.)

> Further: SQL is not composable. If I have a fragment of SQL in my hand, there is no reasonable way to insert that into a larger expression.

This is really annoying. If I have a stored proc for getting data w/ 3 optional parameters for filtering it, I have to either use dynamic SQL (building the query by string concatenation), or have 8 different versions of the query for each combination of filter present/not present. The dynamic SQL part might not even work with bind parameters, because EXECUTE IMMEDIATE / sp_executesql cannot take a dynamic number of bind parameters (0 to 3 bind parameters, depending on the filter).

I have to wonder why none of the database vendors ever created a syntax-tree API for SQL queries. They have the parser right there, so why can't they do something like

  q := PARSE(SELECT x, y, z FROM sometable);
  q.ADD_CONDITION(PARSE(sometable.value = some_dynamic_value)); 

I completely agree with every word you said.

But why do you want columns that contain tables recursively?

Aggregate and non-aggregate mixture is supported in the standard and Oracle with the analytical expressions (if I got your meaning correctly).

"But why do you want columns that contain tables recursively?"

This is the sort of question you ask because you've become so used to the way SQL does things that you can't even think outside the box. (No offense; this is SQL's fault, not yours.) If I recall correctly, the Relational Model fully permits this. And once you start thinking about it, you can see that there are numerous places where you're basically trying to do things like this, and it would be much easier if you could have tables in rows, but instead you have to put a series of weird hacks in. Anytime you have a table that participates in multiple many-to-many relationships, it would be a lot cleaner if SQL could just say "Look, this ID has no meaning except inasmuch as it identifies another set of rows somewhere else, so how about we just cut to the chase and say that those rows are just here?" The joins could be the default and magic access to the ID could be the thing you have to type extra to get.

"Aggregate and non-aggregate mixture is supported in the standard and Oracle with the analytical expressions (if I got your meaning correctly)."

What I mean was more a syntactic issue than a question of whether it's possible; there's very poor separation between the two, so combining them is even harder.

(And crap, the design for my combinator-based language is starting to coalesce in my head. Damn it, that wasn't supposed to happen, I was being vague on purpose....)

Why not columns that contain tables? It would certainly make things easier. e.g. Get a list of customers with their orders:

SELECT ... FROM customers LEFT JOIN orders ...;

To get this into a presentable table, I would have to keep track of the previous customer's ID and only start a new row if the current customer ID is not equal to the previous customer ID. If table-valued columns were available, each customer row would have a column containing a table of their orders.

I have a problem with designing a data model with the "ease of displaying it in a pretty way" as the main consideration.

Nested tables may well be the bee's knees, but probably not for this reason alone...

I agree with you generally, but you seem to imply query performance tuning hasn't been done properly with the major database vendors, but maybe I misinterpreted you?

I don't see where I talked about performance at all. Performance of an implementation is not entirely unrelated to the specification, but they are usually related in really weird, counter-intuitive ways.

Very good article, I myself have found the "NoSQL" label pointless, when you label something you normally tell what it is, instead of what it's not.

"NoSQL" is generally understood to mean "Not Only SQL", not "SQL is evil, begone foul beast".

I disagree that this is the "general understanding".of the term No SQL.

>"NoSQL" is generally understood to mean "Not Only SQL"

I think that that is a bit of convenient revisionism as it becomes clear that the Emperor is wearing no clothes.

There absolutely was a very antagonistic, "this changes EVERYTHING" philosophy behind the movement called NoSQL, and the name is not accidental. Perhaps it was just a bit of trolling to get attention, but in no way was it a live and let live initiative (data tactics that are now bound in "NoSQL" long existed in computer science, since before the days of SQL. They just didn't feel the need to make it a comfortable community of evangelists). It was a "that was yesterday and this is today".

Of course things have changed. A bit of realism entered the mindset, so now it's more of a "Heh heh...no bad blood right? We're all good, eh?"

Actually the name was very much accidental. Read this post by the guy who coined the term: http://blog.sym-link.com/2009/10/30/nosql_whats_in_a_name.ht...

"it was one of 3 or 4 suggestions that I spouted off in the span of like 45 seconds, without thinking."

>Actually the name was very much accidental.

Spur of the moment isn't really accidental, any more than Mel Gibson accidentally said the N-word. There's a bit of a history and a thought process that leads to such a decision.

That would be NOSQL, not NoSQL

The NoSQL datastores are aimed at solving scaling issues. The current relational databases are a pain to scale. It's not about hating SQL.

> Awkward it may be, but SQL is a lot more succint and readable than multiple lines of API calls or crazy, math-like relational algebra languages.

I'd have to totally agree with this. SQL also achieves higher level of abstraction than any of the above.

Perhaps, as the author said, there is a chance that SQL (or subset of it) will be used as the query language for No SQL-type back-ends.

If you know just a tiny bit of relation algebra, then sql is not hard at all. I love ActiveRecord, since it makes you mix the DSL and SQL. Hibernate is impressive, but sometimes i miss SQL, and get's frustrated with it's HQL.

>It triumphed over other query languages not because it was particularly great (though it was easier to read), but because it was standard.

Isn't that a tautology?

Not all standards are popular. Microsoft's OOXML, for example, is technically an ECMA and ISO standard but I wouldn't call it exactly popular (even though it's widely used due to Office's market position). Because being a standard does not always imply popularity, the statement is not a tautology.


Standard just meant that it worked in lots of different places. It didn't mean that it was the only option, first option, or best option. But, by being standard, it became a lowest common denominator that people largely stuck to.

The links at http://en.wikipedia.org/wiki/Query_language include many of the competitors to SQL. In particular look at Datalog, ERROL, ISBL, and QUEL.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact