SQL JOIN tables should be used rather than comma-separated tables
Having used SQL since long before the ANSI JOIN syntax was well supported (first Sybase, then MS SQL and then Oracle) I resisted it for a long time out of habit, and also because at first the syntax was buggy when used in Oracle.
But I have come around to being in favor of it. The bugs have been fixed, and the main advantages are that: inner and outer joins are more clearly stated than by using '*=', or '(+)' suffixes on one side of a predicate; and the join criteria are clearly separated from the WHERE clause. It makes it much easier to see how the tables are being joined vs. how the results are being limited.
They stop Cartesian products from accidentally occurring. That alone should force the use of ANSI joins. cross joining two million row tables is not a pleasant experience!
Limiting syntax to use only ANSI joins also means that you end up with one and only one table as the root table (or view) to select from. I think this makes code much clearer.
I tend to writ my SQL like lisp, very compositional with lots of embedded views, in order to be very explicit on how I want tables and views joined. If the optimizer is good, it will know what parts of your statement it can optimize, and if it's not, it usually follows my code, and I don't have to resort to Optimizer hints in order for Oracle to get its black magic done.
I think the key is that join condition problems are easier to spot. There are several types query issues which are classic join condition issues. If you use cross join and where, these are harder to spot when troubleshooting. I think this is a bigger issue actually than accidental cross-joins. The latter sometimes happens occasionally. The former happens much more frequently and so wins there in terms of debugging time are quite important.
The way I see SQL in my mind is programming Venn Diagrams ( http://en.wikipedia.org/wiki/Venn_diagram ) followed by creating a projection that uses one or more of the areas within that diagram (contrasting strongly with imperative languages, i.e. logic programming). Unfortunately it's not straight-forward to do that because SQL is a superset of Venn Diagrams, but that line of thinking is where you need to be.
Yes, exactly. To use SQL well you need to think in terms of sets and set operations. Venn Diagrams are a good idea to help visualize this. If you find yourself writing a lot loops over cursors there's a good chance you're doing it wrong.
I thought this was a very good article, unusually so for a blog post about databases. I do take issue, however, with the idea that there's a beauty to SQL. There's a beauty in the conceptual model, but you have to squint sideways to see it through the deeply hostile syntax.
I'm not sure the "Relational Model" is even a great conceptual model for common business or analytical processing. I think SQL is ugly because the model is a poor fit; if the model was solid, a clean syntax would follow naturally. As for alternatives, MDX has superseded it for core "pivot table" analytics. For application processing, it seems modern document/hierarchical databases are good for many transactional needs.
What modern relational databases have... is a very intuitive abstract storage model. That's awesome. We've been working on an alternative "navigational model" (inspired from CODASYL and ORMs) based on this storage model. Our experimental query language is at http://htsql.org
That's not complaining about the relational model at all, he's complaining about implementation details of common RDBMSes being obsolete, as they're based on row-oriented storage, slow disks, small memory, and limited concurrency.
Things like Amazon Redshift are still relational but make radically different decisions on most of those points.
Edit: that talk is well worth reading, btw. I wonder why he completely ignores flash though?
Actually the title is "The Traditional RDBMS Wisdom is All Wrong".
He's arguing against multithreaded systems in favour of a partitioned OLTP platform like H-Store (VoltDB). I didn't see anything about the relational model being "all wrong". It works well for lots of scenarios, and turning it into a KV store is also straightforward.
As far as SQL, I think people agree it's not the best query language. QUEL may have won, but Stonebraker says[1]: "The only reason SQL won in the marketplace was because IBM released DB2 in 1984 without changing the System R query language."
Are you criticising the relational model, or SQL? You seem to be contradicting yourself with your own language, which you claim works well with relational databases (ergo, works well with the relational model).
The relational query model (it's operators and set algebra) is a poor fit for common business inquiries. I view SQL as a practical syntax for encoding relational queries; SQL has deviated from the pure model when the model failed to address a real world business problems. I see SQL as a success, in that it works and applies the relational model to real world concerns.
What is intuitive to users is the idea of related files/tables, and this really doesn't have much to do with the relational algebra. While we have implemented HTSQL using SQL, this is an implementation choice -- our experiment is syntax/semantics, we didn't want to be in the business of data storage and query optimization, and, we use (the incredibly awesome) PostgreSQL. HTSQL's logical query model depends upon is a tabular based abstract storage.
I have found SQL to be cumbersome for expressing temporal relationships, eg find all the event As that happen within one week of event B. There's not necessarily a data schema link between the table for event A and the table for event B.
Let's say you wish to list all students in a university, and then, show all other students with the same birth year and month. So, you'd start with ``/student`` and then you could select all columns using the asterisk. Next, you could use the ``fork()`` operation to join to the same table based on a set of columns that are correlated.
With the more general case, let's say you're interested in listing per semester, which students started during that semester. In this case, you start with ``/semester`` and then define the correlated set, ``starting_students`` which uses the ``@`` attachment operator to link the two sets. Then, you'd select columns from semester and the list of correlated starting students.
Typically, you'd include the new link definition, ``starting_students`` in your configuration file so that you don't have to define it again... then the query is quite intuitive:
/semester{, /starting_students }
While this all may seem complex, it's not an easy problem. More importantly, HTSQL has the notion of "navigation". You're not filtering a cross product of semesters and students, instead, you're defining a named linkage from each semester to a set of students. The rows returned in the top level of query are semesters and the rows returned in the nested level are students.
No being a computer scientist I have to admit I do not appreciate the intricacies of the 'problems with SQL' blog entries. But working with htsql I gotta say it seems a lot more intuitive than SQL. It feels like the logic correspond much better to my mental model. And that there is much less of the jumping up and down the code to nest my SQL code logic that I find myself doing all the time.
Is there a way to install this on a PostgresQL instance on Win8?
HTSQL requires Python 2.7 and psycopg2 binary. As long as those install, I don't see any problems. We have tested it with older versions of Windows without any issues. If you run into a problem, you could report an issue: https://bitbucket.org/prometheus/htsql/issues
I found the relational beauty in being able to "store every unique value just once." It resonated with the programming concept of "write a function so that you don't need to change things in many places." This in turn gave me a lens to viewing how businesses operate independent of what systems you put on top of the data.
Does this mean SQL is pretty? Not really, but it gets the job done. Though I did have to spend a lot of time optimizing SQL to force certain query optimizations.
Does this mean relational should be used everywhere? No, but it has a distinct place. I think beer is beautiful too, but I won't drink it for breakfast with the in-laws.
I disagree. There are issues with SQL (how many things can NULL mean?) but there is also beauty in it. Every language has issues.
You don't see the beauty in SQL until you try to get as much bang for the buck out of specific queries. Then the idea that you can think in terms of math and sets really shines. You can do a lot in SQL which would require a lot more app code to accomplish, and the SQL will be more elegant and easier to maintain.
> NULL is a marker that means this data value does not exist in the database.
Both of the uses I stated are data values that do not exist in the database.
> Your second, and any other 'meanings' are a misuse of NULL
No, they aren't. They are different, and substantively different, reasons why data does not exist in a database. One is the value does not exist in the domain of which the DB is model, and one is that the value exists in that domain but it is not known.
(Others have pointed out the additional case of "we do not know if it exists", which is ambiguity between those two cases.)
These two substantially different meanings of NULL were recognized fairly early in the history of the relational model by the creator of the model (E.F. Codd), who proposed refining the model by having separate markers for each of those two cases.
It doesn't exactly mean that the value does not exist, it's that it's unknown whether it exists. NULL means nothing at all, not the absence of a value (though applications will often abuse it to mean that, and pragmatically that's usually OK).
The conceptual model is great, but IMHO a little impractical. Anyone who has ever had to trick the optimiser into choosing a better plan by changing the semi join from WHERE EXISTS to NOT IN (or vice versa) will appreciate what I'm saying :-)
The problem of a query optimizer doesn't go away by not having one.
They were invented for a reason: data changes. Data volume changes and data distribution changes. If your algorithms don't change to suit the inputs, performance will still fall off a cliff. That's a fundamental problem and not unique to SQL.
And this really only matters for somewhat "interesting" queries anyway. It's not like the database will all of a sudden change an "UPDATE foo SET x = x + 1 WHERE id = 13" into a 14-table join.
In practice, you have many queries adapting nicely to changing data, and then a couple where they get some estimates wrong and drive you crazy. But just because you are able to fix the plans in some of those instances doesn't make it a valid comparison. You have to compare it against hand crafting all of the plans (or writing equivalent algorithms) and the results of doing that. I suspect that in most cases, more plans will drive you crazy if you choose the plans rather than letting an optimizer do it.
Remember, the data is going to change once you get in production.
Of course. Sorry, I didn't mean to imply CBOs are bad or unnecessary. They are fundamental and necessary. Tuning queries is of course something that needs to be reviewed as data changes - pruning unnecessary indexes is often as importent as adding new ones!
Hopefully 95% or more of queries don't need to be nudged :-)
Not necessarily. It's almost impossible for a CBO to get the best query plan for every query under every load under all data distributions. But the relational model is still beautiful, I find its limitations are minor in the grand scheme of things.
"It's almost impossible for a CBO to get the best query plan for every query under every load under all data distributions."
It's also impossible for a human to do it. At least the optimizer will be awake at 3 in the morning and able to respond to data changes; the human will be asleep.
That really depends on the database. With MySQL? Yes, you have to apply hundreds of tricks. With Oracle? Chances are, that the CBO knows much better than you how to transform your SQL.
Another thing to understand is how NULLs are handled in different contexts, using either the familiar two-valued logic or the more exotic three-valued logic. It's kind of messy but really worth knowing if you're working with SQL. The wikipedia page actually gives a pretty good account of the issue: https://en.wikipedia.org/wiki/Null_%28SQL%29
The most beneficial thing you can do when you start with SQL is to NEVER EVER put join clauses in the WHERE section. ALWAYS put the join clauses as close to the table you're joining as possible.
eg :
never do SELECT * FROM A JOIN B WHERE A.ID = B.a_id and B.id > 10
but always JOIN B ON A.ID = B.a_id AND B.id > 10
the second way of doing "scale" much much better when you add more tables, and start mixing left and right joins.
The reason being is that IN uses an disjunction and NOT IN uses a conjunction. When you compare the value to NULL it returns UNKNOWN, which ultimately causes the Boolean expression to be false.
Those expressions can be worked out via first order predicate logic. Damned tricky, I can definitely see how that would trip up pretty much anyone at first glance. If you weren't aware of logical equivalences then you might be scratching your head for some time!
I was just about to say that! Truthfully this is a good introduction but is definitely not everything you need to know about SQL.(Obviously) Handling nulls is just one of many sticklers people using SQL frequently screw up.
Fantastic article. I clicked on it, hoping to find a way to help some of my less SQL-savvy co-workers to understand SQL a little better. I instead came away with a better understanding of my own.
I like SQL. The relational model is often underestimated. All of the complex structure that goes beyond individual rows is orthogonal and is itself first-class data. Contrast with document databases where hierarchies are part of the low-level storage mechanism. By making the structure first-class data, the relational model keeps you from painting yourself into a corner.
> If a has 3 columns and b has 5 columns, then the "output table" will have 8 (3 + 5) columns. The records contained in this combined table reference are those of the cross product / cartesian product of a x b.
This is where the easy steps become hard for real noobs.
Those are expressions. They only work once you have picked the table(s) - FROM - and determined the rows you want returned - WHERE. The query then groups and orders the results and calculates the scalar value of COUNT(1) which it repeats on every row, then it determines the weekday for every row for the column created_at and uses that value for output.
My point is that a better explanation of the execution ordering is:
FROM, JOIN, WHERE, SELECT, GROUP BY, HAVING, DISTINCT, UNION, ORDER BY, LIMIT.
The author has SELECT after the GROUP BY and HAVING, which doesn't make sense because GROUP BY is aware of expressions in the SELECT statement. Grouping and filtering happens after selection.
MySQL and Postgres. You can use the source tables, anything in the select statement, or an ordinal position in the select statement. In the example "GROUP BY 1", "GROUP BY wkday", and "GROUP BY WEEKDAY(created_at)" are all the same.
WITH orders(created_at) AS (
SELECT date '2012-01-01' UNION ALL
SELECT date '2012-01-02' UNION ALL
SELECT date '2012-01-03' UNION ALL
SELECT date '2012-01-04' UNION ALL
SELECT date '2012-01-05' UNION ALL
SELECT date '2012-01-06' UNION ALL
SELECT date '2012-01-07' UNION ALL
SELECT date '2012-01-08'
)
SELECT EXTRACT(DOW FROM created_at) wkday, COUNT(1)
FROM orders
GROUP BY wkday
That's really disturbing. I fixed the linked article accordingly.
Know thy database engine for if you do not, your local DBA will be mightily pissed off when you do a cross join across 15 tables and watch his/her IOPS go through the roof...
Having used SQL since long before the ANSI JOIN syntax was well supported (first Sybase, then MS SQL and then Oracle) I resisted it for a long time out of habit, and also because at first the syntax was buggy when used in Oracle.
But I have come around to being in favor of it. The bugs have been fixed, and the main advantages are that: inner and outer joins are more clearly stated than by using '*=', or '(+)' suffixes on one side of a predicate; and the join criteria are clearly separated from the WHERE clause. It makes it much easier to see how the tables are being joined vs. how the results are being limited.