Hacker News new | past | comments | ask | show | jobs | submit login
Easy Steps to a Complete Understanding of SQL (tech.pro)
274 points by lukaseder on Sept 3, 2013 | hide | past | web | favorite | 80 comments

SQL JOIN tables should be used rather than comma-separated tables

Having used SQL since long before the ANSI JOIN syntax was well supported (first Sybase, then MS SQL and then Oracle) I resisted it for a long time out of habit, and also because at first the syntax was buggy when used in Oracle.

But I have come around to being in favor of it. The bugs have been fixed, and the main advantages are that: inner and outer joins are more clearly stated than by using '*=', or '(+)' suffixes on one side of a predicate; and the join criteria are clearly separated from the WHERE clause. It makes it much easier to see how the tables are being joined vs. how the results are being limited.

They stop Cartesian products from accidentally occurring. That alone should force the use of ANSI joins. cross joining two million row tables is not a pleasant experience!

Limiting syntax to use only ANSI joins also means that you end up with one and only one table as the root table (or view) to select from. I think this makes code much clearer.

I tend to writ my SQL like lisp, very compositional with lots of embedded views, in order to be very explicit on how I want tables and views joined. If the optimizer is good, it will know what parts of your statement it can optimize, and if it's not, it usually follows my code, and I don't have to resort to Optimizer hints in order for Oracle to get its black magic done.

I think the key is that join condition problems are easier to spot. There are several types query issues which are classic join condition issues. If you use cross join and where, these are harder to spot when troubleshooting. I think this is a bigger issue actually than accidental cross-joins. The latter sometimes happens occasionally. The former happens much more frequently and so wins there in terms of debugging time are quite important.

The way I see SQL in my mind is programming Venn Diagrams ( http://en.wikipedia.org/wiki/Venn_diagram ) followed by creating a projection that uses one or more of the areas within that diagram (contrasting strongly with imperative languages, i.e. logic programming). Unfortunately it's not straight-forward to do that because SQL is a superset of Venn Diagrams, but that line of thinking is where you need to be.

Yes, exactly. To use SQL well you need to think in terms of sets and set operations. Venn Diagrams are a good idea to help visualize this. If you find yourself writing a lot loops over cursors there's a good chance you're doing it wrong.

I thought this was a very good article, unusually so for a blog post about databases. I do take issue, however, with the idea that there's a beauty to SQL. There's a beauty in the conceptual model, but you have to squint sideways to see it through the deeply hostile syntax.

I'm not sure the "Relational Model" is even a great conceptual model for common business or analytical processing. I think SQL is ugly because the model is a poor fit; if the model was solid, a clean syntax would follow naturally. As for alternatives, MDX has superseded it for core "pivot table" analytics. For application processing, it seems modern document/hierarchical databases are good for many transactional needs.

What modern relational databases have... is a very intuitive abstract storage model. That's awesome. We've been working on an alternative "navigational model" (inspired from CODASYL and ORMs) based on this storage model. Our experimental query language is at http://htsql.org

Our critique of SQL is at http://htsql.org/doc/overview.html#why-not-sql

I'm sure you'll agree with Stonebraker's recent (actually, not so recent) claims that the relational model is "all wrong":


htsql seems very interesting. I should soon blog about your product and critique. Have you been publishing elsewhere?

That's not complaining about the relational model at all, he's complaining about implementation details of common RDBMSes being obsolete, as they're based on row-oriented storage, slow disks, small memory, and limited concurrency.

Things like Amazon Redshift are still relational but make radically different decisions on most of those points.

Edit: that talk is well worth reading, btw. I wonder why he completely ignores flash though?

Actually the title is "The Traditional RDBMS Wisdom is All Wrong".

He's arguing against multithreaded systems in favour of a partitioned OLTP platform like H-Store (VoltDB). I didn't see anything about the relational model being "all wrong". It works well for lots of scenarios, and turning it into a KV store is also straightforward.

As far as SQL, I think people agree it's not the best query language. QUEL may have won, but Stonebraker says[1]: "The only reason SQL won in the marketplace was because IBM released DB2 in 1984 without changing the System R query language."

1: http://iggyfernandez.wordpress.com/2011/12/23/nocoug-journal...

Thanks for linking to that talk, it deserves a submission all of its own.

Absolutely. Very interesting talk. Definitely deserves a submission.

BTW, the person asking the last couple of questions is Ed Bugnion, one of the co-founders of VMWare. He is a faculty now at EPFL.

Interesting, I didn't know that!

I actually submitted it, but it didn't get too much traction...

Are you criticising the relational model, or SQL? You seem to be contradicting yourself with your own language, which you claim works well with relational databases (ergo, works well with the relational model).

The relational query model (it's operators and set algebra) is a poor fit for common business inquiries. I view SQL as a practical syntax for encoding relational queries; SQL has deviated from the pure model when the model failed to address a real world business problems. I see SQL as a success, in that it works and applies the relational model to real world concerns.

What is intuitive to users is the idea of related files/tables, and this really doesn't have much to do with the relational algebra. While we have implemented HTSQL using SQL, this is an implementation choice -- our experiment is syntax/semantics, we didn't want to be in the business of data storage and query optimization, and, we use (the incredibly awesome) PostgreSQL. HTSQL's logical query model depends upon is a tabular based abstract storage.

Oh, the relational query model based on relational algebra! I thought you wrote "relational model".

For what it's worth, a join is indeed part of relational algebra - it's a composition of binary relations.

I have found SQL to be cumbersome for expressing temporal relationships, eg find all the event As that happen within one week of event B. There's not necessarily a data schema link between the table for event A and the table for event B.

how does htsql do with this?

Let's say you wish to list all students in a university, and then, show all other students with the same birth year and month. So, you'd start with ``/student`` and then you could select all columns using the asterisk. Next, you could use the ``fork()`` operation to join to the same table based on a set of columns that are correlated.

/student{, /fork(year(dob),month(dob))}


You could wrap this up in a definition to make it a bit easier to understand:

/student .define(students_same_month := fork(year(dob),month(dob))) .select(, /students_same_month )


With the more general case, let's say you're interested in listing per semester, which students started during that semester. In this case, you start with ``/semester`` and then define the correlated set, ``starting_students`` which uses the ``@`` attachment operator to link the two sets. Then, you'd select columns from semester and the list of correlated starting students.

/semester .define(starting_students:= (student.start_date>=begin_date& student.start_date<=end_date)@student) {, /starting_students }


Typically, you'd include the new link definition, ``starting_students`` in your configuration file so that you don't have to define it again... then the query is quite intuitive:

/semester{, /starting_students }

While this all may seem complex, it's not an easy problem. More importantly, HTSQL has the notion of "navigation". You're not filtering a cross product of semesters and students, instead, you're defining a named linkage from each semester to a set of students. The rows returned in the top level of query are semesters and the rows returned in the nested level are students.

Thank you for the examples.

I made some queries that are analogous to my temporal query needs

Here I'm looking for every student, other students with DOB within 2 weeks of the given student: http://demo.htsql.org/student.define(similar_students:=%20(s...

Here for every semester with at least one student, I find the oldest and the youngest students enrolled: http://demo.htsql.org/semester%20.define(starting_students:=...

No being a computer scientist I have to admit I do not appreciate the intricacies of the 'problems with SQL' blog entries. But working with htsql I gotta say it seems a lot more intuitive than SQL. It feels like the logic correspond much better to my mental model. And that there is much less of the jumping up and down the code to nest my SQL code logic that I find myself doing all the time.

Is there a way to install this on a PostgresQL instance on Win8?

HTSQL requires Python 2.7 and psycopg2 binary. As long as those install, I don't see any problems. We have tested it with older versions of Windows without any issues. If you run into a problem, you could report an issue: https://bitbucket.org/prometheus/htsql/issues

I found the relational beauty in being able to "store every unique value just once." It resonated with the programming concept of "write a function so that you don't need to change things in many places." This in turn gave me a lens to viewing how businesses operate independent of what systems you put on top of the data.

Does this mean SQL is pretty? Not really, but it gets the job done. Though I did have to spend a lot of time optimizing SQL to force certain query optimizations.

Does this mean relational should be used everywhere? No, but it has a distinct place. I think beer is beautiful too, but I won't drink it for breakfast with the in-laws.

I disagree. There are issues with SQL (how many things can NULL mean?) but there is also beauty in it. Every language has issues.

You don't see the beauty in SQL until you try to get as much bang for the buck out of specific queries. Then the idea that you can think in terms of math and sets really shines. You can do a lot in SQL which would require a lot more app code to accomplish, and the SQL will be more elegant and easier to maintain.

> There are issues with SQL (how many things can NULL mean?)

Please elaborate, I can only think of one.

The two main meanings of NULL are "Does not exist" and "Exists but is unknown", though there are more.

Also, `we don't know if it exists'.

NULL is a marker that means this data value does not exist in the database.

Your second, and any other 'meanings' are a misuse of NULL, which is your fault, not SQLs

> NULL is a marker that means this data value does not exist in the database.

Both of the uses I stated are data values that do not exist in the database.

> Your second, and any other 'meanings' are a misuse of NULL

No, they aren't. They are different, and substantively different, reasons why data does not exist in a database. One is the value does not exist in the domain of which the DB is model, and one is that the value exists in that domain but it is not known.

(Others have pointed out the additional case of "we do not know if it exists", which is ambiguity between those two cases.)

These two substantially different meanings of NULL were recognized fairly early in the history of the relational model by the creator of the model (E.F. Codd), who proposed refining the model by having separate markers for each of those two cases.

NULL can mean a bunch of things:

1. Value is unknown

2. Value does not exist (for example, outer join results).

These have ery different implications. For example:

somestring || does_not_exist you would think would equal somestring, but with unknown we dont know.

It doesn't exactly mean that the value does not exist, it's that it's unknown whether it exists. NULL means nothing at all, not the absence of a value (though applications will often abuse it to mean that, and pragmatically that's usually OK).

The conceptual model is great, but IMHO a little impractical. Anyone who has ever had to trick the optimiser into choosing a better plan by changing the semi join from WHERE EXISTS to NOT IN (or vice versa) will appreciate what I'm saying :-)

The problem of a query optimizer doesn't go away by not having one.

They were invented for a reason: data changes. Data volume changes and data distribution changes. If your algorithms don't change to suit the inputs, performance will still fall off a cliff. That's a fundamental problem and not unique to SQL.

And this really only matters for somewhat "interesting" queries anyway. It's not like the database will all of a sudden change an "UPDATE foo SET x = x + 1 WHERE id = 13" into a 14-table join.

In practice, you have many queries adapting nicely to changing data, and then a couple where they get some estimates wrong and drive you crazy. But just because you are able to fix the plans in some of those instances doesn't make it a valid comparison. You have to compare it against hand crafting all of the plans (or writing equivalent algorithms) and the results of doing that. I suspect that in most cases, more plans will drive you crazy if you choose the plans rather than letting an optimizer do it.

Remember, the data is going to change once you get in production.

Of course. Sorry, I didn't mean to imply CBOs are bad or unnecessary. They are fundamental and necessary. Tuning queries is of course something that needs to be reviewed as data changes - pruning unnecessary indexes is often as importent as adding new ones!

Hopefully 95% or more of queries don't need to be nudged :-)

Lovely conceptual models always founder on the shoals of terrible implementation.

Not necessarily. It's almost impossible for a CBO to get the best query plan for every query under every load under all data distributions. But the relational model is still beautiful, I find its limitations are minor in the grand scheme of things.

"It's almost impossible for a CBO to get the best query plan for every query under every load under all data distributions."

It's also impossible for a human to do it. At least the optimizer will be awake at 3 in the morning and able to respond to data changes; the human will be asleep.

I'm not disagreeing :-)

Although the poor DBA on pager may not agree with you.

That really depends on the database. With MySQL? Yes, you have to apply hundreds of tricks. With Oracle? Chances are, that the CBO knows much better than you how to transform your SQL.

Tell that to all those people still adding rule hints to their queries :-)

Good point. It's been some time though, since I've last been using anything below Oracle 12c ;-)

Another thing to understand is how NULLs are handled in different contexts, using either the familiar two-valued logic or the more exotic three-valued logic. It's kind of messy but really worth knowing if you're working with SQL. The wikipedia page actually gives a pretty good account of the issue: https://en.wikipedia.org/wiki/Null_%28SQL%29

The most beneficial thing you can do when you start with SQL is to NEVER EVER put join clauses in the WHERE section. ALWAYS put the join clauses as close to the table you're joining as possible.

eg : never do SELECT * FROM A JOIN B WHERE A.ID = B.a_id and B.id > 10 but always JOIN B ON A.ID = B.a_id AND B.id > 10

the second way of doing "scale" much much better when you add more tables, and start mixing left and right joins.

Great article! I wish all SQL libraries would adhere to this structure instead of forcing devs to write glorified (string) SQL. Ie., let me

    from(join(table1, table2))
    .where(table1.col1 > 10)
    .project(col1, avg(col2))
and only in that order.

I wrote my own library in Java recently for something similar. It was inspired by LINQ and SQL grammar.

     Select statement = table1.select()
       .join(table2,[join condition])
       .selection(col1, Fn.avg(table1.col2))

     Select s = new Select()
        .join(table2, table2.t1_col1.eq(table1.col1))
        .selection(col1, Fn.avg(table1.col2));

have you seen empire db? http://empire-db.apache.org/ no-one much seems to use it, but it's actually quite neat.

Check out http://www.jooq.org, by the same author as the article (me)

Looks great!

Do you happen to know whether jooq runs with DB2 v10? I see it supports DB2 v9.7.

Have you tried LINQ?

Be careful with point 6. If you use a subquery through an IN clause and it returns a NULL as one of the rows, nothing will return.

That's only true for NOT IN predicates (which is mentioned in the article)

Example: http://sqlfiddle.com/#!4/d41d8/16785

    with a(x) as (
      select 1    from dual union all
      select 2    from dual union all
      select null from dual
    select 1 from dual where 1 in (select x from a)

Doh! Of course. :-)

The reason being is that IN uses an disjunction and NOT IN uses a conjunction. When you compare the value to NULL it returns UNKNOWN, which ultimately causes the Boolean expression to be false.

I hope that redeems my reputation a little :-)

Btw, great article!

> which ultimately causes he Boolean expression to be false

Strictly speaking, the boolean expression is also UNKNOWN ;-) I've tried to explain this here: http://blog.jooq.org/2012/01/27/sql-incompatibilities-not-in...

... but to this date, I'm still not 100% sure if I correctly understood SQL NULL. A quite worrying thing is documented here: http://blog.jooq.org/2012/12/24/row-value-expressions-and-th...

It's about row value expressions and NULL predicates. The following are not the same!

        (A, B) IS NOT NULL
    NOT((A, B) IS     NULL)
> I hope that redeems my reputation a little :-)

You're forgiven ;-) Thanks!

Darn it, right again. Sigh.

Those expressions can be worked out via first order predicate logic. Damned tricky, I can definitely see how that would trip up pretty much anyone at first glance. If you weren't aware of logical equivalences then you might be scratching your head for some time!

Another great article, btw.

I was just about to say that! Truthfully this is a good introduction but is definitely not everything you need to know about SQL.(Obviously) Handling nulls is just one of many sticklers people using SQL frequently screw up.

Fantastic article. I clicked on it, hoping to find a way to help some of my less SQL-savvy co-workers to understand SQL a little better. I instead came away with a better understanding of my own.

Stuff in here they nevr taught me all those years ago either! And a smart geek is never be too proud for a refresher.

Where do you go to keep up with things like this.

I was taught at a reputable university to put joins in the where clause. Now it seems that is frowned upon, and I only find out 13 years later.

I like SQL. The relational model is often underestimated. All of the complex structure that goes beyond individual rows is orthogonal and is itself first-class data. Contrast with document databases where hierarchies are part of the low-level storage mechanism. By making the structure first-class data, the relational model keeps you from painting yourself into a corner.

> If a has 3 columns and b has 5 columns, then the "output table" will have 8 (3 + 5) columns. The records contained in this combined table reference are those of the cross product / cartesian product of a x b.

This is where the easy steps become hard for real noobs.

A graphical example of a Cartesian join - http://web.utk.edu/sas/OnlineTutor/1.2/en/60477/m71/m71_3.ht....

here are some other graphical illustrations of the operators from relational algebra - http://bandilab.org/bandicoot-algebra.pdf

It's focusing on bandicoot language, but concepts are taken from the relational algebra. It should match the logic of SQL.

I think the author has the order wrong. Consider:

    SELECT WEEKDAY(created_at) wkday, COUNT(1) 
    FROM orders
    GROUP BY wkday
The above query works fine, thus GROUP BY, ORDER BY, and HAVING are all aware of SELECT.

Those are expressions. They only work once you have picked the table(s) - FROM - and determined the rows you want returned - WHERE. The query then groups and orders the results and calculates the scalar value of COUNT(1) which it repeats on every row, then it determines the weekday for every row for the column created_at and uses that value for output.

My point is that a better explanation of the execution ordering is:


The author has SELECT after the GROUP BY and HAVING, which doesn't make sense because GROUP BY is aware of expressions in the SELECT statement. Grouping and filtering happens after selection.

(I had to correct myself)

SELECT is evaluated after GROUP BY, at least in most databases that implement SQL correctly. SQL Server's documentation nicely explains this:


Skip to "Logical Processing Order of the SELECT statement".

MySQL, PostgreSQL, and SQLite behave differently, though!

Which RDBMS? This would not work on Oracle. You would need "group by WEEKDAY(created_at)" in this example.

MySQL and Postgres. You can use the source tables, anything in the select statement, or an ordinal position in the select statement. In the example "GROUP BY 1", "GROUP BY wkday", and "GROUP BY WEEKDAY(created_at)" are all the same.

Aagh, it works indeed in PostgreSQL: http://sqlfiddle.com/#!12/d41d8/1585

    WITH orders(created_at) AS (
      SELECT date '2012-01-01' UNION ALL
      SELECT date '2012-01-02' UNION ALL
      SELECT date '2012-01-03' UNION ALL
      SELECT date '2012-01-04' UNION ALL
      SELECT date '2012-01-05' UNION ALL
      SELECT date '2012-01-06' UNION ALL
      SELECT date '2012-01-07' UNION ALL
      SELECT date '2012-01-08'
    SELECT EXTRACT(DOW FROM created_at) wkday, COUNT(1) 
    FROM orders
    GROUP BY wkday
That's really disturbing. I fixed the linked article accordingly.

You fixed the article incorrectly. SELECT aliases are never available in a WHERE clause.

Edit: it seems more complicated that I initially assumed. MySQL doesn't allow it. SQLite does.

You're right, the new wording was ambiguous. A better wording is this:

Rule number 2, for instance, does not apply exactly in the above way to MySQL, PostgreSQL, and SQLite.

True, the explanation was maybe a bit simplistic. Some expressions are allowed to appear even if they're not declared in the GROUP BY clause...

Only scalar expressions. You still need to retrieve the rows.

Not a bad article but it needs a step 11:

Know thy database engine for if you do not, your local DBA will be mightily pissed off when you do a cross join across 15 tables and watch his/her IOPS go through the roof...

Well, to beginners it's always good to express things in a positive way. The "DBA rules" is part of the 2nd year of SQL experience tutorial :-)

This is why you have a test instance for developers (or more likely now, every developer runs his own engine locally or in a VM).

That's irrelevant as it usually only goes bang in production under production load and data volumes.

how many platforms do not permit setting limits on processing of queries?

Thanks for this article very beneficial, I'm learning about Databases at the moment so this definitely comes in handy while learning SQL.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact