
A beginners guide to thinking in SQL - craigkerstiens
http://www.sohamkamani.com/blog/2016/07/07/a-beginners-guide-to-sql/
======
hadley
I think this would really benefit from some thoughtful indenting to illustrate
the structure. For example, instead of:

    
    
      SELECT
      members.firstname AS "First Name",
      members.lastname AS "Last Name"
      FROM borrowings
      JOIN books ON borrowings.bookid=books.bookid
      JOIN members ON members.memberid=borrowings.memberid
      WHERE books.author='Dan Brown';
    

Do:

    
    
      SELECT
        members.firstname AS "First Name",
        members.lastname AS "Last Name"
      FROM borrowings
        JOIN books ON borrowings.bookid = books.bookid
        JOIN members ON members.memberid = borrowings.memberid
      WHERE books.author = 'Dan Brown';
    

This helps reinforce that there is one FROM statement that combines three
tables using joins

~~~
Semiapies
Some time ago, I settled on a fairly similar idiosyncratic indentation style
for SQL:

    
    
      select
        members.firstname as "First Name",
        members.lastname as "Last Name"
      from 
        borrowings
        join 
        books on borrowings.bookid = books.bookid
        join 
        members on members.memberid = borrowings.memberid
      where 
        books.author = 'Dan Brown';
    

Normally, I'm a fairly slavish follower of style standards, but almost every
bit of SQL I come across in my job is an unreadable mess, even when whoever
wrote it did have the vague idea that consistent formatting was good.

~~~
sbuttgereit
Sometimes with complex queries, it can be tough to adhere to a simple set of
style rules while also making clear where things like logical units
start/stop. Don't get me wrong I'm not disagreeing with you in principle; but
acknowledging that the declarative nature of the language cause you to break
style for the sake of readability in some cases. I'm not going to post an
example of such a case (I'd have to go digging and I'm too tired).

On a different note, I don't like a lot of the styles I do see. I like yours,
with the exception of the joins. My convention is a bit different:

    
    
      SELECT
          members.firstname as "First Name"
         ,members.lastname as "Last Name"
      FROM borrowings
        JOIN books 
            ON borrowings.bookid = books.bookid
        JOIN members 
            ON members.memberid = borrowings.memberid
      WHERE books.author = 'Dan Brown';
    

Some of that is old habit (like the comma first thing which I know is out of
favor with many) which I found useful when working with long column lists. I'd
also typically and systematically alias the tables, but that wasn't done in
the examples. Outside of tastes, the only real downside I run into is that
it's easy to get too far to the right with the indentation style I use. But
it's not that often that I get there.

~~~
Semiapies
I haven't had trouble with complicated queries with this style. I've been
doing it for ages, though.

There nothing wrong with comma-first, it's sensible; I just never got into the
habit. My joins are so I can instantly glance over the tables involved in a
query.

(And, oh yes, I, too, normally alias the crap out of tables if there's more
than one in a query.)

------
jakub_h
I have to wonder why a relation algebra language isn't actually algebraic.
What about projections in the form of something like c := a x b etc? Nooo,
they had to make it look like COBOL.

~~~
SapphireSun
In databases class, they used Pi for projection, bowtie for natural join, and
sigma for row filtration. They added conditions on the operators with
subscripts. SQL is just the standardized version of that.

~~~
jakub_h
"Standardized"...and without the legibility of symbolic notation and the
option to use intermediate named relvars?

~~~
zenexer
It's designed to be "natural", meaning that technical symbols were largely
avoided. Someone who knows little about formal notation but a decent amount
about databases can usually figure out what's going on. It's also easier to
type, seeing how most of the relevant symbols aren't on your average keyboard.

------
filereaper
What's the canonical reference book one should go to regarding SQL? (ie K&R is
the book you go to for C)

I understand there are differences between DB vendors, but my question is
about general patterns and best practices.

~~~
crdb
Chris Date's Introduction to SQL [1], or Edgar Codd's last book [2].

Actually both are worth reading to discuss differences in philosophy (e.g. "no
NULLs" vs 4-valued logic including "unknown" and "not applicable" NULLs).

[1] [https://www.amazon.com/Introduction-SQL-Mastering-
Relational...](https://www.amazon.com/Introduction-SQL-Mastering-Relational-
Database/dp/0321305965)

[2] [https://www.amazon.com/Relational-Model-Database-
Management-...](https://www.amazon.com/Relational-Model-Database-Management-
Version/dp/0201141922/ref=sr_1_1)

~~~
qilo
First book [1] is not by C. J. Date. So, is this the book you recommend and
just mixed up the author's name, or... ?

~~~
crdb
Good catch. Too late to edit the comment.

Yes, I meant C.J. Date's "Introduction to SQL".

I was trying to get a link to the latest edition which appeared to be the 2nd
link on an Amazon search for "Date introduction to SQL"... anyway, it is
continuously edited so it's worth seeking the latest version which has twice
as many pages as some of the earlier ones.

~~~
qilo
The thing is, and I probably should have pointed it out in the first comment,
sorry about that: I can't find such a book by Cris Date. My guess is you meant
"Introduction to Database Systems".

~~~
crdb
Yes, you are right. Database Systems.

------
Animats
Could you write their example

    
    
        SELECT members.firstname || ' ' || members.lastname
        AS "Full Name"
        FROM borrowings
        INNER JOIN members
        ON members.memberid=borrowings.memberid
        INNER JOIN books
        ON books.bookid=borrowings.bookid
        WHERE borrowings.bookid IN (SELECT bookid
           FROM books
           WHERE stock>(SELECT avg(stock)
           FROM books))
        GROUP BY members.firstname, members.lastname;
    

more simply as

    
    
        SELECT @avgstock = avg(stock) from books; 
    
        SELECT members.firstname || ' ' || members.lastname
        AS "Full Name"
        FROM borrowings, members, books
        WHERE members.memberid = borrowings.memberid
          AND books.bookid = borrowings.bookid 
          AND borrowings.bookid IN 
            (SELECT bookid
             FROM books
             WHERE stock > @avgstock))
        GROUP BY members.firstname, members.lastname;
    

Are those explicit INNER JOIN clauses necessary? By now, most SQL query
optimizers can figure out on their own how to join tables.

~~~
mattmanser
My impression was that few people use the comma syntax these days, it's old
fashioned, harder to read, harder to change, harder for the query plan
optimizer, etc.

I know this is an appeal to authority, but the consensus is pretty clear if
you search SO and programmer stackexchange:

[http://stackoverflow.com/questions/1018822/inner-join-on-
vs-...](http://stackoverflow.com/questions/1018822/inner-join-on-vs-where-
clause)
[http://programmers.stackexchange.com/questions/78225/using-j...](http://programmers.stackexchange.com/questions/78225/using-
join-keyword-or-not)

You're also introducing a lot more concepts by adding variables, and that code
wouldn't work on SQL Server, for example, as you need to declare the variable
first. Then again the `||` doesn't work on SQL Server either.

~~~
hobs
As an active DBA I would tell you that most of the time I would expect the
comma join syntax to come from someone with grey hair, its an older format
that once was standard.

I would also say that if you are suggesting style, you might ommit that INNER
from the JOIN (if you are targeting sql server its redundant.)

~~~
bigger_cheese
I tend to mix and match the two. The comma join looks way more natural to me.
One thing I do habitually which I don't see other people do much is use
aliases.

I learnt SQL through SAS which is where the excessive aliases came from and it
kind of stuck with me.

~~~
lucio
The "comma join" disconnects the join from the condition. It is hard to find
the actual join condition between all the "where" conditions. By using JOIN
... ON .., you explicitly link tables with their join condition and also
separate join conditions from other filter conditions on the "where".

~~~
DanielBMarkham
_The "comma join" disconnects the join from the condition._

Yes, because it's more important to first know _what 's_ being joined, then
traverse that list looking for _how_ it's being joined. You get the full name
in the conditionals, so it's not like it's a mystery what they attach to.

But I have grey hair, so there you go.... :)

Comma syntax makes much more sense to me. But now I know that this is not The
Way Things Are Done. I also learned from the comments that comma-first isn't
recommended. So now my code will be more legible. It's all good.

~~~
lucio
I'm 45. I'm so old that I'm proficient using (+) to code outer joins in oracle
:) Comma-join was the standard for me, and I had a hate-phase towards
LEFT/RIGHT JOIN. But I must admin that once I've really tried the "new" join-
on syntax it has great advantages that outweighs the pain of change.

~~~
DanielBMarkham
Thanks. Good to hear!

I'm 51, and I'm a "full stack" programmer, which used to just be a regular
programmer before everybody started overspecializing. When I learned
programming, you met a guy, he told you what he wanted, and you made the
computer do that. You didn't say "But I'm just the Java guy"

It wasn't that you were an expert in a dozen things, you just figured out
whatever the heck you had to do.

I've kept with this philosophy and it's served me well.

So I dive in and out of SQL. I picked up my habits on SQL intensive projects
in the 80s and 90s.

It'll be difficult for me to switch until I get into a spot where I'm using it
heavily for a week or two. Hopefully I'll remember this thread then and give
the new way a shot. Sounds freaky, but hell, I wasn't crazy about pointer
arithmetic when I first heard about it either. :)

------
optforfon
This is a very naiive and simple question:

I really love the idea of SQL and am looking for a project to use it. It seems
like a great way to encapsulate most of the state of your program so you can
make your code more decoupled. For crunching logs or other data-dumps it seems
like a no brainier, but for everything else I don't really have a a sense for
the performance characteristics. Like would it make sense to keep objects'
data in an SQL table (like a pImpl) ? is going over the rows of SQL table
instead of std::vectors a different order of magnitude performance wise (I
assume with something like SQL-Lite I can keep the tables in memory and not on
disk).

I guess when is it appropriate to use a database and when is it not?

EDIT:

For context - I'm working almost exclusively in C++ and am usually pretty
performance sensitive. In my last job we did radar simulations with lot of
entities (radar platforms, aircraft, ballistic missiles etc.) that just lived
in STD containers. Decoupling state from the simulation was always a bit of a
struggle. Would keeping something like that in an SQL database make sense?

Another example. At the moment I'm working on an art project and I need to
keep track of vectors and polygons which will be later used by a ray tracer.
I'd also like to keep the state decoupled from my program. Would a SQL
database be a bottleneck?

~~~
hobs
It really depends on your performance characteristics as you mentioned, you
can certainly make a database fast as hell but most of the time I would
recommend SQL for when you need ACID characteristics(Atomicity, Consistency,
Isolation, Durability), not performance per-se because while SQL can be
insanely fast, if you do not understand the abstraction it can get slow real
fast.

If you find yourself storing large tables of information that is vaguely
relational and you value the consistency of your data over raw performance I
would generally recommend SQL engines.

~~~
optforfon
I guess i'm more envisioning something like a SQL container. So if you have
multiple instances of a pImpl class you would keep the actual state-
variables/implementation in a DB. When you want to say do some complicated
look up (ex: which objects fit some criteria) you would express that operation
in terms of SQL queries

------
mysterypie
Is there any open-to-everyone SQL database on the net that I can query to try
out some simple operations after reading this tutorial?

~~~
niftich
You can try [http://sqlfiddle.com/](http://sqlfiddle.com/), although I've not
used it.

If you don't object to running one locally, SQLite Manager [1] is a good
Firefox addon that lets you visually manage local SQLite databases.

[1] [https://addons.mozilla.org/en-US/firefox/addon/sqlite-
manage...](https://addons.mozilla.org/en-US/firefox/addon/sqlite-manager/)

------
kevindeasis
Anyone have a resource for mysql 5.6/5.7 about geoqueries and storing json?
Also, resources about how people should build their table schema? It seems
like what I've learn from school might be very different on how you should do
it industry.

~~~
evanelias
You'll need 5.7 to make full use of JSON features; ditto for geo. MySQL 5.7 is
still considered pretty new (GA for ~9 months) so many companies haven't
upgraded yet. Amazon RDS and Google Cloud SQL do offer 5.7 though, if you want
to try out these features.

Best MySQL resources are generally the manual [0],[1] and Percona's blog
[2],[3]

As for MySQL table schema best practices, definitely a broader topic, but yes
it tends to differ greatly from what's taught academically. Baron's book [4]
is probably the best starting place -- it's a few versions old by this point,
but most of the core recommendations around InnoDB tables and indexing in
there still apply.

[0] [http://dev.mysql.com/doc/refman/5.7/en/spatial-
extensions.ht...](http://dev.mysql.com/doc/refman/5.7/en/spatial-
extensions.html)

[1]
[https://dev.mysql.com/doc/refman/5.7/en/json.html](https://dev.mysql.com/doc/refman/5.7/en/json.html)

[2] [https://www.percona.com/blog/2016/02/03/new-gis-features-
in-...](https://www.percona.com/blog/2016/02/03/new-gis-features-in-
mysql-5-7/)

[3] [https://www.percona.com/blog/2016/03/07/json-document-
fast-l...](https://www.percona.com/blog/2016/03/07/json-document-fast-lookup-
with-mysql-5-7/)

[4]
[http://shop.oreilly.com/product/0636920022343.do](http://shop.oreilly.com/product/0636920022343.do)

------
weaksauce
Aside from using temp tables is there a way to assign sub-queries as a
variable and refer to them later on?

completely simple and contrived for example purposes only, something like:

    
    
        active_users = select * from users where user.login > 0;
    
        select * from active_users where active_users.country is not null;
    

seems like that would clean up the query nesting quite a bit. Might need some
kind of begin / end blocks like with transactions so the query planner would
know the scope of those sub-queries.

~~~
ketralnis
Yep, a few.

1\. Temporary views. This one's my favourite but not everything supports it
and sometimes DBMSs have weird limitations, not like allowing parameter
binding in them.

2\. Temporary tables, as you mention

3\. CTEs ("with expressions"). These would be my favourite if they weren't an
optimisation fence in postgres.

    
    
        WITH active_users
            AS (select * from users where user.login > 0)
    
        select * from active_users where active_users.country is not null;
    

4\. If you're only using it once, a subselect lets you add a name to it if
that's all you're after

    
    
        select *
        from (select * from users where user.login > 0) AS active_users
        where active_users.country is not null;
    

5\. If you're using active_users frequently and want it to be globally
accessible, a regular view. Often you can make a DB much more useable with a
few well placed views like this and you'll need a few less abstractions as
well.

~~~
weaksauce
Thanks for the options. can you expand on why CTEs are a postgres optimization
fence?

as for 4: I am aware of the named subselect but once you add a few of those to
the main select statement you start to get into unreadable territory.

back to CTEs, it seems like it should be an easy substitution problem for the
postgres query planner. I am not a DB developer so I am probably not aware of
all the issues. The CTE seems like the thing I was trying to describe though.
Do you think that the optimizations would be more possible with begin and end
scopes? It seems like they are part of the same select statement so i doubt
scopes would help it.

~~~
gleb
In practice optimization fence is not normally a problem, and can even be an
advantage. The general rule is - just use CTEs. And the specific rule is -
learn to profile and optimize using EXPLAIN ANALYZE

------
Numberwang
Thank you for this. It may be too basic for most of the HN crowd but for me it
is a good read for my further development.

------
lukaseder
Every article that makes it clear that JOIN are effectively operating on
tables and thus informally part of the FROM clause (rather than being
independent clauses) deserves upvoting.

------
chris_wot
The one thing I've always wanted to be able to do is to be able to control the
projection part of the query. So in other words, I'd love to be able tell the
select to display all attributes that are like '%name%'.

~~~
bduerst
Have you tried using CASE with HAVING?

    
    
         SELECT
         ...,
         CASE WHEN attributes LIKE '%name%' THEN name 
              ELSE null END AS name
         FROM table
         HAVING name IS NOT null
    

Or did you mean selecting by the column's name specifically?

~~~
koolba
I think GP means "SELECT all columns that have %foo% in the name of the
column". It's not possible directly though you can do it with dynamic SQL by
querying the data dictionary and dynamically creating the columns retrieved in
the SELECT clause.

~~~
screature2
I think you can search for column names by querying the information_schema to
the table schema (at least in postgres)

    
    
      select 
        table_name 
      from 
        information_schema.columns 
      where 
        column_name ilike '%foo%';
    

see [https://www.postgresql.org/docs/devel/static/infoschema-
colu...](https://www.postgresql.org/docs/devel/static/infoschema-columns.html)

For searching for items with '%foo%' in all tables and columns, yeah i think
that you're probably right. [http://stackoverflow.com/questions/5350088/how-
to-search-a-s...](http://stackoverflow.com/questions/5350088/how-to-search-a-
specific-value-in-all-tables-postgresql)

~~~
ZenoArrow
> "For searching for items with '%foo%' in all tables and columns"

To do this you could perform two queries of the information_schema (one query
for '%foo%' in tables, one query for '%foo%' in columns) and join them
together with a UNION statement (or if you didn't mind duplicates you could
use UNION ALL in this case as it's faster and the results from both queries
can have the same fields).

------
danso
I love this guide and will probably steal ideas from it when writing my own
SQL tutorials. That said, I don't think SQL is _that_ hard to read compared to
other computer languages, at least for non-programmers.

I say this as someone who learned object-oriented programming far before I
learned about SQL or its declarative model. SQL was "hard" for me because the
declarative style is so functionally different than OOP, and to this day I
still haven't ventured to figure out the SQL equivalents of functions,
variables, and other mainstays of general programming; I'm satisfied with
using a scripting glue language like Python to automate simple SQL SELECT
statements.

But I love teaching SQL to non-programmers. You can do a lot of important
real-world things with it without having to jump deep into the logical puzzles
of "regular" programming. By "deep", I mean things like for-loops; it never
fails to surprise me how befuddled beginners are about how for-loops work --
not just the syntax, but the concept of a "block" of code, that for-loops can
exist inside of for-loops...or more fundamentally, you can run more than one
line of code inside a for-loop.

With SQL, you don't have to worry about any of that. You simply declare what
columns you want from a table, what values to filter by, and optionally, how
to aggregate it or join it with another table -- concepts that have much
clearer analogies to everyday thinking about information (or, at least
spreadsheets).

SQL gets complicated when you try to do complicated things. In the OP's
opening query, there's a joining of three tables. But without reading it
closely, I can already kind of describe the real-world information-need that
it fulfills: List the people who have borrowed books that have been borrowed
_something something_ to do with `stock`. If I spent a few more minutes, I
could figure out what the fuck that subquery, `stock > (SELECT avg(stock) FROM
books)` is all about, but the difficulty there is just my ignorance of the
domain and/or particular conventions in this hypothetical problem. If I, or
even the average novice, wrote this query line-by-line, its purpose would not
be so opaque (as writings rarely are to their own authors). The point is that
it's much easier for a beginner to author their own SQL without fucking things
up in the way you can easily do so with even just a for-loop. I can't imagine
that the Python or Ruby equivalent would be nearly as clear as this SQL
statement, even for a far simpler kind of join.

My main complaint about SQL is that it seems more logical that the `FROM`
clause should come before the `SELECT` statement. Though my initial guess is
that that hampers the declarative-clarity of the statement when `JOINS` are
involved.

edit: If I could make one suggestion to the submitted article, it would be
that the first "Simple Query" is not simple at all:

    
    
         SELECT bookid AS "id", title
         FROM books
         WHERE author='Dan Brown';
    

I'd argue that this is a simpler statement, and one in which the results are
immediately recognizable to every spreadsheet user:

    
    
         SELECT * FROM books;
    
    

The default behavior of Excel is to show all columns of a data file. When you
get more advanced as a SQL user (and more advanced in understanding the
fundamentals of computation), _then_ you start being selective about SELECT
and realizing that, as an end user (nevermind the cost of computation), it's
not always ideal to see everything by default:

    
    
         SELECT title FROM books;

~~~
Practicality
Ironically most programmers I know have no trouble with loops but struggle
with SQL. The most common SQL error they make is trying to use loops. (RBAR:
Row by Agonizing Row)

~~~
ZenoArrow
If a programmer is struggling with SQL in the way you describe, I'd suggest
they learn how to perform MapReduce in the programming language of their
choice. That should help give them a better understanding of how to approach
SQL. If their language doesn't have good support for the MapReduce approach, a
quick introduction in any functional language should help.

~~~
danso
I teach SQL to students who may have no programming experience at all,
unfortunately

~~~
ZenoArrow
The MapReduce advice was only meant for programmers who had to unlearn
imperative programming habits. Non-programmers won't have imperative
programming habits to unlearn.

------
mozumder
Some of the tables don't work in Safari, but otherwise a good intro.

