

How I write SQL - rachbelaid
http://www.craigkerstiens.com/2012/11/17/how-i-write-sql/

======
unoti
A neat trick on writing sql that I learned working with Oracle consultants:

    
    
        SELECT field1
        , field2
        , field3
        , another_field
        FROM ...
    

By placing the comma at the beginning of the line, instead of at the end, we
can very easily reorganize the sequence without fiddling with commas most of
the time:

    
    
        SELECT field1
        , field3
        , another_field
        , field2
        FROM ...

~~~
ebbv
No, no, no. Comma first variable listing is the worst thing ever in the
history of the world.

Having the comma in the proper place has such a negligible cost (oh no I might
have to delete the comma on the last element and add it to the formerly-last
element!), and having it like this looks so god awful and doesn't really save
you anything if you're swapping the first element instead of the last one.

~~~
hdctambien
Why is "after the element" the proper place for a comma?

~~~
ebbv
The same reason that you should put spaces before and after your logical
operators. The same reason you should use consistent spacing on the left and
right parenthesis. The same reason you should use caps for the SQL and lower
case for all table and field names.

Because it makes your SQL more legible.

In english, commas go after the previous word, not before the next word. The
syntax is the same for all programming languages. The fact that most compilers
and interpreters don't require it is not an excuse for abusing it.

~~~
cptskippy
I disagree with you regarding legibility. I've been using commas like this in
SQL for 10 years and I find it to be much more legible and functional this
way.

Besides the subjective argument of legibility, what other benefits does it
have?

Having a comma preceding something is an indicator that it is one of many.
When debugging you can easily comment out the column, or table from a query.

~~~
ebbv
Again, you can comment out lines with the normal syntax too. There's no reason
for this gimmick syntax. In theory it makes the last 2 elements easier to deal
with but it makes the first two harder. It's silly, and it's ugly.

------
larsolefson
My personal style diverges considerably.

First, most of my SQL scripts are multiple statements, typically 6+, ranging
up as high as 100. When I'm reading and trying to digset such scripts, the
long format described by the author, particularly putting each column on its
own row, makes it difficult to easily digest the script. I'm forced to scroll
constantly to make sense of the statements in relation to one another.
Instead, I prefer to my SQL to be more compact, so I can view as much of the
total script as possible.

To accomplish this, and to maintain readability, I structure each statement
something like this:

CREATE/INSERT/UPDATE line SELECT line FROM line JOINS (if present) ON (if
present) WHERE (if present) AND (if present) Additional SQL (qualify, group
by, order by) ;

If I have a lot of columns, the select portion will get split into several
lines, usually where there is a case when, column operations, or when the line
is 200ish characters long.

This makes the most sense to me, as each command (create, update, insert,
select, from, join, on, where, and, group by) is on its own line, followed by
the information relevevant to it.

~~~
btilly
The same argument applies to writing normal code. The semi-colon exists so
that you can put more statements on a line, and make your code fit into your
editor, right?

You don't agree?

What is the difference between coding in that language and SQL? I submit that
it is only the amount of it you write.

I spent several years of my life focusing on reporting, and spending more time
writing/maintaining SQL than writing any other language. In that time I
discovered that complex SQL queries are a language like anything else. For
"hello world" you can get away with anything. But as soon as you are doing
complex stuff, the layout matters.

Did you know that it can make a difference whether a condition is in your ON
or your WHERE? It can. (Think left joins.) Did you know that the
location/order of the ON statements can make a difference? It does. Is it
visually obvious where this particular condition is? It should be. Did you
know that the order you put things in in your query can have performance
impacts? It shouldn't, but it does (particularly for MySQL - MySQL is stupid).

If you've got 200ish character lines and you are unwilling to format, well,
I'm glad that I don't work with you. Because I'm likely to be asked to figure
it out at some point, and I don't want to maintain crap like that.

~~~
larsolefson
Perhaps I didn't explain myself well. I also noticed that my spacing got
messed up. Each SQL command (create, update, insert, select, from, join, on,
and, where, group by, etc...) is on its own line, followed by the portion
relevant to it. My SQL is still formated, but the preference is towards
putting each "section" of the statement on its own line, rather than on
separate lines.

For example (i've added \ as line break in case those get lost again):

select \ a. _, \ b.column1, \ b.column2, \ b.column3 \ from \ my_table a \
left join \ my_other_table b \ on \ a.col = b.col \ and \ a.col2 = b.col2 \
where \ a.col = condition \ group by \ column1, column2 \ ;

vs.

select a._, b.column1, b.column2, b.column3 \ from my_table a \ left join
my_other_table b \ on a.col = b.col \ and a.col2 = b.col2 \ where a.col =
condition \ group by column1, column2 \ ;

I'll take the 2nd approach any day, particularly when I have statements above
and below that reference that statement, because I can more easily understand
the context of the entire script. I also prefer this, because in my mind,
there is greater continuity, my_table is related to from, so it makes sense
that it should follow it. I read left to right, and don't need to move down to
a new line.

~~~
btilly
That is more reasonable than what I thought you were saying.

However what happens when your list of columns is long? What happens if you
want to include a CASE statement in a field?

I use vim and personally solve the scrolling problem with :split. This is
particularly important in making sure that the SELECT and GROUP BY match up.
(I am perpetually annoyed that the GROUP BY is not inferred from the SELECT.
Unfortunately multiple databases have invented different inconsistent behavior
for missing stuff in a GROUP BY, so there is absolutely no possibility of
getting agreement on the convenient default of grouping on all non-aggregate
functions that appear in the SELECT and HAVING clauses.)

~~~
larsolefson
Sorry for the confusion.

When my list of columns is long, I split it up into multiple lines, usually
7-8 per line (where I work, column names are capped at 30 chars, hence the 200
character estimate), unless it is a case statement. In case statements, each
case/when clause get its own line, so you'll have:

case when condition then result \ when condition_2 then result_2 \ ...\ when
condition_n then result_n end as column_name \

All of my queries and statements follow the same logic and set of rules, I
just prefer a more compact view than most it seems (which could indicate that
I'm optimizing for a different set of constraints/preferences).

I'll have to look into using :split.

Honestly I don't think about formatting that much anymore, as I've started
storing most of my statements as "metrics" which can be easily repurposed for
use in another script. Each metric contains the name of the metric, the
columns added, the metric source (usually a table, but can also be a select
stmt), the possible join conditions, extra SQL like where, group by, qualify,
and any indices. When I want to use that metric in another script, I can do so
by specifying the metric name, the table or select statement I'd like append
the metric's columns to, and my join conditions. It works great for simple to
moderately complex queries, and provides you a good starting point for really
complex queries. Plus, it auto-formats everything how I like it. I also use
comments to explain what the statement is doing, so someone theoretically
should be able to get a pretty good idea of whats going just by reading that.

I really only think about formatting when I'm reading others code and trying
to make sense of it, which is where I find the the formatting like the author
mentioned most annoying/frustrating (perhaps because it's different than how I
think/do things?).

~~~
btilly
My personal experience when I tried to clump things together is that I lost
track of what columns were where. My eyes scan vertically really fast. And
once I was used to it for large queries, well, what works for large queries
works for small as well if you're used to it.

 _Honestly I don't think about formatting that much anymore, as I've started
storing most of my statements as "metrics" which can be easily repurposed for
use in another script. Each metric contains the name of the metric, the
columns added, the metric source (usually a table, but can also be a select
stmt), the possible join conditions, extra SQL like where, group by, qualify,
and any indices. When I want to use that metric in another script, I can do so
by specifying the metric name, the table or select statement I'd like append
the metric's columns to, and my join conditions. It works great for simple to
moderately complex queries, and provides you a good starting point for really
complex queries. Plus, it auto-formats everything how I like it. I also use
comments to explain what the statement is doing, so someone theoretically
should be able to get a pretty good idea of whats going just by reading that._

I do something different that you might like to borrow. If you're working in a
database with real tmp tables (PostgreSQL, MySQL and MS SQL all do, Oracle
does not) you can control execution plans for complex queries by creating a
series of tmp tables in order, adding indexes if you want, then joining them
down the chain. I initially did this for performance reasons, but quickly
found that I could easily reuse the logic to create specific tmp tables in
different places, and that a series of tmp tables was much more readable than
complex subqueries.

You may wonder why I specifically dissed Oracle, after all they claim to have
tmp tables. They do..but all tmp table definitions are in a global namespace
that requires special permissions to write to, and are shared across all
database handles. All of the other databases that I named allow you to
create/destroy them on the fly without special permissions. And do not have an
issue if two database handles want to use the same tmp table name with
different definitions.

This opens you up for all sorts of nasty interaction problems. And increases
the cognitive overhead of the technique to the point where it becomes not
worthwhile.

~~~
larsolefson
It would seem we're each doing what works for us.

I work mainly in MySQL and Teradata (which has tmp tables which are called
volatile tables), and I do exactly what you describe when creating complex
queries. My metric system is just a way to build those temp tables more
rapidly.

I use two main functions to manipulate metrics:

    
    
      create_metric(metric_name,{cols_added},join_src,{join_cols},{extra_sql},{indices});
    

This stores the metric for later use.

    
    
      add_metric(metric_name, my_other_table, {my_join_conditions});
    

This retrieves the metric, and returns a string of (Teradata) SQL in the form:

    
    
      CREATE VOLATILE MULTISET TABLE add_metric_name AS (
    
      SELECT a.*, b.cols_added_1, b.cols_added_2,...,b.cols_added_n
      
      FROM my_other_table a
    
      LEFT JOIN join_src b
    
      ON a.my_join_condition_1 = b.join_col_1
    
      AND a.my_join_condition_2 = b.join_col_2
    
      ...
    
      AND a.my_join_condition_n = b.join_col_n
    
      [If there is extra sql, like where a.condition = X, or group by's like group by 1, 2, it would show up here. SQL here can reference the join columns and table name in an add_metric stmt]
    
      ) WITH DATA PRIMARY INDEX(indice_1, indice_2) ON COMMIT PRESERVE ROWS;
    

I can also store entire create tmp table chains as metrics, with the last
table appending all of the information from that chain to another source (I do
this by storing the chain as preparatory sql, which is run before the create
add_metric_name statement.

It also allows me to search all of my metrics on a number of different
dimensions: the common name of the information I am adding (metric name),
column names, tables, join conditions (particularly useful - It helps you map
how you'll get from one metric to another), indices, or any combination of the
above. For example I can find all metrics that have the word phone in the
metric name and are joinable on user_id.

I'm aware of Oracle's lack of tmp tables. My fiancée has to use Oracle SQL at
work, and I quickly discovered its lack of tmp tables when trying to help her
solve a SQL issue.

~~~
btilly
Yup, they look like similar solutions to similar problems.

One of the things that I built into reports at that location was the ability
to see all of the tmp tables that had been created, and the ability to stop
the report on any particular one and display that. I built this as a debugging
aid for myself, but was quite surprised when finance came to me one day and
said, "Report X is going wrong on step Y - it looks like you're filtering out
duplicate records."

I like having users that will debug my stuff. :-)

------
btilly
My personal style has a lot of similarities, but with some glaring
differences. The biggest is that I put the comma in front of the next item,
rather than trailing the one before. What this means is that when I add a new
thing to the list of columns, I'm less likely to leave one out.

See [http://bentilly.blogspot.com/2011/02/sql-formatting-
style.ht...](http://bentilly.blogspot.com/2011/02/sql-formatting-style.html)
for what this looks like in practice.

~~~
radiowave
I put the commas in front as well, but the reason I started doing it that way
is that you can then comment out (and uncomment) any column without breaking
the query.

------
tom_b
Enjoy. A CTE, Xpath, and Oracle-specific hierarchical bits for a data report
I'm generating. My favorite part was the XML in a CLOB column where all the
values were stored in XML element attributes. I tend to build these
monstrosities up in a very repl-like way. SQL actually led me to Lisp . . .

BTW, I can't imagine any of the below is legible, but I find myself enjoying
these types of things.

    
    
      with bgu as 
      (select
        protocol_code,
        application_code,
        description,
        lab,
        proj."seqno",
        proj."project",
        proj."protocol",
        'false' "nexttag",
        'true' "materialtag"
      from
        protocol,
        xmltable('for $a in /*
                     for $n at $nidx in //*/MATERIAL
                     return <result>
                     <project>{fn:data($n/@project)}</project>
                     <protocol>{fn:data($n/@protocol)}             </protocol>
                          </result>'
                 passing xmltype(protocol.xml)
                 columns
                     "seqno" for ordinality,
                     "project" varchar2(100),
                     "protocol" varchar2(100)) proj
      where
        active = 'Y'
      union
      select
        protocol_code,
        application_code,
        description,
        lab,
        proj."seqno",
        proj."project",
        proj."protocol",
        'true' "nexttag",
        'false' "materialtag"
      from
        protocol,
        xmltable('for $a in /*
                     for $n at $nidx in //*/NEXT
                     return <result>
                     <project>{fn:data($n/@project)}</project>
                     <protocol>{fn:data($n/@protocol)}</protocol>
                          </result>'
                 passing xmltype(protocol.xml)
                 columns
                     "seqno" for ordinality,
                     "project" varchar2(100),
                     "protocol" varchar2(100)) proj
      where
        active='Y'
      order by 1,2,4 )
      select 
        * 
      from 
        (select distinct 
              protocol_code,
              sys_connect_by_path(protocol_code,'||||'),
              connect_by_isleaf "isleaf"
        from bgu
        where protocol_code like 'PROTNAME%'
        connect by nocycle prior "protocol" = protocol_code)
      where "isleaf"=1

~~~
jmelloy
I can't stand the XMLTable syntax. I like the _idea_ of turning an xpath
expression into a result set, but in practice it's a massive pain in the ass.

~~~
tom_b
XML and Oracle drives me crazy in general. You wind up with neither a
relational model nor a generally useful way of data storage/exchange.

I am almost always working with legacy schemas, many generated by long-extinct
tools or developers, and being a genuinely curious person, wandering down
strange code paths to see if I can satisfy random reporting requests.

------
Svip
I don't write SQL like that. He has a good point about using a line for each
and/or condition in the where clause and starting with either one (easier to
remove).

But I don't think all queries should be written in several lines. If a query
is simple enough (e.g. no aggregation, group by, having and at maximum one
where condition and few fields to select), it can be written in one line,
like:

SELECT field FROM table WHERE value = @value

That's easy and straightforward to understand and readable (which was
initially part of SQL's goals, although we can discuss at a later date how
successful that was). Complicated SQL queries - which are bound to happen -
follow - in my style - a more straightforward approach:

There are new lines after _every_ grouping (FROM, WHERE, GROUP BY, SORT BY,
HAVING, etc.) and indenting. In fact, I prefer to indent my fields twice,
because I consider FROM, WHERE, etc. to be subject to SELECT (even if part of
an INSERT). I also tend to avoid extra keywords such as 'OUTER' in 'LEFT OUTER
JOIN' (there is no such thing as 'LEFT INNER JOIN'). And then I follow each
JOIN statement (indented compared to the ground table in from) by the table
name, a new line, an indent and the ON condition. If another table relies on
that table being joined, I indent it further to indicate that, so I can
quickly look at my code to see what tables goes through which (which is useful
if you have a lot of relationship tables).

Should be noted, however, I most commonly write towards Sybase at work, which
tends to have no preference - style wise - on whether you write the keywords
in upper case. My co-workers do not, so I don't either. But when I write my
own projects against MySQL/MariaDB, I use upper case. But I generally follow
the same style.

~~~
jmelloy
If I'm doing a simple query, I still structure it as

    
    
      select blah, blha, blah
      from dual
      where ...
    

I find that even for short ones, lines quickly get very long and illegiable if
you're shmushing them all onto one line.

------
rosser
Please, please, please don't perform your JOINs with comma-separated items in
the FROM clause. As soon as you need also to do an OUTER JOIN, you're going to
get unexpected results.

Per the SQL spec, explicit JOIN-style syntax binds more tightly than FROM-list
elements, which can not only constrain the query planner in the choices it can
make, but can actually _yield a different result_.

------
joulethief
I wonder why so many people hate writing sql with such a poisonous intensity
these days. I remember my database-systems class being one of my favorite
classes when I was in school. Whats wrong with relational algebra and writing
sql? I always thought they were fun.

------
prawks
I was hoping to see some examples of more complex WHERE clauses. They can get
a little messy when you start mixing ANDs and ORs at different levels of
nesting.

Also, why are

    
    
      wine_tags as (
        SELECT DISTINCT
          unnest(tags) as tag,
          wine_id
        FROM 
          app_rating
        GROUP BY 
          wine_id, tags),
    
      wine_detail as (
        SELECT
          app_wine.name as name,
          app_wine.id,
          app_winery.name as winery
        FROM
          app_wine,
          app_winery
        WHERE app_wine.winery_id = app_winery.id
       ) 
    

handling closing parentheses differently? This affects how you organize nested
parentheses.

~~~
btilly
My rule is that if I'm mixing ANDs and ORs in a WHERE, I use parentheses and
extra indentation.

Not everyone knows that AND binds more tightly. Nor should they have to.

------
jmduke
I'm not a big fan of this style. It's technically 'organized' (in the sense
that there's a definite consistency), but it's vertical to the point of
reduced readability: trivial queries can take up dozens of lines.

~~~
kbenson
> trivial queries can take up dozens of lines.

My general rule-of-thumb is that trivial things can have syntax constraints
relaxed, _as long as it makes them more readable, not less_. I may adopt the
style of the article for a complex query, but for something simpler I may go
with:

    
    
        SELECT   foo, bar, baz
        FROM     my_table
        WHERE    foo=1 AND bar > 10
        GROUP BY bar
        ORDER BY baz
    

But that becomes less readable after more than a couple fields or conditions
for each line. Until that point, I think it's fairly concise and efficient.

That said, if your SQL is intermixed with code whose sole purpose isn't to
deal with that SQL, you are probably in for hurt later on anyway. If it is
safely quarantined into data access routines of some sort, the size doesn't
really matter, since the function or method containing the query should really
be _about that query_.

------
fusiongyro
The fact that the first thing he does with the tags is unnest them is, IMO,
material evidence for a traditional 1NF formulation. It's worth considering
that using arrays is a violation of the _first_ normal form. That's a good
indicator of how obvious Codd et. al. thought this rule was.

Other than that, I use a variant of this style, where I put things on one line
if possible (especially GROUP BY and ORDER BY). And I tend to line up my joins
like so:

    
    
        FROM foo
        JOIN bar ON b.foo_id = f.id
        LEFT JOIN baz ON ...

~~~
AlisdairO
I would tend to normalise arrays too, but I can see the argument that when
getting an 'app' and turning it into an object in $programming_language (say),
it's rather a pain to have to go through the sql result set building up an
object from the denormalised data (and arguably results in a lot of wasted
data being transmitted). It gets particularly bad if you have more than one
1-many relationship on the object you're trying to build up.

~~~
fusiongyro
If you're using an ORM, and you almost certainly are, it's equal to all the
other work you do to map your objects. If you're querying directly,
_array_agg_ on the way out is the same work as _unnest_ is on the way in. You
can have your cake and eat it to. Tagging has been discussed a lot recently as
a good excuse to use arrays in Postgres, but since the whole point of tagging
is to enhance searching I think it really misses the point to do it this way.
It appeals to our intuitions only because we think of rows as being large and
string arrays as being small, but in practice this sort of intuitive
"optimization" is almost always wrong.

~~~
AlisdairO
> If you're using an ORM, and you almost certainly are

I'm not a web developer - I spend more of my time in raw-sql-land, so this
discussion is largely academic to me :-). That said, even if you're using an
ORM, it's presumably not exactly free for the ORM to produce an object out of
the thousands of distinct rows you can end up getting if you have multiple one
to many relationships.

In general, the availability of arrays for the purposes of data transport -
less so actual storage - in postgres is rather helpful for these sort of
problems. I find working with other DBs to do object storage quite
constricting by comparison.

~~~
fusiongyro
Without some numbers I see no reason to believe that Postgres would be more
efficient at transporting arrays than rows. One of those is a highly optimized
use-case that literally everybody relies on every day. The other one is a
corner case. I'm not saying it's impossible but I'm going to need more than a
good feeling to buy it.

~~~
AlisdairO
Perhaps my wording was unclear - I'm not talking about postgres' efficiency,
but efficiency of processing the data that postgres produces.

Consider the example of an ORM that's storing an object with three multi-
valued properties on the object it's encoding, each with 100 values. Encoding
one has a row with three arrays in it, encoding two is the result set from
joining to separate tables. In the end, it's quite plainly obvious that it's
going to be cheaper (and, frankly, easier) to construct the object from a row
with three arrays of size 100 in it than it is from a million row result set.
Yes, this is absolutely a corner case (particularly with many multi-valued
props of such a cardinality), but I'm not really sure why it requires numbers
to prove - unless there's some kind of JDBC/ODBC technique for dealing with
this situation that I'm simply unaware of?

------
iblaine
What is comes down to is use _any_ type of formatting as opposed to _no_
formatting. Too often people crank out code, do not comment it and do not
format it. This applies to every language. So far as SQL syntax goes, what
irks me are:

* No use of spaces.

* Obscure use of aliases. If you have 4 tables all aliased as a,b,c,d then you are making life hard for someone else.

* Not capitalizing operations or reserved words.

------
mast
I don't often deal with SQL and when I do, it is usually quite simple. But, I
think there is actually an error in the first step:

    
    
       SELECT 
         avg(rating),
         wine_id
       FROM 
         app_wine
       GROUP BY
         wine_id;
    

The query should be against the app_rating table. It does get corrected later
on, but I was confused for a while.

------
tragomaskhalos
One thing I wish more people would do is to indent joins appropriately; e.g.

    
    
        ...
        Inner join a on a.... = main..
          Inner join b on b... = a...
            Inner join c on c... = b...
          Inner join x on x... = a...
    

The syntax of SQL is a real basket case, readers need all the help you can
give them.

~~~
bcgraham
Whenever I see it, it takes too long to differentiate it from a bunch of
random indentation. Whether this indentation system is needlessly time-
consuming or necessary depends a lot on the type of query and the underlying
schema.

------
NicolasBeuzeboc
Postgres WITH SELECT ... is very handy but in that case I would create three
views instead. It seems that wine_ratings tags and detail could be used in
other queries and having the views created would prevent duplicating their
code.

------
xpose2000
I follow a similar style as well.... One of the things I love about SQLyog is
that it has a 'format query' button that will make any query readable.

Its a huge help when analyzing bad queries that newrelic might spit out.

------
dmourati
Without further ado.

<http://grammar.quickanddirtytips.com/ado-versus-adieu.aspx>

------
mjt0229
I have developed a very similar style, for the same reasons. It's funny to see
this blog post, because I have gobs of SQL just like it on my other monitor.

------
rachbelaid
I very like the WITH statement in Postgres.

It allow to break down the logic and composite the result.

Any idea if it's standard SQL?

~~~
bigdubs
[strikeout, see comment above, it's in SQL99]

MSSQL has a similar construct with Common Table Expressions (CTE's) though, so
it' not unique to PG.

~~~
NDizzle
Another tip for using WITH on MSSQL - prefix it with semicolon like this:
;WITH

MSSQL makes it so that people pick up bad habits over the years and often
you'll find missing semicolons at the end of statements. While the parser will
intelligently handle missing semicolons for the majority of of scripts, it
will bomb 100% of the time (at least up to 2008R2) on WITH.

------
leeoniya
i recently dealt with a lot of this while discussing/resolving some of these
issues @ <https://github.com/jdorn/sql-formatter>

