Hacker News new | comments | show | ask | jobs | submit login
Writing more legible SQL (craigkerstiens.com)
195 points by craigkerstiens on Nov 3, 2016 | hide | past | web | favorite | 162 comments



Here is how I write SQL:

    select
        t1.col1,
        t2.col2,
        t3.col3
    from table1 t1
        join table2 t2 on t1.col2 = t2.col1
        join table3 t3 on
            t1.col3 = t3.col1 and
            t3.col2 = something_else
    where
        t1.col1 > 0 and
        t2.col2 <> t1.col4
    order by col2
    limit 100
So:

1. SQL capitalization is not sacred. I lowercase everything.

2. I just indent subclauses, with four spaces, like I indent other languages. I don't go out of my way to line up things vertically.

3. Conjunctions like "and" are put at the end of a line, like a comma, not at the beginning. I think it lines things up better vertically.

4. I try to keep lines short, but I also try to keep it from getting needlessly long vertically. So if the statement is short and simple, I might fold some things back onto one line, like:

    select col1, col2, col3
    from table1
    where col1 = something_or_other

or even:

   select * from table1 where col1 = something


I fully agree with the sentiment, and especially with treating SQL code as code, because it really is code. My style differs in 4 regards:

1. "join"s are at the same level as "from", and the contents of from/join are indented

2. prefer tuple comparisons of multi-comparisons, i.e. "(a,b)=(c,d)" instead of "a=c and b=d"

3. operators auch as "and" are at the beginning of a line, as in style guide of almost all other programming languages, too

4. the contents of "order by" and "group by" are also indented, the same way as the "select" columns

Example:

    select
        t1.col1,
        t2.col2,
        t3.col3
    from
        table1 t1
    join
        table2 t2 on t1.col2 = t2.col1
    join
        table3 t3 on (t3.col3,  r3.col2) = (t1.col3, something_else)
    where
        t1.col1 > 0
        and t2.col2 <> t1.col4
    order by
        col2 asc,
        col1 desc
    limit 100


For item 3, the benefit of "and" at the beginning is that you can add or remove lines to the condition without disturbing the previous/next lines. The same works for ","


But the disadvantage is that you don't immediately see whether it is "and", "or" or something else.

That's why in most programming languages the coding style says you should put the operators at start of line instead of end of line, e.g.

GOOD:

    if ((...some_condition_that_may_make_sense...)
        && (...some_other_condition...)
        && (...and_yet_some_condition_that_makes_sense...))

    x = ((some expresssion)
         + (some other expression of different length)
         - (some final expression))
BAD:

    if ((...some_condition_that_may_make_sense...) &&
        (...some_other_condition...) &&
        (...and_yet_some_condition_that_makes_sense...))

    x = ((some expresssion) +
         (some other expression of different length) -
         (some final expression))
Examples:

- Python PEP-0008: "Should a line break before or after a binary operator?" https://www.python.org/dev/peps/pep-0008/#should-a-line-brea...

- Topvoted SO answer for "If you break long code lines, how do you indent the stuff on the next line?" http://stackoverflow.com/a/699347


Very similar to my style, though I rarely use <> and instead mostly use !=

Never realized I could do tuple compare and will probably adopt that.


  >> prefer tuple comparisons of multi-comparisons,
  >> i.e. "(a,b)=(c,d)" instead of "a=c and b=d"

    > never realized I could do tuple compare and
    > will probably adopt that.
Indeed. Mind blown.


I always keep the joins indented from the table that they are joined to:

  select    t1.col1,
            t2.col2,
            t3.col3,
            t4.col4
  from      table1 t1
                join table2 t2 on t2.colx = t1.colx
                join table3 t3 on t3.coly = t1.coly
                    join table4 t4 on t4.colz = t3.colz
                join tablen tn on tn.coln = t1.coln
  where     tn.colnx in (...)

            -- Table 3 and Table 2 have some values while Table n value is not something OR Table n has value that is exactly something
            and (
                    (t3.col = x and t2.col = y and tn.col != z)
                    or 
                    tn.col = z
                )
  ...
So in above you can see by the indent on which table some other table is joined to. Tables t2, t3 and tn are related to t1, but t4 is related to t3, not directly t1.


> indented from the table that they are joined to

There is not always "the" table.

How does your coding style work if you join into multiple tables? (because in reality you join the new table with the joined result of the previous tables, and hence can reference any combination of any previous table columns, unless you group your joins with parentheses)

  select
      ...
  from
      table1 t1
  join
      table2 t2 on t2.colx = t1.colx
  join
      table3 t3 on (t3.coly, t3.colz) = (t1.coly, t2.colz)


What you are saying is correct. In this case I would have the join of t3 on the same indent level as the join of t2 as t3 table has also the common key with t1. Somehow I just find the parent comment style not so intuitive for me.


I much prefer the style of the comment you replied to because it's very easy (for me) to scan the table/view names when they are vertically aligned.

They are at the top of the hierarchy of information I want when I'm reading a query.

The information I want to be able to identify the quickest are.

1. Tables/view names

2. How they are joined

3. Columns

4. Filters

5. Grouping/Ordering/Anything Else


Regardless of the particulars, I find the most important thing for readability is for the author to have made some kind of decision and stuck to it. The most illegible style is no style at all.


As long as it does not deviate into getting really creative with the spec, keeping everything consistent is overrated in my opinion.

There are the exceptions of very widely used projects where contributors will only look at the code a couple times when making a change to be pushed upstream or where there are significantly junior developers that need a narrow scope of the language to help them with the learning curve and getting up to speed with the code base.

In my experience(which is very anecdotal I admit), the requirement for consistency is more about a particular personality trait that many developers have - the need for order and control - as opposed to a requirement for quality production code.


Where did you pick up this style?

I have been following this pattern for over a decade. I believe I picked it up from some project's style guide, but I never found it again.


If I came across your SQL, I'd probably hunt you down and give you a hug. There's so much horridly formatted code and it's unusual to see a developer care.

I prefer to uppercase command syntax to make stand apart visually from the parameters of the query. I don't agree with your conjunctions at the end of the line, I actually prefer commas at the beginning of the next line though I don't do that so as to conform to convention. I also use the AS keyword to explicitly denote aliases.

If I had it my way, all SQL would look like this:

    SELECT t1.col1
          ,t2.col2
          ,t3.col3
    FROM table1 AS t1
    JOIN table2 AS t2 
        ON t1.col2 = t2.col1
    JOIN table3 AS t3
        ON t1.col3 = t3.col1
        AND t3.col2 = something_else
    WHERE t1.col1 > 0
        AND t2.col2 <> t1.col4
    ORDER BY col2
    LIMIT 100


Very much like this style, very close to what I do... really the only difference between your style and mine is that I will indent the joins. For me it makes it somewhat easier to be able to scan for the where clause. Also I'll typically not use the AS keyword for table aliasing, no real good reason for that choice other than many years of habit.

But mostly commas first and logical operators and such first: If I have complicated queries that I'm working through... having the ability to just comment bit out here and there without worrying too much if I've left a dangling comma (et al) is a good time saver.


That's pretty close to how I do it. I started putting commas at the beginning of a line because when I started doing so I stopped ever having queries error out because I copy and pasted something from one query to another and forgot to remove a trailing comma. Having the commas at the front of the line makes finding those kinds of errors automatic for me.


Instead of using uppercase, which looks awful, just get an editor with color-coding. Solves the same problem, no more ugly uppercase.


That matches what I do pretty much with the exception of capitalization. I agree, it's not totally necessary, but it does provide a visual delineation of each section/component of the the statement, which, for large statements, can be very helpful in quickly scanning what it does:

    SELECT
        t1.col1,
        t2.col2,
        t3.col3

    FROM
        table1 t1
        join table2 t2 on t1.col2 = t2.col1
        join table3 t3 on
            t1.col3 = t3.col1 and
            t3.col2 = something_else

    WHERE
        t1.col1 > 0 and
        t2.col2 <> t1.col4

    ORDER BY col2
    LIMIT 100
    ;


The lower-case probably works better for IDEs that have SQL syntax-coloring. If your IDE doesn't have that (looking at you XCode..), then upper-case keywords are better.


>> If your IDE doesn't have that

I'm looking at you, System i Navigator. Also you don't support anti-aliased fonts or scale properly (because you're a 90's Java application)


I do almost the same thing, but I capitalize all SQL keywords. I'm sure that's annoying to someone somewhere but I find it useful in separating the SQL from the relation names.


We've had great syntax highlighting tools for decades now, I'd much rather my tooling do it for me with subtle colorations than be forced to resort to COBOL SCREAMING.

I've trended towards a preference to "sentence-cased" SQL where the first word in an SQL statement is Capitalized and everything else is lower-cased. Makes things read more like a narrative and is a nicer hint than trying to spot an optional semi-colon to determine if you've reached the end of a command yet (and helps in those times when you need to find a place where it turned out an optional semi-colon wasn't in fact optional).

    Select *
    from table
    where condition
    order by fields

    Insert into table
      (...)
    values
      (...)
So far as I know I'm about the only person that likes sentence-casing SQL, but it helped me out quite a bit on some projects I worked on and it kept things readable.


+1 Vertical whitespace.


yes, neat, and yes SQL is code. Problem is that there is not SQL linter I know of.


Yep. Same here. I find screaming one's reserved words far more distracting than the highlighting helps.

Another SQL habit that seems to far more annoying than helpful is putting commas at the start of next line. Great, they line up. Did you know that publishers have a term for vertical patterns in blocks of text? They're called 'rivers', and one generally attempts to avoid them, because they're visually distracting.


One developer's 'river' is another's 'grid'. Being able to scan down quickly makes it easier to compare lines against each other. That's not something you do when reading a novel or a newspaper article, but it's something you do all the time when reading code.


Maybe I'm a weirdo. I find it massively distracting. About the only time column-alignment is helpful is when scanning for typos, but that's handled pretty capably in other ways by most editors more sophisticated than Text Edit.


I think "rivers" are not the same. if the same as in edition of books in French, rivers are those unwanted "paths" your eye sees floating downwards inside the rectangle of words that makes a page. They come out of randomness, when whitespaces happen to draw a negative path. It has to be corrected by slightly modifying the whitespaces between some words. AFAIK this is done manually and is one reason for book edition to be done by humans.

In short, it has nothing to do with purposeful perfect vertical alignment in code.


When I worked in magazine production (90s, California) we used the term for both whitespace and other patterns. Other patterns are a lot less common, because whitespace is the most common "character", but they happen.


I used to agree with you about putting commas at the beginning of lines, but now I've changed my mind. By putting the commas at the beginning of lines it becomes much easier to do dynamic sql and to be able to add remove select statements without having to check the ends of the lines to make sure the last line does not have a trailing comma and to make sure the preceding lines each does have the requisite comma.


I go back and forth with conjunction at the end or beginning of the line. At the end you get to line up column names, at the beginning it makes it easier to comment out or remove the line and basically contains the intended logic on a single line.

I think I usually end up with the later because I think it makes more sense.


Why not this?

  WHERE A
    AND B
    AND C
Or better, this? You can comment out any part easily.

  WHERE 1
    AND A
    AND B
    AND C
(With keywords capitalized, as g-d intended. :-) Or not.)


I like the "WHERE 1" trick, thanks!


I have the same problem and also with where to put the commas separating columns names. You can't win the game though.

Sometimes you want to remove/comment out the first line, sometimes the last, sometimes one in the middle.


"2. I just indent subclauses, with four spaces,"

Surely you mean tabs? runs


"runs"

Into my arms.


Interesting! I independently ended up on nearly the exact same style. The only difference is I don't indent join clauses.


I prefer putting the comma at the beginning of the line (this makes dynamic sql much easier). I prefer putting and/or at the beginning of the line. I indent multiple items like case statements in the select. I use parens for join statements with multiple join conditions. I separate the select/from/where clauses with a blank line when a section has more than one line.

select

t1.col1

, t2.col2

, t3.col3

, (case when t1.col1 = 'Y'

   then '1'

   else '0'

   end) as the_flag
, t3.col2

from table1 t1

join table2 t2 on t1.col2 = t2.col1

join table3 t3 on (t1.col3 = t3.col1

   and t3.col2 = something_else
)

where t1.col1 > 0

and t2.col2 <> t1.col4

order by

col2

, col1

limit 100


1. Why indent the joins? Are table2 and table3 less important than table1? Is table1 special? Is that why it enjoys privileged status in the from-clause?

2. Why place some predicates in the where-clause and others in the join-clauses? What's the thinking here? Why not put all predicates up in the join-clauses, nearer to the tables that they affect?


I think of joins as operators and from as the block similar to and/or in the where clause. He is being inconsistent with line breaks for select, from and where blocks. Personally I want to be able to visually pick out the blocks of the query and the easiest way to do that is with indention imo.


If "joins" are operators like "and", and you put "and" at the end of the line, shouldn't you put "join" at the end too?

    from
        table1 t1 join
        table2 t2 on t1.col2 = t2.col1 join
        table3 t3 on
            t1.col3 = t3.col1 and
            t3.col2 = something_else


well I don't put and/or at the end of lines so there is that... but I have seen people but joins at the end of lines


sorry, mistook you for the top-level poster.


  > Why indent the joins?
  > Are table2 and table3 less important than table1?
It's often arbitrary which tables are joined and which one is from'd. But the joins are part of the from-clause. They all join together into one big from.

  > Why place some predicates in the where-clause and
  > others in the join-clauses? What's the thinking here?
  > Why not put all predicates up in the join-clauses,
  > nearer to the tables that they affect?
The join conditions are just to line up the rows of the different tables with each other, to avoid a cartesian product, to form one big table.

This giant table is then filtered through the where-clause, like a funnel. You can put the filters in the joins, and I have in the past, but putting them in the where-clause better reflects the picture in my head.

Tangentially, it would have been better if SQL had the select-clause after the from- and where-clauses:

    from table1 t1
        join table2 t2 on t1.col2 = t2.col1
        join table3 t3 on
            t1.col3 = t3.col1 and
            t3.col2 = something_else
    where
        t1.col1 > 0 and
        t2.col2 <> t1.col4
    select
        t1.col1,
        t2.col2,
        t3.col3
To understand the select-clause, I always first have to jump down to the from-clause anyway. This would also mirror the other statements: insert, update, and delete, which begin with the table names.

This better reflects the flow of data. First you decide the source of data (which tables). Then you filter down to which records (which rows, the where-clause). Finally you determine which fields to get (which columns, the select-clause).


I've been writing SQL for close the 20 years, and my style has settled into something very similar to yours. Lowercase everything, and use indentation to clearly separate the major parts of the query.

And, like you, be pragmatic. If it is a 1 liner, keep it a 1 liner.


I read somewhere, at some point, couldn't provide a ref, that the caps were used for clarity back in the days of monochrome screens. Now, with colour syntax highlighting it's no longer as useful.


The purpose of capitalization is to visually distinguish the keywords.

In this style that's done in other ways so there is no need.


You didn't mentioned the most important thing to make readable SQL and that is demonstrated in your clause: use the "join" keyword instead of doing the join in the "where" clause.

I can always auto-reformat a complex SQL in a IDE, but if it isn't using join clauses, it will stink.


My first thought was "why would you do that?". Glad to say I've rarely seen joins in the where clause.


This is what he means, and I see it all the time: SELECT FROM table1 JOIN table2 WHERE table1.field = table2.field

An unsettling number of SQL developers never actually learned join syntax and write it that way.

It's functionally equivalent to putting the equality in the JOIN ... ON clause, and any modern database will optimize it to execute the same way, but it's a worse syntactical representation of what's actually happening.


Great, thanks for that - I've been using (typically INNER) JOIN .. ON .. for all my life, well 20 odd years of development, and not once did I ever put the ON clause into WHERE. :)

I can vaguely recall seeing it a few times now, actually, and being rather confused as to why you'd do it that way - as perhaps it was a more optimal way of putting it, and that I'd done it wrong all the time?

As you state though, it's just bad; WHERE clauses ought to be used for filtering the data, not declaring how the tables join. Glad to have cleared that up. :)


You'd do it that way if you just didn't know any better. "Two tables where A = B" falls into a natural line of thinking for someone who picked up SQL ad hoc rather than formally learning what a join is and the syntax for it.


Totally agree..This should be put on a page published somewhere. Whenever I am trying to look at performance issues (or any other issue) with a SQL, this is the first thing I do and pretty much follow the same strategy.


quick question about the indent of subclauses. How would you write a query where table3 needs to be joined with both table1 and table2? Would it change anything?


No. I indent all joins to the same level, regardless of the join condition. The effect of joins is to make one large flat table, so nested indentation is an unnecessary and misleading signal.


Oh yes please. Capitals just make it harder to read.


A couple of jobs ago, I worked at a company that did a ton of SQL and we used right justified keywords, which looks like:

    select t1.col1,
           t2.col2,
           t3.col3
      from table1 t1
           join table2 t2 on t1.col2 = t2.col1
           join table3 t3 on t1.col3 = t3.col1 
            and t3.col2 = something_else
     where t1.col1 > 0 
       and t2.col2 <> t1.col4
     order by col2
     limit 100
Initially it seemed weird to see the ragged left edge, but over time I got used to it and grew to prefer it. It doesn't really hold up for extremely complex queries with lots of subqueries because everything tends to start drifting too far to the right (unrealistic example below):

     select *
       from blah b
      where b.whatever in (
          select blah 
            from c
           where c.whatever = in (
               select blah 
                 from d
                where d.whatever = 'foo'
           )
      )
I almost never see this style used these days, and I was curious if anyone else does it this way?


I use this style. I work with queries that are a little bigger than average, and it's very difficult to keep track of it unless it's well laid out.

conditionals, loops, etc, get formatted like they would in a procedural language. the DML gets formatted similarly to what you've posted.

a not unrealistic example from this morning is as follows:

  if (select object_id('aTableName', 'u')) is not null begin
    select column
      from aTableName
     where thisIsTrue = 1
           and somethingIs in (select aList from aListTable)
           and somethingElse = (
             select aValue
               from aValueTable
              where aValue = 'somethingImportant'
           )
  end else begin
    print('Oh noes! I couldn't find somethingImportant or something in someList form aTable!')
  end
we work with huge amounts of data with systems and processes that don't lend themselves to other methods (yet... it's coming.) so we have to make SQL as readable as possible.

edit: yes... i know... subqueries are bad... it's just an example. those would have been done in joins or as temporary variables or something else.


Yes! All the time, but with some variation. Let me explain by rewriting your query:

    select t1.col1,
           t2.col2,
           t3.col3
      from table1 t1
     inner join table2 t2 on t1.col2 = t2.col1
     inner join table3 t3 on t1.col3 = t3.col1 and t3.col2 = something_else
     where 1 = 1
       and t1.col1 > 0 
       and t2.col2 <> t1.col4
     order by col2
     limit 100
My logic is that I: 1 - There should be a clean vertical line between the select/from/inner/where/etc keyboards to the right hand side for clean alignment. 2 - I always use the where 1 = 1 syntax so that I can easily comment out the following conditions while testing.

And thanks for not capitalising the keywords :-)


This is very close to the style we use. The only big difference is that we opt for CTEs over subqueries whenever possible.


This is the style I use other than I uppercase the reserved words.


We also use right-justified keywords (and never "select *", though I guess that was just to keep your example brief).


Yes I too am of the river people now. +1 for long queries and being able to comment out everything. 30 years SQL. bah!


I also use UPPER case for SQL Keywords to visually distinguish them from user-defined names (i.e. similar purpose to syntax coloring), but I don't collapse SELECT over multiple lines and line up each statement so the conditions line-up, e.g:

    SELECT foo, bar
      FROM baz
     WHERE foo > 3
       AND bar = 'craig.kerstiens@gmail.com'


That covers the simplest case.

What would you do when you're selecting 20 columns from 5 different tables joined together?


Re uppercase: can't afford a syntax-highlighting editor?


It's true that there are some editors which highlight SQL inside strings, but most don't.


sqlparse comes to mind - https://github.com/andialbrecht/sqlparse

I know the maintainer, they also run a service at https://sqlformat.org/

You can try it on the command line e.g. on Fedora:

  $ sudo dnf install python3-sqlparse
  $ echo "select c1, c2, c3 from table1 join table2 t2 on t1.col2 = t2.col1 where 1=2 order by col2 limit 100" | sqlformat -k upper -i lower -sa -
  SELECT c1,
         c2,
         c3
    FROM table1
    JOIN table2 t2
      ON t1.col2 = t2.col1
   WHERE 1 = 2
   ORDER BY col2
   LIMIT 100


Am I the only one who puts commas on the next line? I think of them as an operator analogous to AND or ON:

    select
        t1.col1   as col1
        , t2.col2 as col2
        , t2.col3 as col3
It makes commenting out columns painless.


You're not the only one, but I wish you were, because then I'd have to read a lot less of it.

I doesn't eliminate your commenting problem, it moves your commenting problem from the last line of your list to the first.

Also, it is visually distracting as all get out.


> It makes commenting out columns painless.

No it does not. If you have to comment out the first column you'll have to remove the comma from the second column. It's the same problem that happens if you put the commas at the end of the line, except that the offending line will be the last column. So putting commas at the beginning just makes it look weird because it differs from how we use commas in natural language.


Exactly, You're also hiding the most important part of the line which is the variable name.


Same here ... except the first field would be on the same line as the select, and the comma would be aligned with select's "t"


Yes I do. As you say so I can comment out easily but not so much on with the 'as' and maintaining the river.


Dunno if it's the quality of the posts here, but I find interesting to see that when we talk about SQL nobody complains about it : nobody says there are better alternative, nobody talks about the great schism between the last 2 major revisions of the SQL standard,...

(IMHO, SQL is one of the oldest languages and it is still super powerful and used like hell in production environment, but that's just a point of view)


I have plenty of gripes about SQL but am not aware of any real alternatives ... which I do find bizarre - in the realm of programming languages in general we live in a time of Cambrian explosion in terms of the number and type of languages being used and developed, and yet SQL sits aloof and unassailable on its pedestal.

There are a number of clear paths not taken:

1/ A new, better language that compiles down to SQL;

2/ A set of enhancements to SQL that compile down to regular SQL (<= best imo);

3/ A new language that compiles down to the same AST that is generated from SQL: this of course would need to be done by the database implementors.

That we don't see this I attribute to:

1/ Perception that SQL is "good enough". Which is true as far as it goes, but again odd compared with the endless innovation in the world "normal" programming languages;

2/ Innate conservativism of the database tech community;

3/ Wide use of tools that already sit on top of SQL - query builder APIs, ORMs etc.


I've wondered about this too. I think ORMs serve effectively the same purpose as Javascript transpilers (which gives you 1 & 2)

You can also write stored procedures in Postgres in other languages.


ORMs attempt to bridge object and relational models. (With results that vary from "works fairly well" to "screw it, give me a DB handle and I'll just write SQL".) I see what you're saying, but they operate at a different place than generators/translators.

Stored procedures are just that - they get a name and persistence. PLSQL has procedural extensions, but it is still SQL.

Fabian Pascal (who has made a career out of SQL criticism) proposed a replacement system called Raquel that would have a "real" set-theory-based query language. I don't think it went anywhere.

My best guess as to why there aren't many alternatives: SQL is both mature and Good Enough(tm), but importantly, also unsexy and complicated. Database engine development is more similar to kernel development than, say, web application development, and there are simply far fewer engineers qualified to do it. Building some framework in Javascript currently gets a ton of love, and in comparison is far easier than developing a SQL replacement that is roughly as performant as SQL and either less awkward linguistically or functionally superior.

People do take up those sorts of projects, but they're pretty lonely unless some combination of talent, luck, and timing turns into rock star status (Linus wasn't the only one writing unix clones at the time).


> Stored procedures are just that - they get a name and persistence. PLSQL has procedural extensions, but it is still SQL.

PostgreSQL distributions come with support for PL/Python, PL/Tcl, PL/Perl and PL/pgSQL (a PostgreSQL specific PL/SQL clone). Other languages exist, but aren't included. See https://www.postgresql.org/docs/9.6/static/external-pl.html


Yes. And PL/Perl is perversely fun. I don't think I've ever seen other languages used in production, though[1]; I think folks tend to stick to PLSQL because of defaults.

[1] I'm sure people do, but this discussion has an anecdata deficiency.


I use it (way more than a lot) and I love it, but we could do better.

1) It really needs algebraic data types (and pattern matching!). I refuse to use a general purpose language that doesn't have them, and I always end up struggling to represent them using SQL.

2) Null, and the subsequent binary-but-maybe-ternary logic that it imposes on everything is a terrible hazard. It needs Option types. I guess this goes along with algebraic data types, but it is important enough to merit its own mention.

3) It needs more expressive constraints: foreign keys on partial unique indexes or views, declarable immutability, window constraints, etc.

4) It needs better graph representation and queryability. Anything remotely resembling a graph (or even trees!) ends up as a huge incomprehensible hack. Graphs are still relational, they're just ignored in favor of constructs that only tenably represent them.

5) It needs better ways to present output. Dumping everything as a table leaves the end user to their various hacky devices if what they really want is not an array of flat data structures. Users should be able to easily output data as a graph, tree, arbitrarily nested structs, associative maps, sets, arrays, and even scalars.

6) It needs extensive compile time checking of queries and statements, complete with errors and warnings and lints. As it stands right now, you only get syntax validation and checks for object existence. And although this is implementation level and not language level, SQL engines need to provide way better IDEs to compliment the better compilation.

7) It needs phantom keys for multi-column foreign keys against multi-column primary-keyed tables. As it stands right now, I have to lose the semantic benefits of a natural key in favor of a surrogate key in order to not duplicate a large number of columns in a 1-many table relationship. I should be able to use the many-columned natural foreign key, but invisible and behind the scenes map to a surrogate primary key.


In Enterprise Land a lot of people use ETL tooling as an alternative / in addition to writing SQL.

Databases, data interchange formats, and the concepts they represent are mapped out in graphical tools like ER Studio.

The SQL operations you do to load in data from various sources are done in graphical ETL tools like Microsoft SSIS, Talend etc let you draw diagrams which replace SQL. They are more properly a way of programming an engine that runs the SQL but also does things like monitor for incoming files, query webservices etc. Some also provide a technology neutral way to express transforms that can be run in SQL but also on Hadoop or whatever. That said, you sometimes need to go in and write the SQL for optimization.

Enterprise Service Bus sit between multiple databases and multiple data sources and consumers, and the interfaces these provide replace SQL queries. Typically graphical tools or configuration files are used to configure the relationship between the endpoints they provide and databases.

Even outside enterprise land, people rarely write SQL when defining and interfacing with databases. They use an ORM, and a lot of popular ORMs will create a database and provide a way of accessing it so that you never write SQL.

Technologies like OData mean that applications can query data by proxy without needing to know the model it is stored as, and the mapping between the model OData presents and the database is typically implemented using an ORM.

Cloud computing means people with very large databases increasingly want to spread the processing across many slower machines rather than spending the budget on one specialist data processing server with very fast IO. SQL is replaced by other languages in this case (although SQL like languages are making a bit of a comeback).

Finally, in academia and certain very data datacentric industries, SPARQL and successors are still a thing.

I think there is a big divide between enterprise development and what we learn in school / hackernews here. A lot of people outside the industry are not aware of these ecosystems of applications.

I wouldn't recommend any of these over SQL in all cases. They are technologies that solve a problem and you should consider them if you have that problem.


Hum.. SQL is a "mathematically pure" language, like Lisp or, to a smaller extent, Haskell. Those tend to not attract criticism (mostly because they age slowly, but also because the most vocal language critics are biased towards them).

SQL is also a standardized language. People tend to see those as something you should deal with, not something you should improve. Thus less criticism.

It's also a language that is mostly embedded inside others. So any weakness is easy to compensate on the outer code.

Besides this, it is a pretty well designed language.


The main trouble is that even if a better alternative is suggested, how convince the RDBMS developers to use it?

Unfortunately, RDBMS are highly coupled. I remember when someday I ask if is possible to remove the SQL part of sqlite and substitute with my own, and get laughed!

RDBMS are "expected" to be a black box enclosed in dark mystery.


I make very liberal use of functions in SQL, as much as I would in other languages - any complex calculation or WHERE clause I'll usually abstract out into a function with a meaningful name. Postgres has nice syntactic sugar that makes these look pretty much like columns, so if your function takes a row type as an argument, you can just call it as `row.function_name` instead of `function_name(row)`. With the right hints performance is fine, and where it's not, it's possible to build indexes over function call results.

The article touches on CTEs, while mentioning they are on optimisation boundaries on some platforms (e.g. Postgres). This touches on my main annoyance with SQL - most of the abstractions are very costly. Even just phrasing a query in the simplest, most natural way is rarely optimal. I spend hours of my life wondering why query planners are so deliberately obtuse. And there are countless specific little annoyances, e.g. 'percentile_disc' seeming like a natural fit for calculating a median, but being wildly slower than more hacky solutions.

I would love to know what's next, to be honest. A readable query language, which captures the intention and semantics of a query clearly and minimally, is designed to allow abstractions to be composed together, and compiles to a mostly optimal query plan. I'd probably settle for SQL as-is, if there was an RDBMS with a query planner that was willing to go away and think for five minutes so I could just write the SQL I wanted without thinking too hard about the implementation.

Anyway, the formatting's important, but like any language, the way you can create and compose abstractions, and reveal the intention of your code is the biggest thing for readability.


I prefer any formatting over no formatting. If had to choose then I would do this:

* Lowercase for tables & columns and upper case for syntax.

* Having commas first makes it easier to move lines around.

* I avoid small aliases to minimize conflicts when code is resused.

* Every column needs a table reference. Do not make me guess.

* Alias every column in the SELECT clause.

    SELECT
        table1.col1 AS table_col1
      , table2.col2 AS table_col2
      , table3.col3 AS table_col3
    FROM
        table1 table1
    INNER JOIN
        table2 table2
    ON
        table1.col2 = table2.col1
    INNER JOIN
        table3 table3
    ON
        table1.col3 = table3.col1
    AND table3.col2 = something_else
    WHERE
        table1.col1 > 0
    AND table2.col2 <> table1.col4
    ORDER BY
        table2.col2
    LIMIT 100;


Is it just me, or are the code examples not monospaced ?

This doesn't help when talking about alignment.


It isn't just you, and indeed, the alignment examples make little sense with a proportional typeface.

I checked the CSS, and for some reason the code blocks use the exact same fonts as the rest of the text (albeit with a new font-family clause just for them).


Came here to say the same; really struggled to see the difference between before and after. :) I think I'll enjoy reading the comments here more.

(Quite liked SQL Prompt's auto formatter, but a shame so hideously expensive).


If you're in the world of Microsoft SQL Server:

1. http://poorsql.com/

2. SSMS plug-in: http://architectshack.com/PoorMansTSqlFormatter.ashx#Downloa...

Your life reading and saving SQL will change forever.



It was on HN three weeks ago, here es is the link to the discussion:

https://news.ycombinator.com/item?id=12671667 (146 comments)


My pet hate, directly out of SQL Server Management Studio, is how their code generation places commas. It generates code like this:

  select
      columnA
      ,columnB
      ,columnC
  from
      tableName
Whereas I want my commas AFTER the column names on the same line:

  select
      columnA,
      columnB,
      columnC
  from
      tableName
Or just looks so much neater.


Leading with commas makes it easier to refactor if you're prototyping a query. You're less likely to cause an error when removing a column from:

  SELECT 
      first
    , second
    , third
  ...
than from

  SELECT 
    first,
    second,
    third
  ...
Which becomes:

  SELECT 
    first,
    second,
  ...


At best, that may shave off a few seconds of the time needed to refactor.

The trade-off is code that is harder to mentally parse, because we are used to trailing commas, not leading commas.

If you spend more time reading code than writing it (which I assume applies to the majority of development), then the trailing comma is a much, much better choice of style.


I spend a lot of time writing ad hoc queries for analysis. That means I don't know what columns I want, I might be switching things up from aggregates to subsets and back. That's why you'll find things like:

    WHERE   1=1
    AND     a.id = 12
    --AND     a.Col1 > 23
    AND     a.Col1 = 23
Similarly the leading comma makes it faster to cut out things I'm not using anymore. Now obviously for production code you could argue my reasons are no longer valid - but probably they'll leak through because that's what all my shk hotkeys generate and I'm used to writing


> we are used to trailing commas

Leading commas and other operators are common in some languages. Haskell, for example:

https://github.com/tibbe/haskell-style-guide/blob/master/has...


The time saving isn't as important as avoiding frustration in your tools. Stubbing your toe on silly problems like trailing commas breaks flow and makes exploring less fun.

In my experience, these commas don't add any real meaning to the human readers. Will it really obfuscate the code for a future reader? I can imagine it can look unfamiliar and consequentially grate someone's nerves. Reminds me of the arguments around R's "<-" vs "=".

I agree with the sentiment for most code, but I doubt I spend as much time reading my SQL queries as I spend writing and refactoring.


Allowing commas on the last list item makes it easier to refactor.

Leading commas just shift the problem around, making the beginning of the list hard to change, instead of the end.


Agreed.


Think of the leading commas as being the same as leading AND/OR in a where clause or a leading JOIN in a join clause. I used to prefer the commas after the column names as well, but I've since been converted to the wisdom of placing them to the front. It does take a little while to get used to it, but now I visually prefer it and find that I can read my SQL and find any potential issues much faster. Plus its much better for dynamic SQL and refactoring SQL statements.


I'd love to get HN feedback on a library I wrote to make SQL more manageable - https://github.com/hashedin/jinjasql.

The library is built on the Jinja template engine. You can create macros, functions and other reusable sql fragments, and then use them to create the query. You can also write loops and conditionals, so you can do all the things you routinely do to generate HTML.

The library tracks bind parameters, and doesn't let user input into the query. At the end of the day, you get the generated SQL and an array of bind parameters. You can use these two to execute the query using whatever database & driver combination you like.


So I think it is much easier to read if you can line up like clauses:

  SELECT spf.blah_blah,
         count(*) AS cnt
    FROM jv_CTLG_ENTITY_ATTR_PROD_MAP prod_map
    JOIN jv_SSA_PRODUCT_FACT spf
         ON prod_map.prod_ref_id = spf.PROD_REF_ID
   WHERE prod_map.INVSBL_IND = '0' AND
         spf.LSTG_END_DT >= date_sub(from_unixtime(unix_timestamp()), 2) AND
         spf.LSTG_START_DT <= date_sub(from_unixtime(unix_timestamp()), 2) AND
         spf.lstg_site_id = 0 AND 
         prod_map.ENTITY_ATTR_TYPE_DESC IN ('BMPN','GTIN') AND
         spf.CTLG_ID = 1075 AND
         spf.LSTG_TYPE_CD NOT IN (10,12,15)
  GROUP BY spf.blah_blah;


Only tangentially related, but this reminds me of something I've wanted for a while: Are there any languages for relational queries, but with improved syntax (composability, epxression capture, static typing?) and which "compile" down to SQL?

Ideally, I imagine something that looks something like RethinkDB's ReQL, but DSL-ed up a bit beyond what's possible when it's "just" a JS library (eg, a nicer syntax for referencing table and row instead of "r.table('table name').row('row name')" and support for operator literals so you can write ">" instead of ".gt" or similar.).


There's Alf, which has a functional, pipeline-style, though I don't know if it's still active: http://www.try-alf.org/blog/


Sweet, Alf seems to be exactly what I wanted!

Shame it no longer seems to be under active development, but it seems to be pretty mature?


C# has Linq


I use the following indentation pattern and find it very useful:

    select 
        t1.col1,
        t2.col1

    from 
        table1 t1 
        left join table2 t2 on t2.primarykey = t1.foreignkey

    where 
        t1.somevalue is not null
        and isnull(t2.someothervalue, 0) > 0

    order by 
        t1.someothervalue desc
By using this pattern I can easily locate columns, tables, criterias, grouping fields, orders etc. And it works well with more complicated queries as well given all the subqueries, cases etc. written in a similar fashion.


Is there a good sql autoformatter? For cleaning up ORM-generated queries so I can read them. I've used python's sqlparse but it produces output that's often still unreadable.


IntelliJ's SQL editor is fabulous. Auto formatting and superb autocomplete, even with aliased tables and CTEs. If you don't want the full IntelliJ you can get Datagrip which is just the DB part. Can't say enough nice things about it.


If you happen to use MS SQL Server, there is ApexSQL Refactor, a plugin to SQL Server Management Studio (https://www.apexsql.com/sql_tools_refactor.aspx).

As the name implies it mainly does stuff other than formatting, but the latter is what I use it for, and I am rather happy with it.

When I write SQL myself, I tend for format it rather scrupulously, but when I have to read SQL written by somebody else, it is very useful.


I have used this quite a bit.. http://www.dpriver.com/pp/sqlformat.htm


Redgate's SQL Prompt has a great auto formatter with definable rule sets, if you use SSMS - and have the budget for it.


If you like the formatter in the current version of Prompt, wait until you see what's coming out in version 7.3. They've gone nuts. The new mechanisms for formatting are far beyond anything we've had so far. I'm pretty jazzed about it (disclosure: I work for Redgate).


If Oracle is your flavour of SQL, PL/SQL Developer is very nice and has a good formatter built in.


> CTEs

>First, yes they can be an optimisation boundary. But they can also make your query much more read-able and prevent you from doing the wrong thing because you couldn’t reason about a query.

In my experience, CTEs may be simpler for the writer to reason about, but they make the queries complicated for the reader to understand. If you use a CTE because you find the problem too complicated, chances are that you are not yet understanding the problem well enough to describe it in a single plain SQL query.


Sometimes I understand the problem perfectly well, but a single beautiful query is not performant enough. Anything that can help to write and maintain big ugly queries without a cost in performance is welcome.


Care to elaborate on what situation would a CTE improve performance? Because the OP itself mentioned CTEs degrading performance.


It wouldn't improve performance, but it might improve readability without denigrating performance. I don't have much experience with CTEs, but I do have lots of experience trying to debug unreadable SQL.


> Sometimes I understand the problem perfectly well, but a single beautiful query is not performant enough.

> It wouldn't improve performance, but it might improve readability without denigrating performance.

These two statements of yours are negating each other, unless people started using "performant" as synonymous for readable now.


OK, a clarification.

My first attempt at a query is readable but performance is terrible. I fix the performance problem, but the query is now much less readable. I use a CTE as a compromise between performance and readability.


CTE's are necessary for some types of queries but I've seen developers use them where a simple join would have worked. It can add a lot of unnecessary noise.


Exactly! In some cases like in recursive queries, CTEs may be the only way to achieve some result. In almost all other cases, people use them just because they apparently can't reason about joins, or don't like to do so. I frequently have to deal with SQL that looks like the person was writing an imperative algorithm.


One thing missing that I can't recommend more highly - a comment in the SQL clause indicating where in the code it is being called from. These comments are invaluable when debugging or doing performance tuning of queries, especially when you have a large codebase.

i.e.

    SELECT *
    FROM students
    WHERE 1
    /* somemodule.somemethod */
As for query formatting, as with everything, consistency in style is more important than a dogmatic style guide. My time as a DB admin has given me some specific style guidelines, typically stolen from the formatting provided by the DB itself. More specifically, I match my style to what you see when you run a command like `SHOW CREATE TABLE students;`. I tend to reserve my indentation for things like subqueries.


I agree it would be great to show where the code is being called from, but often we have queries that are called from dozens of different places and it would be unwieldy to add all of those. Plus, if one time a developer starts using a query somewhere in the code and forgets to add that place to the query, all of the sudden all places where the query is called from becomes suspect. I think it is better to give your query block a unique name or namespace that allows you to quickly search all of your code for where that name is used.


Good post. I use one of Jetbrains IDEs and I am big fan of their auto-formatting. You select the dialect of your SQL (or connect it to the data source) and the IDE almost formats it ideally. Sometimes I enforce certain line breaks but for the most part it does a really good job.


DataGrip is pretty excellent from them. The only thing I'm not really happy with is it's support for diagramming table structure, but I haven't found anything that really does a good job of that.


I highly recommend Joe Celko's book, SQL Programming Style. [1] He does deal with formatting, but offers a comprehensive usage standard that goes beyond readability to discussing thorny issues like the taxonomy of words like "type," "category," and "code," as well as referencing relevant international standards for identifier naming.

[1] https://www.amazon.com/Celkos-Programming-Kaufmann-Managemen...


I loathe wide lines. I think it's much easier to mentally parse vertical lists of things and I write my SQL much like this.

In PHP / Python / Go when inlining SQL I also line up sequential items indentation with spaces (which most editors do by default when hitting enter inside a multi line string).

Nice to see this codified. I really just wish I would have worked more in my career on teams that care as much about readability and comprehension in 6+ months as I do.

Hopefully more people will follow the advice in the article and write cleaner, easier to mentally parse SQL.


This is how I write SQL:

  select
     col1
    ,col2
    ,col3
  from table1 a
  
  left join table 2 b
    on a.col1 = b.col2
  
  where 1=1
    and col1 = 'condition'
    and col2 = 'condition2'
  ;


 * everything lower case (except strings)
 * leading commas
 * conditions indented by two spaces
 * select columns indented by two spaces, except the first colunm which is indented by three.
 * where 1=1 for easier commenting/uncommenting of conditions


Came here to find out who else uses leading commas. I learned this several years ago at my current job and love it. I find it much more readable and also makes adding to the list clearer for some reason. Plus if you ever need to insert like 50 commas at the start of a bunch of lines in an ad hoc query it is much easier at the start of the line (yes, I am an animal).


I am a visual thinker. The method of listing columns as rows (i.e., the one line per column format) is too annoying to focus on the underlying code. Splitting a list is something like pagination or even regular text, so I suppose my mind just compensates for that without my thinking about it.

But I just can't do the columns as rows thing. I feel a little bit out in the cold because of it, but I'll just keep reformatting everybody else's code so I can read it. :-)


The document styles make code blocks sans-serif, and then attempt to use monospace indentation—the end result looks worse, not better, than no leading indentation.


I while back I wrote a crappy SQL mapper that has an interesting SQL template language that uses SQL comments to make Jdbc place holders: https://github.com/agentgt/jirm/blob/master/jirm-core/README...

One of the interesting side effects of using the parser was easier to read SQL.


One issue that bit me pretty badly wrt comments is either Perl's DBI or the ODBC layer, but when talking to SQL Server from a Perl script via ODBC, I had to remove all comments from a rather convoluted SQL query.

I ended up keeping the comments in my source code and removing them via a regex at runtime, but it took me quite a while to figure out why my Perl script failed with a query that ran perfectly when copied and pasted into SSMS.


This is what is wrong with 'best practices'

`SELECT foo, bar FROM baz`

Is not at all more legible than:

`SELECT foo, bar FROM baz`

On first glance I even missed the 'bar' column completely and just saw it when compressing this line.

As things get longer it gets more important to make it legible, but saying that my first example is better than the second is just nonsense.

Don't worry so much about what you should or should not do, just use common sense.


Haha woops, multiline did not work.

The first example is supposed to read:

    SELECT foo, 
        bar 
    FROM baz


Am I the only one to see or missing obviously. The following is wrong, where clause has table name (baz)?

    SELECT foo,
       bar
    FROM baz
    WHERE foo > 3
      AND baz =  'craig.kerstiens@gmail.com'
Edit: Technically you can have a column name exactly same as table name, however I am finding it hard to find difference between the two queries presented.


Yeah, definitely a typo in the example, will fix.


> CTEs: First, yes they can be an optimisation boundary.

Note that this varies by database engine. They are an optimisation fence in postgres, they generally aren't in MS SQL Server, I'm not sure about Oracle or DB2 but I think they can optimise predicate application over CTE boundaries too.


This is all trivial. Simple queries are pretty easy to read as long as the formatting is vaguely sane. What we need to talk about is how to deal with queries with multiple nested subqueries (in the select clause and the where clause) and how to make that legible.


I write my SQL as follows:

    SELECT `second_column`,
           `fourth_column`
      FROM `table_name`
     WHERE `first_column` = 'Value'
       AND `third_column` = 3
  ORDER BY `fifth_column` ASC
     LIMIT 1


Well that is cute. I find it odd that the second line of the where clause is on the same level as the where but a neat format otherwise. Tabs or spaces though?


What happens when you need to do something non-trivial? Ie. nested selects, joins or even only some more complex expression in WHERE?


Yes, readability is good, it makes any scripting easier. SQL is no exception. Hand in hand with this is calling things what they are and good comments for the 'why' rather than the 'what' parts of a statement.


I prefer my own style where comma is placed before every column. It makes columns, subqueries and case expressions line up nicely, especially when you have 15 columns or more.


Yeah, I've definitely seen this and tried it myself. Visually for some reason, it just pains me too much. It does make removing lines much easier though so I can understand the appeal.


I do this often mainly due to beine annoyed by missing a comma somewhere or having to worry about commas when commenting out the last item


Just use Redgate SQL Prompt. It's the easiest fix. I do not work for Redgate.


I prefer keywords in their own line eg:

<code> SELECT foo, bar, blah FROM baz WHERE foo > 3 AND blah = 'dont use baz' </code>


Would have been useful to see some thinking on joins and the naming conventions when you have the same column in both tables


Use aliases, and if you have more than two joins you should prefix every column, unless the column names are obvious.


I think that may be more compact, if the IDE can reflect the algebra beneath the SQL clauses.


How do you handle indentation in python (especially if the line is already indented)?


Triple quoted strings. Since SQL doesn't care about whitespace, you can line up your SQL with the rest of your code quite neatly.


Anyone got some good Emacs advice for automated formatting?


SQL is 'blighted' by inertia.


Is there indent for SQL?


Poor Man's SQL Formatter for Notepad++ can do it, it leaves a lot to be desired though in terms of customization. It has saved me a lot of times, since I'd rather read Poor Man's code than the code of the other people.


Another vote for Poor Man's.

But I think it's an important caveat that the whole team use the same formatter and the same settings.

Otherwise diff's turn to useless noise* because Cubicle Bob has personalized his settings.

*Yes, I've tried the Winmerge plugin. It's helpful but not ideal.


For Microsoft's SQL Server Management Studio, try SSMS Boost (http://www.ssmsboost.com/). There is a free community edition.]

They just released a beta for SSMS 2016 that has a lot more flexible SQL formatting features. See http://www.ssmsboost.com/social/posts/m12279-SSMSBoost-v3-0-....


I am using Jetbrains Datagrip to write SQL all day, every day. It can auto-format SQL just like all their other tools can auto-format the language in use. It also has a bunch of other handy stuff like symbol completion from the current database.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: