Hacker News new | past | comments | ask | show | jobs | submit login
Sqlfmt: an opinionated online SQL formatter (cockroachlabs.com)
207 points by mjibson on Sept 27, 2018 | hide | past | favorite | 89 comments

Tangentially related. I'm of the opinion that SQL shouldn't be uppercased. I don't want to hit shift or caps lock and I find lowercase easier to read.

Are there others out there like me? I assume I'm in the minority, but I don't understand why. People stopped uppercasing HTML tags, why can't we do the same with SQL?

First, I'm sure mere inertia plays a role, but there are practical reasons as well if you're writing more than trivial SQL queries.

In most other languages you can read the statements fairly linearly, one statement leads to the next, and to the next, etc. At least within functions/methods and then you usually have indentation to help with understanding boundaries. Each "logical line" may be spread out over one or even several on-screen lines, but it's fairly easy to see the whole logical line and understand it at a glance.

Not always so with SQL.

If I get a complex SQL query to analyze, the first thing I do is not look for the SELECT clause (I'll usually have some sense of that before I start), but rather the FROM clause including it's JOINs and such. When keywords are uppercase, scanning the page for where in the statement to start my reading is much easier. Next I want the WHERE clause and its predicates... and finally I want to see what's being selected. So I'm not reading things nearly as linearly to understand a single logical statement like I would to understand say, a well written function. Also, single queries can be very long, so being able to jump around them where there are not going to necessarily be other cues can be important.

I could be wrong, but the density of user defined names (schemas, tables, columns, etc) also is higher than in other languages, so having the keywords in uppercase makes those more spread-out statement boundaries and fewer numbers of keywords in uppercase again helps me scan the statements for those major parts as I need them.

If you're writing single table queries, no sub-queries, maybe a dozen of columns or so, and only a relatively small number of WHERE clause predicates, then it might not matter much. For better or worse, that's not the world I usually find myself in :-)

Wouldn't syntax highlighting help with most of the issues you've outlined?

If I've got my developer hat on, yeah, sort of. I still find the case differences more pronounced without necessarily forcing me into a more in-my-face code formatting theme. Also, for all the reason A code format is important, the fact that I have 20 years experience with up-to-now SQL norms mean that I'm trained to look for certain things and can, almost without thinking, focus in on those with the help of historical formatting. Arbitrary changes to that history mean that now I have to discard that muscle memory and learn new ways of looking at things... I'm not sure the gain of a few less caps locks keystrokes merits forcing that learning curve. Sure the kids won't care about that because there's nothing to lose, but some of us old timers lose the benefit of our experience.

But developers with good tooling to deal with more static views of the code aren't the only readers of SQL, and perhaps not the most common.

With my DBA hat on, trying to figure out what's going on reading through logs and such, I don't always have that kind of tooling available. In this context the casing can help as well.... and this may form some of my bias; my first professional experiences on the application side of the world were in very much DBA roles dealing with other people's applications running on my servers. Insofar as access tools preserve developer formatting, having good "color-blind" formats made reading complex queries easier.

Yeah indeed -- though I'd prefer to write all lowercase for the sake of ease, I too stick to the 'uppercasing' norms mainly to help with ease of readability -- especially in shell-only environments where syntax highlighting isn't available [e.g., using psql]

Syntax highlighting is a huge boon working with SQL, but unfortunately you often find you are working with SQL embedded as a string in some other programming language.

I wonder if programming languages should have a way of specifying that a string constant contains embedded code or a template, so text editors can display it accordingly.

I deal with a lot of SQL embedded in code, and prefer uppercase keywords as it seems to help me visually parse the overall file.

PyCharm (and likely other editors by Jetbrains) supports embedded programming languages! However, language level support so that this could be automatically done by all editors would be great.

a lot of programming languages do have a way of specifying that a string constant contains embedded code or a template. Sublime Text for example will look for PHP heredocs with an identifier of XML, HTML, SQL etc. to know what to lex/highlight the string as. Or it will check whether the string starts with an uppercase "SELECT", as the chances are it's gonna be SQL. Code fences in GitHub Formatted Markdown are also a good example of this hinting.

I never uppercase the SQL keywords. If it's more than a one-liner select statement, I format it so the indentation tells me what's going on.

To throw out a tossed-off example:

        P.full_name, A.full_address, AT.type as address_type_name
        People as P
        inner join
        Addresses as A on P.address_id=A.id
        P.deleted is null

That's also my preferred approach. When formatting your code like this, the caps don't really offer an advantage anymore.

Plus, it's more comfortable to type as well.

Dimitri Fontaine writes about this in Master PostgreSQL in Application Development. He agrees. I found this to be a great modern sql resource.

If you namedrop something always link https://tapoueh.org/

it’s because if i’m writing sql, i’m definitely angry

I'm all for uppercasing the keywords and lowercasing your variables and columns. It makes for much easier reading...

I do the same. hand-cased words don't uniquely identify keywords or not, and more than once I've seen keywords not-upcased or column names that shadow keywords (with inconsistent upcasing), so they're inherently untrustworthy anyway.

Just use syntax highlighting, which can make it unambiguous what's a keyword vs func vs name.

I agree, I Write SQL code all day long, can’t explain why, I have just always hated upper case, so never used it for SQL. I also like the expressiveness of Postgres double quoted identifiers, so use these a lot in for auto generated app tables, reserving snake case for meta hand coded tables.

I never understood why people want to uppercase anything. It's like source code from 1970, when even bold highlight was something unavailable and developers used uppercase for keywords. I don't uppercase SQL, I prefer software which does proper highlighting.

When I see non uppercased SQL I think I’m reading it from some hipster that just discovered e e cummings and that makes me angry. Added to the fact that I’m writing SQL so I’m already angry (thanks poster above! I’m stealing that).

But seriously upper case helps my eyes focus on the sql commands so that I have some sort of hand hold on the wall of text.

In the beginning, there was uppercase

Originally, back in the Roman era, you had to delimit your VPDATE statements with interpuncts.

I always uppercase the keywords. Maybe for short one-off queries I’ll type all lowercase. But for compound statements, readability is more important, and keywords usually are a small portion of the overall query code anyway

The same can be seen in the fortran community. The common factor is that both languages are case insensitive. All upper case makes no sense whatsoever.

I'm with you, I prefer to uppercase the keywords only, but most SQL formatters are more liberal with their uppercasing.

You are not alone. SQL server and sybase's system stored procedures are mostly lowercased.

My personal style is to lowercase SQL commands, but uppercase or initialcase objects like tables, columns, etc. for readability and minimal use of caps.

That might only work if you came up with the schema or your collation is case-insensitive.

Author here. Based on feedback I am going to make a sqlfmt binary for easy editor integration.

You can currently already do this by using the CockroachDB binary (https://www.cockroachlabs.com/docs/releases/v2.1.0-beta.2018...) but understandably people want a more minimal solution.

Hey, thank you for the formatter! I really like the way it formats statements. Although I have found two issues:

1. It lowercases table and field names, rendering the formatted statement useless for a project with not exclusively lowercased names. Could you please fix this?

2. I'd appreciate accepting common placeholder "?" as a valid part of the SQL statement so that we can format queries with placeholders. Would that be possible?

The formatter adheres to CockroachDB rules which is why it acts like you describe. Case is insensitive in names unless you double quote it. Postgres-style (CockroachDB is one) uses "$1" for placeholders instead of "?". Adding "?" would complicate our grammar enough that it's not worth adding it.

sqlfmt is not super useful for non-postgres grammars.

I understand it about placeholders. Fair enough. I can do quick s/\?/\$1/g, no prob.

Still how about not lowercasing identifiers? This is actually a blocker, I'm sure you see it:)

Consider those of us stuck with Java adapters to Postgres, which means Postgres syntax but ? for placeholders.

Is that a thing? The Java driver must actually parse those and convert them. Java is crazy. https://github.com/mjibson/sqlfmt/issues/25 is the issue to track '?' as placeholders.

It might be crazy, but having just gotten my hands dirty with some seriously DB programming, JDBC has a hell of a lot of value in my view.

Hey! I love your tool.

Any chance of allowing keywords that are specific to a SQL implementation? (e.g. PostgreSQL)

If I enter "CREATE MATERIALIZED VIEW ... WITH DATA;" in for example, it would tell me "materialized" and "with data" is wrong.

As CockroachDB matures these kinds of things will slowly get added.

But sqlfmt doesn't (currently) have a goal to format all SQL, just cockroach SQL.

I thought it was a ploy to get people to install cockroach, that's what I'm doing now.

If you're working with something close to PostgreSQL syntax support, why wouldn't you start with pgFormatter (http://sqlformat.darold.net/)? It's been around for awhile and I can't see anything new or substantially more advanced in the proposed solution.

I have to say, I do most of my development work in the database, and I'm not thrilled with any of the automatic formatters thus far. The ones I've seen can do a good job in many cases, but if you start getting very advanced at all, the weaknesses of an automatic formatting approach start to become more evident. It may just be that anything other than relatively simple SQL, PL/SQL are not well suited for automatic formatting.

I'm still working through a good set of formatting rules, and I always find I end up making exceptions for readability.

pgFormatter formats

  SELECT 1, 2

which is one of the core complaints the article addresses.

That actually seems like a relatively minor complaint on the order of submitting a pull request to an existing fairly good project (as is pgFormatter) rather than coming up with standard 15 for SQL formatting in a new tool. An option where, if the column text doesn't overrun a column limit, is single line and then split once it does overrun doesn't sound like a bad feature, but it does sound like a small one.

(addition) More than anything that's wrong with all of these formatters for SQL is that there real consensus for what good formatting is. There are historical norms for sure, and that's likely a starting point, but reconciliation between the communities around each database vendor (and sometimes each company developing with that database) make that a difficult challenge. So this new tool will become the CockroachDB formatter, pgFormatter is as far as I know the most popular PostgreSQL formatter, other databases your mileage will vary. Each not really an SQL formatter, but an SQL formatter for (stick affinity here). For CockroachDB, I'm not sure how much payoff there is for moving away from the more popular tool for Postgres if they're stressing their similarity to PostgreSQL SQL.

It sounds like this formatter is sufficiently different in its approach (based on the prettier paper) that a PR would basically rewrite the existing one?

My point is why create something new on any basis, if there's a fairly good solution to the problem that you can help improve?

Sure, once you've gone and created a brand new tool you've gone past the point of a PR unless you were starting with a fork. My original question was, why jump to a new tool out of the gate when the differences seem fairly minor. Yes, I'm sure that pgFormatter probably didn't start with the prettier paper... but so what? If you aren't coming up with something really different, I'm not sure that matters.

Actually, thinking about it. What I need more than an SQL formatter is something closer to a linter quite frankly. In complex queries it's easy to miss something... I know, I know... I'm probably talking about something more like an IDE since many of the errors are likely to be logical rather than syntactic, but still... that would be a better problem to solve than a formatter right now.

Brief question, only tangentially related:

Is it me or is "opinionated" software popping up all over the place? It looks to me like the author considers it a selling point, too.

Can someone explain to me why I would want an "opinionated" software over any regular old software? Especially over a configurable software?

I'm not a native speaker, but I never encountered "opinionated" being used with positive connotation. It has so far always been in the "well, uh, he's very opinionated!" sense of apologizing for that old grumpy person who gets on everybody's nerves.

The word does indeed have initially negative connotations.

Perhaps it resonates more positively now to people who have had to work with "flexible" software such as Spring in the old days.

There were 100 ways to skin a cat with old Spring, and each way was a tortuous devil's brew of XML. Googling for any solution was a nightmare, exacerbated by hundreds of wannabe experts polluting the interwebs with their own strange takes.

Spring Boot was introduced in response as an opinionated approach. You can still skin the cat in 100 different ways, but at least you start with sensible defaults that someone far more knowledgeable than yourself had deemed to be good.

That's a good point.

In programming, there's the idea that "The code shouldn't merely work, it should clearly work". If your configuration system becomes so complex that the same principle applies there, that's a bad sign.

sqlfmt is inspired by the style of gofmt (https://blog.golang.org/go-fmt-your-code). The reason a tool having an opinion about formatting is useful is that it removes humans from having to care about formatting. I didn't like how gofmt formatted my Go code at first. But since I couldn't care, I didn't care, and now I never care about formatting. Without an opinion, we spend lots of time quibbling about stuff that doesn't matter.

Not having to argue about whether you want the shed to be blue or red really is super valuable.

One thing I learned somewhat recently as well is that within google3, these automatic formatters come in super handy when doing large-scale changes, e.g. when a library is deprecated and the maintainers are moving all the clients to the new one.

"Opinionated" can make sense because choice increases complexity and potentially decreases interoperability in a team (e.g. the "black" code formatter in Python removes choice from the user, so everyone's code looks the same).

Having a sensible set of defaults can be helpful as well.

That said, sometimes "opinionated" can also just mean "I haven't had the have time to make things configurable yet".

The best of both worlds is "opinionated defaults but configurable". I use an excellent free SQL formatter called ApexSQL Refactor for SQL Server (Windows), and it came with decent defaults, but is also extremely tweakable.

In software it's supposed to mean: does things one way, which is (hopefully) very good, doesn't include a lot of options to change behavior if you don't like it.

It's become popular because people became sick of tools with bad defaults and a million options that could maybe make it better, but nobody uses.

I see it as people reacting to the tendency for software and source code to get quite messy or complex as the number of developers involved grows, more so than one would expect from the result of an increased feature set alone.

Specific (or opinionated) design, interfaces, or formatting feels like an early declaration that the people in charge want to deliberately restrict an aspect of the software, in order to keep things consistent and reduce cognitive load during use or development. Constraints are freedom, and all of that.

There was a trend recently of "unopinionated" software, where it has a ton of options to do exactly what you want. The problem was, it took a long time to configure it, and development was slow because every variant and option had to be tested.

To me, "opinionated" is shorthand for "not a lot of options", so it is less extensible, but will probably also work better and right out of the box.

"opinionated" means some options have been hidden or removed for your sake because in the author's opinion they're wrong/old/unnecessarily complex/etc.

good examples are micro frameworks that wrap very large complex APIs and simply don't pass through whole swaths of complexity.

It Jeans the color of the bikeshed will be decided by one strong opinion and everyone can get back to work.

Great idea.

I work on a project that's grown in complexity and team size. Developers have come and gone. Sometimes a bit of code that's grown over a few years needs its style to evolve. This kind of formatter clearly realizes this.

What I tell developers that complain about this or that style is that it's more important to have a consistent code style throughout the lifetime of a project, then to adopt whatever bracing style someone happens to like at the time.

I'd really like all of my code to be automatically formatted with a tool like this, as part of the build process. This would help with situations where code evolves, as it would reformat it when needed.

The problem with opinions is everyone has a different one.

I like https://www.sqlinform.com/ - keep a copy of notepad around just so I can use it. Wish there was a VSCode extension. The free version was good enough for my needs, but it's so handy I upgraded to the pro version just to show appreciation. The best part is the ability to really customize the layout and save the settings. It's always step #1 when someone sends me a massive blob of badly formatted SQL.

works for the mentioned case of SELECT 1, 2

Web site is kind of horrible.

This looks great! I notice you mentioned Prettier in the post — do you have any plans to integrate sqlfmt with that?

Prettier actually has a beta Postgres plugin, but progress on it seems to have stalled. Hooking it up with sqlfmt would be neat - then prettier could format SQL inside other languages too, like code fences in Markdown or template literals in JS.

I wish they didn't stop development: https://github.com/benjie/prettier-plugin-pg

I use plv8 very often, and this would be perfect.

Looks really cool, but I have one suggestion. Have you considered using leading commas? It makes big SQL statements much easier to edit and it improves diffs. I normally use trailing commas for my code, but most sql dialects don't support it so I fallback to leading ones there.

I believe and direct my team to use leading commas, with fields on individual lines, to facilitate easier copy/paste and also to make it easier to comment out sections of the query.

It does lead to longer queries and poor use of screen real estate, but it is very readable.

Oh god, no. No leading commas, anywhere. Not in javascript, not in SQL. It's like nails on a chalkboard.

Funny because I was reading other people's comments and thinking "wow, that person is VERY opinionated on query formatting" (but now here I am).

> Funny because I was reading other people's comments and thinking "wow, that person is VERY opinionated on query formatting" (but now here I am).

There are opinions, and there are crimes against humanity (although that is a matter of opinion).

For really complex queries it can be totally confusing. At least for me it does. I prefer indentation. I like my SQL blocky.

I'm sad this isn't available as a binary, it's not much use to me as a website.

It's available as part of the cockroach binary (https://www.cockroachlabs.com/docs/releases/v2.1.0-beta.2018...) as a subcommand. And an external contributor has made a more standalone version at https://github.com/lopezator/sqlfmt.

I searched around and found https://github.com/jackc/sqlfmt, which is an older open source project also called sqlfmt, also written in Go. I don't know anything about how they compare.

It simply doesn't work with cockroachdb it throws syntax errors with valid schemas/queries.

I was just wishing for a better SQL formatter! I'm wasting too much time pasting queries into a doc just to reformat them.

It would be great if the parser supported question marks as placeholders. The pg_stat_statements table in Postgres saves denormalized queries, replacing the literal values with '?' placeholders (e.g., `SELECT * FROM table_name WHERE id = ?`).

I love this. Now to implement as:

- javascript lib / (free atom plugin with this)

- jetbrains plugin

- vim plugin

What I want: an SQL (actually, it could be just Postgres as I only use that) parser that takes two schema declarations and calculates the commands needed to go from one to the other, so we can have automatic migrations.

For example, if A is

  CREATE TABLE person (
    id serial PRIMARY KEY,
    name text
and B is

  CREATE TABLE person (
    id serial PRIMARY KEY,
    name text,
    age int
Then this tool would output

  ALTER TABLE person ADD COLUMN age int

I’d love it if StackOverflow would let me select code text and automatically reformat it according to some standard spec (I don’t care which set of opinions are used provided it’s readable!) - too many inexperienced (or unqualified, it seems) people copy and paste ad-hoc written SQL that I always have to manually reformat myself when I edit their questions for readability.

I would love to package this (most likely from https://github.com/lopezator/sqlfmt) for my package manager of choice. What is the license?

Not very aware of license internals, as it uses CockroachDB codebase we'll better ask @mjibson and add a LICENSE to "lopezator/sqlfmt" for better understanding depending on their answer :)

that would be really cool to have as a local package you can plug into your text editor instead of copy and paste. people using this at companies might leak sensitive information or possible attack vectors to the opaque service.

See my other comment. There are a few options for a local binary.

cool thanks!

My opinion: text should never be aligned to right as the keywords here are.

That's exactly what makes this formatter so good.

The style is similar to https://www.sqlstyle.guide/, which is excellent.

I would challenge the decision to UPPER CASE keywords, it looks oldschool and is unnecessary since syntax highlighting is ubiquitous. And right-justifying is something that only looks clean for the simplest of queries and wreaks havoc with editors that detect mixed tabs/spaces.

I still prefer upper case and indentation. It's cleaner imo. Lower casing makes everything blend together regardless of modern editors highlighting the text for key commands.

standalone Query parser would be nice :)

In case you're looking for something Postgres syntax compatible, and find cgo acceptable: https://github.com/lfittl/pg_query_go

(this uses the actual Postgres parser wrapped into a library)

Wow nice thank you for pointing it out!

not sure what this offers over




both of which seem to give nicer output and handle more diverse sql without erroring.

also not really sure why it considers itself "opinionated"

Why all the SCREAMING? Is something wrong?

It's 2018, we aren't using teletypes with bad shift keys and we all have syntax coloring IDEs, we don't need to scream out all our KEYWORDS like we were still writing COBOL or FORTRAN in the 1960s.

It's become my biggest pet peeve with most SQL formatters. I think sentence cased SQL is far more readable.

    Select * from sometable where SomeColumn > 3
But I realize I'm a minority with the opinion right now.

Yours is a fair opinion and I can't come up with any reason why it's wrong.

Personally I like it because it clearly separates keywords from objects (tables, columns, databases, etc).

This is very much like the tabs vs. spaces debate.

Syntax highlighting already clearly separates keywords from everything else, and more flexibly to individual taste. (The keywords being both uppercase and bright bold purple on the linked page here particularly caught my personal distaste.)

Certainly there are still far too many places people are writing/reading/interacting with SQL without the benefit of syntax highlighting, but we have the technology to solve that (and in many cases need to solve that; a lot of the places that SQL is embedded and the outer language has no idea about are opportunities for SQL injection attacks).

> This is very much like the tabs vs. spaces debate.

It's sadly worse than that. It's something like 60 years of momentum versus user/developer experience and readability concerns.

Despite my under-appreciated jokes that most SQL databases probably don't need to be shouted out to do their jobs (Oracle maybe being an exception. ;), it actually is a serious readability issue.

We have decades of research showing that people read/scan by word shape and that lower-case and mixed-case is far faster and easier to scan/read than all caps. 'from' has a more interesting and relatively more unique shape than 'FROM', making it easier to spot. (Especially in a monospace language where the outermost shape of all four letter all caps words is essentially the same width and height rectangle.) (If you try a syntax highlighting scheme that does something more interesting than bolding keywords, italicizing them with a good italic monospace font, the lower-case word 'from' can be almost beautiful and very easy to find. A nice benefit to syntax highlighters is that sort of variation is a personal option between you and your editors of choice.)

Typesetters have known for a long time that all caps is slower to read, and should be used sparingly because it exhausts readers.

Sentence-casing SQL statements benefits from the readability of all the English we read everywhere else.

I make jokes about shouting in our database queries, because it's sadly easier than changing more than a half-century of momentum in the status quo, but I do think that all-caps keywords in SQL statements is a huge papercut in dev productivity/ergonomics that we have the technology to fix but apparently not the interest/willpower.

I wrote an SQL formatter as part of a larger source-to-source compiler for a re-engineering tool many years ago and ended up with initial caps for SQL keywords. E.g.

    Select * From sometable Where SomeColumn > 3
It doesn't SCREAM but provides some visual handle for spotting the keywords. It's particularly effective if the table and column names use a lowercase convention which many of systems we had to deal with did.


"As my code gets more complicated or deeplier nested"

OMG will someone please tell college professors about Clean Code? Perhaps it takes years of writing deeplier nested code before you even know what Uncle Bob is talking about.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact