I'm glad to see pipe syntax moving forward! In my time at Google, I wrote a lot of SQL queries that required a long sequence of intermixed filtering, grouping, aggregation, and analytic window operations. But the ordering of those operations rarely matched SQL's inside-out syntax, so I commonly had to chop up my logic and glue it into multiple chained CTEs. This busywork made my code's logic harder to follow and was frustrating, especially since I was familiar with alternative coding models, such as R's dplyr, that would have allowed me to express my logic as a simple linear pipeline.
I hope SQL pipes become commonplace so that I can use them anywhere I have to write SQL queries.
What’s wrong with CTEs though? I have never thought of them as busywork and start with them if I know my query is going to be a multi-step process. I already thought them as “pipes” (in the abstract, not related to this paper). If the query flow is linear so is the sequence of CTEs.
The only slightly annoying thing with CTEs is that you always have to name them. You might consider this an overhead if the flow is strictly linear. But when the flow is not linear (eg when doing two different aggregations over the same data, to join them later) you need a way to refer to the intended CTE, so always requiring a name does add some consistency and doesn’t seem a big deal overall.
CTEs are not inherently busywork. I rather like them. What is busywork is having to chop a linear flow of operations into chunks and then wrap those chunks in CTEs that you must wire together. All this, simply because the SQL syntax doesn't let you express that flow directly.
Depends on DB engines I suppose. I've come across that certain operations were not allowed in CTEs, and they can be an optimization barrier.
However if your query is dynamically modified at runtime, then CTEs can be a no-go. For example, we have a grid component which first does a count and then only selects the visible rows. This is great if you have expensive subselects as columns in a large table. However to do the counting it turns the main query into a sub-query, and it doesn't handle CTEs.
Understood. I should have asked my question a bit more specifically: what's wrong with CTEs that wouldn't be an issue with this new pipe syntax. I briefly scanned the paper and it appears there aren't any specific benefits to the pipe syntax that would make optimization easier. So we can expect that if a SQL engine doesn't optimize CTEs well it would likely have the same limitations for the pipe syntax.
Section 2.1.4 the paper lists the benefits of the pipe syntax over CTEs, and they are all based on ergonomics. As someone who has never had issues with the ergonomics of CTEs I must say I am not convinced that proposed syntax is better. It may be that I've been doing SQL for so long that I don't see its warts. Overall SQL feels like a very well designed and consistent language to me. The new pipe syntax appears to bolt on an imperative construct to an otherwise purely functional language.
> The new pipe syntax appears to bolt on an imperative construct to an otherwise purely functional language.
It's not imperative. The pipe symbol is a relational operator that takes one table as input and produces one as output. It's still purely functional, but it has the advantage of making the execution order obvious. That is, the order is a linear top-down flow, not the inside-out flow implicit in vanilla SQL. Further, when your wanted flow doesn't match vanilla SQL's implicit ordering, you don't have to invent CTEs to wire up your flow. You just express it directly.
As for ergonomics, consider a simple task: Report some statistics over the top 100 items in a table. Since LIMIT/ORDER processing is last in vanilla SQL's implied ordering, you can't directly compute the stats over the top items. You must create a CTE to hold the top items and then wire it into a second SELECT statement to compute the stats. That's busywork. With pipe syntax, there's no need to invent that intermediate CTE.
> It's not imperative. The pipe symbol is a relational operator that takes one table as input and produces one as output.
Maybe I used the wrong term. In my mental model, the query planner decides the order in which the query is evaluated based on what table stats predict is most efficient query plan, and I actually don't really want to think about the order too much. For example, if I create a CTE, I don't necessarily want it to be executed in that order. Maybe a condition on the later query can be pushed back into the earlier CTE so that less data can be scanned.
I will admit that technically there should be no difference in how a query planner handles either. But to me the pipe syntax does not hint as much at these non-linear optimizations than CTEs do. I called the CTE syntax more functional as it implies less to me.
> but it has the advantage of making the execution order obvious.
So we're back to ergonomics which I just never had an issue with...
> As for ergonomics, consider a simple task: Report some statistics over the top 100 items in a table. Since LIMIT/ORDER processing is last in vanilla SQL's implied ordering, you can't directly compute the stats over the top items.
Could I not compute the stats over all values, then order and limit them, and depend on the query planner to not do the stat calculation for items outside the limit? If the order/limit does not depend on a computed statistic that should be possible? Or does that not happen in practice?
No, the wanted stats are a function of the top 100 items.
As a concrete example, consider computing the average sales volume by category for the top 100 items. Here's the vanilla SQL for it:
WITH
TopItems AS (
SELECT category, sales_volume
FROM Items
ORDER BY sales_volume DESC
LIMIT 100
)
SELECT category, AVG(sales_volume) AS avg_sales_volume
FROM TopItems
GROUP BY category;
Because ORDER/LIMIT processing is implicitly last in vanilla SQL, if you need to do anything after that processing, you must do it in a new SELECT statement. Thus you must capture the ORDER/LIMIT results (e.g., as a CTE or, heaven forbid, as a nested SELECT statement) and then wire those results into that new SELECT statement via its FROM clause.
In contrast, with SQL pipes you can express any ordering you want, so you can feed the ORDER/LIMIT results directly into the statistical computations:
FROM Items
|> ORDER BY sales_volume DESC
|> LIMIT 100
|> AGGREGATE AVG(sales_volume) AS avg_sales_volume
GROUP BY category
That's way simpler and the data flows just as it reads: from top to bottom.
Okay, thanks for that example. The ability of the pipe syntax to re-order the standard SQL pipeline order does indeed provide for more compact queries in this case.
At least in SQL Server CTE's are syntax level, so multiple uses of a CTE in a query causes it to get expanded in each of those places, which typically increases the complexity of the query and can cause issues with the optimizer and performance.
The paper's section 5.1 (“Usage at Google”) and its Figure 4 (page 9) shows the usage growing steadily, “despite limited documentation and incomplete tooling” — of course 1600 users is still small relative to the number of SQL users at Google, and it's possible the growth will plateau at some point, but I for one adopted pipe syntax the very day I saw it (even sent an excited email to their mailing list), and have been using it ever since. As it's an extension, it's always possible to use regular SQL for parts of the same query (copying others' queries etc), but often I find myself rewriting queries into pipe SQL and find it significantly simplifies them (CTEs can be removed, etc).
FROM r JOIN s USING (id)
|> WHERE r.c < 15
|> AGGREGATE sum(r.e) AS s GROUP BY r.d
|> WHERE s > 3
|> ORDER BY d
|> SELECT d, s, rank() OVER (order by d)
Can we call this SQL anymore after this? This re-ordering of things has been done by others too, like PRQL, but they didn't call it SQL. I do think it makes things more readable.
The point of SQL pipe syntax is that there is no reordering. You read the query as a sequence of operations, and that's exactly how it's executed. (Semantically. Of course, the query engine is free to optimize the execution plan as long as the semantics are preserved.)
The pipe operator is a semantic execution barrier:everything before the `|>` is assumed to have executed and returned a table before what follows begins:
From the paper:
> Each pipe operator is a unary relational operation that takes one table as input and produces one table as output.
Vanilla SQL is actually more complex in this respect because you have, for example, at least 3 different keywords for filtering (WHERE, HAVING, QUALIFY) and everyone who reads your query needs to understand what each keyword implies regarding execution scheduling. (WHERE is before grouping, HAVING is after aggregates, and QUALIFY is after analytic window functions.)
Golly, QUALIFY, a new SQL operator I didn’t know existed. I tend not to do much with window functions and I would have reached for a CTE instead but it’s always nice to be humbled by finding something new in a language you thought you knew well.
Is not common at all, is a non ANSI SQL clause that afaik was created by Teradata, syntactic sugar for filtering using window functions directly without CTEs or temp tables, especially useful for dedup. In most cases at least, for example you can't do a QUALIFY in an query that is aggregating data just as you can't use a window function when aggregating.
Other engines that implement it are direct competitors in that space: Snowflake, Databricks SQL, BigQuery, Clickhouse, and duckdb (only OSS implementation I now). Point is: if you want to compete with Teradata and be a possible migration target, you want to implement QUALIFY.
Anecdote: I went from a company that had Teradata to another where I had to implement all the data stack in GCP. I shed tears of joy when I knew BQ also had QUALIFY. And the intent was clear, as they also offered various Teradata migration services.
> The point of SQL pipe syntax is that there is no reordering.
But this thing resembles other FROM-clause-first variants of SQL, thus GP's point about this being just a reordering. GP is right: the FROM clause gets re-ordered to be first, so it's a reordering.
> The pipe operator is a semantic execution barrier:everything before the `|>` is assumed to have executed and returned a table before what follows begins
I already think about SQL like this (as operation on lists/sets), however thinking of it like that, and having previous operations feed into the next, which is conceptually nice, seems to make it hard to do, and think about:
> *(the query engine is free to optimize the execution plan as long as the semantics are preserved)
since logically each part between the pipes doesn't know about the others, so global optimizations, such as use of indexes to restrict the result of a join based on the where clause can't be done/is more difficult.
This kind of implies there's better or worse ordering. AFAIK that's pretty subjective. If the idea was to expose how the DB is ordering things, or even make things easier for autocomplete OK, but this just feels like a "I have a personal aesthetic problem with SQL and I think we should spend thousands of engineering hours and bifurcate SQL projects forever to fix it" kind of thing.
The post I was referring to said that this new pipe syntax was a big reordering compared to the vanilla syntax, which it is. But my point is that if you're going to understand the vanilla syntax, you already have to do this reordering in your head because the order in which the the vanilla syntax executes (inside out) is the order in which pipes syntax reads. So it's just easier all around to adopt the pipe syntax so that reading and execution are the same.
This is an extension on top of all existing SQL. The pipe functions more or less as a unix pipe. There is no reordering, but the user selects the order. The core syntax is simply:
query | operator
Which results in a new query that can be piped again. So e.g. this would be valid too:
SELECT id,a,b FROM table WHERE id>1
|WHERE id < 10
The multiple uses of WHERE with different meanings is problematic for me. The second WHERE, filtering an aggregate, would be HAVING in standard SQL.
Not sure if this is an attempt to simplify things or an oversight, but favoring convenience (no need to remember multiple keywords) over explicitness (but the keywords have different meanings) tends to cause problems, in my observation.
I was not there at the original design decisions of the language, but I imagine it was there specifically to help the person writing/editing the query easily recognize and interpret filtering before or after an aggregation. The explicitness makes debugging a query much easier and ensures it fails earlier. I don't see much reason to stop distinguishing one use case from the other, I'm not sure how that helps anything.
I also wasn't there, but I think this actually wasn't to help authors and instead was a workaround for the warts of SQL. It's a pain to write
SELECT * FROM (SELECT * FROM ... GROUP BY ...) t WHERE ...
and they decided this was common enough that they would introduce a HAVING clause for this case
SELECT * FROM ... GROUP BY ... HAVING ...
But the real issue is that in order to make operations in certain orders, SQL requires you to use subselects, which require restating a projection for no reason and a lot of syntactical ceremony. E.g. you must give the FROM item a name (t), but it's not required for disambiguation.
Another common case is projecting before the filter. E.g. you want to reuse a complicated expression in the SELECT and WHERE clauses. Standard SQL requires you to repeat it or use a subselect since the WHERE clause is evaluated first.
I think this stems from the non-linear approach to reading a SQL statement. If it were top-to-bottom linear, like PRQL, then the distinction does not seem merited. It would then always be filtering from what you have collected up to this line.
I think the original sin here is not making aggregation an explicitly separate thing, even though it should be. Adding a count(*) fundamentally changes what the query does, and what it returns, and what restrictions apply.
If you perform an aggregation query in a CTE, then filter on that in a subsequent query, that is different, because you have also added another SELECT and FROM. You would use WHERE in that case whether using a CTE or just an outer query on an inner subquery. HAVING is different from WHERE because it filters after the aggregation, without requiring a separate query with an extra SELECT.
> HAVING is different from WHERE because it filters after the aggregation, without requiring a separate query with an extra SELECT.
Personally I rarely use HAVING and instead use WHERE with subqueries for the following reasons:
1-I don't like repeating/duplicating a bunch of complex calcs, easier to just do WHERE in outer query on result
2-I typically have outer queries anyway for multiple reasons: break logic into reasonable chunks for humans, also for join+performance reasons (to give the optimizer a better chance at not getting confused)
You can always turn a HAVING in SQL into a WHERE by wrapping the SELECT that has the GROUP BY in another SELECT that has the WHERE that would have been the HAVING if you hadn't bothered.
You don't need a |> operator to make this possible. Your point is that there is a reason that SQL didn't just allow two WHERE clauses, one before and one after GROUP BY: to make it clearer syntactically.
Whereas the sort of proposal made by TFA is that if you think of the query as a sequence of steps to execute then you don't need the WHERE vs. HAVING clue because you can see whether a WHERE comes before or after GROUP BY in some query.
But the whole point of SQL is to _not have to_ think of how the query is to be implemented. Which I think brings us back to: it's better to have HAVING. But it's true also that it's better to allow arbitrary ordering of some clauses: there is no reason that FROM/JOIN, SELECT, ORDER BY / LIMIT have to be in the order that they are -- only WHERE vs. GROUP BY ordering matters, and _only_ if you insist on using WHERE for pre- and post-GROUP BY, but if you don't then all clauses can come in any order you like (though all table sources should come together, IMO).
> The second WHERE, filtering an aggregate, would be HAVING in standard SQL.
Only if you aren't using a subquery otherwise you would use WHERE even in plain SQL. Since the pipe operator is effectively creating subqueries the syntax is perfectly consistent with SQL.
Perhaps, however then you eliminate the use of WHERE/HAVING sum(r.e) > 3, so in case you forgot what the alias s means, you have to figure that part out before proceeding. Maybe I'm just used to the existing style but as stated earlier, seems this is reducing explicitness which IMO tends to lead to more bugs.
A lot of SQL engines don't support aliases in the HAVING clause and that can require duplication of potentially complex expressions which I find very bug-inducing. Removing duplication and using proper naming I think would be much better.
I will already use subqueries to avoid issues with HAVING.
> A lot of SQL engines don't support aliases in the HAVING clause
We're moving from SQLAnywhere to MSSQL, and boy, we're adding 2-5 levels of subqueries to most non-trivial queries due to issues like that. Super annoying.
I had one which went from 2 levels deep to 9... not pleasant. CTEs had some issues so couldn't use those either.
I'm surprised you had issues with CTEs -- MS SQL has one of the better CTE implementations. But I could see how it might take more than just trivial transformations to make efficient use of them.
One issue, that I mentioned in a different comment, is that we have a lot of queries which are used transparently as sub-queries at runtime to get count first, in order to limit rows fetched. The code doing the "transparent" wrapping doesn't have a full SQL parser, so can't hoist the CTEs out.
One performance issue I do recall was that a lateral join of a CTE was much, much slower than just doing 5-6 sub-queries of the same table, selecting different columns or aggregates for each. Think selecting sum packages, sum net weight, sum gross weight, sum value for all items on an invoice.
There were other issues using plain joins, but I can't recall them right now.
CTE's (at least in MS SQL land) are a syntax level operation, meaning CTE's get expanded to be as if you wrote the same subquery at each place a CTE was, which frequently impacts the optimizer and performance.
I like the idea of CTE's, but I typically use temp tables instead of CTE's to avoid optimizer issues.
I use them on purpose to "help" the optimizer by reducing the search space for query plan ((knowing that query plan optimization is a combinatorial problem and the optimizer frequently can't evaluate enough plans in a reasonable amount of time).
Should we introduce a SUBSELECT keyword to distinguish between a top-level select and a subquery?
To me that feels as redundant as having WHERE vs HAVING, i.e. they do the same things, but at different points in the execution plan. It feels weird to need two separate keywords for that.
In order for a thing to be considered legacy, there needs to be a widespread successor available.
SQL might have been invented in the 70s but it's still going strong as no real alternative has been widely adopted so far - I'd wager that you will find SQL at most software companies today.
I mean kinda? It's legacy in the "we would never invent this as the solution to the problem domain that's today asked of it."
We would invent the underlying engines for sure but not the language on top of it. It doesn't map at all to how it's actually used by programmers. SQL is the JS to WebAssembly, being able to write the query plan directly via whatever language or mechanism you prefer would be goated.
It has to be my biggest pain point dealing with SQL, having to hint to the optimizer or write meta-SQL to get it to generate the query plan I already know I want dammit! is unbelievably frustrating.
> having to hint to the optimizer or write meta-SQL to get it to generate the query plan I already know I want dammit'
That's not in the domain of SQL. If you're not getting the most optimized query plan, there is something wrong with the DBMS engine or statistics -- SQL, the language, isn't supposed to care about those details.
That's my point, I think we've reached the point where SQL the langage can be more of a hindrance than help because in a lot of cases we're writing directly to the engine but with oven mitts on. If I could build the query from the tree with scan, filter, index scan, cond, merge join as my primitives it would be so nice.
That's the thing though, I still want my data to be relational so NoSQL databases don't fit the bill. I want to interact with a relational database via something other than the SQL language and given that this language already exists (Postgres compiles your SQL into an IR that uses these primitives) I don't think it's a crazy ask.
> It's legacy in the "we would never invent this as the solution to the problem domain that's today asked of it."
I don't think that definition of legacy is useful because so many things which hardly anyone calls "legacy" fit the definition - for example: Javascript as the web standard, cars in cities and bipartisan democracy.
I think many of us would say that that none of these is an ideal solution for the problem being solved, but it's what we are stuck with and I cannot think anyone could call it "legacy systems" until a viable successor is widespread.
Not bad, very similar to dplyr syntax. Personally i’m too used to classic SQL though and this would be more readable as CTEs. In particular how would this syntax fair if it was much more complicated with with 4-5 tables and joins?
Maybe not, just as we don't call "rank() OVER" SQL. We call it SQL:2003. Seems we're calling this GoogleSQL. But perhaps, in both cases, we can use SQL for short?
this is consistent, non-pseudo-english, reusable, and generic. The SQL standard largely defines the aesthetic of the language, and is in complete opposition to these qualities. I think would be fundamentally incorrect to call it SQL
Perhaps if they used a keyword PIPE and used a separate grammar definition for the expressions that follow the pipe, such that it is almost what you’d expect but randomly missing things or changes up some keywords
Honestly SQL screwed things up from the very beginning. "SELECT FROM" makes no sense at all. The projection being before the selection is dumb as hell. This is why we can’t get proper tooling for writing SQL, even autocompletion can’t work sanely. You write "SELECT", what’s it gonna autocomplete?
PRQL gives me hope that we might finally get something nice some day
Doesn’t change anything, you can still have the select at the end, and optional from and joins at the beginning. In your example, the select could be at the end, it’s just that there’s nothing before.
I also hate having SELECT before FROM because I want to think of the query as a transformation that can be read from top to bottom to understand the flow.
But I assume that that’s part of why they didn’t set it up that way — it’s just a little thing to make the query feel more declarative and less imperative
At this point I think that vanilla SQL should just support optionally putting the from before the select. It's useful for enabling autocompletion, among other things.
My initial reaction is that the pipes are redundant (syntactic vinegar). Syntactic order is sufficient.
The changes to my SQL grammar to accomodate this proposal are minor. Move the 'from' rule to the front. Add a star '*' around a new filters rule (eg zero-or-more, in any order), removing the misc dialect specific alts, simplifying my grammar a bit.
To see how well it works, I uploaded the PDF to Google AI Studio and ran the prompt "Convert this document to neatly styled semantic HTML" against the gemini-1.5-pro-exp-0801 model - the result is actually pretty good!
I would welcome this! I’ve been using SQL for over 20 years and I still sometimes have trouble expressing certain queries, which in my head are well understood. I typically get there but I wish the amount of trial and error was shorter, and the process of turning what I have in my head inside out disappeared.
Absolutely as soon as I start doing multiple joins and throwing up all sorts of trailing syntaxes to further refine/filter/group I get lost
But piping is both intuitive and very easy to understand without having to get everything right from the beginning.
I really hope Postgres 16 can implement pipes, might be interesting to be able to use triggers and functions to pipe to and from, actually might simplify that aspect of it too.
Same here. I often connect to mariadb via console and try my queries there. I confess that I've used gpt to help me with some hairy ones and it worked quite good.
It's difficult for me to tell how sincere your confession was, I am guessing you meant it in jest.
But it is worth stating nonetheless. No one confesses when they search StackOverflow for something they've solved 100 times before, or when they ask Google a question a dozen times a day. Asking ChatGPT for some insight should equally not be considered sinful :)
There are variations between mysql and mariadb, I've used mysql for a long time now, when I switched to mariadb I found those differences. I find writing complex queries on the console first and then committing it to the code a good strategy. I can see the results right away and some error that I might have done.
I've got plenty of complaints about SQL, but as one of the most useful programming languages to have ever been invented, I have to say that syntax complaints are one of lowest items I would have on my list of things to be prioritized for a change. Sure, the syntax could be better, but why do we care so much about it over the dozens of other problems with it?
How about we get a SQL successor with algebraic data types, true boolean logic (as opposed to SQL's ternary logic), or functional composition? Null values are the bane of any query writer's existence, and we should have a reasonable solution by now...we've already done it with other programming languages.
It sounds like you want SQL to be more like a "real programming language", but I feel like there's a real chasm. SQL itself seems to be oriented towards "non-programmers" which is why it has declarative English syntax. But so many systems that interface with SQL databases are software systems written by programmers.
Why are our programming languages, which have rich ADTs, Boolean logic, etc. serialising queries into an English syntax written for ad-hoc business intelligence tasks? Why not have a binary query interface to the database that provides a programmatic, rather than human-readable, API?
The first time I got a "too many bind variables" error I was flabbergasted. All I wanted to do was insert a ton of rows into a table. But the database expects me to construct an English sentence containing a placeholder for each value of each row?
It definitely makes things easier to follow, but only for linear, ie. single table, transformations.
The moment joins of multiple tables come into the picture things become hairy quick and then you actually start to appreciate the plain old sql which accounts for exactly this and allows you to specify column aliases in the entire cte clause. With this piping you lose scope of the table aliases and then you have to use weird hacks like mangling names of the joined in table in polars.
For single table processing the pipes are nice though. Especially eliminating the need for multiple different keywords for filter based on the order of execution (where, having, qualify (and pre-join filter which is missing)).
A missed opportunity here is the redundant [AGGREGATE sum(x) GROUP BY y]. Unless you need to specify rollups, [AGGREGATE y, sum(x)] is a sufficient syntax for group bys and duckdb folks got it right in the relational api.
This may be the most practical way to make things better for developers, analysts, data scientists...
There have been so many attempts to alleviate the difficulty of writing SQL -- from ORMs, to alternate syntaxes, to alternate databases -- and none have been successful. The authors identify many reasons for this. The approach the authors have taken is incremental and makes only slight demands one people already familiar with elementary SQL -- and even people familiar with advanced SQL will likely find this approach to be easier to use for advanced queries.
The pipeline syntax as presented is nicer than the status quo, but I'd prefer a syntax that models query execution as a directed graph of operations. Doing so would not only make some of the more complex SQL query constructs much more straightforward to represent:
* Joins can be modelled as a "cross-referencing" operation that consume two (or more) data streams and produce a single data stream
* CTEs can be modelled as producing multiple data streams
* Recursive CTEs can be modelled as cycles in the execution graph
Yes DOT (and the other UML whatever languages) are absolutely the only extant examples that make an attempt. But again, if you look at DOT you'll see it doesn't actually do anything syntactically - it just has syntax for edge lists.
BTW, PRQL is mentioned in the article as well. In sum, they decided that instead of a new language against SQL, extending SQL with pipes would be better for user experience, learning curve, etc.
Yeh, when I saw the example above I thought it was KQL.
But they do sort of acknowledge it in the paper. Eg on the first page it says:
> …we present a solution – adding pipe-structured data flow syntax to SQL. This makes SQL more flexible, extensible and easy to use. This paradigm works well in other languages like Kusto’s KQL[5]
Strange typo though, to say “Kusto’s KQL” instead of “Microsoft’s KQL”
Kusto is allegedly named after (sort of in reference to) Jacques Cousteau, so “Kusto’s” doesn’t make sense.
unfortunately KQL doesn't seem to have INSERT, UPDATE etc. support, it seems to be a pure query language for querying. Unless this strange different .insert syntax is what they intended for their language from the start? I don't know: https://learn.microsoft.com/en-us/kusto/management/data-inge...
> unfortunately KQL doesn't seem to have INSERT, UPDATE etc.
(Disclaimer: I'm an engineer at Microsoft, and I use Kusto basically every day)
This seems to me to be a deliberate design choice. Microsoft doesn't want engineers mutating the databases by hand. There are mechanisms to do that (mostly outside of Kusto, and usually to resolve privacy incidents), but the common case of querying is not supposed to allow for arbitrary changes to the database.
Does KQL still enforce no blank lines in the `let` clauses to the expression?
When I last used KQL, it was infuriating that I could create my `let` clauses in chunks separated by whitespace because a blank line would be considered a terminated statement (and Kusto would say "Hey where's your expression?!"). This meant every Kusto file was a sea of text with no clear differentiation between subsequent clauses. I ended up using 3 blank comment lines as a "fake empty line" just to maintain my sanity.
Looks like from Elixir. If old SQL syntax will be supported then why not - but this one will introduce a lot less readable queries when multiple JOINs, subqueries and aggs are involved - it is very easy in plain SQL and here they will probably look bad.
Piping syntax is particularly useful when querying time series too, especially when we need to write complex queries in a one-liner to share with others.
I understand the desire to reorder clauses in a way that makes sense to the user, but they seem to discount the value in the formality of enforcing a particular order to the clauses: you're assured that if the query contains, say an ORDER BY clause, it will always appear at the end, no need to hunt for it.
Also, why the need to match semantic evaluation order when there are far more important things happening under the hood that affect the execution plan (indexes, etc.)?
> Side-effects at a distance...The same columns are cross-referenced in SELECT, GROUP BY and ORDER BY, and corresponding edits are often required in three places
> when there are far more important things happening under the hood that affect the execution plan
It feels like google is only concerned with large query engines like BigQuery where you're relying on partitioning and sort order within partitions more than you can indexes or complicated multi table reference semantics.
> Can't this be avoided by using aliases?
In any language where variables are hard to declare and use pipes always feel like a good option; however, just adding some variables in is probably the better solution in most cases.
The first-page of the paper has 13 co-authors listed - but all with the same affiliation ("Google, Inc") - so this is ultimately a single-vendor making a unilateral proposal to break with the past - which means I'm confident this proposal won't be gracing the pages of the ISO/IEC 9075 (ISO SQL) standards in my lifetime - no matter how badly we all need QoL improvements to SQL.
...okay, if I dial-back my feelings of resignation to mediocrity, then I'll admit that Google probably does have enough clout to make this go somewhere - but they'd need to add this to all their database offerings (BigQuery, Spanner, Firebase's SQL mode) and contribute patches to Postgres and MySQL/Maria - maybe after Microsoft relents a decade later to add it to MSSQL we'll maybe start to see Oracle's people refer to it vaguely as a nice-to-have they'll implement only after they start losing more blue-chip customers[1].
Also, it's giving me M (Excel PowerQuery) vibes too.
-------
[1]For context, Oracle's DB lacked a `bit`/`bool` column type for the past 40 years until last year. People had to use `char(1)` columns with CHECK constraints to store '0'/'1' - or worse: 'T'/'F' or 'Y'/'N' (see https://stackoverflow.com/a/3726846/159145 )
And to be fair it is not like any database implementations implement the entire spec, or that the spec itself is nearly as long as the C++ but still very underspecified.
Piped SQL fits in perfectly with the overall SQL pot-luck buffet! I for one welcome Google to the table, enjoy the language that works everywhere and nowhere but is the best there is.
You could implement a family of poorly defined dialects using SQLGlot, sure, in the same way you could define a meta-language targeted at Algol-family languages. That's not a substitute for adding language features.
I don't have time to read this closely or ponder the grammar right now. They have a section on "complex queries" that acknowledges my first concern, but doesn't seem to really address it. Namely, that SQL allows tree-like composition of queries which is more general than a linear composition.
Has anybody figured out whether they are proposing this pipeline syntax to be mixed with regular compositional forms like CTEs and subqueries? Or is it another limited, top-level syntax, similar to how some DB engines do not allow nesting of CTE syntax?
I seriously wonder if the people who are so adament that sql is flawed have spent as much time using at as they have trying to "fix" it. After 20 years of sequeling I have come to believe that this language is so far ahead of its time that we're only just beginning to see what a proper tooling for it looks like. Azure Data Studio w/Copilot makes starting queries with "select" the most natural thing in the world and this pipe syntax is barbaric in contrast.
I think that engineers and analysts each have very different relationships with SQL.
When I was doing data science, all the other DS folks would be perfectly content to read and write queries that were hundreds of lines long. There were plenty of minor bits to pick, but it was a great lingua franca for describing data processing.
But engineers hate SQL because they generally only need a tiny little subset of the feature to enable transactional data updates. So they write an ORM to do the subset of SQL they need and never get the opportunity to be indoctrinated into the SQuLt
I think their claim isn't that it's impossible to be efficient in existing SQL but rather that pipe syntax is more natural and approachable to a lot of people?
Great question, answered in section 4.1.5 of the paper (page 6): it turns out that the single character | is already used for bitwise OR, and although that's rarely used and in most cases there would be no confusion, there are at least a few cases, and at this time, “More drastic solutions requiring users to migrate queries did not seem desirable.”
Question for people writing highly complex SQL queries.
Why not write simple SQL queries and use another language to do the transformations?
Are SQL engines really more efficient at filtering/matching/aggregating data when doing complex queries? Doesn't working without reusable blocks / tests / logs make development harder?
Syntax is one thing, but actual performance (and safety/maintenance) is another deal?
If I need to join a million rows to another table and then calculate an aggregate value, and do it all in a single query, it's fast. It might only take 0.01 seconds.
Whereas if I make separate queries to retrieve a million rows, and a million rows from another table, then it's incredibly slow just due to the data transfer. No matter how conceptually easier another language may be. So it might take 30 seconds for repeated sequential serialization and network and deserialization.
And even if you're looking up individual rows (not a million), with lots of joins that would be lots of round-trips to the database which multiplies latency and load -- so even if you can still get your final result quickly (e.g. 0.01 seconds rather than 0.001 seconds), the load you can handle drops by the same multiple (e.g. you can now only serve 10% as many users).
The general rule of thumb with databases is that they can be super-fast when everything is done on the database machine in a single query and your final result is a small amount of data (assuming everything is indexed properly and the query is written properly). But they become super-slow when you're doing a bunch of queries in a row, and where those intermediate queries can return massive amounts of data, or (even worse) need to send back massive amounts of data.
I've worked on a few SQL systems used for analytics and ETL.
My users fell into (for the purposes of this discussion) three categories:
1. Analysts who prefer sheets
2. Data scientists that prefer pandas
3. Engineers who prefer C++/Java/JavaScript/Python
I'm fairly sure SQL isn't the first choice for any of them, but in all three cases a modern vectorized SQL engine will be the fastest option for expressing and executing many analysis and ETL tasks, especially when the datasets don't fit on a single machine. It's also easier to provide a shared pool of compute to run SQL than arbitrary code, especially with low latency.
Even as a query engine developer, I would prefer using a SQL engine. Performing even the basic optimizations a modern engine would perform -- columnar execution, predicate pushdown, pre-aggregation for shuffles, etc -- would be at least a week of work for me. A bit less if I built up a large library to assist.
I've heard great things about Pola.rs performance. To get there, they have a lazy evaluation so they can see more of the computation at once, allowing them to implement optimizations similar to those in a SQL engine.
In the early days, even as I appreciated what Pandas could do, I never found its API sane. Pandas has too many special cases and foot-guns. It is a notorious case of poor design.
The problems of the Pandas API were not intrinsic nor unavoidable. They were poor design choices probably caused by short-term thinking or a lack of experience.
On eager vs lazy evaluation -- pytorch defaulting to eager seemed to be part of the reason it was popular. Adding optional lazy evaluation to improve performance later seems to have worked for them.
I'm not interested in loading all the data in memory, or swapping it out if it doesn't fit, but also it may be a lot of data, and just moving it over the network is hard.
I have tests. I have reusable blocks (SQL functions, WITH blocks and views). I don't have logging though.
I can put the result in a non-materialized view and have it update in real time as the data changes. Or I can toggle it to materialized view and now it's snapshotted data.
Finally, views that depend on views that depend on views get automatically optimized by the query planner. You need a lot of very tricky custom code to start approaching that.
> Are SQL engines really more efficient at filtering/matching/aggregating data when doing complex queries?
As others have said, yes. In most cases, the more complex the query the better the result.
> Doesn't working without reusable blocks / tests / logs make development harder?
SQL isn't a programming language, it's a query language. You're effectively writing a single expression that describes the shape of the data that you want and then then the RDBMS goes off and finds the most performant way to retrieve the data that matches that shape. It doesn't compare well with procedural programming.
The closest "languages" that comparable to SQL are HTML and CSS. However you manage HTML and CSS is also how you can manage SQL.
>> Doesn't working without reusable blocks / tests / logs make development harder?
> SQL isn't a programming language, it's a query language. You're effectively writing a single expression that describes the shape of the data that you want...
Exactly this. Generally speaking, your SQL queries won't have "bugs" the way that you can create bugs when writing a function.
The challenging parts of building a complex query are usually 1) getting it to function at all (just being a valid SQL statement for your tables that gives you your desired output fields), and 2) making sure it runs performantly on realistically large table sizes (usually in milliseconds as opposed to seconds), which may involve rewriting things like joins vs. subqueries and/or adding indexes.
A lot of bugs in functions come from edge cases or different combinations of paths through code or unexpected combinations of parameter values or math formula errors or whatever... but a SQL query won't usually really have any of those things. It's just a single transformation that basically either works or doesn't.
One reason SQL has become more popular lately is as an API for map/reduce.
Before you would write a Java/c++ class that would do the map/reduce job for you distributed over 100s of CPUS. And you would feel like you were on the bleeding edge doing innovative stuff.
Turns out that SQL is a perfect API for map/reduce.
Everything you write in SQL can be transformed into a massively parallel job. And you don't even know about it.
This is the secret behind BigQuery and Trino/Presto/Athena.
> Why not write simple SQL queries and use another language to do the transformations?
> Are SQL engines really more efficient at filtering/matching/aggregating data when doing complex queries?
Yes. With knowledge of the data (like indices) and statistics of the data it is usually very much more efficient than piping the data to another process to handle the same job.
Of course that requires you to write a good query and know how to tell what makes a good query, but if your data is in a relational database it is usually more efficient to do your filtering/matching/aggregating there.
SQL doesn't have problems. It's arguably one of the oldest and most successful languages of all times.
Some developers have problems, because they don't understand its abstraction. Periodically they try to re-implement feature to "fix" SQL. Every time it's an absolute disaster.
I realize this is how the world works, but it's just a waste of time.
> Some developers have problems, because they don't understand its abstraction.
I think the point of this new syntax is to lower the cost of understanding the abstraction. You keep the familiar clause syntax but let clauses be chained into pipelines in which the actual (semantic) execution order is the same as the written order.
I think this new syntax is more imperative (you tell the DB "how" to get you the data). The old syntax is more declarative (you tell the DB "what" you need).
The big advantage is that the SQL engine can optimize the how if you focus on the what -- especially after so much research has been don building query compilers.
If you want to focus on the "how" then perhaps a "bare bones" DB like Redis is more in line with this kind of thinking.
> I think this new syntax is more imperative (you tell the DB "how" to get you the data). The old syntax is more declarative (you tell the DB "what" you need).
I think this syntax could accidentally give the impression that this is what's happening, but it's still an abstraction thanks to the optimizer. The "how" a pipe syntax describes roughly aligns with how a human would naturally want to construct a machinated process to sort through a file cabinet. Which may have been why it's been such a big hit for Unix and functional programming, despite poorly mapping to the machine model before the optimizations of modern OSes and functional compilers came along.
Any choice whatsoever might give a false impression of the machine model. Maybe SQL is doing the right thing by essentially having its syntax elements in random order, thwarting any attempts by developers to relate what they are typing to what is actually happening, but I think the authors of this paper are right in their desire to make it more ergonomic.
I personally find the syntactical flow of SQL to be quite logical: you SELECT tuples FROM a table, JOINing others as needed, filtering WHERE necessary. It’s much like list comprehensions in Python, which I love, so maybe that's biasing me.
You can tell the authors realized this was a bad idea when they had to add the 'OVER' keyword, which isn't documented and hardly mentioned in the paper.
I disagree that the paper not mentioning ‘OVER’ implies that the paper authors secretly think pipe syntax is a bad idea. They probably just wanted to keep the paper concise, or forgot about that one less-used bit of syntax.
Do you think that ‘OVER’ keyword implies something fundamentally wrong about pipe syntax? If so, how?
I appreciated the paper's effort to highlight data analysis techniques, but I was hoping for a more forward-thinking perspective.
The inclusion of pipe syntax or data processing using pipe-structured data flow syntax is not very novel at all. Splunk's SPL and similar languages have been a longstanding industry practice.
I wish the paper had provided more extensive references to the existing body of work in this domain, as it would have added depth and context to the discussion.
Starting with SELECT was a mistake in SQL as you need to know where before what to select from. What we really need is something like
FROM r JOIN s
on r.Id = s.Id
ORDER BY r.Id
SELECT *
But the thing is, such changes will break pretty much all existing code, so the author added |> to distinguish, but why not use | instead? Don't make people typing one more character please.
Yes, being able to start with FROM is what makes this so intuitive to use (and autocomplete work better); your hypothetical
FROM r JOIN s on r.Id = s.Id ORDER BY r.Id SELECT *
would indeed be pretty much that (linebreaks optional):
FROM r JOIN s on r.Id = s.Id
|> ORDER BY r.Id
|> SELECT *
The question about “typing one more character” is answered in the paper's section 4.1.5 “Why use ‘|>’ for the pipe character?” (page 6): “The most natural and obvious choice would be to use ‘|’. Unfortunately…” — they don't want to break existing queries that use | for bitwise OR.
(I wonder if one day, if the majority of users are using pipe syntax and don't mind updating their old queries or adding a special directive to enable single pipe…)
This was my first thought as well. I like the language, but I’m hesitant to adopt it without tooling that can extract an AST, ideally interoperable with the hard-won tooling we already have for SQL.
Improved UX at query development time is nice, but data teams need so much more than this. Stuff like lineage, linting, cross-dialect interop - these are no longer niche concerns, they are table stakes, and they come from having a shared syntax for data transformation.
Of all the problems with SQL, this solves one I absolutely don’t care about at all. Kill SQL or don’t kill SQL, but don’t add pointless syntax sugar to make normal queries look alien.
I really wish that browsers had developed first-class support for offline web page bundles. There's no way to share a page that is guaranteed to be self-contained and not hit the network, especially if you want to use javascript. It's particularly frustrating since browsers supported offline mode as far back as the 90s; it just needed to be combined with support for loading from zipped folders.
That simple change would've largely solved the academic paper problem decades ago. It's bizarre that it still isn't a feature.
Mail clients kinda do that (or at least they can, if asked to). Also, why would academic papers need JS anyway? CSS and images, I can get, but beyond that there's no need for anything fancier.
Yes, but it's not guaranteed to be self-contained. I wouldn't want to open a random HTML file knowing that it could phone home, or that the content might break one day without me realizing. There's a practical and psychological aspect to sharing `steves_paper_2014.html` versus `steves_paper_2014.offlinesitebundle`. The latter feels safe and immutable.
What you want is an HTML tag or response header that restricts network access, which the browser can then enforce. Offline or a list of allowed domains, this would be great for security in general. Not so great for advertisers though.
Then you have to verify that the tag is there, right? But if it has another extension like .offlinebundle you can know thay browsers will not make any extra requests.
Browsers don't have native support for opening WARC. It doesn't solve the safety problem either: you can still construct a WARC that phones home, AFAIK.
It's a great format for the problem it solves, but if browsers supported offline-only files the container format wouldn't (and shouldn't) need to be that complicated.
Plus I can't use web tools, like "Read this page" in Mobile Safari.
And copying and pasting is harder.
And I can't link to individual sections.
I'm honestly baffled by people who prefer PDFs for this kind of information. Are they printing them out on paper and going at them with a highlighter or something?
Just my personal take, but when I have to read something carefully, I find it easier to do on paper.
For example, I recently wrote an article about taking random samples using SQL. Even though I was writing it for my blog, which is HTML, I proofread the article by rendering it as a PDF doc, printing it out, and reviewing it with a blue pen in hand.
What surprised me is that I also found it easier to review the article on the screen when it was in PDF format. TeX just does a way better job of putting words on a page than does a web browser.
Actually, if you want to do the comparison yourself, I'll put both versions online:
On mobile phone, as a reader with photophobia, the pdf causes physical pain, and is illegible, whereas the html is perfectly readable via reader mode (where text can be enlarged and dark mode settings are respected.
Personally, it's sending it to GoodReader on a 13" iPad.
I don't know that I'd go so far as to say I 'prefer' this, but there are a lot of PDFs out there, this works fine, and it's a nice change of pace given how much time I spend in front of a monitor / laptop screen.
Translating LaTeX to HTML is not a straightforward process, unfortunately. Many people have tried to implement automated translation systems, but nothing has really worked out yet.
I think it's unfair to expect the research team to invest additional hours in learning how to make good websites, so to solve your problem would require hiring additional talent whose only job is to translate academic PDFs into accessible web pages. I don't think that's a bad idea, and certainly Google has the funds to do something like that, but I don't imagine they'd find it to be a good use of money. Accessibility is an afterthought for most major companies these days.
I hope SQL pipes become commonplace so that I can use them anywhere I have to write SQL queries.