For me the examples on the website https://prql-lang.org/ are the biggest selling point for PRQL, in particular the SQL it generates. It looks clean, straightforward, something I would've written myself.
In general, I like this slightly more careful take on modern database development. 10-15 years people would start a brand new database like Mongo, or Riak, or Influx, or whatever, and would try to convince application developers to select it for new projects. But recently we started getting more conservative options like EdgeDB, TimescaleDB, or even PRQL which all expect us to run Postgres with some addons and / or query preprocessors. Tech like this is so much easier to adopt!
I'm really liking what Edge folks are doing with schemas and migrations, but I do find PRQL syntax much more intuitive. My application code is littered with data transformation pipelines already: all these map / filter chains in TYpeScript, iterators in Rust, enumerables in Ruby, streams in Java, LINQ in .net, Rx in dozens of languages etc. etc. So the concept is very, very familiar. Getting nice SQL queries out of PRQL that I can store, inspect later, see the query plans, add indices where necessary, is just great. It's such a good tool!
To me it seems quite nice, but really just trivially different from SQL - like if Ruby was 'friendlier syntax that transpiles to Python', meh? You'd use whichever you happened to learn first and not bother with the other. (That's often true even though it's more than that of course.)
The examples arbitrarily make SQL look more verbose:
SELECT
id,
first_name,
age
FROM
employees
ORDER BY
age
LIMIT
10
Yes! Of course I'd rather:
from employees
select {id, first_name, age}
sort age
take 10
..but wait, actually the SQL could've been:
select id, first_name, age
from employees
order by age
limit 10
and it's more verbose by a character or two... (no braces, but 'order by' vs 'sort')
It's not the tiny changes in the syntax/order that matter here. It's that if you want to limit the columns or do something else with them afterwards, in PRQL you append that text to the end. In SQL you'd have to wrap it around your original query, it inject joins in the middle. It doesn't read well anymore, because "how" you're doing things eclipses "what" you're doing.
For a simple select, this just doesn't matter. But you also wouldn't bother with PRQL for simple selects.
In python I use sqlalchemy core to build the SQL for me, but I manage the transactions myself. Sqlalchemy query builder offers a similar role to what you described, as in I can add another select or where field after I have already added a sort criteria. It’s clever enough to order these correctly based on the SQL dialect.
I use CTEs, window functions, and groupings all the time when I write reporting queries. These things tend to be much more verbose in raw SQL, and ORMs / Query Builders either do not support some of these features at all or do very poor job (like, force me to put raw SQL substrings in my code), or force to write DSLs that are even more verbose than raw SQL. Look at corresponding PRQL samples, and you may see an appeal.
Also, I agree, no one should write SQL like this - screaming keywords, superficial newlines, etc. I don't think this style made sense ever, even in 1970s.
While there's some stuff in C#/LINQ/EF that's more verbose (left joins are often a nightmare) or not-supported, I'll always say that I prefer writing queries in EF than in SQL, at least when dealing with SQL features that are supported by EF (which is a lot of them, it's a very expressive dialect).
But EF lets you start with FROM, lets you do whichever keywords you need in whichever order (instead of WHERE -> GROUP BY -> HAVING and you've got to CTE or Subquery if you want another GROUP BY). It also lets you access the members of a group because the objects are still treated as a graph instead of being pulverized into a flat table like SQL does. It also makes your FKs into first-class navigational properties of the table.
Like, if I have an addressID and I want to get its country code?
In MS SQL that's
SELECT CountryCode
FROM Country
INNER JOIN ProvState ON ProvState.CountryID = Country.ID
INNER JOIN Address ON Address.ProvStateID = ProvState.ID
WHERE Address.ID = @AddressIDParam
EF has a hell of a lot of flaws, but linguistically I love it. Yes there's a lot of aliasing boilerplate in EF, but the ability to walk the foreign keys and the fact that you can put the select after the table name pays off so very well.
Also there's a dialect of LINQ that looks more like SQL but it's kind of weird and I don't love it so I prefer to use the lambda syntax above.
In that dialect, it's
from a in db.Addresses
where a == addressIDParam
select a.ProvState.Country.CountryCode
which is even more terse and SQL-y although I find it a weird linguistic mash-up.
The downside though is you have to grab the sql it’s generating somehow to try to optimize it, figure out what crazy cross apply madness it’s going or to figure out why it’s blowing sql servers recursion limit.
I prefer to avoid linq syntax now. It’s a false economy.
In my experience, it quite rarely uses cross apply, and typically only for functions where you're doing complicated join filtering and trying to extract full EF entities.
If you don't use linq you can't get EF entities anyway, and if you don't need EF entities you can can still use linq to get a projection and avoid the cross apply!
The worst case scenario often ends up being writing a linq query to get your primary keys, then loading entities just using the list of keys.
I've used linq for years and it makes the basic and intermediate stuff way simpler, and the complex stuff no more complex. In hundreds of thousands of lines of code, we have exactly TWO queries where it ended up simpler to hand write SQL, and EF did not prevent us from doing that!
EF for some time (since the "Core" reboot) has made things like "crazy cross apply" "opt-in only" and instead you get runtime exceptions for LINQ query shapes it thinks can only be done that way. Some of those crazy things early days EF silently supported still aren't even possible in today's EF, for mostly better (but sometimes rarely worse), let alone opt-in behaviors.
I think more than ever current EF generates SQL that straightforward looks like you would expect given the LINQ you've written, with only surprises being runtime exceptions before it ever tries to confuse an SQL database with a horribly deoptimized query.
Today's EF also has more tools than ever to optimize its queries. It logs query text by default in some debug contexts to visible debug consoles, and to certain telemetry providers when errors occur in certain release/Production contexts. It's easy to opt-in to such logging in even more contexts should you desire that. (Thanks to modern .NET's ubiquitous System-owned ILogger infrastructure and the modern "Generic Hosts" approach.) For advanced scenarios, intercepting specific LINQ queries and enhancing/optimizing their output based on domain knowledge of the queries or the specific databases is easier than ever with simple dependency injection.
I drew a different conclusion from similar experience. I avoid navigation properties and other advanced mapping features, so an entity maps flatly to one table.
The LINQ queries will be more verbose as you'll need to write the join and group clauses explicitly, but I find it much easier to predict the performance of queries since the generated SQL will look almost exactly the same as the LINQ syntax. It's also less likely to accidentally pull in half the database with `.Include()` this way.
I have some ideas how to implement a high performance ORM (yes that sounds strangey doesn't it?) that makes extensive use of CTEs, window functions and temporary tables. Writing reports in that ORM would be downright trivial.
From my perspective the JPA speciation is downright harmful. There can never be a good JPA implementation. I don't understand how there are no Java ORMs that actually learned anything from the mess that Hibernate etc are.
I agree that I think the difficulty with the uptake will be "it's really just less annoying SQL", and it's hard to overtake a technology that's so ubiquitous when your fixes are really "nice to haves" vs something truly transformational.
That said, it's not the succinctness that's an improvement, it's that the pipelined nature of PRQL really maps much better to how people should think about queries, and also how the server executes them! Something as trivially simple as putting the from clause first means I can now get much better typeahead/autocomplete support from dev tools. Heck, I already do this now in a slightly more annoying manner: I write "select * from some_table ..." first, then I go and type out the actual columns I want because once the from clause is correct my IDE will autocomplete/show errors for possible columns.
> the pipelined nature of PRQL really maps much better to how people should think about queries
I disagree. Database engines take SQL and transform it into an execution plan that takes into consideration database metadata (size, storage, index analytics, etc.). Queries should be thought of with a _set based_ instead of _procedural_ approach to maximize the benefits of this abstraction - diving into the implementation details to guide the execution plan formation only when necessary.
Also, the pipeline approach could be achieved with common table expressions (CTEs), right?
That said, I think PRQL looks promising because it is a solid attempt to make RDBMS development more approachable. I also like that `from` comes before `select`: it is far more readable. A solid and modern IDE experience for PRQL could be a "killer app".
I disagree. I find it extremely hard to reason about large queries as set transformations, whereas it is much easier to break it down to "first this, then that". And this is long before I've even started writing my first line of SQL.
So let me write it procedurally and have the optimization engine fix it for me, just like how it fixes my SQL.
Even SQL queries are often better understood procedurally. Take this one [1]:
SELECT article, dealer, price
FROM shop s1
WHERE price=(SELECT MAX(s2.price)
FROM shop s2
WHERE s1.article = s2.article)
ORDER BY article;
That inner WHERE clause doesn't make sense in my opinion, unless you think of it procedurally as for each row in s1, ask do a search for the highest price amongst all items that share article number.
Completely agree, thanks for putting it better than I could have, with an excellent example. Correlated subqueries like the example you give, or similarly lateral joins in postgres, fundamentally are treated like for loops by DB engines anyway.
Semi-related, but the example you give is also why I love Postgres' "DISTINCT ON" functionality (I don't know if other DBs have something similar) - it makes it so much easier to reason about these "give me the 'first' one from each group" type queries without having to resort to correlated subqueries.
queries like these are best suited for window functions, although I am not sure Mysql supports it:
SELECT article, dealer, price FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY article ORDER BY price DESC) as rnk
FROM shop s1
) sub
WHERE sub.rnk=1
ORDER BY article;
this query will be a single pass over table without loops/joins
This is the "set based" approach for the MAX: there does not exist a bigger element:
SELECT article, dealer, price
FROM shop s1
WHERE NOT EXISTS (SELECT 1 FROM shop s2
WHERE s2.price > s1.price AND
s2.article = s2.article)
ORDER BY article;
The uncorrelated example should have been rewritten with a CTE and should have been aliased as 'article_max_price' as if it was a computed property and where price = amp.price
That's a really simple example though. I think the real pitch for PRQL is that the syntax is way more regular- the group operator just runs a normal pipeline on each group and multiple derivations with different operations between them don't need to all be plotted out in advance, just as an example. SQL suffers because it's not really composable, especially once you get outside of specific modern versions.
Even though a small example doesn't highlight the advantages well, I still prefer the PRQL syntax in this because that reads like map/filter code I'd write in some programming language.
"Order by" is a much better name, by the way. I'm permanently slightly annoyed by the fact that in programming it's a custom to call ordering "sorting" for some unimaginable reason.
I don't get that - to me the examples are much less readable than SQL and I don't understand why I should want to use this. Like, yes, you can reorder the query sections, which seems to be everyone's complaint about SQL, but then you also have multiple types of brackets, colons and other syntax for no reason, all while not really accomplishing anything SQL doesn't already do.
If you're happy with SQL then there isn't much point.
For the folks building and supporting PRQL, SQL just has a few too many warts and the popularity of tools like Pandas, dplyr, Polars, LINQ, ... shows that for analytical work we often like to work with our data differently. Other frameworks and languages feel that we should throw out Relational Algebra as well but we feel that's like throwing the baby out with the bathwater.
PRQL's core tenets are that Relational Algebra is foundational and a great abstraction but what's needed is a modern, ergonomic language that allows us to work with data the way most of us conceive of it - as pipelines of transformations.
Personally I feel much more productive in PRQL. I can start with a dataset and append one transformation at a time, writing from top to bottom as I go along without having to jump back and forth between the SELECT and the WHERE and the GROUP BY clause etc... .
Also, if I want to take out a step, I can just comment out that line and the rest of the pipeline usually still works. This might seem like a minor thing but in my experience it's those kind of ergonomics that make the difference in actual day to day work rather than fancy theoretical features you only use once in a blue moon. It's therefore worth noting that this was an intentional design decision. You try and take out some steps from your SQL query and see how well the rest of your query holds up.
For what it’s worth, it looks really readable to me. I have decades of sql experience at this point so consider myself pretty proficient but I can see the appeal of having a terser syntax for transformations. I especially like the “it just makes sql” approach.
Stepping through the second example on that page I know how I could do the same in sql, and I also know that it would be harder for most people to follow.
Question about that example though - why is there an aggregate without an assignment to a variable (average gross_salary) like the other examples?
I couldn't find that example right now but there is a similar line in the main example on the prql-lang.org homepage:
aggregate {
average total,
...
}
I can't definitively say why it is there, other than perhaps just to show that you can specify aggregations without having to give them an alias. The column name won't be pretty but if you're just interactively trying something out and want to see the results then you probably won't care.
Gotcha. The thing that’s not immediately clear from the syntax is which columns I’m getting out as a final result of my query. I guess the last select you ran + any derived ones since then?
That's right. It can be a bit hard to keep track of but the rules are quite simple. You start with all the columns from a `from`, then any `derive` just adds columns. A `select` restricts the columns to those selected and a `group + aggregate` replaces the columns with those from the `group` + those from the `aggregate`.
I once tabled a proposal whether each query should be forced to end with a `select` to make the final column list explicit. This could be generated by the compiler. It was felt that that wasn't necessary though and would also be somewhat arbitrary since you also need to know what columns are available at each previous step of the pipeline if you want to be able to make modifications. As the tooling improves, you could perhaps be shown the current list of columns as you hover over each line?
There’s nothing stopping people from using their own convention of selecting early to grab the stuff they need and selecting again right at the end to be explicit about what they want. Like select *, it’s better to just be explicit.
The attraction is that, especially for more complicated queries with complex joins, subqueries, aggregates, etc., that the structure of PRQL much more closely matches the "English", mental-model of the query. I can just read PRQL linearly, from top to bottom, and at each point I know what is being "pipelined" into the next phase of the query.
With SQL I have to look all over the place. And it's not just that FROM should come before SELECT, it's that if I'm doing, say, an aggregation with a group by, normal SQL doesn't really have anything to make me think about the ungrouped rows, and then merge them together to get the aggregate values. With PRQL I can just go top to bottom, and for me it's much easier to reason about (i.e. first get all the rows, then group by some specific columns, then take aggregates, etc.)
And I say this as someone who spends about half my days in SQL at present.
1. The SQL that I've seen output by PRQL isn't monstrous, at least no more than it would be coding by hand.
2. I do agree that I don't think PRQL will have much uptake until it is baked in to DB systems, at least as a plugin. One reason (among many) that I hate ORM systems is that with native SQL it's very easy for me to just capture a SQL statement from my DB logs and then search my code for an offending statement. But given how PRQL is transpiled, I don't see any reason it couldn't be essentially part of the query planner, so that query logs etc. are reported in PRQL and not SQL. E.g. the query planner already takes text (SQL) -> Query Plan, it seems rather straightforward to go text (PRQL) -> text (SQL) -> Query Plan, with the SQL intermediate just being an implementation detail I don't have to care about as a DB user.
The attraction is something that was designed after decades of usage and PL research. Consistency of syntax is a big one for me. A favorite example of mine:
SELECT substring('PostgreSQL' from 8 for 3);
SELECT substring('PostgreSQL', 8, 3); -- PostgreSQL-only syntax
SELECT trim(both from 'yxSQLxx', 'xyz');
SELECT extract(day from timestamp '2001-02-16 20:38:40');
I think main difference is how PRQL translates into query execution plan, with SQL you need to read entirety of query to get rough understanding of order of operations (especially if you join two-three tables and have some nested and lateral queries).
with PRQL I see that the order will be explicit and set up by developer, so any footguns will be evident.
things like predicate push down, optimization fence, variable rewrite, etc are not needed to be relied upon, because pipeline of query is more explicit in PRQL
also since it is new lang, it can be naturally extended into ETL type data pipelines
also because PRQL can be translated into query execution plan - it can be converted into non-SQL languages like MongoDB or pandas / spark / etc, eliminating SQL altogether for distributed nosql engines
Is that true? Doesn’t this just effectively translate into sql to be executed on the db as usual if you’re running on an rdbms?
Looking at the examples and the sql generated in the playground you’re just going to have the query planner doing its thing to decide how to execute eventually anyway.
Maybe you didn't see the more complex examples, but some of the SQL queries it abstracts are very unintuitive to write by hand. PRQL turns it into something more readable and closer to the expected mental model.
For people who can write error-free and complex portable SQL queries by heart this may not be really interesting. From experience, however, that is not a skill most developers have.
Isn't this a fairly simple way of doing this? That said, it is a bit non-obvious if you haven't seen it before.
select earliest_joiner.* from employees as earliest
left join employees as earlier on
earlier.role = earliest.role
and earlier.join_date < earliest.join_date
where earlier.id is null
order by earliest.join_date
This is indeed a sticky problem, one that usually requires a subselect or other workaround to address the non-determinism of group by + order by; i.e. one cannot simply "select * from employees group by role order by join_date limit 1" and be guaranteed to get the expected ordering.
I mean if you can look at the "expressions" example on the homepage and say the SQL is more readable than PRQL, then more power to you. Hell, actually more power to you, that's extremely impressive. But I think for many people, especially programmers more used to parsing brackets and nesting than pseudo-english, PRQL is easier in the places where you're trying to compose.
Oh, I'm a big fan of raw SQL, too. There's very little out there that can bring me same joy as a well-written 30-50 lines-long sql query! Like, I can spend an hour or two on it, but when it runs and produces the results in the desired form, it's sooo rewarding! And realizing that producing similar results in my main language (be it Rust or TypeScript, or whatever) would take me 3-4 days instead (and it would run much longer) makes me appreciate SQL programming more and more. In terms of "usefulness per minute spend" nothing comes close to SQL!
However, the syntax can be a bit clunky from time to time. I'm very fortunate that I use Postgres only in the past 5-7 years, so CTEs, and things like Timescale or JSONB, or PlV8 / PlRust for stored procedures are often on the menu. Yet, simple things still require repeating fragments of code from time to time, and complex grouping or window functions code often looks too convoluted.
And as I wrote in my GP comment: I (and many other developers) already use pipelines for data manipulation in code - beyond databases. Say, we got data from an API endpoint, or we mixing data from several data sources. SQL is familiar because I see it often, PRQL is familiar because I use similar pipelining concepts often in my no-SQL code. Would I use PRQL for super simple queries? Probably not. Would I be upset if one of my coworkers used it for simple queries? No, why would I? Would I try implementing complex reporting query using PRQL? Yeah, I would. Partially because I suspect I would get to my data in desired form quicker with pipelines, and partially because I know: if I get stuck I would convert my intermediary query to SQL and pick up from there.
For me PRQL looks better then most DSLs for ORMs or Query Builders in most languages. Adding a new tool to a project may be annoying, but depending on a project I'd rather pay this price once if it makes my less SQL-savvy team mates more comfortable. "Yet another syntax" is obviously the main point against, but from the application developer perspective ORMs / Query Builders often force one to learn "yet another syntax" anyway. "Don't use ORM / Builder" is an often voiced opinion and yet in practice we work with them all the time, too.
So, I view PRQL not as an alternative to SQL but as a third database communication dialect in addition to SQL and my ORM / Query Builder DSLs.
The stiff syntax is a feature. The infamous 1000 line sales report query is already a bear to maintain. If you get a clever developer reordering syntax it will only be worse.
If you have auto-complete it can see what table you are using and complete the column names. Starting with SELECT it could be any column in the database.
Huh. I have been doing this for like 15 years and never have a problem. Text editors like Sublime Text suggest names based on the content of the file. SQL Developer, SQLWorkbench, and DataGrip all seem to handle it just fine.
It is psychic in the sense that it makes educated guesses based on previous entries and the current database and schema. This is the same kind of educated guess it would make with a table name.
I really want this to take off and become a first party supported language for Postgres. Yes, yes, relational algebra is beautiful and all that jazz. SQL is a garbage, first pass design that should have been replaced decades ago. Even Codd has complaints about it. It is amazing what he invented, but we have learned a lot about PL design since then.
People really like to associate relational algebra with SQL, probably because they learned them one alongside another. But SQL is really terrible relational language - it breaks a few core concepts in different places, like relations being unordered sets, that you can ORDER BY. This bubbles up as relations losing ordering after being wrapped into a subquery, which is really unexpected.
PRQL has a data model very similar to the relational one. The only big difference is that relations are ordered - they are defined as arrays of tuples. So let's hope that PRQL gets to be known as "the relational language Mk II"
Not quite what you're asking for but DuckDB has both PRQL [1] and Postgres [2] extensions, so you could probably query your Postgres database with PRQL from there.
There's also a DBeaver plugin [3] which we still need to document better and simplify the usage of but you could potentially also use that to query Postgres with PRQL.
Finally there is pyprql [4] with which you could query Postgres from a Jupyter notebook.
Same. I first learnt SQL 20 years ago as a teenager and even back then I remember thinking how odd and, quite literally, backwards the language was. Back then I probably thought I just didn't understand enough to see why it had to be so. Now I know there is no reason. SQL has become more like a natural language. There's no arguing against it, you have to speak it even if it doesn't make sense. But it's not a natural language and we can do better.
It’s optimized for reading and use: the first time you see a query written by another developer, the interesting bit is what data it returns (the select part). Once you know that, you may be interested in the how, where does that data come from? Same for functions or methods, you first see the signature, input and output structures, before you see the implementation.
Good point, I didn't think of it like that and, oddly, I do often advocate for writing things "backwards" from highest level of abstraction to lowest. Maybe there is a good reason SQL has stuck around.
The main issue with SQL is that you are stuck in a very strict way of writing things, which does not clearly match to how I think. The top-down way of writing PRQL where each step is simply a transformation of the previous one makes way more sense to me. SQL is something I'd need a reference manual for, PRQL is simply writing down what I want the query to do.
I do agree that PRQL's join syntax is extremely bad, though. They should've stuck to explicit "left join"-like keywords, and the alias & join column shorthand could be done better.
Despite it not at all being the right tool for most jobs, this is why I actually enjoy writing spark. It reads top down, and is “sql-ish” for most commands. Enough that you don’t need to go out of your way to learn it if you already know SQL, you just think in terms of serial transformations.
I've had two real gripes with SQL. The rest of it has been, as you said, pretty good.
Complaint 1: Not being able to use selected columns later in the same select.
SELECT
gnarly_calculation AS some_value,
some_value * 2 AS some_value_doubled
Instead:
SELECT
subquery.*,
some_value * 2 AS some_value_doubled
FROM (
gnarly_calculation AS some_value
) AS subquery
Complaint 2: Not being able to specify all columns except. This combines with the above, where I have to pull some intermediate calculations forward from a subquery, but I don't need them in the final output. So I have to then enumerate all the output columns that I actually want, instead of being able to say something like `* EXCEPT some_value`.
Some may find your first complaint questionable... But I specifically designed ClickHouse SQL to allow aliases to be used and referenced in every part of SQL query.
So unsurprisingly based on my gripes, we are an OLAP shop. I've never heard of ClickHouse before, and I'll be looking into it. We're doing OK on PostgreSQL, but just OK. A lot of learning and unlearning.
I'm quite aware of your advice regarding using `` against table. However, I'm talking about using `` against subqueries and CTEs where the pieces of the table have already been extracted.
I'm quite aware of your advice regarding using `*` against tables. However, I'm talking about using `*` against subqueries and CTEs where the pieces of the table have already been extracted.
I was unable to edit this earlier due to HN being unavailable after I realized the formatter mangled it.
That's awesome! ClickHouse is a great system by all accounts and I've been meaning to try out ClickHouse Local.
I'm more familiar with DuckDB and they've also been doing some great innovation on the SQL front. I don't know offhand if they can also do the forward referencing thing but they allow putting the FROM first and having GROUP BY ALL etc..
It's great to see all this innovation happening in the SQL and Query Language space more generally at the moment.
It is strange to hear about innovation in DuckDB - I see that they are gradually re-implementing the stuff already existing in ClickHouse. Sometimes they do a better job at promoting it.
They also do bunch of stuff that ClickHouse can't do, like correlated subqueries [1] or recursive CTEs.
I kinda wish for a database with ClickHouse storage and DuckDB optimizer. At least my experience with ClickHouse is that MergeTree is incredibly good at what it does, but the optimizer hurts.
Maybe it's my familiarity with R and the tidyverse paradigm, but I think in general this paradigm and syntax is pretty readable. The example you chose is perhaps a case where brackets and commas go a long way, but that aside it's really not too bad in context of the rest of the language. `:` seems to consistently be treated as keyword arguments, and `=` for aliasing.
Because now the type of join is an argument on the "join" operation, and join is the first word of that statement making it more obvious what the operation actually is? I also prefer foo(bar) over "bar to foo()" as well, which seems like the equivalent in a more general purpose function call in a language example.
The problem with SQL is the same as the problem with C-style variable declarations. It sounds slightly better than the alternatives when you say the code out loud, but in reality it causes problems with readability and parsing/processing the code
There is a rather large breaking change in the syntax from `[]` to `{}` for tuples. This is because initially it seemed like these were lists but over time we've realised that they are actually tuples. As per the release notes, freeing up `[]` clears the way to start supporting arrays soon.
The limitation of PRQL is that it only does SELECTs, by design. If you want to insert/update/delete data, you're back to SQL.
That means that your team's data scientist might give you a query written in PRQL, but if you want to actually incorporate it into the data pipeline, you'll need to translate it into SQL.
I wish that PRQL would support at least a limited ability to insert -- for example, maybe just the case of inserting into a new temp table. No update or ON CONFLICT logic to worry about. It could look like this:
from tracks
filter artist == "Bob Marley"
save bob_marley_tmp
I don't see why PRQL can't support data mutation. Just have an insert/update/delete operator that must go at the end of a pipeline, and which takes a table name as an argument. An SQL query like this:
UPDATE counters SET value = value + 1 WHERE name LIKE 'prefix.%'
Could then be written in PRQL as something like this:
from counters
filter startswith(name, 'prefix.')
derive {
new_value = value + 1
}
select {
name, new_value
}
update counters
Big fan of this post! We link to it from PRQL website.
Our goal for PRQL is a great language to integrate with & build on. So we don't have N languages for N databases. Because PRQL will always be open-source, and won't ever have a commercial product, we think that's much more feasible than a DB-specific or product-specific language.
As evidenced from the poster's ORM rant, two languages for the same DB seem just as offensive to them as different languages for different DBs. I suspect you're still going to have an uphill battle converting them, unless they were hit with a blinding shaft of sunlight on the road to the SQL History Museum or something.
Syntax doesn't matter much, what in the optimizer does.
I think that right now a lot of SQL problems are baggage. Too many people are battle-taught to write hard to read, complex, deeply nested queries because older engines were bad at handling modern constructs (e.g. look, somebody got common table expession (CTE) here, let's materialize it immediately). So older style was mess of
SELELT FROM (SELECT FROM (SELECT FROM A JOIN (SELECT ) B ON A.x=B.x))) etc
With things like improved CTEs handling by engines, it is getting easier and easier to write understandable, composable queries of structure like below. Add modern window functions that decrease another layer of complexity, things like SELECT * EXCEPT, more complex TOP statements like TOP X FOLLOWING etc and SQL will be even more simpler.
WITH A AS (
),
WITH B AS (
),
WITH C AS (
SELECT y FROM A)
,
SELECT result from C
WHERE
and in new-ish world minor improvements of different syntax shape (like that PRQL) do not really bring the benefit, from my POV. Yes, you can rewrite it in imperative looking pipeline, but optimizer can throw away some of it etc
Yes, there are things that can and should be done better in SQL, but I have yet to see compelling case for cosmetic, syntactical differences.
There seems to be very little support for these SQL challengers despite many people saying SQL sucks and they want a redo. Why is that?
Some guesses:
- There is some popular non-sql query language that has gained lots of momentum that I just dont know about.
- People are more effective with SQL because they know it so no new and especially existing databases will switch. This is definitely true to some extent, but if this is the whole reason then I guess SQL is pretty good since people rely on it to be effective.
- There is support for it but a few key decision makers are to blame (big data or the MYSQL lobbyists)
- People actually dont dislike SQL that much and/or its not that bad and hackernews is not representative
- These query languages look nice but have technical or performance issues compared to SQL
I think the main reason mainline SQLs continue to be popular is that, unlike general purpose programming languages, the implementation of the query runner is the most important aspect, not then language spec. Take Python for instance. The official implementation objectively sucks, yet it remains extremely popular because the language implementation is only a part of a language’s success. That won’t fly for a db engine.
The last one is definitely a big factor, combined with the fact that you still have to compile to SQL, i.e. you are limited by its limitations, and there isn't always a workaround.
The rest are also true to some degree. But I think that once a serious contender comes along, people will eventually move to it.
You have to love the honesty: "Why PRQL? For HackerNews enthusiasts: The PRQL compiler is written in Rust [and] We talk about “orthogonal language features” a lot" (https://prql-lang.org/)
So did QUEL, as implemented in Ingres circa 1976. Perhaps this is what happens when you go back to basics and take a good long look at Codd's relational calculus (or, equivalently, relational algebra), to which SQL bears only a passing resemblance.
An admirable ideal, but declarative languages always seem to devolve towards some frankenstein imperative/declarative hybrid. We need to stop going down this path and embrace Pulumi's pattern: use existing general purpose imperative languages to generate a declarative structure. Instead, people try to take their not-mature declarative language and fit a weird general purpose language inside it.
EDIT>> I'm not suggesting that SQL needs to be declarative, only that if a problem space would benefit from declarative structures, generate them imperatively instead.
Pulumi serves as the strongest contender to Terraform when doing IaC (infrastructure as code). Terraform attempts to be a declarative markup language (HCL) but it has a lot of weird imperative quirks due to (understandably) trying to support common complex use cases. In the end they have a clunky custom language that tries to do what general programming languages have done well forever. Pulumi doesn't re-invent the wheel, and lets programming languages do what they do best. Pulumi only really cares that the programming language generates a declarative spec that can be used to compare with your infrastructure. It's the best of both worlds.
> Pulumi only really cares that the programming language generates a declarative spec that can be used to compare with your infrastructure. It's the best of both worlds.
Fully agree! I've called this approach "imperatively generating declarations".
Things might get complicated, though, if you try to nest this approach. E.g. if you imperatively generate the structure of your deployment pipeline, which in turn, contains imperative code to generate your infrastructure spec. :) Or something like that[0].
But that's probably just the nature of the problem.
[0]: I can't come up with a fully realistic scenario right now because it's late. Maybe tomorrow!
I don't really know what you're saying, can you say it another way? "Most realistic evolution"... why is that needed? If the problem is different database engines implementing the SQL spec differently, that's not something that can be papered over with another abstraction without a lot of wrinkles.
SQL does not need to be replaced, in the same way that COBOl never needed to be replaced. It can get the job done, but we have superior tools today. Better abstractions enable more productivity.
That there are N flavors of SQL is annoying, but there are foundational design choices in the language which we are stuck with today. PRQL is quite readable to those with SQL experience and feels like a plausible next language in the space without reinventing paradigms.
To go back to my original post, my main beef is with turning an imperative language into a declarative one. If you've seen enough of these types of languages degrade, you start to see a pattern.
This is tangential but a new query language is inevitably based in the idea that SQL is deficient in some manner (hard, not ergonomic, whatever). More interestingly, it also implies that the countless alternatives aren't good enough either.
Is there an existing query language that anyone will argue is better than SQL? I have limited exposure on this, but if SQL is really not that good then I'd expect there to be a better one at this point. All the new entrants kinda suggest that maybe SQL isn't so bad after all. Even many noSQL db's use an SQL-like language (see PartiQL). I'm not that familiar with alternatives though.
I think of it in the same way as JavaScript. Clearly, it has issues, but if you want to work on the web, it is what you have available. If you want to speak to Oracle/MySQL/SQL Server/Postgres database you are writing SQL. New entrants, could make a new language, but now they are fighting a battle on two fronts: the novel technology +query language.
> I think of it in the same way as JavaScript. Clearly, it has issues, but if you want to work on the web, it is what you have available.
And, interestingly, the approach they're taking here is similar to how folks have dealt with JS: introduce a transpiled language whose paradigms are close enough to the host language to feel familiar. Reminds me of CoffeeScript, actually (although if we're being honest, I couldn't stand CoffeeScript.)
I counter the argument that just because it is still around it is the best solution. Many things stick around just because of inertia, for example, the QWERTY keyboard.
You're comparing apples to oranges. Typing is much harder to relearn due to muscle memory. New well designed programming languages become quite popular quickly and go viral when they solve the right problems for developers.
SQL is a very sound querying language originally based on relational algebra and tuple relational calculus. Many attempts have been made in the past to come up with better syntax but they don't go far which might have to do with the fact that authors realize later that SQL is just a thin layer masquerading the mathematical concepts required to retrieve relational data and make certain assumptions without breaking set theory axioms.
One common mistake I see is that developers attempting to create an SQL replacement often approach it in the same mindset as creating yet another general purpose programming language.
Edit: The above is not in reference to PRQL. It's just my anecdotal experience working on several SQL engines, building ones from scratch and working in this domain for 15 yrs.
Besides, I'm not saying it's the best because it's most common, or even that its the best at all. It just makes it difficult to understand the argument that SQL sucks when its so widely used and no one can agree on an alternative. Including in new databases which could opt for these "new and improved" variants.
I grant that there is a first-mover advantage and it's not easy to shift paradigms. That's a big factor. That's why I am honestly - not rhetorically - asking which query language is clearly better than SQL? And why do people keep making new challengers instead of backing something that's already better than SQL? There seems to be very little support for these query languages despite many people saying SQL sucks and they want a redo.
Last time this showed up on HN, I complained about it not having CASE WHEN equivalent, which I've seen heavily used in ETL usecases. Now I see they added it!
Pretty cool and responsive. I'll go take it up for a spin.
Forwarding this to some of my team members right now.
We use SQL for a lot of things, almost all of them implementation of customized business rules. When attempting to develop a business rule that needs to manifest as a single SQL query, your brain is forced to work in a very particular crystalline way. Many business rules are defined in procedural terms at requirements time, so having a way to implement them in the same could be beneficial.
BlaBla.io Better Tomorrow Today(tm)
Fullstack Engineer (React, Typescript, PRQL) (100% Remote)
What We're Looking For:
You’ve got 5+ years of experience in designing, building and maintaining large apps written in React, Typescript and PRQL
...
Not just in what it's capable of doing, but in form. It has warts, yeah, but over the years I've been writing it I've realized most of them are there for a reason. To me it's more than useful, it's beautiful. It's the absolute last thing I want a replacement for.
How do we ever hope to differentiate between useful evangelism and stockholm syndrome? Either way, I'm probably learning a lot of postgresql this year.
I do really wish we had at least settled on one data format instead of having enough differences between PL data and DB data to start building ORMs.
Unsurprisingly, not everything is intuitive. There are language nuances that can surprise and frustrate users but are often well thought out. It’s exceedingly rare to have a SQL query written 20 years ago that couldn’t be run today, and written today it may be the same or very similar to one written long ago. It’s hard to say the same for other programming languages.
It's been really cool seeing PRQL come to life! I'm not involved with it, but I still remember the very first HN post about a year ago where it was just a proposal. And look at it now! It's really gaining steam, I'm really excited about it!
Except for niches, one is not going to unseat SQL as the primary de-facto standard relational query language unless the replacement is significantly better, not just slightly. Unless a fad storm kicks in, being slightly better is rarely enough to unseat an established standard.
As far as SQL competitor candidates, my personal favorite is SMEQL. It's more composable and uses tables for many operations instead of dedicated commands and syntax, making it more uniform where a handful of idioms do all the work instead of lots of specialized stuff for each task, like what SQL does.
Thanks. I hadn't heard of SMEQL before. To me it structurally actually looks very similar to PRQL.
PRQL just always has `srt` as an implicit parameter (now referenceable as `this` since 0.9) and a more linear flow whereas with SMEQL I guess you could have more complex chains. Actually you can do those with PRQL as well, you just need to break them out into separate `let` expressions. On the upside, PRQL can do away with a lot of the parentheses and some punctuation which makes it a bit nicer IMHO but YMMV.
I agree with you that PRQL would need to be *significantly* better to justify the time investment in learning it. For some of that is already the case but we'll continue working on it to try and hit that sweetspot for even more people.
There is a point in moving the FROM ahead of SELECT and getting sensible auto-complete suggestions from your IDE as you type.
LINQ, however, attempts to solve a much bigger problem:
Use consistent code to process data and don't worry whether it is in a remote database or in a local array.
If you manage to do it in a way that
- integrates seamlessly with e.g. Python IDE and I don't have to pass and parse strings
- allows me to access graph, document and relational data
- allows me to pull and process the data from REST APIs, [O/J]DBC and straight from my RAM
- and maybe even includes deductive (PROLOG/DATALOG) features
Having spent more time in MongoDB aggregations than I'd like to lately, I really wish they'd support this. So much more sensible than the madness they've got going on.
This sounds interesting. Could you provide a link to the MongoDB aggregations that your are referring to because they're not something I'm familiar with?
I was not trolling, I actually dislike MongoDB aggregation syntax and it seems like PRQL is something that could potentially replace it or go on top. Admittedly I only took a quick glance at PRQL though, so perhaps I misunderstood something or it's not applicable to this case.
PRQL is a breath of fresh air. Reporting languages generally miss built-in visualization and drill-down capabilities. Ideal reporting query should define not only how to seek, join, and and aggregate data, but also how to visualize output and how to present details in reaction to user clicks. There are some limited efforts like in PowerBI and Splunk but we need a standard. I wonder if PRQL guys will address this need in the future.
Has anyone benchmarked it or the bindings? SQL tends to be embedded directly in software programs, so building would likely happen on the fly in real-world cases.
> PRQL is being actively developed by a growing community
You can see this at https://devboard.gitsense.com/PRQL which shows a good mix of new and not new contributors. Not sure why there was a spike in the last couple of days though. Maybe the maintainers that in this thread can comment on this.
Regarding the activity, there's been a lot of preparation for the 0.9 release. As to the new contributors, I don't know the reasons for that but all contributions are very welcome. It's a very welcoming community and a great way to make an impact!
Thanks for the insights. You might also want to add a "good-first-issue" topic an/or labels to your project to signal how open you guys are to new contributors. In the near future, I'm going to start surfacing projects with these labels/topics.
I am currently building a SQL-only web framework [1], and I was contacted by the prql folks to include the language by default in sqlpage. I have to say, I have mixed feelings about it: on one hand, it indeed feels more natural to query data with a clean pipeline of operators. On the other hand, it makes me think of the famous xkcd comic about standards [2]: there were too many data querying languages, so we made a new one.
I don't personally work with Prefect, so if you have any ideas about what you would like to see here, please comment on the issue or on Discord. We're also very open to Pull Requests and they usually get merged fast ;-)
Thanks! We just did a big refactoring so we do have some links failing in our tests. Just PR-ed a fix to this one: https://github.com/PRQL/prql/pull/3075
Totally agree. This was a high priority goal for us and @maximilianroos did a lot of work putting something together. In the end, I believe it was really fighting against how dbt is set up though as it is so tightly interwoven with Jinja2. Perhaps with the introduction of the Python dbt models, things might have changed now and it's worth taking another look.
Have completions been implemented at all? The language looks like it is designed to take that into account, but I don't see that after a cursory search
This is insanely cool. It’ll be interesting to see if this gets decent adoption, and whether the query language ends up being easier to maintain than sql.
The leak to SQL is trough s-strings, which are chunks of SQL that gets inlined into the resulting SQL. The long-term plan is to remove them completely, but they are needed as an escape-hatch for now.
I see that PRQL is compiled to SQL. How do you deal with different type of SQLs? i.e Snowflake/Postgres/Clickhouse? Isn't it a potentially impossible abstraction or at least there will be places where you will need to use the native support sql language (unless of course databases will adopt that in the first place which might be great)?
If you are layering an abstraction on top of SQL to avoid the SQL, you fail that goal once something goes wrong.
The reason you fail the goal is that you end up having to have skills to troubleshoot the SQL anyway. Wrong here can be: bad result, bad performance, or even just bad syntax. Unless you're ready to accept being helpless and rely on the knowledge of others... you'll want to look at what's actually hitting the database to know if it's a database problem, a PRQL problem, or problem in your own code and to do that we'll you better be well versed in SQL, your database, and how your DB vendor have implemented their flavor of SQL. So to use this DSL truly competently, you need to know everything that you'd need to know to write plain SQL... and then PRQL and how it might translate into the target SQL of your application. And taking PRQL on its own it doesn't really seem any simpler than SQL from what I can see... what you get in increased compositional simplicity you have to pay back in understanding its own quirks.
To be fair this approach can be worth it, but I don't think for a stand alone DSL like PRQL. For example, in .NET land there's C# and Linq where you get a query DSL... but you also get a more seamless integration into C# and its paradigms. Same could be said of Elixir and the Ecto query DSL; all the other things I above remain true, but the DSL brings the value of clean integration in to the code base of your app.
Needless to say, I understand the why they'd try something like this... but I'm skeptical that the marketing here seems to oversell what you end up really getting.
Depends on the situation. Creating tables is a space/CPU tradeoff. Views can be managed the same way as the rest of your data pipeline.
CREATE TEMPORARY VIEW and CREATE TEMPORARY TABLE are both valid SQL.
> CTEs should be the first thing you reach for when trying to clarify the intent of the way you’re chopping up your data.
Sure but they aren't a silver bullet. Hence the "or" in my comment.
I use Redshift a lot and temporary tables are useful there for defining the sort and distribution keys when those are not favorable to the current query. Think joining several demographic tables into a single customer dimension, distributing it on the customer key then filtering the product_id distributed sales and re-distributing it on customer before joining. You can't do that with a CTE.
Views are handy when you have ugly CASE statements or other derived logic and need a convenience column that does something like string concatenation. I have a whole repository of them I deploy on top of the default system views for simplifying admin tasks.
If you are struggling to keep your SQL clean it's because you are doing too much at once. Take a step back and re-evaluate your data model. If your physical model is bad then no number of CTEs will save you.
I’d argue that the extra tables with duplicate data aren’t a space / cpu trade-off as much as they are a data management trade-off. There are cases for duplicating data in the system but I’d leave that to indexes and generated columns unless you really need to materialise it in a different structure.
Not in any system I have ever used. CTEs aren't materialized so they're not tables.
Tables (or materialized views if you prefer) are literally space/CPU tradeoffs. I have a data pipeline that combines website logs into a table before I join it with product data. If you tried to do both in one query then the DB falls over.
You seem to be coming at this from a OLTP perspective but in OLAP data "duplication" is the name of the game. But what naively looks like wasteful duplication is actually creating coherent analytical models from disparate data sources. This leads to much simpler analytical queries.
I agree with your further example cases (they were added after my comment).
I was thinking more of the original post where someone was complaining about readability when you have joins and subqueries - which is the usecase for CTEs.
Tables, views, and CTEs are all tools that can be used to make SQL more readable. They are all valid alternatives to subqueries in JOINs which is the only thing I would say you should "never" do.
I tend to use all of them. I create tables with an optimized (typically star) schema for my purposes. My date dimensions are almost always views on top of a list of dates. I also use views to create a ~"feature store" that pre-joins the underlying star schema. I can then write "simple" analytical queries that utilize CTEs. Those CTEs tend to not have joins and they tend to only apply relevant filters so the final select is clear and concise.
You might well know this, but one thing thats worth pointing out is that in Postgres you _can_ materialise the CTEs (though it's more to trick the planner when your use case is more about making sure everything to exploded and indexed in advance).
> All queries in the WITH list are computed. These effectively serve as temporary tables that can be referenced in the FROM list. A WITH query that is referenced more than once in FROM is computed only once, unless specified otherwise with NOT MATERIALIZED.
Ah, nice to see a mention of M. I don't see it mentioned a lot on HN and get the feeling most folks here probably don't even really know about it because it's so embedded in PowerQuery and PowerBI. It's actually not a bad language, it's just quite verbose in terms of the keywords.
I actually draw on my experience with M a lot in terms of the input that I give to PRQL because for data munging M is quite a good language.
Until we did the [] to {} syntax swap in 0.9, PRQL actually started to look a lot like M because M is the only other language I know that has [] for tuples.
What holds it back most of all I feel is the prison that is the Power Query editor (I have a sizable list of small and large papercuts in mind).
Now that PBI allows experimental export to pbip, change may be on the horizon. Hoping for more fine-grained tools, a better profiler and a better editor.
Will look at PRQL when I find the time. The planned column lineage feature and type checking e.g. seem like a godsend. Good work!
This seems like just a compiler from a better language (PRQL) to SQL, while EdgeDB is actually a database that comes with not just a better language (EdgeQL), but also other great features (I love their TypeScript query builder or how incredibly frictionless the migrations are).
dbt is a tool for defining and orchestrating in-database transformation jobs. Typically you write SQL statements to define the transformations (plus some template language to pull in additional context if needed).
PRQL is an alternative language for writing queries which compiles to SQL. There is a dbt plugin which presumably allows you to define the transformations in PRQL instead of SQL, while everything is still orchestrated by dbt.
The sooner SQL is phased out in favor of something more akin to Hadley Wickham's dplyr the better. Don't particularly like the syntax of this but it's the right direction.
Kind of. dplyr, Pandas, LINQ, Kusto, etc... are big inspirations. The fact that these are so popular and have reinvented the same workflows with slightly different syntaxes to me is a sign that they capture something fundamental about how humans like to think about data transformations.
PRQL is indeed very close to dplyr. In my (biased) opinion, PRQL is actually a bit cleaner than dplyr because it is its own language and doesn't have to work as DSL inside R. The same goes for comparisons with Pandas and Polars having to work as DSLs inside Python.
In general, I like this slightly more careful take on modern database development. 10-15 years people would start a brand new database like Mongo, or Riak, or Influx, or whatever, and would try to convince application developers to select it for new projects. But recently we started getting more conservative options like EdgeDB, TimescaleDB, or even PRQL which all expect us to run Postgres with some addons and / or query preprocessors. Tech like this is so much easier to adopt!
I'm really liking what Edge folks are doing with schemas and migrations, but I do find PRQL syntax much more intuitive. My application code is littered with data transformation pipelines already: all these map / filter chains in TYpeScript, iterators in Rust, enumerables in Ruby, streams in Java, LINQ in .net, Rx in dozens of languages etc. etc. So the concept is very, very familiar. Getting nice SQL queries out of PRQL that I can store, inspect later, see the query plans, add indices where necessary, is just great. It's such a good tool!