Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Csql – Python lib for composeable SQL queries (github.com/akdor1154)
89 points by akdor1154 31 days ago | hide | past | favorite | 38 comments



As an analyst, I often am faced with a choice between

- write a giant unmaintainable SQL query, or

- pull everything to my PC and use pandas, to take advantage of its ability to build up results piece by piece.

I wrote this library to try and enable that piece-by-piece development approach with SQL queries, without resorting to the mental overhead of full on query builders like SQLAlchemy or Linq.

Seeking feedback - is this useful for you? is it at the right level of abstraction?


I like what you've done here. I've personally gone down the route of using SQLAlchemy to build queries. The primary reason is that SQLAlchemy lets me build up reuseable SQL elements, to encapsulate business logic, like metrics and custom dimensions in Python. In particular SQLAlchemy's hybrid expressions, that allow you to tie an SQL expression to a ORM model are super useful for doing this sort of thing. There's also first class support for CTEs, window functions and other advanced SQL features. For me there's just something "dirty" about using raw SQL in code, or even templating it, even with security concerns like injection attacks put aside.

I second the recommendation for dbt. Especially if you're following an ELT architecture where you load your data into a data warehouse and then want to transform it. However dbt is more useful for transforming existing data, or aggregating data in batches. It's not a tool for generating SQL expressions on-the-fly like your library would allow you to do.


Have you tried dbt? Even if you don't want to adapt their "all in SQL" idea, I definitely took inspiration from their Jinja templating idea and have similar libraries as you, except it uses Jinja templating to parameterize the SQL. This allows more logic to be put in the queries than just variable names!


Huh, no I haven't come across dbt. Looks interesting! Jinja'd sql scares me a bit, but to be fair the string interpolation in my lib here scares me as well :)


Practical experience is that while there are kinks, Jinja and SQL marry fairly well together. Combining this idea and snowflake we have been able to create some extremely versatile metrics reporting systems that run with zero intervention I wouldn't have imagined a year or two back!


> pull everything to my PC and use pandas, to take advantage of its ability to build up results piece by piece.

Most difficulties in data processing arise from the need to process multiple tables. It is highly difficult in SQL but it can be also difficult in pandas especially for complex analytical queries. Indeed, in pandas you still have to use the same join and groupby operations as in SQL.

An alternative to SQL, join-groupby, and map-reduce is developed in the Prosto library:

- https://github.com/prostodata/prosto Functions matter! No join-groupby, No map-reduce.

It is a layer over pandas and its main distinguishing feature is that it relies on functions and function operations for data processing as opposed to using only sets and set operations in SQL and other set-oriented approaches


Does the unit of composition have to be a complete query/CTE? Theoretically it seems like it could be any fragment of SQL, but the docs seem to imply it must be a complete query.

I created something similar recently: https://docs.racket-lang.org/plisqin/index.html. At its core, it is also a library for composing fragments of SQL, but it has some novel (as far as I know) ideas. The most notable is that joins are values (or "expressions", if you prefer) that can be returned from a procedure like any other value. I was hoping that the world would realize "that's obviously how query builders should work" and copy the approach when starting new projects, but that hasn't happened.


Interesting approach! I've been working on it from the other angle: having pandas code generate SQL. If you're interested in checking it out, happy to try and show how it would generate the query in your readme!

https://github.com/machow/siuba

One thing I was wondering about the gnarly ast stuff you mention, what about operator overloading? E.g. Q("Select a from" + subquery + "where a < 1")


If you're an analyst, I second the recommendation for dbt. Here's a podcast interview with the CEO of the company behind dbt that explains a lot of the philosophy, and I think will help you even if you don't end up using dbt: https://softwareengineeringdaily.com/2020/03/09/dbt-data-bui...


As someone in a similar position, my first impression is that this is going to be very useful in saving time coming up with big sql monstrosities. I will be following your project with interest and hope to be able to use it in the daily grind.


For those of us who didn't know what a CTE was standing for. Here is a link to the PostgreSQL documentation on Common Table Expressions (CTEs)[1], and an article on Wikipedia on Hierarchical and recursive queries, that explains the concept [2].

[1] https://www.postgresql.org/docs/12/queries-with.html

[2] https://en.wikipedia.org/wiki/Hierarchical_and_recursive_que...

edit: formatting


Also for those who didn't know, and are just getting started with them:

They're incredibly handy for a variety of reasons, but can also have unexpected performance impacts. For example, here are functionally equivalent queries written with a CTE vs an inlined/derived table:

  WITH cte_foo AS 
  (SELECT * FROM bar LIMIT 1000000)
  SELECT * FROM cte_foo LIMIT 1
and

  SELECT * FROM 
  (SELECT * FROM bar LIMIT 1000000) 
  as inlined_foo LIMIT 1
Depending on the database you're using, those two could have wildly different performance due to a concept called an optimization fence[1]. In Postgres versions 11 and below, the CTE would have truly returned/materialized 1 million rows, then the outer query would execute and ultimately return 1 row for the resultset. Whereas the second version would have been optimized such that the outer LIMIT 1 would have been pushed into the subquery and not materialized those extraneous 999,999 rows to begin with.

As mentioned in [1], Postgres 12 (and 13) have started to tackle that optimization fence within Postgres. But it's still a concern/concept to be aware of, since many databases that support CTEs have varying levels of optimization fences, and you'll want to be sure you understand what optimization/performance impacts exist for your particular database before you go down the CTE path.

[1] https://auto1.tech/postgres12-a-precious-release/


Advice on the Parameters class: implement attribute access (__getattr__) in addition to __getitem__. Compare

  f"""... WHERE created_on > {p['created_on']}} ..."""
to

  f"""... WHERE created_on > {p.created_on} ..."""
The latter is a lot more readable and typeable, especially when it's meant to be used in f-strings which place additional restrictions on the usage of quotes.


This project seems nice, but there's some detail I have trouble wrapping my head around: why is a template string represented as a lambda containing an f-string, instead of as a plain string? Namely, wouldn't the code below suffice?

    Q("""select 1 from {otherQuery}""")
What does the following code enable that the code above cannot do?

    Q(lambda: f"""select 1 from {otherQuery}""")


It's impossible to hook into Python's string interpolation system to the degree required for the first to work. JS and Julia can do it, e.g. in JS (with typescript annotations) it'd just be a matter of defining

   function Q(stringBits: string[]): Query {
      for (bit in stringBits) {
         if (bit is string) {
            add to sql
         } else if (bit is Query) {
            add bit to dependencies
            add bit.name to sql
         }
      }
   }
But this cannot be done currently in Python (see PEP-501), so I'm forcing the user to pass a lambda, which I can get the AST of, with which I can implement the machinery to do the above.

Suggestions for improvements are welcome!


I'm pretty confident f-strings use str.format under the hood, so instead of AST mambo-jumbo, you can do just query.format(var1="something") or something.


I think this is really neat. It’s seems a bit more flexible than the composition approach of SQLAlchemy, too.

I also like that the unit of composition is the CTE.


Can you say more about that?

I though SQLAlchemy core goal was exactly that. When does it fails, compared to Csql?


My desire was to be able write real SQL directly in the dialect of my database.

SQLAlchemy can't give me that, instead I'd need to learn a special SQLAlchemy-specific query builder syntax that probably won't support all the analytical functions I want to use anyway. It's really a whole different beast to csql.


From what I understand, not only SQLA API does provide support for all functions you wish to use, but it provides also raw SQL escape hatch.

What kind of limitations did you encounter in the past?


We use dbt and it works really well, but when it comes to complex logic using UDF or something other than sql makes more sense.


Very cool and useful stuff, but aren't CTEs optimization fences in postgres?

If you're dealing with largeish tables and several linked CTEs this might get too slow, and you're still stuck on optimizing your queries manually.

AFAIK subqueries do allow predicate pushdown, etc. Maybe for postgres, composability can be achieved through subqueries.

Nevertheless very cool stuff!


> aren't CTEs optimization fences in postgres?

Apparently "not all the time with PG12".

https://www.depesz.com/2019/02/19/waiting-for-postgresql-12-...


Interesting! I didn't know that, thanks!


Hmm, interesting and good to know, thanks. (and damn, subqueries are awful to read :( ) If there was appetite, it would be straightforward to make `csql` work by generating subqueries instead, might be something to add to the TODO list!


Looks like theyre headed in the right direction. ORMs are tricky to get right, when I wrote ClojureQL I looked more towards combinatorics than SQL: https://clojureql.sabrecms.com/en/examples


Your example of multiple sorts is problematic.

> SELECT * FROM (SELECT users.* FROM users ORDER BY users.id asc) ORDER BY users.id desc

A sort should never (for some value of never) appear in a subquery because it's meaningless; it can't affect the result. In tsql it's explicitly checked for and reported.

Not having a go, just fyi.

Edit: now I'm more confused by that. The bracketed subquery doesn't have a name but it's apparnetly called 'users' because the last clause is 'ORDER BY users.id desc' - but that's illegal. Again tsql correctly rejects that (once I comment out the illegal inner order by).

Edit 2: sorry about this but FYI, I'd expect the inner sort order to be disregarded anyway, and assuming this is mysql it explicitly is.

"If ORDER BY occurs within a parenthesized query expression and also is applied in the outer query, the results are undefined and may change in a future MySQL version"

https://dev.mysql.com/doc/refman/5.7/en/select.html

(Actually, what the heck is it saying? An outer order by in the presence of an inner order by is undefined overall??)


I agree in this specific example the inner sort is meaningless. There are valid cases where a subquery can use an ORDER BY, such as when a LIMIT or OFFSET is specified.

I'd go to far to say that unless a query has an ORDER BY it is likely a bug to use LIMIT or OFFSET. In the absence of an ORDER BY the database is free to sort the rows any way it likes, and this could change between versions or based on any number of implementation details. If it appears to sort determinsitically with no ORDER BY it should not be relied upon.


> valid cases where a subquery can use an ORDER BY, such as when a LIMIT or OFFSET is specified.

Oh, quite true! However the output of that subquery, despite having an order by, will not have a guaranteed order. Ordering is lost as the result set leaves the subquery. This is in the sql standard and in most if not all implementations.

> I'd go to far to say that unless a query has an ORDER BY it is likely a bug to use LIMIT or OFFSET

Wholly agreed.


You're absolutely right, the inner sort example makes no sense. I hope I'll have to time to work on this again in the near future. I use CQL every day but there's still a lot I'd like to do with it.


https://ibis-project.org/ is another great alternative. It provides a fluent/linq style api but doesn't abstract the SQL away too much.


> Ibis uses SQLAlchemy internally, but aims to provide a friendlier syntax for analytics code.

What about SQLAlchemy makes it unfriendly for analytics code? The last time I looked at it, SQLAlchemy Core had a pretty good fluent interface for writing SQL queries


This looks more like what I was expecting from the title: a jOOQ (Java eDSL for SQL with type safety by generating classes based on the schema) like library for Python.

Interesting.


Ooooh, that's going on my 'to play with' list, thankyou.


Cool! Thinking so many times in something like this, but sql it's not easy composable...


PandaSQL allows you to do SQL on Pandas dataframes.[0]

[0] https://github.com/yhat/pandasql


This library is badly abandoned. I've been looking for a better alternative, but ended up writing my own relatively simple and somewhat inefficient code.

There's some serious potential in combining Pandas and DuckDB[1], which has an ability to efficiently transfer query results into DataFrames.

[1] https://duckdb.org/docs/api/python


Trust me. The devil is in the details :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: