Windyquery: A non-blocking Python PostgreSQL query builder

ghostwriter · on May 10, 2021

The amount of code duplication in this language community after introduction of asyncio is staggering. Gevent should have been adopted into the core language, but instead people are re-inventing 'sqlalchemy', network libraries and entire web frameworks for the sake of two new keywords. Not to mention that a pure query builder should not be concerned of IO mode in the first place.

nickserv · on May 10, 2021

It's much more than "2 new keywords", asynchronous programming is a different way of thinking about program flow and architecture.

The thing is, your program is only as asynchronous as the lowest I/O level. It doesn't really make much sense to have your business logic take advantage of this, if your program will block on lower levels.

So you DB connection, caching, HTTP requests, disk I/O, etc must be asynchronous to really reap the benefits. Since Python has been synchronous for something like 20 years, there's a lot of historical code that simply won't play nicely with the asynchronous approach.

Hence the brand-new libraries.

ghostwriter · on May 10, 2021

> The thing is, your program is only as asynchronous as the lowest I/O level. It doesn't really make much sense to have your business logic take advantage of this, if your program will block on lower levels.

it's called non-blocking sockets, precisely the thing that 'gevent' brings on the table without breaking conventions of other existing interfaces, contrary to what 'asyncio' does - https://stackoverflow.com/questions/42448664/async-generator...

And no, it's not only the lowest I/O level, since there are groups, queues, chains, and other primitives [1] that work on higher levels without special keywords too.

[1] https://www.gevent.org/api/index.html

willvarfar · on May 10, 2021

Yes, I remember the discussion before AsyncIO, and how Python could have morphed into the Golang -style CSP model. I don't think it was to simply adopt gevent, but rather to move the core into massive multithreading. Now we seem to have the worst of all worlds. The rejection of stacklets because Python runs on constrained devices didn't seem strong then either.

BiteCode_dev · on May 10, 2021

> but instead people are re-inventing 'sqlalchemy'

SQLalchemy last version is asyncio compatible now: https://docs.sqlalchemy.org/en/14/orm/extensions/asyncio.htm...

> network libraries and entire web frameworks for the sake of two new keywords

The trend in the community now is SansIO: https://sans-io.readthedocs.io

You create entire libs without the IO parts, dealing only with the business logic.

The you let the I/O libs to deal with the rest. This allows modern project to provides a sync and async API with the same protocol implementation. E.G: https://github.com/encode/httpx

Also, I'm pretty glad gevent has not been adopted:

- it's very hard to know if you are doing something wrong with gevent. Or doing something with gevent at all.

- gevent magic patching broke stuff in unpredictable ways, and the solution was to disable part of it: https://stackoverflow.com/questions/8678307/gevent-monkeypat...

- gevent didn't play well with stdlib threads and multiprocessing, (had its own thread pool, subprocess wrapper, and no multiprocess story) while asyncio has a very good interoperability story

- the asyncio pluggable loop meant we didn't need a perfect solution out of the box. Thanks to that design, we have been able to integrate the nodejs loop, like with the uvloop project, giving birth to stuff like asgi, fastapi and uvicorn. Asyncio can now comparable async perfs as Go. (gevent does that, but forces you to, and having libuv in the stdlib is not something the core devs would want).

- gevent is a 5mo package. It's 1/6 of the ENTIRE python installer.

- the coroutine based design means we have a standard api for async, no matter the backend or implementation. Which means solution like trio were able to emerge, and eventually the core dev are now considering including trio-like scoping in the next asyncio iteration.

There are plenty of problems with asyncio though:

- create_task is basically a goto. See trio for a solution.

- coro() does not run the coro task. In JS it does, and returns a future you can await, or not. This makes it very confusing to beginner that get lost in the await/return/create_task dance.

- there is no concurrency limiting primitive, cancellation is hard and back pressure mitigation is non existent.

I was very skeptical about asyncio, but after 3.7, it became quite nice. asyncio.run() made all the difference, ergonomically, and when dealing with errors.

oauea · on May 10, 2021

> gevent is a 5mo package. It's 1/6 of the ENTIRE python installer.

Oh no, that means I'll need 2 more floppies to distribute my software!

BiteCode_dev · on May 10, 2021

If core dev would accept such budget for one lib, given stdlib is composed of a lot of them, each python install would be enormous.

Not to mention python runs on constraints systems such as MicroPython, which features coroutines.

HN, the place were people complain about how electron is bloated then ask you do copy that.

ghostwriter · on May 10, 2021

> - the coroutine based design means we have a standard api for async, no matter the backend or implementation. Which means solution like trio were able to emerge, and eventually the core dev are now considering including trio-like scoping in the next asyncio iteration.

The structured API for concurrency that Trio provides, in principle, doesn't depend on the said keyword-based approach - it's just what the author chose to use for the implementation. It's definitely not something that asyncio enabled for its users that wasn't possible without it. Here's another structured API for concurrency that doesn't depend on keywords [1]. It can be implemented with gevent.

[1] https://hackage.haskell.org/package/async-2.2.3/docs/Control...

BiteCode_dev · on May 10, 2021

No, but it does mean that primitives in trio can be awaited in the same way than any async primitive instead of using a callback.

I get that scoping trough nurseries is the trio selling point, but you do need to have the ability to run something "before" and "after" a lone async call, and I don't want to have to write a "with" block to wrap any single thing that I wish to await.

By having "await" as a universal construct, we allow trio (or anything really) to offer an additional level of granularity - scoping a group of async calls - without losing the smaller one: scoping a single call.

ghostwriter · on May 10, 2021

> SQLalchemy last version is asyncio compatible now.

it's been compatible for 5+ years already, Core API was driver-endependent even in 2015.

> Also, I'm pretty glad gevent has not been adopted:

When I say 'adopted' I mean integrated into the language runtime so that the entire stdlib is aware of it. Some of your points consider the state of gevent 10 years ago, that's not a benchmark for a conclusion on how successful a potential integration into the core language might be, and none of them point at inherent issues with the implementation that make it unusable compared to asyncio.

BiteCode_dev · on May 10, 2021

> , that's not a benchmark for a conclusion on how successful a potential integration into the core language might be

That's fair.

I'm also biased by the fact I really like to know when my code is suddenly going into async territory.

I like the JS model for this: it's explicit enough that there is little surprise (promises are now a standard, so if something returns a promise, it's async), but implicit enough that I don't have to care (no loop to setup, no coroutine instantiation between calling and scheduling). And it aged well: future can now just be awaited, and you can simply keep the callback api in non async function.

gevent is too implicit to my taste I guess. Now you could argue that we could make gevent calls explicit, but then it would be not that much different from asyncio.

I would really be happy if the community would settle in something like:

    @autorun
    async def coro(): ...

Then:

    coro() # this calls create_task(), and binds it to the upper nursery

I think it would strike the right balance from a usability point of vue.

As for the interoperability issue, it's a pretty though nuts to crack.

lopatin · on May 10, 2021

What's the alternative with Gevent? Passing callbacks everywhere? Is that somehow better than async/await syntax?

Libraries will still have to be re-written and public interfaces will change. Might as well do it right, and not abandon the concept of functions having meaningful return values.

sonthonax · on May 10, 2021

> What's the alternative with Gevent? Passing callbacks everywhere?

You build on callbacks to create futures and queues. That’s how you do async programming in any framework. AsyncIO works on top of this and so does Gevent. The only real difference between the two is that asyncio explicitly types functions that do IO. The fundamental advantage is little more than a typing decorator.

The reason that JavaScript was such a crapshoot of callbacks was because the majority of JavaScript developers had no idea what a future/promise was.

orisho · on May 10, 2021

Geez, yes, absolutely. Python went from being my favorite language to being somewhere down there with Javascript, largely thanks to the ecosystem disaster async await brought on the language.

BugsJustFindMe · on May 10, 2021

I keep waiting for the Python community to come to its senses and set asyncio on fire and start again...again. Everything that people do with asyncio is simpler and easier to follow using threads. The Thread/Process pool executors from concurrent futures were almost good interfaces, and then asyncio took a giant step backwards.

"I want this to happen in the background without blocking" should never have turned into this terrible dance of rewriting everything using new magic keywords so that you can then wedge it into an asyncio event loop. It's almost like developers have no sense of UX elegance.

helge9210 · on May 10, 2021

With asyncio and multiple processes you can achieve more than just with multiple processes.

> "I want this to happen in the background without blocking"

That is not the problem the asyncio is solving.

BugsJustFindMe · on May 10, 2021

> That is not the problem the asyncio is solving.

It is, actually! All asynchronous code amounts to having some things occur in the background while other things occur. asyncio just does it with a clumsy interface that breaks everything that already exists.

helge9210 · on May 10, 2021

Async is concurrent, but not parallel. Nothing is happening "in the background". In contrast with threads/processes where it is parallel.

BugsJustFindMe · on May 10, 2021

> Nothing is happening "in the background".

Your IO is (that's why the library is called "async io")!

The entire reason the library exists is because all IO blocks at the lowest level, even so-called non-blocking IO, because computer science abstractions don't change physics, which is why sockets require buffers. At some point in the chain your process is asking the OS to do some IO, and at that point the process gets a choice between twiddling its thumbs or doing something else and coming back to check on the result later.

The mechanical difference between using coroutines vs threads for IO, under the hood, is the process polling sockets itself vs letting the OS do it. The OS allocating chunks of time to checking different threads that are spinning their wheels waiting for an IO response that may or may not have arrived yet on an OS socket buffer is fundamentally equivalent to the process allocating chunks of time to checking different coroutines that are spinning their wheels waiting for an IO response that may or may not have arrived yet on an OS socket buffer. The only difference is which layer does the spinning, the OS's schedule timer or the process's event loop timer (which of course itself runs at the whim of the OS's schedule timer). Are there performance differences between those approaches? Yes because creating OS threads has more overhead because they're fundamentally more powerful, but that's a technical detail that should be hidden, not exposed. Introducing an interface that requires new magic keywords and behavior in the underlying requests so that everything needs to be rewritten is terrible when the actual desire is _always_ some grammatical variation of "I don't want every IO request to prevent using the CPU until it completes".

jbverschoor · on May 10, 2021

I have no idea why the builder part should be async. Either that, or it’s named badly, which usually comes from an unclear view of the problem domain

yoavm · on May 10, 2021

I was thinking the same. It seems like Windyquery isn't really just a query-builder, it also executes the query and parses it.

whalesalad · on May 10, 2021

Because this is more than a builder from the looks of it - it’s executing and fetching data too. (I thought the same thing)

wiredfool · on May 10, 2021

So it's a database API?

whalesalad · on May 12, 2021

salmo · on May 10, 2021

Async ACID DB interfaces are so weird to me. I guess the async all the things! is frustrating since I’m brought in when things go boom, and folks that couldn’t deal with parallelism are all up in concurrency and baffled by the result.

I get going for throughput over transactional performance. But I don’t find that to be a practical goal in most business systems.

In my experience every dev team that goes this route thinks “async is faster”, then gets confused when their latency shifts dramatically under load. Or their CPU usage jumps from 20% to all the CPU they can get.

It’s especially odd with the (overly) microservices trend. A bunch of variable latency calls with a ton of network I/O overhead required to service 1 customer request. Shove those into tiny containers without any tuning and wooo!

And probably exacerbating this is the trend to tack on afterthought async frameworks to languages like Python and Java, vs systems designed with it originally like Erlang and even Go.

Folks familiar with Java or Python are suddenly lost troubleshooting. Stack traces, thread dumps, etc. become confusing. It’s not impossible, but the majority of developers are procedural themselves. I’ve only known a handful (out of thousands of contractors and employees) that were familiar with syscalls or could find useful info in a JFR.

All that said, I’m looking at “big old company” workloads. Python for analytics, where it’s a friendly wrapper around C/FORTRAN, and Java for most everything else. And most async use is what I would call “YouTube driven development.”

Ok, I’ve gone full “kids these days.”

It’s interesting stuff, for sure. I just wish there were giant disclaimers around it for every library or framework. Troubleshooting async performance issues is my new troubleshooting C/C++ memory issues of old.

ralusek · on May 10, 2021

Avoiding async in order to avoid parallel database calls literally only makes sense if you're choosing between blocking/non-blocking on a single process on a single server instance. The second you scale up to use multiple processes/threads and/or multiple servers, you're going to be dealing with concurrent db calls regardless of whether or not your code is written as blocking or non-blocking.

So not only are you saying that you insist on writing blocking code in order to avoid parallelism, but you're also saying that you'll limit yourself to a single execution process. Assuming a 50ms round trip to the db, you're basically setting a best case scenario of ~20 requests per second.

BiteCode_dev · on May 10, 2021

I'm not sure why the example starts with all those verbose calls to run_until_complete(). Why not do this?

    async def main():
    
        db = DB()
        await db.connect('db_name', {
            'host': 'localhost',
            'port': '5432',
            'database': 'db_name',
            'username': 'db_user_name',
            'password': 'db_user_password'
        }, default=True)

        await db.connect('other_db_name', {
            'host': 'localhost',
            'port': '5432',
            'database': 'other_db_name',
            'username': 'db_user_name',
            'password': 'db_user_password'
        }, default=False)

        db.connection('other_db_name')

        db.default = 'other_db_name'

        await db.stop()

    asyncio.run(main())

Am I missing something ? It's using asyncio.run() in other examples, so it assumes 3.7+ anyway.

uranusjr · on May 10, 2021

They differ slightly in semantics; asyncio.run() always creates a new event loop to run the task, while run_until_complete() don’t (unless you call asyncio.create_event_loop() of course). I don’t know if this is the reason behind the decision, but the two are not equivalent (and there are cases asyncio.run() wouldn’t work).

BiteCode_dev · on May 10, 2021

Yes but await doesn't, which is what I replaced run_until_complete() with here.

The asyncio.run() is just to pretend I run the lib snippet in a code that already started the loop, which the doc seems to do.

justsomeuser · on May 10, 2021

Is there any point in having a query builder API in a dynamic language?

I can see the point in a typed language where you can ensure the query is correct at compile time.

But in a dynamic language, you never know if the query is correct until the interpreter hits that line in the program at runtime.

This is true with both raw SQL (error returned from server) and the query builder API (error returned from interpreter).

aforwardslash · on May 10, 2021

> Is there any point in having a query builder API in a dynamic language?

Yes, it eases the creation of queries and ensuring both escaping of parameters and quoting of columns and fields; It also allows the easy creation of dynamic queries based on lists of parameters, and facilitates both naming of tables and fields.

> I can see the point in a typed language where you can ensure the query is correct at compile time.

Given that most query builders allow arbitrary parameters (such as table and column names), you can't actually ensure the query is correct at compile time, regardless of the type of language.

justsomeuser · on May 10, 2021

> Yes, it eases the creation of queries and ensuring both escaping of parameters and quoting of columns and fields; It also allows the easy creation of dynamic queries based on lists of parameters, and facilitates both naming of tables and fields.

But in Python all these are checked at runtime.

Just like it is possible to not quote columns and fields in raw SQL, it is possible to mis-use the query builder API and only find out at runtime on the same line as you would with raw SQL.

Perhaps the dynamic queries may be easier with the query builder, but you still find out about issues at the same time as raw sql/string templates.

> Given that most query builders allow arbitrary parameters (such as table and column names), you can't actually ensure the query is correct at compile time, regardless of the type of language.

You cannot check that the query is correct for a given schema (missing tables/cols etc), but I think many typed languages would allow you to ensure the AST is correct at compile time.

aforwardslash · on May 10, 2021

> Perhaps the dynamic queries may be easier with the query builder, but you still find out about issues at the same time as raw sql/string templates.

As with everything, its a tradeoff and a matter of taste. SQL Query builders give you the flexibility of SQL, without relying on string concatenation directly, and making sure all the values are passed in order. Also, they usually can be extended to support multiple databases/adapters (imagine specifying values as $1, $2... vs ?,?, quoting identifiers using " or `, etc). They also usually fit better than plain strings in heavily programmatic contexts (imagine a query with different fields and filters depending on the value of a given field).

> You cannot check that the query is correct for a given schema (missing tables/cols etc), but I think many typed languages would allow you to ensure the AST is correct at compile time.

There are usually 2 types of query builders - SQL query builders, and DSL query builders. DSL ones are usually integrated with an ORM or something similar, and those are usually abstract enough to guarantee that the generated SQL and passed types are syntactically correct, even if invalid (ex. mentioning a column or a table that doesn't exist). SQL query builders (at least the ones I've seen) don't use an AST, as their purpose is to map methods to SQL chunks. They basically keep a list of different parts to assemble and then just build from those lists when generating the actual SQL. Some may provide object mapper capabilities, and in that case, some type checking may be performed. "May" in the sense that eg. a 14th century date is a valid date, but SQL Server will most certainly reject it, so in the end there are no "compile time guarantees" in this case.

mattashii · on May 10, 2021

> Given that most query builders allow arbitrary parameters (such as table and column names), you can't actually ensure the query is correct at compile time, regardless of the type of language.

JOOQ [0] does a great job of providing a type-safe query builder that can guarantee correct queries at compile-time. It indeed also supplies arbitrary strings for names and sql components which will break that guarantee, but apart from that you can extract your databases' schema into typed classes and use those to run your queries.

I really like it, but it is definately not a replacement for an ORM.

[0] https://github.com/jOOQ/jOOQ

aforwardslash · on May 10, 2021

That is basically the description of an object mapper, with all the guarantees of an object mapper :). It seems if you actually use the query builder as such, no guarantees exist.

I'm pretty picky regarding query builders and ORM's, to the extent of having written several of them over the years, in different languages (both dynamic and strong typed, unfortunately closed-source). I'm a strong advocate of schema-first design, and usually a query builder will allow you to design your queries explicitly, but having some internal behaviors (such as string concatenation, identifier quoting and automatic in-order separation of parameters and values to be bound) taken care of. As good examples of this, I'd mention golang's goqu (https://github.com/doug-martin/goqu) and - to some extent - C# SqlKata (https://sqlkata.com/). Following my frustrations with Python ORMs, I built my own toy project, sort-of-in-beta, called rickdb (https://github.com/oddbit-project/rick_db).

nerdponx · on May 10, 2021

Also: composability and reusability.

willvarfar · on May 10, 2021

Kudos, gorgeous. Would be great to have a generic version that can also connect to other databases too. Perhaps this is the start of an async interface that can be used instead of dbapi?

One thing that stood out was the overloading of using functions sometimes and sql snippets other times. For example,

   .select('column').where('id', 1).where('name', 'Tom')

or

    .select('column AS alias').where('id = ? AND name = ?', 1, 'Tom')

This means the framework has to pass through the snippet as-is, rather than applying escaping etc.

Although sql injection is usually through values, it can be through other parts of the string too, e.g. in many user query builder UIs the column names themselves may be tainted.

I'm generally a fan of all or nothing, e.g. either go with building your own string, or else use helper functions for everything e.g.

    .select('column').as('alias').where('id').equals(1)...

This means the framework can always correctly escape column names etc.

I ran into problems like this when doing an InfluxDB connector for Presto. The Influx Java client library mixed this up so it couldn't sanitise column names. I ended up just writing my own query builder.

aforwardslash · on May 10, 2021

Shameless plug - you can have a look at my pet project https://github.com/oddbit-project/rick_db; Its explicitly not async, but the query builder is detached from the DB querying logic, and can be used separately.