
Thinking Psycopg3 - fanf2
https://www.varrazzo.com/blog/2020/03/06/thinking-psycopg3/
======
jgehrcke
At Mesosphere/D2iQ we used psycopg2-backed sqlalchemy to interact with
CockroachDB.

Daniele, one point that I'd like to get your opinion on, and that's maybe
worth considering for API developent around psychopg3: we found it difficult
to implement timeout control for a transaction context. Consider application
code waiting for a transaction to complete. The calling thread is in a
blocking recv() system call, waiting for the DB to return some bytes over TCP.
My claim is that it should be easy to error out from here after a given amount
of time if the database (for whichever reason) does not respond in a timely
fashion, or never at all (a scenario we sometimes ran into with early versions
of CockroachDB). Certainly, ideally the database always responds timely or has
its internal timeout mechanisms working well. But for building robust systems,
I believe it would be quite advantageous for the database client API to expose
TCP recv() timeout control. I think when we looked at the details back then we
found that it's libpq itself which didn't quite expose the socket
configuration aspects we needed, but it's been a while.

On the topic of doing "async I/O" against a database, I would love to share
Mike Bayer's article "Asynchronous Python and Databases" from 2015:
[https://techspot.zzzeek.org/2015/02/15/asynchronous-
python-a...](https://techspot.zzzeek.org/2015/02/15/asynchronous-python-and-
databases/) \-- I think it's still highly relevant (not just to the Python
ecosystem) and I think it's pure gold. Thanks, Mike!

~~~
dvarrazzo
Hi! Much of the difficulty comes from using blocking libpq calls in psycopg2.
I'm thinking to avoid them altogether in psycopg3 and only use async calls
with Python in charge of blocking.

Note that you can obtain a similar result in psycopg2 by going in green mode
and using select as wait callback (see
[https://www.psycopg.org/docs/extras.html#psycopg2.extras.wai...](https://www.psycopg.org/docs/extras.html#psycopg2.extras.wait_select)).
This trick enables for instance stopping long-running queries using ctrl-c.

You can also register a timeout in the server to require to terminate a query
after a timeout. I guess they are two complementary approaches. In the first
case you don't know the state of the connection anymore: maybe it should be
cancelled or discarded, we should work out what to do with it. A server
timeout is easier to recover from: just rollback and off you go again.

------
Hitton
I'm again reminded how much we depend on and how we owe these these unnamed
heroes of modern age who maintain and develop software we take for granted in
our every day lives. Kudos to the author.

And interesting to find out after years of using it that psycopg2 actually
doesn't use prepared statements underneath.

------
skeletal88
The one thing I dislike about the Python DBAPi spec is the "cursor". Maybe the
idea was good when the creators of the spec came up with it, but no other
database API exposes a cursor so upfront to the users. When someone new to
databases first learns about the Python DBAPI then he will probably think that
cursors are something really important, but really - they aren't. Developers
just want to fire queries at the database, the cursor is just something
between the db connection and the query result. Maybe it is needed to keep the
state of query objects, but it should have some other name.

~~~
quietbritishjim
I don't understand this comment. A cursor is just an iterator over the
results. Would you prefer "iterator" as a name? A concept like it exists in
basically every database API (for row-oriented databases) so that you can
start processing results as soon as they start coming in rather than having to
load all results into application memory first.

~~~
luhn
I think OP has a point. He's not complaining about the functionality (of
course a DB client needs to iterate through results) but that there's no
reason a user needs to understand cursors for running simple DB queries. I
should be able to do:

    
    
        results = conn.execute("select * from mytable").fetchall()
    

or

    
    
        query = conn.execute("select * from mytable")
        for row in query:
            print(row.id)
    

You can do the same with cursors, but it's extra mental overhead and a
somewhat awkward API.

Plus the name conflicts with Postgres' CURSOR, which confused me when I first
started. So, yes, I would prefer "iterator" :)

~~~
skeletal88
Yes, that's what I meant, it doesn't have to be exposed to the user, I just
want to execute a query and get the results. If needed then maybe I should be
able to create a cursor to go through the results. It's just that other
database access APIs don't expose and force the developers using the API to
use a cursor, so it's mildly annoying in Python.

------
mangecoeur
If we are doing wishlists: I would love to see better support for fast
creation of Arrow/Dataframes/numpy objects from query results (possibly
through the COPY support, maybe through some kind of plugin or compile time
optional support). See for example Turbodbc
([https://turbodbc.readthedocs.io/en/latest/pages/advanced_usa...](https://turbodbc.readthedocs.io/en/latest/pages/advanced_usage.html#apache-
arrow-support)).

At the moment we are forced to have query data converted from the binary
stream into (boxed) python objects, then unbox them back into arrays - this
can add a lot of overhead. I did some very rough experiments in Cython and got
3x speedup for bulk loads of queries into arrays.

------
PudgePacket
Off topic:

The comments system on the site is awesome, can't believe I haven't seen it
before!

[https://utteranc.es/](https://utteranc.es/)

~~~
nathancahill
Did a triple-take when I scrolled to the bottom and saw.. GitHub. Super cool.

------
willvarfar
The first part, Query parameters adaptation, is perhaps problematic. Postgres
allows only 32K parameters (they use a signed short? really?), which is
relatively few. This limits the numbers of rows you can INSERT in a batch.
Expanding client-side neatly side-steps this.

Also: with copy, is there a way to ignore, or replace, duplicates?

~~~
ants_a
For batch insert it makes more sense to parse an insert statement for one row
and then send bind-execute messages for each row. That would much more
efficient than a huge insert statement with interpolated values.

Copy doesn't on conflict handling yet, although there doesn't seem to be a
major reason why it couldn't. A workaround is to copy into a temporary table
and do a insert-select from there.

~~~
willvarfar
What do you mean by bind-execute? I'm guessing you mean sending multiple
statements in a string where the first PREPAREs and the rest EXECUTE? I
haven't benchmarked anything like that.

I've done lots of exploring and profiling and comparing of the best ways to do
big upserts into a variety of DBs. For Postgres, I've messed around with CTEs
but found INSERT ON CONFLICT UPDATE to be fastest. And inserting large numbers
of rows is a clear win over individual statements round-tripping.

~~~
ants_a
PostgreSQL wire protocol has a concept extended query which splits out query
execution into 3 steps, parse, bind and execute.
[https://www.postgresql.org/docs/devel/protocol-
flow.html#PRO...](https://www.postgresql.org/docs/devel/protocol-
flow.html#PROTOCOL-FLOW-EXT-QUERY) If you send one parse message and multiple
bind and execute messages you can insert multiple rows parsing the statement
only once. Currently psycopg2 doesn't support that, which is the first thing
the blog post was talking about.

If working in the confines of psycopg2, you could do something like insert ..
on conflict .. select * from unnest(%s) rows(a int, b text, ...); and pass in
an array of row types. Parsing a huge array is cheaper than parsing a huge SQL
statement.

~~~
anarazel
And relevant to the roundtrip issue at hand here: You can send bind/execute in
a pipelined manner.

I'd benchmarked batch upserting at some point, and at that time for large
amounts of data the fastest approach was somewhat unintuitive: A separate view
with an INSTEAD trigger doing the upserting. That allows for use of COPY based
streaming (less traffic than doing separate bind/exec, less dispatch
overhead), and still allows use of upsert. Not a great solution, but ...

