Pg8000 – Pure-Python PostgreSQL driver

quietbritishjim · on May 14, 2021

Beware that for strict compliance with Python's DB-API 2.0 the database driver needs to obey weird rules about transactions:

* All statements start a transaction automatically if one isn't already open.

* You need to call .commit() or .rollback() on the connection object to do the commit or rollback.

* The driver can optionally support an "autocommit mode" where, instead of the above rules, every individual statement is wrapped in its own transaction and committed immediately.

That's the theory. Many database drivers, including the built-in SQLite driver, differ in practice (for example, it doesn't automatically start a transaction for a SELECT statement).

The biggest common practical difference is an implementation detail of so-called "autocommit mode". Often that is NOT implemented by actually issuing a "BEGIN TRANSACTION" before and "COMMIT" after every statement. Instead, it often just issues the statement without any automatic transaction statements whatsoever, relying on the fact that the underlying database engine will immediately commit any statements outside an explicit transaction (i.e. has its own autocommit mode).

That means if you do want complete manual control of transactions in a Python DB-API driver (including sqlite), then - confusingly - you need to set the autocommit option to true!

It's not clear to me whether/how this applies to pg8000 driver in the OP. There is a section on autocommit but it doesn't explicitly say what it does.

tlocke · on May 21, 2021

I maintain pg8000, and it has two APIs: the standard DB-API 2 accessed with pg8000.dbapi and its own native API at pg8000.native.

The native API is a lot simpler, and so I'd use that if you can. The native API is really my suggestion for the next version of DB-API, so DB-API 3. I'd be interested to know people's thoughts on this, and maybe we could put together a draft specification?

jeltz · on May 14, 2021

Personally I think this was q huge mistake since almost all of the rest of the world uses autocommit by default.

muxator · on May 14, 2021

The existence of the autocommit feature in DB drivers is something I never really understood: why an application developer would want to relinquish the direct control of one of the most powerful features of a relational DB?

Unless one's interaction with the database are really basic (i.e. a single SELECT statement), one would want to be as explicit as possible regarding where transaction boundaries are.

Having a default that hides this aspect from my code always seemed a false convenience to me.

edit: this is just a tangential reflection inspired by your comment, not an answer / nor I wanted to detract from your fact statement.

crad · on May 14, 2021

Scaling for one. If 95% of your workload is non-transactional reads, why introduce transactions?

If your writes are single table and don't require transactional safety, why introduce transactions?

If you need complex, multiple-insert or update writes, transactions are great, but that can also be accomplished in a stored proc.

In a high-velocity environment, avoiding transactions are a great way to get more throughput out of your database server with lower latencies.

Combine Postgres with pgBouncer in statement mode and you can scale it pretty dramatically.

saurik · on May 14, 2021

I would instead look at it as turning autocommit "off" is more like turning "automatic transaction begin" "on", which is just a strange semantic. I could appreciate a mechanism that insisted "transactions must always be explicit", but that isn't what autocommit controls, on or off. (You also wouldn't want such unless you were assured that you could put the transaction boundaries inside of multi-statement statements, as you really really want to minimize round trips there as otherwise you can end up--even on the same machine--drastically harming your lock contention by having transactions that last thousands of times longer than the single statement they wrapped... at which point the API is going to get awkward trying to enforce the transaction boundaries... autocommit is essentially just what you should want as the solution to all of this.)

quietbritishjim · on May 14, 2021

I think this illustrates my point ... but I'm not 100% sure what you mean by "autocommit mode" in your comment! You said "autocommit feature in DB drivers" but I think you mean in the DB itself by the sounds of it.

AFAIK autocommit mode just applies if you don't explicitly open a transaction, so what would you propose instead? That a transaction is silently opened but not committed (potentially leaving a dangling transaction open for a long time after SELECT statement) (like Python's odd default behaviour)? Or an immediate error?

I can certainly see that if you are indeed wrapping every individual statement in their own BEGIN/COMMIT pairs that you'd wonder why the database can't help you for such simple use cases. You can still selectively open an explicit transaction on those statements that do need to be grouped together, so I don't see that you've given up any control.

roenxi · on May 14, 2021

There are a lot of people who use databases without really understanding anything about them. They mostly won't understand why it is important to control time and execute multiple statements at the same logical moment in database world. That is a difficult concept to intuit naturally, people need to be told and they often don't have time or a teller.

Databases that only do the technically-correct thing are weeded out because they lose the big supportive cloud of novices that experts grow out of. So the good databases support the intuitive way and the the correct way of manipulating data.

There are more computing concepts to learn than there are hours to learn it in. People use systems they don't understand and systems need to support that.

fulafel · on May 14, 2021

Up until 10ish years ago, the most popular SQL DB didn't support transactions in the default setup. So lots of apps got written without relying on having transactions, either for compatibility reasons or because the devs grew up on said nontransactional DB. So autocommit was an easy way to emulate transactionless SQL on the rest of the DBs.

samuel · on May 14, 2021

I wouldn't blame MySQL for that. JDBC's default mode since 1.0 is autocommit set to true, and surely predates MySQL's success.

jeltz · on May 14, 2021

In practice autocommit works very well. In many workloads most transactions consist of just one query. Additional if you have autocommit it is harder to accidentally create a long running transaction which is a good thing since long running transactions can really tank your performance.

bencollier49 · on May 14, 2021

Delighted to see this. Installation of psycopg2 can be a nuisance thanks to the native compilation steps which don't always work as planned and rely on specific versions of libssl.

bogcotton · on May 14, 2021

After suffering with this myself, I started using the psycopg2-binary package exclusively: https://pypi.org/project/psycopg2-binary/

Haven't had any issues since!

gjvc · on May 14, 2021

good to hear, but beware what it says in the last paragraph here https://pypi.org/project/psycopg2/ ...

"The binary package is a practical choice for development and testing but in production it is advised to use the package built from sources."

...or use one from your operating system distribution.

SonOfLilit · on May 14, 2021

To all those wondering _why_ that is the recommendation, I did some digging and the answer is "the binary package is compiled against one version of libssl, and if your system happens to use a different one, in some weird edge cases it will lead to a segfault"

sirsinsalot · on May 14, 2021

Odd then that Digital Ocean's App Platform for Django deployment requires the binary version in your requirements file.

gjvc · on May 14, 2021

exactly my motivation for ferreting it out and it's worked great for the simple scenarios. the only thing i'm missing is a .mogrify() function.

this guy has put one together though; might see if we can get it migrated into the original version.

https://github.com/realazthat/aiopg8000/blob/master/aiopg800...

mdaniel · on May 14, 2021

You'll want to be careful in "convert_paramstyle" as it does not appear to take python's dollar literals into account: https://www.postgresql.org/docs/13/sql-syntax-lexical.html#i...

viraptor · on May 14, 2021

> python's dollar literals

Did you mean postgresql's dollar literals?

mdaniel · on May 14, 2021

Yes, sorry about that and of course the edit window has expired :-( Now I'm even more glad I looked up the URL or folks might have been even more confused

jquaint · on May 14, 2021

Question, what advantages does a pure python driver give you over the competition? Is it multi threading support? (psycopg2 does not support cursors over threads)

lmns · on May 14, 2021

Packaging is simpler. psycopg2-binary is actually not recommended for production use, so you have to compile it yourself.

ohlookabird · on May 14, 2021

Thanks for the pointer, I wasn't aware of this! I was wondering what the problem is, because I use psycopg2-binary in production without problems for ages. It seems the danger of psycopg2-binary is that one can end up with segfaults, due to the libpq library (and hence psycopg2) might use a different libssl than the system one (use by Python, with e.g. urllib and requests): https://www.psycopg.org/articles/2018/02/08/psycopg-274-rele...

BiteCode_dev · on May 14, 2021

I've never heard of that, why would it not be?

lmns · on May 14, 2021

I don't know, that's what the developers say (last paragraph): https://pypi.org/project/psycopg2/

jka · on May 14, 2021

Personal opinion, not really inherited from any authoritative wisdom: pure Python is useful because it:

- Simplifies and improves the performance of deployment: it becomes a straightforward case of copying code files

- Largely achieves the same stated goal of Java: write once, run anywhere

- Allows for easy code inspection and debugging in production (it's debatable whether that's a good thing; I think it is, overall)

grandinj · on May 14, 2021

Agreed. It also unlocks another level of optimisation - the human one - I'm much better at reasoning my way through code improvements when I don't have to manage dealing with switching language/runtime models in the middle of the code :-)

GateCrasher · on May 14, 2021

It is also easy to add to an executable with pyinstaller

anentropic · on May 14, 2021

It is useful if you want a speed boost by running under PyPy

Take all benchmarking with a pinch of salt, but see some stats here: https://suade.org/dev/12-requests-per-second-with-python/#th...

willvarfar · on May 14, 2021

I once speeded up a program by changing from the normal native wrap MySQL drivers to a pure python one.

The pure python MySQL drivers were marginally slower for cpython.

The native drivers worked fine with pypy, but pypy didn’t give any real performance gains as the program spent all its time in the native code

Then when testing pure python drivers we found pypy made the whole program massively faster.

qeternity · on May 14, 2021

That’s surprising if your app spent most of the time in database calls, since 1) native code is still faster than pypy and 2) most of the time spent in database libraries is spent waiting for the database to return.

willvarfar · on May 14, 2021

More details of my use case:

The programs were ETLs that are generating big REPLACE INTO statements.

Different databases and drivers do prepared statements in different ways. Postgres, for example, can receive the statement and the parameters and do the interpolation in the server. The params still need to be serial used by the client and the template isn’t reused, so it’s no big perf win.

MySQL, on the other hand, needs the driver to do the interpolation. The server only knows how to execute sql strings.

It turns out it is far better to prepare the sql in pure python where pypy knows all about types etc than to have the c MySQL driver do it.

Exactly the same thing plays out in java. The MySQL java drivers are staggeringly inefficient at large prepared statements, and in my profiling massive amounts of time is spent in java functions in the MySQL driver that are trying to work out where the next question-mark is. Sadly the drivers are the poo they are. Iirc the mariadb java drivers are rather more optimized.

dragonwriter · on May 14, 2021

> That’s surprising if your app spent most of the time in database calls, since 1) native code is still faster than pypy and 2) most of the time spent in database libraries is spent waiting for the database to return

I would argue a speedup (though maybe not a “massive” one) from pure python over a native extension is not that surprising, specifically because of (2), which makes it possible for the runtime-extension interface overhead to be more significant than (1). Native extensions tend to give a speed boost for computationally intensive tasks, but can’t really wait any faster, and aren’t free of overhead.

zeta0134 · on May 14, 2021

Interesting. What sort of optimizations does PyPy do? Could it be that pure python unlocked more options in that department due to not needing to preserve the native bindings?

fulafel · on May 14, 2021

Wider compatibility than just CPython, potentially at least.

mjbrownie · on May 14, 2021

I just managed to get it working on an ipads iVim python environment which is pretty crippled (no compiler etc).

gjvc · on May 14, 2021

the title was editorialised from pg8000 to Pg8000 -- the github project clearly shows the former is correct.