Hacker News new | past | comments | ask | show | jobs | submit login

First, thank you very much for the clarifications and corrections. I wasn't on the engineering team that fixed the problem, and I'm not a Python guy; I just found the problem and explained the consequences of what was happening to the engineers. It seems I was mistaken in the particulars, for which I do apologize, both to you and everyone who's ever contributed to SQLAlchemy, and to anyone who was misinformed by my previous comment.

Nonetheless, there is a problem. It's not open connections that interfere with vacuuming; it's open transactions. While a transaction is open, vacuuming can't reclaim dead tuples, if those tuples were live at the start of the transaction. Those old row versions are still "visible" in the context of that transaction, and can't be reclaimed — and nor can the additional dead tuples created from further updates on those rows, because there is an open transaction that's older than they are.

Simply, if you connect to the db, open a transaction, and leave it open, then every subsequent update to every row in every table in that database will leave a dead tuple that can't be reclaimed until that transaction commits or is rolled back, because at the bottom of each chain of dead tuples, there's one that is still visible in the context of that open transaction.

That's the problem.

Net, in my particular situation, and a couple of others I've encountered first- and second-hand, enabling auto-commit (and only explicitly declaring transactions when multi-statement transactional semantics were specifically needed, and which were promptly committed upon completion) fixed the problem. A 2000 row table with ~1600 byte rows took up a little over 400 (8KB) disk pages, and the autovacuum daemon was able to keep it at that size. Compare that to before, without autocommit: it would be that size after a VACUUM FULL, and weigh gibibytes within days, no matter how many autovacuum workers there were, how aggressively they were tuned, or how often you ran (regular) VACUUM manually. Those dead tuples were not going away until the abuse of transactional semantics that caused them to accumulate was addressed.

If there's anyone to "blame" here, it's the engineers who didn't understand the implications of what they were doing. (To be fair, though, they were also using the database as a work queue, so it's probably reasonable to suggest they liked doing things sub-optimally...) The high volume environments you allude to are probably going about things with more cognizance of the implications of implicitly transactional semantics. They aren't proof that there isn't a problem; they're proof that you can do this sort of thing without shooting yourself in the foot, if you know what you're doing.

Look, it's not that the DBAPI being implicitly transactional is a "bad" thing. It's not. Transactional semantics are awesome, and as someone who gets paid for keeping peoples databases (particularly PostgreSQL) happy, I'm emphatically for them. It just has consequences, particularly in the context of an MVCC-based RDBMS, and doubly so if your application is architected such that it leaves transactions, implicit or otherwise, open for extended periods.




> if your application is architected such that it leaves transactions, implicit or otherwise, open for extended periods.

Can you give any good reason why you need to leave transactions open for extended periods? In my experience, it only happens when the developer in question does not understand the semantics of the DBMS and therefore hasn't suitably designed their solution, in which case it's hardly the DBMS's fault when things go wrong.

I've rewritten plenty of code that "needed" a transaction to be open for a long time into code requiring a transaction for a few milliseconds (e.g. in a bulk import, parsing, caching lookups and reference data outside the transaction, building an XML document to supply into a SP as a parameter which only used a transaction for a single INSERT from the XML).


Can you give any good reason why you need to leave transactions open for extended periods?

Offhand, other than xid-level consistency for a backup as mentioned by the sibling post, no. [1] No, this case was pretty clearly the engineers not understanding what they were doing, as evidenced by it being easier to turn off autocommit, and add explicit BEGIN statements where they were needed, than add explicit ROLLBACKs everywhere they were needed.

That makes an implicitly transactional adapter seem a bit of a foot-gun to me, though. At a minimum, I think it places an unnecessary burden on engineers to have to "roll back" every time they even ask the db something. That's how memory leaks happen, too, and that's why we generally think garbage-collected languages are a Good Idea.

[1] That said, my preferred backup strategy is to take a filesystem-level snapshot of the db volumes, mount that, and start a second Postgres instance against it. It'll "recover" from an "unclean shutdown", and then you can take your backup without incurring the penalty of holding open a transaction that long in the presence of concurrent writes.

You do pay some write-performance penalty for the snapshot while it's open, but upon releasing it, the disk pages are immediately gc-ed, and you're already taking your backup at a low-traffic point anyway, right? Disk pages in a Postgres cluster allocated because vacuuming was hindered by holding an xid for extended periods are merely autovacuumed, however; they're reclaimed for later use, not released.


I'm not familiar with PG but I am planning on using it in my current project, which is why I'm interested in understanding whether these are really DBMS issues.

In the .NET world, for all the database adapters I've used (MSSQL, SQLCE, Oracle, MDB), individual statements are implicitly wrapped in transactions when no explicit transaction has been specified - not sessions/connections.

I agree the "implicit transaction per session/connection" design is confusing and likely to result in bugs. While the standard .NET approach limits bugs, the downside is poorer performance when an unaware developer uses numerous implicit transactions for multiple statements where a single explicit transaction would suffice.

Experts in the area appear to recommend always using explicitly managing transactions (e.g. http://ayende.com/blog/3775/nh-prof-alerts-use-of-implicit-t... ), and this is the approach I follow in my own work now.


Can you point to any open source projects or other examples implementing this pattern? I think I understand where you're coming from on this but I have not seen many examples of this approach in the wild (except religiously using TransactionScope, eg. http://amnesia.codeplex.com/ ).

I'm wondering where it's best to manage the transactions... would they ever be in raw SQL or stored procedures instead of .NET code?


The PostgreSQL official C client library, libpq, (which most language specific wrappers are based on) works the same way as you say the .NET adapters do. Every statement is by default implicitly wrapped in a transaction until you run "BEGIN;".


>In the .NET world, for all the database adapters I've used (MSSQL, SQLCE, Oracle, MDB), individual statements are implicitly wrapped in transactions when no explicit transaction has been specified - not sessions/connections.

That is how the postgresql API works as well. Most high level language modules built on top of it also work that way. Unfortunately python's DB API is not one of them, but you can just set a config option to "act like everyone expects".


The only reason I can see is for taking consistent database backups. But then your transaction should also use a readonly snapshot (pg_dump will do this for you).




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: