I read a lot of Postgres code to get this finished, and I'm happy to say that for a codebase with so many authors, the quality of Postgres' is very high. Like any complex program it can be a little hard to trace through, but most of the naming is pretty self-explanatory, and it comes with some really amazing comments that walk you through the high-level design. I'd definitely recommend taking a look if anyone is curious.
AFAIK there are also optimizations that allow queries to avoid having to look up the individual heap tuples for visibility information by marking the entire page on the index as visible (useful for index-only scans). Yet other optimizations exist to not require updating indexes when updating existing rows where none of the indexed columns change (HOT).
Maybe somebody with a more in-depth understanding can comment, but hopefully the above is somewhat correct :).
There is one small inaccuracy in your summary, which is that a structure called the visibility map summarizes if a heap page is all visible; that's how index only scans work. At no point do we mark an index page as all-visible, because a HOT update would invalidate that.
That being said, I don't understand why HOT would interfere with a hypothetical mechanism for marking index pages as all-visible. If a tuple gets a HOT update, I think the index values should remaing unchanged? And if they do, shouldn't their visibility on an all-visible index page remain unchanged as well?
Unrelated: Thank you so much for your work on "UPSERT" :)
A much simpler way of putting it would be that deletes would have to touch indexes if summarized visibility info was stored in indexes, even in cases where overall there is hardly any benefit from putting per-page visibility information in index pages. It's not impossible to make something like this work, but it's not obviously compatible with the current design.
I'm glad that you found UPSERT useful.
Correctness requires that any insert into an index generate a rw-conflict with a concurrent serializable transaction if, after that insert, re-execution of any index scan of the other transaction would access the heap for a row not accessed during the previous execution.
These diagrams are really beautiful, I love the subtle separation in the lines showing that they're made of hypens. Is there some software that makes these easy to generate?
I'm excited to use this tool.
Thank you for all the effort you put into writing this. It is clearly the product of a lot of effort and craft.
We were looking through it earlier to see why our nested loop nodes where performing badly — which ultimately made us realise we were just misreading the EXPLAIN output :-)
> We could ensure that any process takes out an exclusive lock
> on a file before reading or writing it, or we could push all
> operations through a single flow control point so that they
> only run one at a time. Not only are these workarounds slow,
> but they won’t scale up to allow us to make our database fully
> SQLite allows multiple processes to have the database file
> open at once, and for multiple processes to read the database
> at once. When any process wants to write, it must lock the
> entire database file for the duration of its update. But that
> normally only takes a few milliseconds. . . .
> If your application has a need for a lot of concurrency, then
> you should consider using a client/server database. But
> experience suggests that most applications need much less
> concurrency than their designers imagine.
So it's not really all that different from PostgreSQL in that regard and the authors point remains valid :).
You might also enjoy this talk by the main author of SQLite: https://www.youtube.com/watch?v=ZvmMzI0X7fE (SQLite: Protégé of PostgreSQL)
First, it talks about concurrency. It says that a naive solution to concurrency is file locking. It says this is slow. So I first trot out SQLite as an example that is not slow.
It says, "Not only are these workarounds slow, but they won't scale up to allow us to make our database fully ACID-compliant." To me it sounded like he said choosing file-locking for concurrency will mop you into a corner, that you won't ever become ACID, regardless of what you tack on, like a journal.
One other clarification about concurrency: SQLite says it's fast enough for most websites. Then it turns around and says that if you need a lot of concurrency, you should go with another database. The contradiction evaporated when it dawned on me that when database makers say "concurrency," they're talking about many users at a time connecting to a database directly: picture a bunch of SQL gurus connecting directly to an SQLite file and issuing raw SQL statements. Or replace the gurus with several "application" servers that all write to a database on yet another server. For that, SQLite would be bad, we all agree, including SQLite.
But that's not what most of us, the 99% of app developers, are thinking of when we hear "concurrency." We're thinking of several people hitting our website. In that case, SQLite would still hold up (https://sqlite.org/whentouse.html, see Server-side database).
Anyway, again, I'm not against Postgres. It's what I use. I'm not talking about SQLite because I'm invested in it but because I admire it. It's just an example of a simple solution that meets many people's needs, without resorting to more complexity.
It doesn’t. At all.
The quassel project supports PostgreSQL and SQLite as backends.
And, as expected when you use a database as log storage for an IRC bouncer, you end up with many concurrent threads writing and reading to the same table.
The result is that if a user is reading a table right now, no one can write to it - and that, in turn, means that the thread receiving messages (and responding to PINGs, which is required to keep the connection to the IRC server alive!) gets blocked, and has to wait on the threads currently requesting messages.
So if you connect and set your client to, on connection to the bouncer, request the last 500 messages for each channel, you might actually time out. Yes. When you have enough data, then connecting to the bouncer can kill the bouncers other connections.
If you have more than one thread doing writes or reads – in a usual web application you’ll have dozens or hundreds of worker threads reading and writing concurrently – then SQLite just isn’t useful.
Disclaimer: Not speaking for the quassel project here. This comment is just my opinion.
> if a user is reading a table right now,
> no one can write to it
> > 99% [...] SQLite would still hold up
> an IRC bouncer
The issue with those is still that if you have a too high writing load, SQLite will become a bad fit. It’s perfect for any low writing, high read situations, as many websites are, but for more dynamic applications it seems to be less usable.
Also, the FAQ of SQLite still lists
> SQLite supports an unlimited number of simultaneous readers, but it will only allow one writer at any instant in time. For many situations, this is not a problem. Writer queue up. Each application does its database work quickly and moves on, and no lock lasts for more than a few dozen milliseconds. But there are some applications that require more concurrency, and those applications may need to seek a different solution.
You're correct that there are some mostly-ready-only situations where a file-lock isn't a problem (perhaps a small CMS, blog or similar).
But using coarse grained locking often does mop you into a corner when it comes to concurrency. Just look at MongoDB (older versions had a global write lock), Python (GIL), etc.. The sibling post of this comment gives a real-world SQLite example. Anyway, you're right that coarse grained locking has nothing to do with atomicity itself. I may have misread your first comment, in retrospect you may have meant to say exactly that.
> in retrospect you may have meant to say exactly that
> there are some mostly-ready-only situations where a file-lock
> isn't a problem (perhaps a small CMS, blog or similar)
> The sibling post of this comment gives a real-world SQLite example
Although if I had a $100 for every time I've had DB corruption in Postgres over the years...
That being said, since 9.4 (or maybe 9.5) these incidents have mostly stopped happening and it's been remarkably stable.
After a few tough lessons like that we've put monitoring in place for ensuring that the standby and/or PITR servers have testing in place e.g. PITR servers must have uninterrupted WAL sequences.
Most Postgres corruption (on a master) actually happens due to dodgy hardware, but we don't really use bad hardware. In cases where we used VM's we've suspected something fishy in VMWare itself. In one case we suspected the SAN that the VMWare host was using.
So mostly not Postgres' fault. There were however some known replication bugs in the 9.x versions that were quite nasty. I think those bugs have now all or mostly been fixed, so 9.6 and 10 (beta) onwards are extremely stable.
But the risk of HW issues still remain as the biggest reason for postgres corruption.
I cannot emphasize enough how important backups are (both standby servers and PITR) and also to verify that these are indeed valid for that day when you're going to need to use them.
You'd be broke. :)
- PG only added checksums in 9.3
- his disks or RAID controller could be shit or he could be using some 'very cool' filesystem
- Some people turn off PG settings for performance that really shouldn't be turned off
- Some people think ECC ram isn't worth the money
- or overclock their CPU on anything other than a gaming machine
Usually it's shit hardware
There are a few online variants, e.g. http://asciiflow.com/ and https://textik.com/
For extra credit use these fonts: https://int10h.org/oldschool-pc-fonts/fontlist/