
PostgreSQL's Imperfections - chmaynard
https://medium.com/@rbranson/10-things-i-hate-about-postgresql-20dbab8c2791
======
topspin
If I could have one thing on that list fixed it would be #9 - no planner
hints.

I used Oracle (6 through 11) for both bespoke applications and to back large
third party systems. I never saw widespread abuse of hints. Yet they were
immensely helpful during development and troubleshooting. I put perhaps two
queries into production with hints over 10+ years. No one ever had a reason to
complain about either.

There are no perfect query planners. There is no perfect analysis. The notion
that one must submit entirely to the mercy of PostgreSQL's no-hints dogma
causes me to harbor some resentment. Fortunately you can frequently abuse CTEs
to achieve a desired access pattern because (until recently) CTEs were an
"optimization fence." But I've also resorted to creating functions and other
hacks.

So I conclude the policy is simply wrong headed; there is no legitimate reason
to fear hint abuse and the premise that hints aren't necessary is false.

~~~
hibikir
While I won't say that the Postgres ideosyncrasies are correct here, I can
tell you that your experience is not necessarily typical. In my career, hint
abuse has always been rampant. From telcos to biotech companies, a high
percentage of complex queries I had to interact with had hints in them. In one
extremely egregious case, the company decided to purchase an Exadata server,
and since its performance characteristics had little to do with the previous
servers, a vast majority of hinted queries were very wrong, leading to months
of developers rebuilding every query, despite having no schema changes.

So yes, a lot of people in the industry have been burned by hints to make the
Postgres perspective understandable. At the same time, this doesn't make it
good: Postgres' default settings over the years have lead to specific kinds of
tables requiring extra love and care to make the query planner not do silly
things. The most traditional failure case being a transaction table with an
always increasing timestamp, where a vast majority of queries only care about
today: The traditional thing to do was to convince Postgres that yes, this
table needs very frequent stats recalculation, as to make it learn that there
are more than 3 rows today, so nested loops will not do. Whether the Postgres
quirks are better or worse than hint hell, I am still not sure of: A competent
engineer can handle things either way.

~~~
viraptor
> In one extremely egregious case, the company decided to purchase an Exadata
> server, and since its performance characteristics had little to do with the
> previous servers

I don't get this point. If the hinted queries changed performance for the
worse, why wouldn't you expect unhinted queries to also change for the worse
after migration? After all, the hints were there to overcome such issues
already.

It sounds like the lesson should be "with large enough system, plan for
extended time for query rewriting if you plan to replace your db engine",
rather than anything about hints themselves.

~~~
ric2b
> If the hinted queries changed performance for the worse, why wouldn't you
> expect unhinted queries to also change for the worse after migration?

The hints can introduce wrong assumptions to the planner, which can lead to
worse performance than if those wrong assumptions weren't there.

~~~
viraptor
Same with lack of hints. We're not talking about simple queries which can be
expected to just work. These were previously found to be badly optimised by
the first engine, so we're in unknown territory either way.

------
jeffdavis
I largely agree, but a couple points I'd like to challenge:

Is hardware corruption really happening and making it into the WAL stream with
checksums on?

The next point, on planner hints: it's really just something that hasn't been
done. If a few engineers made plans to tackle the problem, a lot could be done
in a couple releases' worth of work. In the mean time, people are getting by
with various half-measures anyway, such as extensions[1], planner tunables[2],
and statistics tweaks[3].

The only dogma is that a half-baked solution isn't wanted. It's got a lot of
architectural impact and long-term supportability implications. And a lot of
different use cases that need to be considered that may drive different
technological solutions. "Make plans stable/managable" is a different use case
than "I know something the planner doesn't" which is different from "Make the
planner do this thing because I said so".

[1] [https://github.com/ossc-
db/pg_hint_plan/blob/master/doc/pg_h...](https://github.com/ossc-
db/pg_hint_plan/blob/master/doc/pg_hint_plan.html)

[2] [https://www.postgresql.org/docs/current/runtime-config-
query...](https://www.postgresql.org/docs/current/runtime-config-query.html)

[3] [https://www.postgresql.org/docs/12/sql-
createstatistics.html](https://www.postgresql.org/docs/12/sql-
createstatistics.html)

~~~
DiabloD3
Hardware corruption can happen anywhere to anything.

If you care about your data: Use ECC, and use a checksumming filesystem like
ZFS, and _also_ on top of this all, export your WALs to a second machine.

~~~
jeffdavis
Well, I'm asking specifically if the author saw corruption that snuck past the
checksumming features that postgres has.

Extra layers are always good, but since I was one of the main authors of
checksums in Postgres, I'd like to know if there's room for improvement.
(Aside: the page checksum is only 16 bits, so if you have frequent corruption
it's entirely believeable that a few sneak past. But I haven't seen it
personally.)

------
karyon
There's also one inherent data privacy problem in MVCC that I've been running
into.

Suppose you have an app that lets people anonymously vote or comment on stuff,
but only once. The vote in the DB must not have any connection to the person.
So, you give the person a flag whether or not they voted already, and store
the vote separately.

Now, you'd want to set both values in the same transaction for obvious
reasons. But, since Postgres uses MVCC, the two tuples that are added to the
database both contain the same transaction ID (XID), so there's the connection
between user and vote again.

There seems to be no way to instruct postgres to "clean" those XIDs in any
way. What we're doing now is periodically and manually updating every tuple in
each affected table with dummy changes, essentially duplicating all tuples
with all new XIDs, and then running VACUUM to delete the tuples with the old,
potentially-deanonymizing XIDs. We haven't found anything easier...

~~~
dragonwriter
> Now, you'd want to set both values in the same transaction for obvious
> reasons. But, since Postgres uses MVCC, the two tuples that are added to the
> database both contain the same transaction ID (XID), so there's the
> connection between user and vote again.

The simple solution is that at each change, you rewrite the whole election,
not just the new votes, and clear out all outdated tuples (basically, that you
make the "periodic and manual" process you are currently doing automatic and
integrated with the "real" transactions rather than additional side process.)

Alternatively, you don't do the changes in the same database transaction but
in the same business domain transaction which is managed outside the database,
and where any database artifacts related to the management of the business
transaction are deleted and vacuumed after the transaction is completed.

~~~
karyon
Re "same business transaction": That would probably still allow for
correlating the votes with the flags because they'll still be added with
_some_ XID, which are monotonically increasing. Of course, it would require
more effort and probably even guesswork, but still :)

Rewriting the election in the same transaction sounds smarter than the
periodic solution. I need to discuss that with the team, thanks!

------
Doctor_Fegg
> The on-disk binary format is incompatible across major versions

This is my major bugbear. If Postgres were able to upgrade its datastore on
the fly (optionally, of course) that would make a massive difference. Instead
I’ve had heart-in-mouth moments when Homebrew has decided that it wants to
upgrade Postgres. (Yes, I do now use brew pin, until I transition off Homebrew
for good.)

#2 for me is inefficient enum storage. Each value takes up 4 bytes. A single-
byte enum would vastly reduce my database size.

~~~
yardstick
Re upgrades, have you tried pg_upgrade for upgrades in place?

Re enums, we had a similar thing and simply went with a smallint column
instead of enum.

~~~
yjftsjthsd-h
> Re upgrades, have you tried pg_upgrade for upgrades in place?

This only works when you have both versions available at the same time, which
seems likely to break when a package manager jumps major versions, and also
doesn't work when running postgres in Docker ([https://github.com/docker-
library/postgres/issues/37](https://github.com/docker-
library/postgres/issues/37)).

------
mulander
> Every time an on-disk database page (4KB) needs to be modified by a write
> operation, even just a single byte, a copy of the entire page, edited with
> the requested changes, is written to the write-ahead log (WAL). Physical
> streaming replication leverages this existing WAL infrastructure as a log of
> changes it streams to replicas.

First, the PostgreSQL page size is 8KB and has been that since the beginning.

The remaining part. According to PostgreSQL documentation[1] (on full page
writes which decides if those are made), a copy of the entire page is only
written fully to the WAL after the _first_ modification of that page since the
last checkpoint. Subsequent modifications will not result in full page writes
to the WAL. So if you update a counter 3 times in sequence you won't get 3*8KB
written to the WAL, instead you would get a single page dump and the remaining
two would only log the row-level change which is much smaller[2]. This is
further reduced by WAL compression[3] (reducing the segment usage) and by
increasing the checkpointing interval which would reduce the amount of copies
happening[4].

This irked me because it sounded like whatever you touch produces an 8KB copy
of data and it seems to not be the case.

[1] - [https://www.postgresql.org/docs/11/runtime-config-
wal.html#G...](https://www.postgresql.org/docs/11/runtime-config-wal.html#GUC-
FULL-PAGE-WRITES)

[2] -
[http://www.interdb.jp/pg/pgsql09.html](http://www.interdb.jp/pg/pgsql09.html)

[3] - [https://www.postgresql.org/docs/11/runtime-config-
wal.html#G...](https://www.postgresql.org/docs/11/runtime-config-wal.html#GUC-
WAL-COMPRESSION)

[4] - [https://www.postgresql.org/docs/11/runtime-config-
wal.html#G...](https://www.postgresql.org/docs/11/runtime-config-wal.html#GUC-
CHECKPOINT-TIMEOUT)

~~~
anarazel
That is correct. And neither is a full page write logged if the page is
initialized from scratch. And even without WAL compression, the "hole" in the
middle of the page if the page is not full, is "compressed" out.

That's not to say that FPWs are not a problem. The increase in WAL volume they
can cause can be seriously problematic.

One interesting thing is that they actually can often very significantly
increase streaming replication / crash recovery performance. When replaying
the incremental records the page needs to be read from the os/disk if the page
is not in the postgres' page cache. But with FPWs we can seed the page cache
contents with the page image. For the pretty common case where the number of
pages written between two checkpoints fits into the cache, that can be a very
serious performance advantage.

------
SigmundA
Great list I would add one thing to it:

No query plan caching not even for sprocs or functions.

Was surprised by this one, looks like the optimizer is much simpler than other
db's so it usually take less time to create to the plan but the overhead is
still there.

This is why you see the recommendation to use prepared statements and many
client libraries try to automatically, but a prepared statement cannot be
shared between sessions so its only good if your repeating the same statement
over and over on the same connection.

If your app calls the same statements over and over from different connections
which most apps tend to do it can save significant overhead and reduce
response times. It was pretty much mandatory to make sure you where using
parameterised SQL or sprocs back in the day to make sure it was using a cached
plan properly.

------
jedberg
I agree with everything on this list. Been bit by most of these at one point
or another. XID wraparound was the worst.

Ironically we avoided a lot of the replication bugs by accidentally deciding
to use logical replication from the start, but that of course brought in a
whole different set of bugs instead.

I'm surprised there wasn't a complaint about the vacuumer. That was probably
my biggest single pain of running a large active cluster. There was never a
good time to vacuum, but if you skipped it, it eventually happened
automatically, usually at the worst possible time, when the database was most
active.

To be fair I haven't managed postgres since 8.3, so maybe that got better?

~~~
Tostino
It's pretty darn different than 8.3 today.

------
KingOfCoders
The 'best database' article recently linked here, says

"PostgreSQL picks a method of concurrency control that works best for high
INSERT and SELECT workloads. [...] tracking overhead for UPDATE and DELETE."

This one says "INSERT and UPDATE operations create new copies (or “row
versions”) of any modified rows, leaving the old versions on disk until they
can be cleaned up."

I do think this one is wrong, but this is my wild guess as I am no expert in
any way. INSERT should be fine. Or how would INSERT "create new copies"?

[Edit] See authors comment.

~~~
rbranson
Author here. This should not have included INSERT. I updated the post to
reflect that. Thanks for picking this up.

~~~
KingOfCoders
Thanks for your fast reply, much appreciated.

------
DeathArrow
I'm a humble web developer and I'm not very knowledgeable about databases.

I am glad I deal with an ORM for both personal and work projects instead
relying on database specifics. That way, the app is DB agnostic and I can
switch the database with ease. If your resource are limited, I think that is
good.

When you have the resources, it's better to hire an architect and a DBA to
tell you what DB to use and maintain it.

~~~
goatinaboat
_I can switch the database with ease_

In my experience an organisation is far, far more likely to switch operating
systems or hardware platforms or programming languages than they are the
database. But no programmer bothers to code in a clever but restricted syntax
that would be a valid program in both C# and Java. Or restricts themselves to
a core set of OS features or hardware instructions just in case. It really is
quite bizarre to watch.

~~~
karatestomp
Plus it’s fairly common to end up with more than one program, in more than one
language, reading from and writing to a DB. The more you’ve avoided using DB
features (in featureful databases, anyway) the slower (in performance and dev
time) and riskier (you _will_ have more bugs, and maybe some pretty bad ones)
this is.

------
mamcx
Ok, I have mine:

When declare a custom type with a check, the error not show the row/table that
cause the problem, only that something happen:

    
    
        CREATE DOMAIN TEXTN AS TEXT
        CONSTRAINT non_empty CHECK (length(VALUE) > 0);
    

However, doing the check inline show the error in full.

This cause me to rewrite all the tables, twice (one adding the new type
thinking will help, once again inlining everything).

------
rossmohax
Criticism is valid, but he talks about cases of millions connections to a
single db, that is a significant scale many companies will never see. In
addition to that, probably no database can serve under significant load
without careful tuning, preferably with understanding of DB internals and
knowing compromises DB authors took when designin it.

PostgreSQL is constantly improving. At least some of the problems with scaling
with number of connections have more to do with locking rather than process-
per-connection architecture, it is being worked on with impressive results
doubling number of transactions per second for 200 connections:
[https://www.postgresql.org/message-
id/20200301084638.7hfktq4...](https://www.postgresql.org/message-
id/20200301084638.7hfktq4sh2mcnuen%40alap3.anarazel.de)

~~~
scurvy
I'll trust postgresql more when it can support a few thousand connections
without resorting to running middleware (pgbouncer) all over. That was his
point. PostgreSQL is just abysmally bad in this area.

The process per connection model works great for "my first rails project" so
every developer brings it to $dayjob. Then they are caught off guard when they
start getting real traffic. It's terrifying to watch a couple hundred
connections take a moderately sized server (~100 threads) down the
native_queued_spin_lock_slowpath path to ruin. That's just sad.

~~~
anarazel
> I'll trust postgresql more when it can support a few thousand connections
> without resorting to running middleware (pgbouncer) all over. That was his
> point. PostgreSQL is just abysmally bad in this area.

Depending on your workload it's entirely possible to run PG with 2000
connections. The most important thing is to configure postgres / the operating
system to use huge pages, that gets rid of a good bit of the overhead.

If the workload has a lot of quick queries it's pretty easy to hit scalability
issues around snapshots (the metadata needed to make visibility
determinations). It's not that bad on a single-socket server, but on 2+
sockets with high core counts it can be significant.

We're working on it (I'm polishing the patch right now, actually :)). Here's
an example graph
[https://twitter.com/AndresFreundTec/status/12346215343642419...](https://twitter.com/AndresFreundTec/status/1234621534364241920)

My local 2 socket workstation doesn't have enough cores to show the problem to
the same degree unfortunately, so the above is from an azure VM. The odd dip
in the middle is an issue with slow IPIs on azure VMs, and is worse when the
benchmark client and server run on the same machine.

> It's terrifying to watch a couple hundred connections take a moderately
> sized server (~100 threads) down the native_queued_spin_lock_slowpath path
> to ruin. That's just sad.

Which spinlock was that on? I've seen a number of different ones over time.
I've definitely hit ones in various drivers, and in both the generic parts of
the unix socket and tcp stacks.

~~~
scurvy
>> native_queued_spin_lock_slowpath path >Which spinlock was that on? I've
seen a number of different ones over time. I've definitely hit ones in various
drivers, and in both the generic parts of the unix socket and tcp stacks.

Not sure yet. It was on a server with 1000 stable connections. Things were
fine for a while, then suddenly system would jump to 99% on all 104 threads
and native_queued_spin_lock_slowpath was indicated by perf.

Ironically we cleared it up by having sessions disconnect when they were done.
Boggled the mind that increasing connection churn improved things, but it did.

~~~
anarazel
That sounds like you could have hit transparent hugepage / compaction related
issues. They, IME, tend to hit more often with lots of long running processes,
than when there's a lot of churn. It has gotten a lot better in more recent
kernel versions, but if you're on an older kernel, it can be really bad.

~~~
scurvy
THP => never. I thought about that, too.

------
phamilton
Moving to PostgreSQL on Amazon Aurora simplifies all the replication issues
listed. We (Remind) use an autoscaled PostgresQL Aurora cluster and have been
pretty happy with it.

~~~
rossmohax
How much knoweldge is transferable? Isn't Aurora just protocol and SQL dialect
compatible, but underneath it has nothing to do with postgres?

~~~
vp8989
Each flavor (and version) of Aurora is compatible with a corresponding version
of the open source software. For example Aurora MySQL 1.* is compatible with
MySQL 5.6

At my current gig we use it in Prod, but we are also able to run our software
during development pointing to locally installed open-source versions of MySQL
just fine. I imagine it's the same for Postgres.

~~~
setr
I believe he was asking whether an understanding of "under-the-hood" of
postgres transfers to Aurora -- that is, does your tuning knowledge transfer
as well, or is it just you can migrate your codebase transparently

Though now that I think about it, I think Aurora gives relatively little in
tuning accees, so it's more of whether the hueristics transfer (eg the ol'
avoid all joins, which I've always been suspicious of, but still don't know if
it's a useful saying)

------
gregn610
I'm surprised that no-one is complaining about the lack of Active Directory
integration for authorization.

#disclaimer - Author of an AD integration solution that never got off the
ground.

[https://github.com/gregn610/padnag](https://github.com/gregn610/padnag)

~~~
LunaSea
I think it's a bad idea for a database to start implementing third-part vendor
related features.

That's the type of feature that should be implemented as a plugin.

~~~
nijave
You could just call it "ldap authentication". AD comes with an LDAP interface.

~~~
anarazel
Postgres does have ldap based auth, and also can authenticate against AD using
sspi/gssapi.

The problem with that is that it requires users to have been created inside
postgres first, and that you can't manage group membership inside AD.

~~~
gregn610
yup, that's what I meant by authorisation, keeping the roles and groups in
pgsql up to date.

------
mrfusion
Can I bounce this idea off you guys.

Would there be a market for a dba to charge maybe 100-200. Just comes in,
listens to your DB use cases, and recommends various config/setting changes,
hardware, etc?

It seems so much better than having a team of programmers study Postgres
settings for a week. That was my last experience with it at least.

~~~
codegladiator
Why wouldn't an automated interface be able to give similar recommendations ?

~~~
no-s
>Why wouldn't an automated interface be able to give similar recommendations ?

you being sarcastic? If only we had an automated interface to make decisions
about what code to write, then we wouldn't need programmers. Look how well
that turned out IRL. We don't really need many assembly language programmers
any more, but now we have all these nifty new programming languages...

~~~
codegladiator
I am under the impression that all the combinations of all configurations
would still be small enough to present neatly maybe in a wizard.

------
jakearmitage
Does anyone know any good book with PostgreSQL tips and tricks and cool
snippets? Coming from MySQL, I'm always finding exciting things: checks, date
ranges and GIST, to_tsvector(title) @@ websearch_to_tsquery('foobar')...

------
codegladiator
Wow this article and this thread is so full of useful information for getting
started to ventur into DB management/engineering.

------
osrec
Not sure if this makes sense, however, I've always preferred MariaDB to
Postgres because it _feels lighter_. And I haven't really come up against any
significant limitations in MariaDB that would make me want to switch to
Postgres.

~~~
edoceo
Heavy writes, mulitstep transactions is where PG has always won for me vs
Maria and MySQL - it was even more pronounced when I decided I'm all in on PG
c2003

~~~
redis_mlc
1) 2003 was 17 years ago.

2) PG has a write amplification problem with multiple indexes that MySQL
Innodb doesn't have.

~~~
asah
I'm curious how innodb solves this, given that the theory of indices pretty
much demands write amplification by definition. Postgres' WA is worse?

~~~
edoceo
Its from an Uber post - more story and follow up in this post
[https://dba.stackexchange.com/questions/226879/did-
postgresq...](https://dba.stackexchange.com/questions/226879/did-postgresql-
write-amplification-reduction-warm-efforts-get-released-in-10-1)

~~~
djd20
Yeah - they indexed every single column.... also - what may not have worked
for uber, will be just fine for 99% of usecases. How often do you work with
that kind of load. What they moved to isn't so much mysql, rather mysql used
as a key value store with 0 relational database usage. Not really a postgres
issue, rather one of being too big for relational databases.

------
rishav_sharan
As a not very tech savvy person, object/hierarchical data querying in pg is
horrible. The query syntax is alien and just feels tacked on.

~~~
Tostino
Mind giving an example?

------
Mountain_Skies
Anyone else getting a 500 error for the link?

~~~
andy_ppp
Hopefully a database problem

:-/

------
thayne
why does postgresql use a process per connection? Is there some advantage to
doing this over using threads for each connection (or even asynchronous
connection handling)?

~~~
int_19h
Postgres dates back to the era when multiple processes were the normal way of
doing this kind of thing on Unix-like OSes, and POSIX threads were often
poorly supported on free ones.

------
animalnewbie
Anyone got postgres (or general db) tips for a CoW-fs? I'm using postgres on
zfs.

~~~
p_l
Ensure that the dataset containing your postgres data is configured with
record size equal to postgres page size or close enough (Lots of places use
8kB ZFS records for 4kB pages).

This will reduce write amplification due to excessive read-modify-write
cycles.

~~~
Tostino
Pg uses 8kb pages by default. You do really want your fs/db page size to match
though except in very very specific scenarios.

~~~
animalnewbie
Why doesn't pg query the fs for the default scenario I wonder

~~~
Tostino
Because it's a compile time flag, not a init or config parameter.

I agree it would be nice if the page size was more adaptive to just not have
FS page size alignment issues.

------
rb808
Its scary that my company is replacing old Oracle DBs and teams of specialist
DBAs with developers running their own PostSQL instances. Sure it saves money
but its gonna blow one day soon...

~~~
Tostino
Getting rid of Oracle, fine... But getting rid of specialists is probably not
the smartest move. I don't think Postgres will have anything to do with the
problems your company may face.

