

PostgreSQL when it is not your job - wahnfrieden
http://reinout.vanrees.org/weblog/2012/06/04/djangocon-postgres.html

======
rosser
As someone whose job it _is_ to keep peoples' PostgreSQL instances happy, this
list is fairly comprehensive, and much of it is good. His advice about
configuration directives towards the top of the article, however, is terrible.

In particular, work_mem: the article suggests setting it to 2-3x the size of
the largest temp file you see. The thing you need to be mindful of with
work_mem is that the limit is _per sort_. I have a process on one of my
masters that periodically regenerates a materialized view. Each run leaves
hundreds of mibibytes of temp files. Configured per the article's advice, it's
eminently possible to exhaust physical memory on sorts (100 connections each
doing 10 sorts, for example). Unfortunately, the Linux OOM-killer is naïve
about postgres; it tends just to thump the postmaster. Fun times.

To the contrary, something low like 16MB is the general recommendation. You
can tweak that per session, if you know you'll be doing larger sorts and don't
want to spill to disk ("SET work_mem = $desired_value"), but there's no need
to allocate 100s of mbytes to sort tens of tuples.

EDIT: Heed the advice about transactions under Django. At a previous gig,
correcting the default behavior to leave a transaction open for sometimes
_days_ at a time reduced the amount of bloat on some hotter, but small tables,
by three orders of magnitude. VACUUM can't do its job if there are
transactions open to whom the dead tuples it's trying to reclaim might still
be visible.

Also important, the bit about IN() clauses. A few months ago, I was given a
query that hadn't completed overnight and asked to make it go faster. It
contained a moderately sized (but not massive) IN() clause, which I refactored
into a JOIN. That was the only change I made, after which it ran in 3.7s.

EDIT: clarification.

~~~
hennk
Just out of interest, but how large was the "moderately sized" IN() clause? I
am asking, as we are looking at postgresql as an alternative to MySQL, and we
have some queries currently with up to 5000 values inside IN()

~~~
rosser
I don't remember specifically, but I believe it was in the thousands to low
tens of thousands of rows. It's also not consistent. I've seen larger IN()
clauses that never have a problem, and smaller ones that consistently do. It's
been on my very low priority to-do list to put together some demo cases for
the mailing lists, because overnight to < 4s just from that little refactor
isn't the greatest...

~~~
tmoertel
Are we talking about IN clauses that contain a correlated subquery or
something the optimizer would have a hard time determining was independent of
outer context?

~~~
jeltz
PostgreSQL is able to convert correlated subqueries with IN () clauses into
joins in most cases. My guess is that it could have been two queries
refactored into one.

------
justin_vanw
This advice is just copypasta.

It's also pretty dangerous and wrong.

Example: "shared-buffers. below 2GB: set it to 20% of full memory, below 32GB:
25% of your full memory." -- Don't do this. Set it to around 20% of your
memory if you have a small machine, such as a vps or desktop. If you have lots
of memory, set it between 2GB and 4GB. Anything above 8GB exceeds what it is
designed to handle and can cause major performance problems, such as the
database becoming unresponsive for 1-2 minutes.

"work_mem. Start low at 32/64MB. Look for temporary file lines in logs. Then
set it to 2-3x the largest temp file that you see. This setting can give a
huge speed boost (if set properly)." -- This is a great way to cause your
database machine to swap to death. A single query can cause many times
work_mem to be allocated (it can allocate this much for every sort or hash).
So this really depends on how many connections you have and what the queries
are. No silver bullet here, but 16-24MB usually works pretty well if you have
enough memory.

maintenance_work_mem: 2GB is plenty. This is used whenever you create an
index, and I think autovacuum also uses this setting. 10% is way too high.

checkpoint_timeout: the higher the better here, but keep in mind that if your
db goes down and has to be started, it can take this long before it is
available to accept queries. 5M is probably what I would use unless I knew I
could accept more down time than that.

Also, changing linux kernel settings can make a huge difference, but tuning
disk performance and dirty buffer sizes is a whole topic I won't get into
here.

You can learn basically everything you need here:
[http://www.2ndquadrant.com/en/postgresql-90-high-
performance...](http://www.2ndquadrant.com/en/postgresql-90-high-performance/)

One thing not mentioned, but which can have a HUGE performance advantage,
especially on virtualized disks or spinning disks that don't have a battery
backed raid controller, is this:
[http://www.postgresql.org/docs/9.1/static/wal-async-
commit.h...](http://www.postgresql.org/docs/9.1/static/wal-async-commit.html)

~~~
masklinn
> This advice is just copypasta.

> It's also pretty dangerous and wrong.

It's livenotes from a presentation, I expect Reinout van Rees noted the parts
that interested him as the actual presentation has what you think "right":

> If you have lots of memory, set it between 2GB and 4GB. Anything above 8GB
> exceeds what it is designed to handle

From the slides:

> Above 32GB (lucky you!), set to 8GB.

> This is a great way to cause your database machine to swap to death.

from the slides:

> But be careful: It can use that amount of memory per planner node.

maintenance_work_mem: 2GB is plenty. [...] 10% is way too high.

from the slides:

> 10% of system memory, up to 1GB. Maybe even higher _if you are having VACUUM
> problems_.

(emphasis mine)

~~~
reinout
Hi, I'm the Reinout that made the notes. Yes, it was a live summary and his
slides went by very fast. I'm surprised how much I got written down. So I had
to leave stuff out, for instance the >32GB comment (which I did as it seemed
applicable only to few people).

Masklinn, thanks for doing this bit of checking! For many people (including
myself :-) the summary will be enough, but checking the actual presentation is
a good idea if you run into problems.

------
einhverfr
Part of the reason why PostgreSQL has so any knobs is that these things are
not always thing that cookie cutter approaches work with. Although if you do
need to worry about these, it probably is your job and you probably are at
least going to learn it.

I am not convinced about his list of "stupid db tricks you should not do." For
example:

1) Sessions in the db are sometimes a good thing and sometimes not. They do
have a real performance cost (we have a query in LedgerSMB that for large db's
takes 20x longer because of having to do this, but the performance cost is
necessary and still worth it, though we will probably offer non-web-based
alternatives instead later where this wont be necessary). Of course that query
is actually using another table to provide "discretionary locks" of rows to a
session.... and those locks MUST persist across transactions because of the
fact we are mapping to a series of HTTP requests...

2) I dont know about celery queues, but with listen/notify, you can do some
really cool message queuing in PostgreSQL.

3) Usually when my app has to store files attached to data, I usually find
that it's simpler to put it in the db than the filesystem. That guarantees
that the files are in the backups among other things. Also for larger files,
PostgreSQL's lob interface (up to 2gb) provides seek operations and more.
performance issues here end up occurring outside PostgreSQL.

I am with him on very long-running transactions.

COPY is good for some things, but if you are trying to create more portable
code you probably want to do something like insert foo (....) values (...),
(....), (....)....

I would also suggest it is important to know the difference between LIKE
'%this%' and full text searching on PostgreSQL. These are not simple drop-in
equivalents. However additionally LIKE '%this%' cannot use an index (though
like 'this%' can, and you can do full text indexing on Pg).

~~~
jeltz
Actually since PostgreSQL 9.1 LIKE '%this%' is indexable with the pg_trgm
contrib module. Since it is based on the trigrams in your search query it
obviously has its caveats, generally the longer the query the more effective
the index lookup is. I would imagine for example '%th%' requires a full
table/index scan since it contains zero trigrams.

[http://www.postgresql.org/docs/9.1/static/pgtrgm.html#AEN137...](http://www.postgresql.org/docs/9.1/static/pgtrgm.html#AEN137099)

~~~
einhverfr
But pg_trgm isn't really the same either, is it? I have looked at pg_trgm
primarily for handling misspellings and suggested alternatives.

Also it wasn't clear to me how "%this%" would be differentiated from "his thin
snake."

~~~
jeltz
Yes, pg_trgm was built for that but someone figured out how LIKE could be
hacked to use the trigram indexes (gist_trgm_ops, gin_trgm_ops). So if you
have a * _trgm_ops index on the column normal LIKE and ILIKE queries may use
that index.

I assume your example would be a false index hit which then is necessary to
verify against the real value. The same would apply to make sure 'This' is not
a hit when doing a case sensitive search for '%this%'. So index-only LIKE
scans are not possible with * _trgm_ops indexes.

EDIT: I just realized how awesome the extensibility of PostgreSQL is. An
extension can make a core operator such as LIKE indexable in an entirely new
way without touching the core code.

~~~
einhverfr
The extensibility is really cool. And yeah, that looks like a very useful
thing. Thanks for the pointer.

------
__mharrison__
Here's an attempt[0](by the creator of the Postgres High Perf Book, (and some
patches by yours truly)) to automate PG configuration. It's not perfect, but
certainly better than OOB settings

0 - <https://github.com/gregs1104/pgtune>

------
antihero
One question on avoiding giant IN clauses with Django?

Say I have a class called Fridge, and a classes called Vegetables and
Condiments.

Both of these have ManyToMany relationships between themselves and Fridge.

So something like:

    
    
        class Fridge(models.Model):
             condiments = models.ManyToManyField(Condiments)
             vegetables = models.ManyToManyField(Vegetables)
    
    

And here we have a QuerySet that represents our white fridges:

    
    
        qs = Fridges.objects.filter(color='white')
    

First query:

"Given a list of condiment IDs, get me all the fridges that have ANY of those
condiments in them (modifying the original QuerySet).""

Second query:

"Given a list of vegetable IDs, get me all the fridges that have ALL of those
vegetables in them (modifying the original QuerySet)."

How on earth would I do that without building a list of fridge IDs and adding
an IN clause to my queryset?

Here are solutions that do it with IN clauses:

First query:

    
    
            condiment_ids = [...] # list of condiment IDs
            condiments = Condiment.objects.filter(
                id__in=condiment_ids).all()
            condiment_fridges = None
            for condiment in condiments:
                qs = condiment.fridge_set.all()
                if not condiment_fridges:
                    condiment_fridges = qs
                else:
                    condiment_fridges = condiment_fridges | qs
            qs = qs.filter(id__in=[l.id for l in condiment_fridges])
    

Second query:

    
    
            vegetable_ids = [...] # list of vegetable IDs
            vegetables = vegetable.objects.filter(id__in=vegetable_ids).all()
            vegetable_fridges = None
            for vegetable in vegetables:
                qs = vegetable.location_set.all()
                if not vegetable_fridges:
                    vegetable_fridges = qs
                else:
                    vegetable_fridges = vegetable_fridges & qs
            qs = qs.filter(id__in=[l.id for l in vegetable_fridges])
    

These solutions seem horrible and hackish and I was wondering if there was a
better way to do them with Django. Something like object HAS these objects or
object HAS ALL of these objects.

Should I just post this on StackOverflow instead?

~~~
ashray
Might be a good idea to take that to stackoverflow. My suggestion though would
be to use that little sauce that we can't achieve programmatically yet. Human
intelligence. How often do you think you're going to need this data ? How
often does it change ?

Cache it based on that. Whatever way you do it, queries like this are
expensive. If you need it often enough, throw it into memory and invalidate as
necessary. More often than not, your cache invalidation will use less lines
than what you wrote up there.

Also, don't be afraid to drop into raw sql if you really want to do something
a certain way. The example you posted above clearly looks like the ORM working
against you.

~~~
antihero
Ok, here's my question on StackOverflow:
[http://stackoverflow.com/questions/10930169/how-do-i-get-
dja...](http://stackoverflow.com/questions/10930169/how-do-i-get-djangos-orm-
to-check-if-model-has-or-has-all-of-a-certain-relation)

------
davyjones
If you don't configure your shmmax first, you might have a problem when
tweaking the memory setting in postgresql.conf.

Relevant doc: [http://www.postgresql.org/docs/9.1/static/kernel-
resources.h...](http://www.postgresql.org/docs/9.1/static/kernel-
resources.html)

~~~
rosser
TL; DR: Postgres will just fail to start if shmmax is too small, and the logs
will reflect that. On Linux, you'd say:

    
    
      # sysctl -w kernel.shmmax=N
    

Where N is the desired maximum shm segment size. (You'll also want to increase
shmall, to accommodate shmmax, and any other shared memory requirements you
may have.)

EDIT: Of course, to make the change persist across reboots, you'll also want
to add it to /etc/sysctl.conf. I've forgotten that part more than just in this
comment...

~~~
stefantalpalaru
Try this strategy instead of using sysctl to set individual values: \- edit
/etc/sysctl.conf \- run "sysctl -p"

~~~
james4k
Yes, definitely. I just learned about -p recently, and had been editing the
config AND running sysctl -w for way too long.

------
seunosewa
Some of the suggestions make PostgreSQL seem less mature than InnoDB, still:
"[don't put] sessions in the DB", "[don't put] constantly-updated counters in
the database", and "[don't put] task queues in the database."

My forum gets almost a million page views daily; we store all our data in a(n)
InnoDB database, including sessions, task queues, and constantly updated
counters. They work just fine and are not even bottlenecks when the isolation
level is REPEATABLE READ.

We use SERIALIZABLE though, as a matter of principle, so we had to implement a
more fancy schema, but everything happens in the InnoDB database. We don't
have to worry about running VACUUM periodically either. I feel PostgreSQL may
be a wee bit overhyped.

~~~
einhverfr
Sessions in the db are fine, depending on what you are doing with them.
LedgerSMB for example uses them to track who is doing what right now in the
db, as well as maintain per-user locks that have to persist across db
transactions.

In most workflows there is no significant performance penalty here. The only
problem is where we are checking those locks and trying to obtain them if they
are not held by someone else. This is a significant problem and currently
makes a query in a large db take about 20x as long.

It all depends on what you are doing. But yeah trying to have extra-
transactional locks so you can do reliable locking across HTTP requests tying
it to the session sucks :-)

------
xxiao
for the memory portion, if the stated is true, why not postgresql just use
them directly? that's fixed values based on the hardware it runs on.

------
raverbashing
THIS

Thanks

Even better if it was a quick guide to all quirks PSQL

Dear DBAs, PostgreSql may be great and etc, but if I need to spin a DB for
testing/proof of concept, you bet I'm going to use MySQL 20 out of 10 times.

"Go RTFM" sorry, I lost count of how many times I had to set up MySQL or PSQL
and MySQL is much more intuitive and easy to work with.

PSQL is sincerely a waste of time and energy for small things. If I need
scalability, etc, sure, get a dba and go for PSQL

"Stupid DB tricks you should not do:" don't forget no logs in the DB. Or jut
put it in a totally separate DB, but hey, an append only file is ok

"Don’t use gigantic IN clauses. Django generates them a lot. JOINs are fine,
but IN isn’t well-supported in postgres."

Oh really?!?! And then they say Django in PSQL is faster?! Maybe to something
really simple. (not in my experience, maybe that's why some things are faster
in MySQL)

And don't forget PgAdmin3 _rocks_

~~~
wahnfrieden
Is MySQL faster at handling IN than PSQL? Just because it says it's not well-
supported in PSQL doesn't mean it is in MySQL. I don't know either way, but
I'd like to.

~~~
raverbashing
I'm not sure how Django constructs the query, but it certainly does
differently between MySQL and PostgreSql (and Oracle) since this is in the
Django driver

