

Top Missing PostgreSQL Features - someotheridiot
http://blog.kimiensoftware.com/2011/04/top-10-missing-postgresql-features/

======
btilly
We've discussed query hints before at
<http://news.ycombinator.com/item?id=2179433> along with why they don't exist
in PostgreSQL. My opinion is clear from that discussion.

The other feature in this list that I really, really like is session tracing.
When I used to work with Oracle I implemented a feature that allowed any web
page to have the database session logged if you passed in a magic parameter.
This was unbelievably useful in debugging performance problems on the live
system as they were happening. In fact it proved so useful that we would turn
tracing on for one session for some period of time each day, inserting
pointless marker queries at the start and end of each web page. We then ran a
script through the log to be able to break it up by web page, looked at the
slow web pages, what they did, and thought about how we could make it faster.
This exercise suggested optimizations that resulted in something like doubling
the capacity of our website on that hardware.

This is invaluable data that you simply cannot collect in any other way. You
certainly can't turn on logging for the whole database - the website will
crash. Sure you can log things at the application level - I did that as well.
But it just says, "We ran these queries and got this timing." It doesn't shed
light on what is happening inside of the database.

~~~
someotheridiot
Yes, Oracle's tracing capabilities are awesome. I too have regularly traced
individual production sessions. A quick tkprof later (well maybe some dev time
too) and that 10 mins report now runs in 10 secs.

------
dougabug
You can restore data to a table from a dump using pg_restore without
recreating indexes. If you have a couple extremely large tables, you should
consider partitioning them. Partitions are true tables in Postgres, and you
should be able to use parallel restore on them.

Bruce Momjian told me on Monday that Robert Haas was leading the index only
scan effort, and that this (important) feature should appear in the 9.2 time
frame.

Most hints are a bad idea, they represent leaks in the relational abstraction.
Further, the DBA's knowledge of the performance characteristics of various
tasks is transient, and often quickly mooted by changes in technology, data
scale/statistics, or even simple hardware/software reconfiguration. FIRST_ROWS
is a reasonable hint, since it speaks to the logical intent of the query, and
says something about how much of the data it intends to use and how it
prioritizes the results. The planner can then work towards the stated
optimization goal.

~~~
someotheridiot
I think you forgot to finish your sentence. You were about to say how to do
it?

~~~
dougabug
Pg_restore is pretty flexible. You can have it generate a list of the objects
in the dump, and then edit that file to create/restore only the objects of
interest. You can omit/change permissions, schema objects, restore order, etc.
You can also specify data only restore as a command line option. If you don't
want the index, just remove it from the table definition and do a data only
restore on that table.

------
rbranson
The same argument he uses to advocate query hints rests on a situation
actually much more likely to occur because of query hint use. Your assumptions
can't use runtime statistics or cost configuration to alter a query plan.

~~~
pyre
You're also making assumptions here. You're assuming that the query planner
will always chose the right plan unless a human tells it otherwise (therefore
bad plans are the fault of the humans).

Assuming that in the author's example it's really the fault of the human the
wrote the query with a hint in it, it's not an argument against allowing
hints. Just because someone might misuse a tool isn't a reason for not
creating it. It's the same argument that the RIAA/MPAA uses when they want to
ban P2P technologies (i.e. It can be used to infringe copyright, therefore it
should be illegal).

Most of the 'hints-vs-no-hints' arguments devolve into two camps:

1) People that want to have control over the query because they don't fully
trust the planner (this mistrust could be logical or illogical -- i.e. people
that experienced a need for hints, or people that just don't trust the planner
even though it's never given them a bad plan)

2) People that put 100% trust in the planner and say that since the planner is
deterministic, then it's a human's fault when it comes up with a bad plan.
(i.e. They love The Planner. They trust The Planner. They are careful not to
taunt or anger The Planner.)

------
ibejoeb
I'm pretty much on board with these points. Here's an interesting explanation
of the absence of hints: [http://it.toolbox.com/blogs/database-soup/why-
postgresql-doe...](http://it.toolbox.com/blogs/database-soup/why-postgresql-
doesnt-have-query-hints-44121?rss=1).

Also: FIRST_VALUE and LAST_VALUE really need IGNORE NULLS to do proper, high-
performance rollforwards for things like time series data and periodic
analysis.

~~~
gaius
OK here is an example off the top of my head. Let's say you have a join
between two tables and the query planner wants to do a hash join, because
given the statistics on all the tables, that makes perfect sense. Now let's
say that you know that you only actually want the first 10 rows. Well a hash
join is cheap per row, but expensive to set up. Nested loops are expensive per
row, but cheap to set up. So, maybe, actually,as a DBA and a developer I _do_
know what I want, and I can communicate my intent to the planner, and then let
it decide.

Anyway, I'd be curious to know what flags the Postgres guys pass into GCC when
compiling it. Or do they know better than it's "planner"?

~~~
wulczer
To answer your question about getting the first 10 rows, you can use the LIMIT
clause, which will influence the planner's decisions. Or you can declare a
cursor for your query and only fetch the first 10 rows.

As for GCC, it's usually up to the packagers, but the source tree after
running ./configure gets compiled with:

    
    
      -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wformat-security -fno-strict-aliasing -fwrapv

~~~
gaius
That's kinda my point. Aren't those "hints" to the compiler?

~~~
wulczer
Not in the sense that query hints are "hints" to the planner.

-O2 means "take more time compiling but produce faster code". This hint will always be valid and useful, regardless of how the codebase changes. Query planner hints can become obsolete as the data distribution changes.

-fwrapv and -fno-strict-aliasing are correctness settings

-Wxxx are just help for the developers

------
wiredfool
Multicore aware dump and restore would be a big win. I've hacked something up
to do it in python queues and threads calling out to pg_dump. The drawback is
that when I do it, I'm not getting a consistent, one transaction dump. But
then usually, I'm shutting down the system to do an upgrade when I thwack it
like this, so it's a reasonable tradeoff. With a fast set of drives, I can
generally dump and reload with a concurrency of roughly O(#cores) in
approximately the time it takes for the largest table, rather than the sum of
the times for all the tables.

(there's a caveat there, and that's that my biggest tables tend to be toasted,
so there's a decent processor load uncompressing the toasted bits. Also, Power
law governs table size, and the biggest table is ~ 50+ % of the database, with
~ 100 tables. YMMV)

I wish there was a way to transparently parallelize operations over more than
one core. I've got enough operations that are actually processor bound that I
could benefit (some) from it. I could partition the database, but that's a
management headache.

~~~
random42
Are you looking for something similar to flag -j, --jobs of
<http://www.postgresql.org/docs/8.4/static/app-pgrestore.html>, which implies
the _number of jobs_ running in parallel (which to my understanding depends on
the number of CPU cores.)

Granted, this only exists for pg_restore (8.4 onwards), no equivalent
functionality in postgres for pg_dump.

~~~
wiredfool
Something like that. The nice thing there is that the index creation is done
in parallel, which is not split well with my method.

What I do is dump the schema, then split it between the table creation and the
indexes. The first part gets loaded before the data, the second, after. Then
all the tables and all the sequences are dumped, optionally teed to disk, and
loaded in a psql process to the new location.

I've been doing this since the 7.x series, so it way predates 8.4. (It also
predates the in place upgrades, which is a big thing that they've added since
I was coming up to speed)

------
Roboprog
This was an interesting article. In some ways, I think I learned more about
Oracle than PostgreSQL, but it's all good. My perspective is developer, rather
than DBA, so it was enlightening to delve into another's specialty.

I've occasionally used Postgres at home, including doing some work at home
long ago for a (small batch oriented) project to be hosted on Oracle at work.
Irony on that project: my crap hardware at home outperformed the server at
work, due to neglect by the Oracle DBAs at work. The DBAs had to really be
pushed to index several of the tables, as we had to convince them that we
really did read some tables and columns many times, rather than generate
"write once, read never" type data :-) In contrast, the default Postgres setup
did well at this particular job running at home with my "2 cans on a string"
setup: 200 MHz Athlon client & 166 Mhz P1 "server", joined by a 96 Kb serial
PPP connection! (I forgot why I used 2 machines instead of 1, but computers
were small back then -- my "Atom" laptop now has more computrons than those
boxes)

I found that MySQL was not even close (at least 10 years ago) to emulating
Oracle. At least Postgres was close enough to make the effort, and the
poster's comparison, worthwhile.

------
bensummers
If you have DTrace on your platform, there are some interesting probes
available which can be handy for tracking down what's going on. Not quite one
of the requested features, but it can be used in system wide explorations of
performance issues as well as just looking inside the database.

[http://www.postgresql.org/docs/current/static/dynamic-
trace....](http://www.postgresql.org/docs/current/static/dynamic-trace.html)

------
tsewlliw
I really liked the article, it made me think about interesting problems, but I
only count 7 features in it.

 _> 1\. Query progress_ _> 3\. Session Wait Data_ _> 4\. Session Tracing_

These are all the same feature. Granted, it's an awesome feature, and thats a
reasonable division.

 _> 2\. Index Only Scans_ _> 5\. Multiple Block I/O_

These are real valid (and separate) points. There's a good argument for
leaving 5 to the OS, but it does mean accepting that the OS might suck at
predicting what you want.

 _> 6\. Tablespace Quotas_

It doesn't seem like this should be so hard -- use a user-space FS with a cap.

 _> 7\. No Logging_

Why are you creating indexes all the time? Large data loads I can see.

 _> 9\. Query Hints_

Flamebait, but acknowledged as such.

 _> 8\. Better Export/Import Tools_ _> 10\. More Backup/Restore Options_

Seriously, you just used the flamebait between these two so you'd make it to
10. Nice tactic :)

------
wulczer
Nice writeup. Point #7 (no logging) should be addressed by CREATE UNLOGGED
TABLE ([http://developer.postgresql.org/pgdocs/postgres/sql-
createta...](http://developer.postgresql.org/pgdocs/postgres/sql-
createtable.html#AEN67235)) which is coming in 9.1

------
rbanffy
I was fully expecting a very troll-ish article. Instead, I got a list of very
interesting features I agree would be very nice to have and an excellent
discussion here on HN.

I can sum it up with a smiley :-)

------
brown9-2
_"They would rather spend time improving the optimiser in future versions". OK
nice idealistic thinking. Meanwhile on planet Earth, I’m trying to convince
management that they need to redesign their app that hasn’t been touched in
years because of some gradual growth in the data that screwed up PostgreSQL
into thinking it should run the queries in a sub-optimal way._

Wouldn't re-ANALYZE-ing the tables in question help fix this problem, where
the statistics that Postgres has about the table access patterns no longer fit
reality?

~~~
lobster_johnson
That presumes that Postgres will always make good decisions based on up-to-
date statistics, which is unfortunately not true. The statistics are just a
sample across a subset of your data, collecting things like value frequencies,
for a single table. It does not track inter-tabular correlations, which means
that it can never truly plan joins optimally because it does not know at the
outset how many rows will match any given row in a join. I have hit many cases
where a query will use a nested loop and take 5-10 seconds to run where an
index scan would have taken just a few milliseconds. Some discussion here:
[http://archives.postgresql.org/message-
id/603c8f070902180908...](http://archives.postgresql.org/message-
id/603c8f070902180908j3ae46774g535d96ece2c90e74@mail.gmail.com)

------
agentultra
I think 9.0 might be addressing it, but lack of replication has been a feature
I've missed a fair bit.

Shipping WAL files over rsync seems a little.. hackey to me. Worse still,
Postgres has to load the entire WAL on the slave when it wakes (whereas with
Oracle you can query the slave before it has finished loading the WAL files
since it can pause and continue).

Disclosure: I'm not a DBA or specialist in this area. I know enough to pick
sensible architectures, optimize queries, and maintain good indexes.

~~~
amock
9.0 has asynchronous replication and 9.1 will have synchronous replication
[http://developer.postgresql.org/pgdocs/postgres/warm-
standby...](http://developer.postgresql.org/pgdocs/postgres/warm-
standby.html#SYNCHRONOUS-REPLICATION) .

------
cafard
How about packages in PgSQL? Package state can be so damn handy.

------
tszming
Seriously, get a decent bug tracker for the project.

------
jtchang
Wow databases have gotten a lot more complicated in the past few years.

