

PostgreSQL 9.1 released - wulczer
http://www.postgresql.org/about/news.1349

======
saurik
The only thing I feel is wrong with this (really: the /only thing/, which is
fricken awesome... I love PostgreSQL from the bottom of my heart, and have
been using it for almost all of my database needs since the late 90s) is "per-
column collation": collation is not a problem "per-column", it is a problem
"per-index", which means you really want it "per-operator class".

Here's the use case: you have a website, and you have users using it in
English and French. With per-column collation, you are being advocated to have
two fields, one english_name, and one french_name, that have /the same
content/, but are defined using a different collation, so that the ordering
condition on them becomes language-dependent.

The effect that has is actually terrible: it means that the size of your row
(and yes, this may end up in TOAST, but there is still a massive penalty to
going that route) ends up becoming ginormous, and the size of your row will
just get larger the more languages you want to support as first-class citizens
in your app.

Instead, what you /want/ is to just have an index english_ordered and an index
french_ordered, and you want to be able to select which index you use for any
specific query. If you "do it right", you'd also want to be able to support
ordering the data using German collation, but it would "just be irritating
slower".

Now, if you don't use PostgreSQL much, this may seem like a pipe dream of
extra standards and complex interactions ("how will you specify that?!",
etc.). However, it turns out there is already a feature that does 99% of this:
"operator classes", which is how PostgreSQL lets you define custom collations
for user-defined types.

Only, PostgreSQL operator classes are slightly more general than that, as you
can specify an operator class to be used when performing order operations for
your index; and, even more importantly: they are already being used to work
around a specific case of operator-specific collation.

Here's the example: let's say that your database is set up for UTF-8
collation, and yet you have this one field you want to do a "prefix-match" on:
WHERE a LIKE 'B%'. The problem with this is that you cannot use a Unicode
collation to index this search: it might be that 'B' and 'b' and even 'Q' all
index "exactly the same" for purposes of this collation (and there are some
other corner cases with the other mapping direction as well).

So, to get index performance for this field, without changing your entire
database to collate using "C" collation (which works out to "binary
ordering"), you have a few choices, with one of them being to create a index
that uses the special "operator class" called text_pattern_ops ("text" in this
case as the field is likely a "text" field: there is also varchar_pattern_ops,
etc.).

Once specified in your index, PostgreSQL knows to use it for purposes of the
aforementioned LIKE clause. You specify this while making your index by
specifying the operator class after the column.

    
    
        CREATE INDEX my_index ON my_table (a text_pattern_ops);
    

The next piece of the puzzle is that an ORDER BY clause can take a USING
parameter to pass it a custom operator, and you can always (obviously) use a
custom operator for purposes of comparison. So, you now are in the position
where you should be able to do this:

    
    
        CREATE INDEX my_index ON my_table (a english_collation_ops);
        CREATE INDEX my_index ON my_table (a french_collation_ops);
        
        SELECT * FROM my_table
            WHERE english_collation_less(a, 'Bob')
            ORDER BY a USING english_collation_less;
    

So, really, the only thing that needs to be specified, is we need the ability
to have "parameterized operator classes": as in, we really need a "meta
operator class" that takes itself an argument, the string name of the
collation, and then returns an operator class. With this one general technique
defined, we not only drastically increase PostgreSQL's user-defined type
abilities, but we better solve this whole class of collation problem.

(Unfortunately, I suck at e-mail, or I'd get on the PostgreSQL mailing list
and try to argue for this in a more well-defined way; maybe someone else who
cares will eventually see it and become this feature's champion; or, of
course, come up with an even better solution than mine ;P.)

~~~
jvdongen
For my multi-lingual-content-needs I work with a main table which only has the
non-translatable data items and a subtable which contains the translatable
content, one row per language per row from the main table. Then I do a join to
get a single full 'object' in the language of choice. And per the postgres
docs I can choose collation at query time. Thus the following query would do
the trick if I'm correct:

select * from main_table join sub_table on main_id=sub_id where
sub_table.language='fr_FR' ORDER BY my_translatable_col COLLATE "fr_FR.utf8";

No bloating tables, just a join and correct collation.

~~~
saurik
Most content is not "translated", it just needs to be differently collated. If
you have a company directory, you have the names of every user in the company
in a table, and you need to display that information in a collation based on
the locale of the viewing user. Having to have a new table for every language
that contains the same data as the main table is just pointless overhead.
Also, while you can choose the collation at query time, the point of per-
column collation is to let you have an index over that collation. Can you
please demonstrate how the use case of "company directory" would be cleanly
and efficiently implemented using per-column collation?

EDIT: I've been looking more into this COLLATE keyword that they have added,
and I'm actually somewhat curious to see if I can make this work (where the
optimizer manages to choose the right index) by something like the following,
in which case I'm going to be seriously happy... ;P.

    
    
        CREATE TABLE my_table (a text);
        CREATE INDEX my_index ON my_table ((a COLLATE "en_US"));
        CREATE INDEX my_index ON my_table ((a COLLATE "de_DE"));
        SELECT * FROM my_table ORDER BY a COLLATE "de_DE";

~~~
simcop2387
Simply out of curiosity for this same topic, do you happen to know of a good
resource for finding out even just some of the less trivial differences that
this solves? I'm sure it does but off hand I don't know them (I'm not all that
multilingual).

I understand it'll bring in glyph orderings that don't exist in en_US or
whatever you've got the default set to, such as 'Ç' in french among others.

~~~
saurik
I do not have a good resource, however, I know a few off the top of my head:
1) characters with modifiers, like umlauts, sometimes collate the same, and
sometimes collate differently; 2) multiple characters may collate as a single
character, such as "ll" (I just did a search to verify that this was the case
in Spanish, and found the Collation page on Wikipedia, which you might find
interesting); and 3) different locales may choose to collate numbers using
different algorithms (in English we usually expect "1,000" to sort after
"200", but if "," is a decimal point, then you might not).

------
pilif
(Mostly) like a clockwork: A new year, a new release. And like every year
before we find a beautiful collection of new stuff to play with.

Even better this time around: It's looking as if the next release of Ubuntu
will get 9.1 packaged which spares me from manually packaging or using a PPA
this time around.

The new features each release introduces are too sweet to skip just because a
distribution is lagging. And ever since I began using PostgreSQL at the 7.1
days I have _never_ experienced a bug that really affected me. No byte of data
has ever been lost, no single time did it crash on me due to circumstances
beyond my control ( _cough_ free disk space _cough_ ).

Congratulations to everybody responsible for yet another awesome release!

Yes. I am a fanboy. Sorry.

------
rbranson
For a trivial, synthetic write benchmark that I usually use to benchmark
hardware and/or config changes, I'm seeing slight slowdowns for non-concurrent
loads, and solid improvements for concurrent loads with.

For 1-2 clients, I'm seeing ~8% slower.

For 4 clients, 7.2% faster.

8 clients, 15% faster.

16 clients, 16.4% faster.

32 clients, 11% faster.

64 clients, 10% faster. Aggregate performance starts to level off here, so I
stopped.

These are just cycling super simple INSERTs/DELETEs against the same table,
columns data is 1K string, 100 byte string, then a concatenation of the pid
and current iterator count. No indexes or primary keys. Each client is just a
fork that performs 10,000 INSERTs, then 10,000 DELETEs in a loop of 10,000
iterations.

For the record, that's around 6,729 writes per second with 32 clients. If I
set synchronous_commit = OFF in each client before running the benchmark, it's
27,157/sec. Then, if I reduce the first column size to 100 bytes, it's
50,592/sec. Impressive. I'm sure the synchronous_commit improvement would be
much more drastic on disks without BBU write caches.

Database server is a 4-core Nehalem-based Xeon with 16GB RAM and a SAS disk
array. PostgreSQL configuration has been decently tuned and full write
durability is retained all the way down to the disks.

------
ConceitedCode
I recently switched to PostgreSQL from MySQL and I couldn't be happier about
it.

~~~
pbreit
It seems that in general this decision should be made more frequently than it
is, at least for new projects. Why do not more people select Postgres? Is it
comfort level with MySQL? Tools and extensions?

~~~
goodside
I have an unusual reason, but a strong one: I primarily use relational DBs for
large-scale analysis of frozen snapshots of data, rather than transactional
loads. PostgreSQL has MVCC features built into it at the most fundamental
levels where they cannot be disabled, and as such it's not suitable for the
types of queries that I frequently run against MySQL. In the most extreme (and
trivial) case, you can't run a "SELECT COUNT(*)" on a table in PostgreSQL
without a full sequential scan of your data, which can be a huge expense when
your row counts are in the tens or hundreds of millions. MySQL, in contrast,
can return a cached answer instantly, which isn't possible under PostgreSQL.
Less trivially, MySQL is still drastically faster for aggregates on low-
cardinality fields.

Yes, I know you could set something up with triggers, but the point is that it
needs to be simple enough for frequent, ad-hoc usage. Usually the time I need
to know how big a table is when I just made it, and I might well drop it five
minutes later. I'm not going to set up an elaborate network of meta-data
tables when MySQL will just do it for me for free.

Not trying to hate on Postgres, by the way. I fully get that it's superior in
most regards, and I use it frequently just for the much stronger support of
user-defined functions. But MySQL does still have a few tricks left in it.

~~~
jeltz
If I remember correctly it is only with MyISAM you get instant count(*) while
InnoDB, just like PostgreSQL, has to look at every row of the table (or every
row of the index since InnoDB also supports index only scans). This is due to
both PostgreSQL and InnoDB being MVCC (multi-version concurrency control)
database engines.

And MyISAM is only fast in very specialized workloads with low write and
read/write concurrency.

------
hyperrail
Finally! True serializability! Never let anyone (including the docs for older
postgres versions) tell you that predicate locking is too hard to implement.

~~~
clarkevans
Kevin Grittner will be talking on Serialization this Friday, September 16th
11:30 a.m. – 12:30 p.m. His talk will be recorded... so those who aren't in
Chicago can see/hear it later next week.
<http://postgresopen.org/2011/schedule/presentations/61/>

------
socratic
Should I as a web app developer at a startup be looking for an RDBMS beyond
PostgreSQL, probably a commercial one?

I have somewhat of a database background, so I see the obvious advantages of
PostgreSQL over MySQL. In particular, things like more procedural language
support, a better query optimizer, better concurrency control, etc. (Though
things like Amazon RDS are compelling from a deployment perspective.)

I fundamentally believe that using an RDBMS, rather than a NoSQL data store is
the right approach for rapid development of web apps. (Though, I primarily
mean this as an attack on, e.g., MongoDB, since I think Redis is great, just
not a replacement for an RDBMS.)

However, I have almost no experience with the more advanced end of the RDBMS
spectrum, primarily because they tend to cost money (for the real, non-free
versions). Should I be learning/looking at DB2? Should I be learning/looking
at Oracle?

Or do the additional features of these more advanced RDBMS options require
such specialized scenarios (a bank, a big enterprise) or such specialized
hardware (weird clustered setups) that MySQL/PostgreSQL will always be just as
good?

~~~
wulczer
A short, probably biased and not 100% precise answer would be: no, you don't
need to look for a commercial RDBMS.

The real answer is: always evaluate options. Make sure the solution you choose
supports everything you need and that you will be able to learn how to use it
or hire people who can. Without knowing what your requirements are it's
impossible to say, but in the vast majority of cases, PostgreSQL will be as
good as Oracle, MS SQL Server or DB2.

If your evaluation indicates that you need to pay for one of these, double and
triple check it... and only if you're 100% sure that's the case, shell out for
a commercial RDBMS.

~~~
socratic
I think that's probably true, though I do wonder from a purely engineering
standpoint what features Oracle/DB2/MS have at this point that PostgreSQL does
not. Special index types? Query hints? Suggestions for physical layout on
table creation?

~~~
megaman821
Materialized views and query hints are the big ones the PostgreSQL is lacking.

~~~
simcop2387
Query hints are not likely to happen, at least not in the way that they do in
other databases. There's been a number of discussions about it and you can see
some of the aftermath at
<http://wiki.postgresql.org/wiki/OptimizerHintsDiscussion>

As for Materialzed Views, there's some options there currently, though it
looks like native support would be a lot nicer:
[http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized...](http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views)

------
clarkevans
If you happen to be in Chicago, or near Chicago this week, be sure to come to
PostgreSQL Open from Wednesday to Friday this week(Sept 14th-16th).

<http://postgresopen.org/2011/>

------
pvh
Congrats to the team for a really exciting release. We can't wait to ship it
here at Heroku.

------
jgavris
KNN indexing!

~~~
samstokes
Can anyone give an example use case for this? I'm not sure I fully understand
what it means.

~~~
netghost
In spatial queries, find the 5 restaurants closest to me. In full text search,
find the 5 words closest to "fuschia" so you can spell check, or try to search
for what you __think__ the user meant to type.

