
Why Postgres - jashmenn
http://www.craigkerstiens.com/2012/04/30/why-postgres/
======
davidw
I love Postgres, and I usually get grumpy when forced to use Mysql, but ...
rather than lots of fluffy features (most of the stuff listed here is not that
compelling taken individually), _the_ thing for me about Postgres is that it's
solid, reliable, and dependable. It behaves correctly. It has transactions. It
has DDL transactions. It can, by and large, be counted on to get things right,
rather than take shortcuts. It's a well-crafted tool that you can get a lot of
mileage out of if you know what you're doing.

~~~
pjscott
In other words, it is what you find yourself sorely wanting when you use MySQL
for a non-trivial length of time.

~~~
davidw
Well, yeah, that or some inanimate object to vent my frustrations on by
kicking it

------
23david
Interesting articles. I read part I and part II.

As a long-term mysql user who has dabbled with production Postgres machines I
still don't see how Postgres really has any huge advantages that would cause
me to want to run it for any new projects and deal with a new and possibly
daunting learning curve. Postgres seems to do some things better perhaps, and
in general seems to have more flexibility than MySQL. But it also seems to
lack focus as a end-to-end db solution.

The clustering / replication complexity of Postgres is a huge issue for me,
since scaling databases is not a trivial thing in the best of cases. With
postgres, there seem to be 10+ different cluster / replication systems (both
open-source and commercial?) and it's just a mess.

Giving people tons of options is great, but ease of use for developers and an
amazing & simple out of box experience is so important. MySQL won the hearts
and minds of developers this way... and now MongoDB is doing the same thing by
improving on what MySQL what able to accomplish. The Postgres team should
consider implementing something similar.

~~~
twerquie
By choosing Pg over MySQL or Mongo, you are selecting a high-quality, stable
product that is on a very slow and steady development trajectory with very few
missteps along the way.

If you don't want to be surprised by the interesting and unexpected ways your
database functions, choose Postgres.

~~~
taligent
> slow and steady development trajectory

Slow being the operative word. Sharding and JSON support should have been
added as basic features years ago.

Now if you need either (very common) PostgreSQL is a non-starter.

~~~
rosser
Patently false.

Skype was supporting tens of millions of _concurrent_ active users on a
sharded PostgreSQL setup _years_ ago (like, 2006-07 or so). Yes, the support
wasn't native, and they had to write PgBouncer and PL/Proxy to enable that
kind of scalability, but they Open Sourced both projects, and they're pretty
widely used in many environments for exactly that purpose.

As far as JSON, the HStore extension has been available some time since 8.3,
which was released in _2008_. Again, not native (though that's being addressed
with 9.2, which should drop any time now), but not particularly difficult to
use. A trivial web search shows people using JSON with HStore at least as far
back as 2010, if not earlier.

~~~
taligent
Not false.

Skype added sharding support themselves. And it requires you to use stored
procedures instead of SQL making it unusable for most users.

And I am talking about native JSON support as a first class citizen. Nobody is
going to lock their entire data structure to a third party extension that may
or may not disappear.

~~~
jeffdavis
"And I am talking about native JSON support as a first class citizen."

Wait a week for 9.2, please.

------
dude_abides
Another cool feature: recursive queries using the WITH common table
expressions.

Say you have a table fs <id, name, parent_id> representing a hierarchical
filesystem, and you want to print the full path of each file, here's how you
can do it in a single query:

    
    
      WITH RECURSIVE path(id, name, parent_id, path, parent) AS (
        SELECT id, name, parent_id, '/', NULL FROM fs WHERE id = 1 -- base case
        UNION
        SELECT fs.id, fs.name, fs.parent_id, parentpath.path || 
          CASE parentpath.path WHEN '/' THEN '' ELSE '/' END || 
          fs.name as path, parentpath.path as parent
          FROM fs INNER JOIN path AS parentpath ON fs.parent_id = parentpath.id
      ) SELECT id, name FROM path;

~~~
aidos
I don't know the details but won't that be quite inefficient? (it's not a
construct I've seen before - never used Postgres in anger)

You're effectively working with a tree and there are much more relational
friendly ways of doing that in SQL.

I know I'm just picking on this specific use but I can't help imagining that
recursive querying will always be slow. Would love to hear how it's
implemented if that's not the case.

~~~
dude_abides
All that WITH RECURSIVE does is allow a WITH query to refer to its own output.
Beyond that there is no overhead.

Here is the query plan I got for my above query:

    
    
      QUERY PLAN
      CTE Scan on path  (cost=292.79..304.41 rows=581 width=104)
        CTE path
          ->  Recursive Union  (cost=0.00..292.79 rows=581 width=72)
              ->  Index Scan using fs_pkey on fs  (cost=0.00..8.27 rows=1 width=40)
                    Index Cond: (id = 1)
              ->  Hash Join  (cost=0.33..27.29 rows=58 width=72)
                    Hash Cond: (public.fs.parent_id = parentpath.id)
                    ->  Seq Scan on fs  (cost=0.00..21.60 rows=1160 width=40)
                    ->  Hash  (cost=0.20..0.20 rows=10 width=36)
                          ->  WorkTable Scan on path parentpath  (cost=0.00..0.20 rows=10 width=36)

~~~
aidos
But that could end up being quite deep couldn't it? Is it not like stacking up
an unknown number of correlated queries? Could you even screw it up and have
an infinite joining condition?

------
thaumaturgy
Could someone that's really familiar with Postgres give me the skinny on
replication? I like Postgres a lot, but finally moved all of our data storage
to MySQL because, according to the documentation I read just a couple months
ago, Postgres couldn't really do multi-master replication.

This post, dated prior to the documentation I read before the switch, seems to
suggest it would do it just fine as long as it was asynchronous. I'm not clear
though on just how ugly its async replication might be
([http://www.postgresql.org/docs/8.4/static/high-
availability....](http://www.postgresql.org/docs/8.4/static/high-
availability.html)).

I'd like to use dbmail to store mail in a database replicated across multiple
servers (which would need fairly reliable replication), as well as syslog and
other logging facilities to sql (where reliability isn't quite so big of a
deal).

Is Postgres up to this now, or not?

~~~
saurik
You should use the "this page in other versions" mechanism at the top of the
page you linked to so you can see more recent documentation: most of what
PostgreSQL has with regards to replication is from versions more recent than
8.4 (the version you were reading).

[http://www.postgresql.org/docs/9.2/static/high-
availability....](http://www.postgresql.org/docs/9.2/static/high-
availability.html)

(I'm curious why you want to do replication for dbmail: it would seem like
what you really need/want is partitioning; e-mail metadata, especially with
effective indexes, requires a lot of write throughput, and synchronous
replication is going to largely cause the same write load on all systems.)

~~~
thaumaturgy
Thanks, I hadn't even noticed that link at the top.

(To answer your other question: at the moment our mail load isn't severe
enough to pose a problem for synchronous replication, and I'm just aiming for
the ability to have multiple mail hosts share the same mail data with no
single point of failure. As the mail load outgrows current infrastructure,
I'll start partitioning -- although I haven't figured out how to do that just
yet. I've considered a distributed file system, but hammer has only just
recently started to look like it's up to the task.)

~~~
saurik
You can accomplish that with synchronous multi-slave replication with
failover: as far as I can work out in my head right now, the result will be
similar to the behavior you will get from a synchronous multi-master setup
assuming you can work things out so the slave can be used for read-only
queries.

~~~
thaumaturgy
I don't think I can do that without writing my own mail daemons for pop and
imap, unfortunately, which I'd rather not do at this stage. All of the ones
that I know of expect to be able to read/write metadata over the same
connection to the same database or filesystem.

Failover setups aren't my favorite option. They seem to be easy to get wrong,
and when you get them wrong, they only do wrong things at the exact moment
that you most need them to be doing right things.

~~~
saurik
So, if you are doing synchronous multi-master you will also need to do
explicit failover (in this case, not from server A to server B, but from
server A/B to only server B): otherwise, a partition between the servers would
be catastrophic to your data integrity (as there would be no way for the
servers to know whether they should start accepting data that might be
different from its buddy, as both think the other is offline).

Once you start thinking in terms of multiple servers (whether it be based on
replication or partitioning) you have to start thinking about these kinds of
complex corner case issues, as you have moved from working with "a server" to
"a distributed system", with all of the associated theoretical limits (such as
CAP).

~~~
thaumaturgy
You're right, although the MySQL binary log used for replication handles this
fairly gracefully for reasonable outage periods. Collisions are still possible
after a resync, but it tries pretty hard to resolve them using timestamps on
the transactions.

I'm working on some software to automatically manage outages, deploying new
server instances, re-syncing databases, etc., but that's quite a few steps
away from where we're at right now. For near-term purposes, anything that
could do as good of a job at multi-master replication as MySQL can would be
just fine.

~~~
saurik
I can then only wish you luck in your attempt to take an off-the-shelf system
(dbmail) that was designed to be used with an ACID database and plop it on top
of what is now an only eventually consistent data store without first
rewriting it to tolerate those semantics ;P.

------
craigkerstiens
Original author here, after the original post there was so much great feedback
that it was worthwhile to publish a part two -
<http://craigkerstiens.com/2012/05/07/why-postgres-part-2/>

Additionally there will probably be a third post when Postgres 9.2 releases
soon highlighting some of the great new features such as the JSON datatype.

------
jpdoctor
I'd add "business risk" into the discussion. Mysql has an interesting road
ahead of it, and there are certain people that just won't put open-source
Oracle in their critical path.

~~~
falcolas
Fortunately, there are at least two drop-in replacements, if Oracle should
drop the ball. Percona Server, and MariaDB. Both are based on the well-vetted
core of MySQL, and both are continuing to make huge strides in the
capabilities and stability of MySQL.

Of course, Oracle hasn't dropped the ball yet; 5.5 was a good, solid release
from them.

------
hopeless
A few week ago I discovered Postgres Schemas (worst named feature ever --
nothing to do with SQL schema definitions). Basically, schemas allow you to
have multiple databases within a single database. So, you only need to manage
(or host/pay for) a single database but each company can have their own schema
which gives you complete separation of data across all the tables.

These, together with the Apartment gem for Rails, makes building isolated
multi-tenant applications really easy. It also makes migrated to a true multi-
database/multi-server/sharded setup much simpler.

~~~
sitharus
If you were using Oracle or MS SQL you'd be quite familiar with what Postgres
calls a schema.

The SQL standard as I understand it is very vague on how schemas should be
implemented beyond the information schema.

------
tesmar2
I don't value the underlying database as much anymore since I have been
working with Rails apps. The database is just an API. I want to write my code
in such a way that I could unplug POSTgres and plug in SQL Server tomorrow
without skipping a beat. Using AREL you should largely be able to do that.

~~~
emmett
1\. No app of significant size will actually allow this. If you've ever tried
to switch over a large data set with a high query volume you'll know that it
never "just works".

2\. Just because there's an API behind it, doesn't mean that it doesn't matter
which one you use. As a trivial example, Linux and OSX are both POSIX but it's
hard to make an argument it doesn't matter which you use.

~~~
tesmar2
I bet after you switched your data over you wrote the conflicting SQL lines in
a much more generic way.

~~~
emmett
No, the issue is not conflicting SQL. An ORM solves that problem extremely
easily.

It's the fact that query planners and performance optimizations implemented by
each database are different, and thus directly porting your current schema and
queries doesn't work.

------
pbreit
Is Oracle's ownership of MySQL becoming an issue? I realize it's open source
but still it seems like things have become to deteriorate slightly. Is that
causing any alarm?

~~~
cagenut
Yes and no. Oddly enough, Oracle has owned the most important part of mysql
(innodb) for years, so them buying the rest of it was almost a benefit to have
it all under one roof. They then went on to put out a very good release (5.5)
that didn't suffer from the "don't use it till its in the teens" birthing
pains that most previous releases did.

The thing is, most of the improvements in that release were really the work of
a bunch of companies that had given up getting their patches accepted upstream
and were trading around and maintaining their own patchsets. "The Google
patches", "The Facebook patches", and "The Percona patches", etc. Mysql/Oracle
brought their long stagnant release up to where the rest of the community had
already gotten.

The nice thing is that means its already been proven that if they go dark and
just squat on a stale code base for years, its not actually going to hold
anything back.

The two things they have done are raise prices, and stop providing source code
for new unit tests. Neither is a good thing, but they're pretty trivial in the
overall scheme of things.

~~~
army
That isn't entirely fair to the InnoDB/MySql developers at Oracle - from what
I've heard there has in general been an increase in development activity
within Oracle. Some of that is merging patches from downstream, but most of
the work has been on other improvements.

Performance in 5.6, for example, has improved a lot even over patched versions
- e.g. [https://www.facebook.com/notes/mysql-at-facebook/can-
innodb-...](https://www.facebook.com/notes/mysql-at-facebook/can-innodb-
do-100k-iops/10150907613720933)

------
jbarham
One thing I appreciate about using Django and its DRM is that it hardly
matters what database I use.

If I'm writing SQL by hand I much prefer Postgres to MySQL since it is more
standards conforming and has fewer random quirks. But given that Django
generates the SQL for me, which database to use is largely a matter of ease of
deployment, and on AWS that's MySQL (via RDS) so that's what I use.

------
nigma
Here's a link to the previous discussion
<http://news.ycombinator.com/item?id=3910743>

------
heretohelp
To answer his question, yes virginia, we (people that use MySQL) use
replication a lot. Especially at mid-grade levels of scaling where custom
sharding and replication middleware hasn't necessitated itself yet but
vertical scaling is out of the question.

I'd say the bigger issue with Postgres now is that middleware like
<http://code.google.com/p/vitess/> for it hasn't been deployed in the large
yet. Most Postgres scaling anecdotes I've heard were:

"Well we put 64 gb of ram in the server and installed a RAID array of SSDs and
stopped writing dumb unindexed queries."

Well that's just dandy, but what if my indices don't even fit in the ram of a
single machine?

~~~
twerquie
> I'd say the bigger issue with Postgres now is that middleware like
> <http://code.google.com/p/vitess/> for it hasn't been deployed in the large
> yet.

That simply isn't true. Take a look at PgPool
(<http://www.pgpool.net/mediawiki/index.php/Main_Page>) - it provides many of
the same core features as Vitess, goes beyond to provide replication and
flexible sharding/partitioning and is a good number of years more mature than
Vitess.

PL/Proxy is similar middleware (though implemented as stored procedures)
developed by and used at Skype for massive Pg partitioning.

There are dozens more packages like this at Pgfoundry
(<http://pgfoundry.org/>). What specifically do you feel is missing?

~~~
heretohelp
>What specifically do you feel is missing?

A very large deployment of it that has shaken out the bugs.

~~~
dazzawazza
You get the feeling that Postgres needs some marketing. It's a great piece of
software but if people can't easily find out if it's been used at massive
scale they aren't going to trust it.

Like it or not most deployments of MySQL are probably because people can
easily hear about it being used for massive databases. The decision isn't
based on technical merits (or lack of).

