
Why Use Postgres - Part 2  - craigkerstiens
http://craigkerstiens.com/2012/05/07/why-postgres-part-2/
======
boyter
Having recently been bitten by a 20 hour schema change to a very large table
in MySQL I have been looking at Postgresql more seriously.

My biggest issue with it having used it for other projects is getting the
dammed thing setup and working to begin with. I could never find a decent
tutorial (or rather one that fits my mindset) of how to do the following,

1\. Install 2\. Setup users, including how to set it up on a local development
machine with 'root' user who can do anything (saves time in dev) 3\.
Import/Export SQL/Backup files

I managed to do all 3. Once.

I'm sure its out there, and I would switch in a heartbeat were these steps
much easier to work out. As it is I can just copy the table, run the alter in
the background and rename the tables the next day. Slower, but doesn't cost me
any time and works.

~~~
moe
The bootstrap is not really hard when you've done it once, just jot it down
somewhere.

In my experience most people stumble because the network security in postgres
is pretty tight by default.

This is easily fixed and needs to be done only once.

First: Find your pg_hba.conf. It's in the database-directory, that's often
linked to /etc/postgresql.

    
    
      # Backup the original
      $ cp pg_hba.conf pg_hba.conf_orig
    
      # Now replace it with our desired security settings
      $ cat >pg_hba.conf <<EOF
      # Require password auth from remote hosts
      # localhost and local socket are trusted
      host  all all 0.0.0.0/0 md5
      local all all trust
      host  all all 127.0.0.1/32 trust
      EOF
    
      # restart
      $ /etc/init.d/postgresql restart
    

From now on you can connect as _any_ user from localhost without a password.
Thus, we can now just go about our business.

    
    
      # connect as user postgres (the super-user)
      $ psql -U postgres
      postgres=# create database dummy;
      postgres=# create user bob with password 'pony';
      postgres=# grant all on database dummy to bob;
      postgres=# ^D
    
      # Now you have a database that user bob can use.
      # From remote he will have to use the password 'pony'.
      # From localhost no password needed because of our pg_hba settings above.
      $ psql -U bob dummy
      bob=> create table ...

~~~
justinjlynn
I would be wary about trusting all connections from localhost. As
vulnerabilities are many, getting a local unprivileged shell isn't exactly
hard and as opening a socket connection isn't exactly a privileged operation..
you could expose yourself to a nasty bootstrap attack. Probably better to
trust the local unix domain socket, and make it accessible by root only... if
you really need it.

~~~
moe
_I would be wary about trusting all connections from localhost._

The only place where this is an issue would be hosts with multiple _untrusted_
users, and these are becoming very rare.

 _getting a local unprivileged shell isn't exactly hard_

Sorry, that's nonsense. The entire internet relies on the fact that this _is_
relatively hard, unless you neglect basic security precautions.

~~~
justinjlynn
You make a good point. Conversely, a part of basic security policies is to
present as little attack surface as possible. Ensuring that you're not
trusting local connections with root equivalent access to your database is a
good way of doing that. Modern GNU/Linux distributions have many thousands of
packages installed, some of which with fairly open Internet access. Could you
put up a default deny firewall to prevent that? Yes. Given that, in most
systems there will always be holes punched in those firewalls which allow
attacks through as those machines are typically used to serve traffic and do
actual work. If you are exploited by a vulnerability in some random package
you have installed then you /will/ have potentially multiple /untrusted/ and
/hostile/ users on your system. Now, assuming there isn't a local privilege
escalation attack to which your kernel is vulnerable, they'll be looking for
services which treat local sockets as trusted and attempt to bootstrap from
there. A vital concept of basic security precautions is that one doesn't just
prevent attacks through secure defaults and front-line security precautions
(i.e. firewalls), but one must both contain and detect successful penetrations
when they do occur. You must have defence in depth, and not just a hard shell
of security, because you can never really tell which one of those protective
safeguards will fail due to bugs, human error, or technical incompetence.

~~~
moe
_Now, assuming there isn't a local privilege escalation attack to which your
kernel is vulnerable_

That's an invalid assumption (extremely unlikely).

 _You must have defence in depth, and not just a hard shell of security_

When you can't trust localhost anymore then your defenses have long failed.

~~~
justinjlynn
I believe the null hypothesis should be that you assume there are
vulnerabilities in those systems. Your assertion that it's extremely unlikely
that local privilege escalations exist is demonstrably false. In fact a quick
query of the CVE database shows a large number of /known/ vulnerabilities (
[http://web.nvd.nist.gov/view/vuln/search-results?cves=true&#...</a> ).
Regarding the statement that if you can't trust localhost then your defences
have failed... well, duh. The point is to limit the damage by having multiple
layers of defence. If a script kiddie manages to use UltraPWN2000 to get an
unprivileged shell on your box, you don't want them to be able to easily just
drop all tables if they're not smart enough to use a local privilege
escalation (assuming they can compile locally anyway). You won't be able to
prevent all damage from a skilled attacker but you should attempt to mitigate
it and give yourself time to respond to it. You don't trust your locked front
door to protect your valuables, you put them in a safe.

~~~
moe
_Your assertion that it's extremely unlikely that local privilege escalations
exist_

You misread me. I meant the opposite.

Your argument was based on the premise "assuming there isn't a local privilege
escalation attack to which your kernel is vulnerable". I said this premise is
invalid (as you just confirmed yourself).

 _You don't trust your locked front door to protect your valuables, you put
them in a safe._

The front door is your firewall. The safe is your host. When someone breaks
into your host then it's game over. When you have too much spare time you can
attempt to layer further at the host-level but that's usually an exercise in
futility.

~~~
justinjlynn
I can see that we don't fundamentally disagree on many issues. Attempting to
protect from all on-host attacks is pointless. Though I don't disagree with
you in the majority of what we've discussed, I will always stand up for simple
changes to settings that make it just that much harder for unauthorised users
to cause trouble. That said those changes must not add too many onerous access
requirements for authorised users. Of which, I don't believe asking people to
have sudo access to your machine to have super-user access to your database is
one.

------
sheff
One of the other nice things about Postgres is the pretty much annual release
of a new version. 9.2 should apparently hit beta in the next week or so, which
will bring some cool new features : [http://archives.postgresql.org/pgsql-
hackers/2012-05/msg0034...](http://archives.postgresql.org/pgsql-
hackers/2012-05/msg00344.php) .

~~~
diminish
Are there any reasons not to use postgres, as a replacement for mysql?

~~~
vidarh
If you use multi-master replication, MySQL Cluster, or depend on properties of
one of the alternative storage engines would be my big ones.

~~~
jeltz
Hopefully PostgreSQL-XC might fix this lack of good multi-master replciation
in a couple of years. PostgreSQL-XC 1.0 will be released soon.

------
gtaylor
I had no clue that Postgres supported LISTEN and NOTIFY. Granted, it's a new
addition, but this is really slick:

<http://www.postgresql.org/docs/9.1/static/sql-listen.html>
<http://www.postgresql.org/docs/9.1/static/sql-notify.html>

~~~
pilif
New addition?

listen and notify go back to at least 6.4 which came out 1998 ish
<http://www.postgresql.org/docs/6.4/static/sql-listen.html>
<http://www.postgresql.org/docs/6.4/static/sql-notify.html>

What might have changed is the implementation though. I seem to remember that
some point in the past I was investigating listen/notify and the driver
(libpq) was forcing you to handle LISTEN by polling (calling PQnotifies
periodically) which doesn't really help compared to polling on your own.

This has not changed so far:

<http://www.postgresql.org/docs/9.1/static/sql-listen.htm>

states

> With the libpq library, the application issues LISTEN as an ordinary SQL
> command, and then must periodically call the function PQnotifies to find out
> whether any notification events have been received.

It might be possible for drivers working without libpq by talking the postgres
protocol directly on the wire to get real asynchronous behavior, though I
don't know anything about how this is being handled on the server right now
(it might still be polling internally on the server end).

~~~
wulczer
I'm not sure about long ago, but the recent version of libpq has NOTIFY
working asynchronously (you can stick the connection file descriptor in a
select(2) and you'll get woken up when a notify is available).

~~~
jeltz
Hmm, pretty certain nothing has changed recently there either. At least as far
back as 7.1 the documentation has remained the same about this.

<http://www.postgresql.org/docs/7.1/static/libpq-notify.html>

What has happened recently though is that more language bindings to libpq have
started exposing the NOTIFY functionality in more convenient ways.

------
aidos
Can anyone point me at a HOW to use Postgres guide? I know my way around MySQL
and SQL Server pretty well, but Postges is fairly new to me. I did one project
where I migrated several websites from one server to another but I kind of
fumbled my way through it. Would love to see a guide for getting started with
Postgres for those who are already familiar with MySQL.

~~~
RowanH
The official docs are fantastic. The newer PacktPub books are also pretty good
<http://www.packtpub.com/postgresql-90-high-performance/book> and
<http://www.packtpub.com/postgresql-9-admin-cookbook>

~~~
timc3
It should be said that the older books have never been very good. I have read
probably all of them, none come close to the documentation on the website.

------
scotty79
Has slowness of SELECT COUNT() on large datasets been addressed in any way?
Last time I checked it was responded with "it's slow because of how we do
things here" and if you want to count records fast "you count them yourself"
(not actual quotes from anyone, just tl;dr of some answers I've seen).

~~~
fusiongyro
I kind of hate this "tl;dr" culture. I can explain the reason why this is
never going to change (though it may get better with index scans) but I expect
you'll just see a wall of text and decide not to read it. But if you never
read long answers, you have no right to expect solutions to your problems or
to understand anything. So I hope you'll read this, it may help.

PostgreSQL and most modern, ACID-compliant RDBMSes use MVCC instead of locks.
MVCC has a lot of upsides, namely, you almost never need table-level locking
for writes, but it ensures you need to look at the whole table to do COUNT.
That's because MVCC is basically multiple parallel universes for data. What
you see depends critically on which transaction you're in when you look.

For example, suppose you have a process that needs to know the number of users
in some user table. You start your transaction, and then another process
starts a transaction that goes through and deletes certain users. While this
is happening, you need the count. In MVCC, deletion doesn't delete, it just
marks rows with the transaction ID range in which they're visible. The rows
that are being deleted by the other process are just being marked dead as of
that process's transaction ID, but your transaction ID is below that one, so
they're not dead to you. Later on, when the youngest transaction alive is
older than the oldest transaction that can see these rows, PostgreSQL will
actually expunge the data (VACUUM does this).

Because of MVCC, there is more than one legitimate answer the question of
COUNT; potentially, one for each transaction in progress. The answer is
changing constantly. People tend not to think in these terms; we tend to
think, well the table has so many rows in it, right? Why don't you just
increment the count when you write one? It's not true, because with ACID
compliance, uncommitted transactions should not have visible effects until
they commit and transactions in progress should not see anything vary during
their operation. So there really are multiple right answers at any given
moment. The expectation that SELECT COUNT(*) will be fast is built on the
assumption that the database has some sort of master count of what rows are in
the table it can just glance at. But it doesn't, and to have one would require
removing MVCC altogether and using locks instead.

~~~
ataggart
I'm not sure I buy your argument. It is certainly true that with MVCC "there
is more than one legitimate answer the question of COUNT", but that's true for
_any_ query, not just COUNT. Given that we can maintain transactional
correctness for the data, it's not at all clear to me why the same mechanisms
cannot be used to maintain transactional correctness for the _meta_ data.

~~~
ralph
I agree. Given count(star) is a common need you'd think there would be
multiple count(star) results in the metadata for the table, each with a
generation associated with it.

~~~
jeltz
I do not see how this would be implemented without introducing extra locking
which would harm concurrency, which would remove some of the performance
advantage of having MVCC.

Now I see it could be worth taking that penalty on specified tables, jsut like
you do for indexes. And some databases do support this in a more general form:
materialized views.

~~~
fusiongyro
I think you could fake this proposal with triggers. It might be worth seeing
what the additional cost is.

~~~
scotty79
Proper solution (not fake with triggers) should be at most as costly as having
partial index on the table.

