
Is PostgreSQL good enough? - richardboegli
http://renesd.blogspot.com/2017/02/is-postgresql-good-enough.html
======
brightball
I've been attempting to preach the PostgreSQL Gospel
([http://www.brightball.com/articles/why-should-you-learn-
post...](http://www.brightball.com/articles/why-should-you-learn-postgresql))
for a few years now about this exact same thing.

When you look at your database as a dumb datastore, you're really selling
short all of the capabilities that are in your database. PG is basically a
stack in a box.

Whenever I started getting into Elixir and Phoenix and realized that the
entire Elixir/Erlang stack was also basically a full stack on it's own...and
that by default Phoenix wants to use PG as it's database...I may have gone a
little overboard with excitement.

If you build with Elixir and PostgreSQL you've addressed almost every need
that most projects can have with minimal complexity.

~~~
josho
This reminds me of the days of J2EE containers where _we_ tried to write
applications that never took advantage of vendor specific features so that our
applications could be portable across application servers and databases. It
was extremely rare to swap out from IBM to Oracle or vice versa but
conceptually we could.

The upside was that when a vendor's licensing fees became prohibitive we could
conceptually swap platforms. The downside was, well plenty. Like dealing with
generic errors, building your own services that the platform stack already
provided.

So, I love that we now have a OSS stack that folks are taking full advantage
of rather than trying to abstract away the platform.

~~~
brightball
Totally agree with you there. I worked for a telecom right out of school and
they hired a couple of consultants to implement some horrible Java single sign
on system (I think it was called JOSSO?) backed by our Oracle database. The
system was really buggy and as we were navigating the issues with the Java
code, we ended up porting most of the core functionality to Oracle Stored
Procedures.

Eventually, the Java code continued giving us so much trouble that I mouthed
off to our VP that I could rewrite this entire system in PHP in one night. He
said, "Okay, do it."

So we did it...and because the core functionality was in stored procedures it
was really easy to switch from Java to something else. The system never had a
single issue the next 2 years that I was there after the move either.

Opting to not use vendor specific features just let's you make a commitment on
where you want to be locked in harder.

------
yorhel
There was a similar talk[1] at FOSDEM, where the speaker describes how, as an
experiment, he replaces a full ELK stack plus other monitoring tools with
PostgreSQL. He even goes as far as implementing a minimal logstash equivalent
(i.e. log parsing) into the database itself.

It wasn't an "we do this at scale" talk, but I'd love to see more experiments
like it.

For the impatient: Skip to 17 minutes into the video, where he describes the
previous architecture and what parts are replaced with Postgres.

1\.
[https://fosdem.org/2017/schedule/event/postgresql_infrastruc...](https://fosdem.org/2017/schedule/event/postgresql_infrastructure_monitoring/)

~~~
dozzie
> It wasn't an "we do this at scale" talk, but I'd love to see more
> experiments like it.

Well, I will be conducting such thing in near future. From ELK stack, I never
used Logstash in the first place and used Fluentd instead (and now I'm using a
mixture of my own data forwarder and Fluentd as a hub). I'm planning mainly to
replace Elasticsearch, and probably will settle with a command line client for
reading, searching, and analyzing (I dislike writing web UIs).

All this because I'm tired of elastic.co. I can't upgrade my Elasticsearch
1.7.5 to the newest version, because then I would need to upgrade this
small(ish) 4MB Kibana 3.x to a monstrosity that weihgs more than whole
Elasticsearch engine itself for no good reason at all. And now that I'm stuck
with ES 1.x, it's only somewhat stable; it can hang up for no apparent
reasonat unpredictable intervals, sometimes three times per week, and
sometimes working with no problem for two months. And to add an insult to an
injury, processing logs with grep and awk (because I store the logs in flat
files as well as in ES) is often faster than letting ES do the job. I only
keep ES around because Kibana gives nice search interface and ES provides a
declarative query language, which is easier to use than building awk program.

> He even goes as far as implementing a minimal logstash equivalent (i.e. log
> parsing) into the database itself.

As for parsing logs, I would stay away from database. Logs should be parsed
earlier and available for machine processing as a stream of structured
messages. I have implemented such thing using Rainer Gerhards' liblognorm and
I'm very happy with the results, to the point that I derive some monitoring
metrics and was collecting inventory from logs.

~~~
awj
> I can't upgrade my Elasticsearch 1.7.5 to the newest version, because then I
> would need to upgrade this small(ish) 4MB Kibana 3.x to a monstrosity that
> weihgs more than whole Elasticsearch engine

...is that really a good reason to reinvent this whole solution, though?
You're basically saying you're going to spend the time to replace your entire
log storage/analysis system because you object to the disk size of Kibana.
(Which, without knowing your platform specifically, looks like it safely sits
under 100 megs).

The rest of your complaints seem to stem from not having upgraded
elasticsearch, aside from possibly hitting query scenarios that _continue_ to
be slower-than-grep after the upgrade.

Maybe I'm misunderstanding your explanation, but if I'm not this sounds like a
lot of effort to save yourself tens of megs of disk space.

~~~
dozzie
> ...is that really a good reason to reinvent this whole solution, though?

The system being dependency-heavy and pulling an operationally awful stack
(Node)? Yes, this alone is enough of a reason for me. And I haven't mentioned
yet other important reasons, like memory requirements and processing speed
(less than satisfactory), elasticity of processing (ES is mostly query-based
tool, and whatever pre-defined aggregations it has, it's too constrained
_paradigm_ for _processing_ streams of logs), and me wanting to take a shot at
log storage, because our industry actually doesn't have any open source
alternative to Elasticsearch.

> Kibana. (Which, without knowing your platform specifically, looks like it
> safely sits under 100 megs).

Close, but missed. It's 130MB unpacked.

> Maybe I'm misunderstanding your explanation, but if I'm not this sounds like
> a lot of effort to save yourself tens of megs of disk space.

I'm fed up with the outlook of the whole thing. Here ridiculous disk space for
what the thing does, there slower-than-grep search speed, another place that
barely keeps up with the rate I'm throwing data at it (single ES instance
_should not_ loose its breath under just hundreds of megabytes per day),
upgrade that didn't make things faster or less memory-consuming, but failed to
accept my data stream (I was ready to patch Kibana 3.x for ES 5.x, but then I
got bitten _twice_ in surprising, undocumented ways and gave up, because I
lost my trust that it won't bite me again).

Sorry, but no, I don't see Elasticsearch as a state-of-the-art product. I
would gladly see some competition for log storage, but all our industry has
now is SaaS or paid software. I'm unhappy with this setting and that's why I
want to write my own tool.

------
cel1ne
I do use PostgreSQL wherever possibly. Add
[http://postgrest.com/](http://postgrest.com/) and nginx as a url-rewriting
proxy and you have a performant, highly adaptable REST-Server.

~~~
vosper
I looked at Postgrest and liked what I saw, until it came to authentication
and authorization - the docs mention that this is an area that tends to blur
the line between application and database layer, and then just sort of punts
on the whole thing, suggesting that maybe you want to use the auth0 SaaS. Have
you had to deal with this?

~~~
icot
Yes, we found the same issue. We worked around this by organizing our
application (an admittedly basic REST API) in two layers. The external one
exposed to API consumers, developed in a language/framework you may like
(Python/Tornado for us), and the internal data access layer which uses
Postgrest, and is only accessed locally from within the REST API application
servers

~~~
rubber_duck
Why not just query the database using standard drivers then ? What am I
missing here ?

~~~
spdionis
Maybe the the exposed application acts like a proxy for the internal apis,
only check authentication info and pass on the request to the postgrest
servers.

------
einhverfr
Nice writeup though I would add a few things.

Listen/Notify work great for short-term job queues. For longer term ones, you
have some serious difficulties on PostgreSQL which require care and attention
to detail to solve. In those cases, of course, you can solve them, but they
take people who know what they are doing.

Also in terms of storing images in the database, this is something that really
depends on what you are doing, what your database load is, and what your
memory constraints are. At least when working with Perl on the middleware,
decoding and presenting the image takes several times the RAM that loading it
off the filesystem does. That may not be the end of the world, but it is
something to think about.

Also TOAST overhead in retrieved columns doesn't show up in EXPLAIN ANALYZE
because the items never get untoasted. Again by no means a deal breaker, but
something to think about.

In general, PostgreSQL can be good enough but having people know know it
inside and out is important as you scale. That's probably true with any
technology, however.

------
Aeyris
Is anyone actually utilising a recent version of PostgreSQL for full-text
searching beyond a hobby project? How do you find the speed and accuracy
versus Elasticsearch?

~~~
mrmondo
Yes we've used full text search for quite some time for one of our products,
it's great, we really haven't had any issues performance or otherwise with it.
For one of our other products we index data from PostgreSQL / PostGIS into
elasticsearch and that's also good, in other ways.

~~~
Aeyris
At a guess stemming from ignorance of Elasticsearch, I'm going to assume you'd
join PostgreSQL and Elasticsearch like that in situations where you need to
perform more complex analysis over the queries to provide results, is this
correct? I've seen this particular architecture before and my biggest question
was whether or not it was a piece that could be eliminated.

For, say, a tag-based search system (think any forum or booru-style
imageboard), is Elasticsearch completely overkill?

I realise I could probably just google it™, but it's a lot easier to
understand a product's strengths when they're put in a situational context.

~~~
dguaraglia
Your guess is pretty much it: Postgres full text search is great for simpler
scenarios, but elasticsearch provides more powerful search capabilities and
simpler implementation of complex features (such as "search all shops in this
geographical area with products that match these words the user typed".) That
said, I've had good results with both systems, and always found the
elasticsearch query syntax pretty arcane and difficult to put together from
the docs. The docs essentially assume you are very familiar with Lucene
concepts and that you'll be able to figure out what goes where in the JSON
object from your past experience. If you are implementing a simple tag system,
then I'd use Postgres first and only when performance becomes an issue move to
elasticsearch.

------
jaequery
is it good enough? yes. in fact, its probably an overkill for most. i think
the question of good enough wouldve been perfect for sqlite.

~~~
rwmj
I love sqlite but it's really difficult to recommend it in a situation where
you need parallel r/w access to the database (ie. any web-type situation, even
for light loads).

~~~
marktangotango
If care is taken to cache query results and clear on insert/update, one can
get stupendous performance from sqlite. I used Apache, mod_lua, and redis to
implement a system that does this for example.

~~~
einhverfr
SQLite is _wonderful_ for some things but write concurrency is not one of
them. This means you really need to segregate reads from writes, and use a
different path for that.

I actually think SQLite in general is fairly underrated, but there are
important limits.

PostgreSQL is my go-to db, but I _love_ the ability to have the callbacks to
SQL functions being in my main code. For lots of data analysis work, the way
custom functions and custom aggregates work in SQLite is wonderful.

But again, a chasing hammer is not the same as a machine hammer.

------
mattferderer
Very nice article for those of us who have never used PostgreSQL much. I've
been starting to use it with Elixir & this gives me a good understanding of
why someone would use it, especially when when starting a new app.

Out of curiosity, does anyone have a favorite article saved that does a great
comparison of when to use certain databases?

------
seibelj
Yes, for all projects and small businesses I start, Postgres and Redis is what
I use from the beginning. Then if it ever gets to the point where I need a
different DB for something, I replace components with the new tool. People get
fascinated with these fly-by-night data stores and put their operations at
serious risk. Start with the tried and tested technologies, then carefully
augment your stack as needed.

------
rosser
Using an RDBMS as a work queue is an anti-pattern, but if you're going to do
it, you probably can't do much better than LISTEN/NOTIFY.

~~~
smilliken
Listen/notify doesn't queue messages if the listener isn't online, so isn't
usually appropriate as a work queue.

I encourage everyone to challenge the conventional wisdom that using
postgresql as a work queue is an anti-pattern. I've never found a queing
software, out of countless options, that I was satisfied with. In concept, a
queue is very simple. In practice, I've never seen the end of messy real world
constraints that need to be imposed on queues.

We have somewhere around 50-100 queues in postgresql databases at MixRank,
nearly all of them with millions of records or more, being hammered
constantly. We get transactional consistency, automatic insertion with
triggers, expression and partial indexes for prioritization and partitioning,
joins, constraints and foreign keys, and everything else that comes with
postgresql. It works great.

~~~
einhverfr
Also if your queue takes a month to clear because you bulk inserted a few
hundred million jobs..... Listen/Notify isn't going to help you.

------
mamcx
Now imagine if we understand that the relational model is no for "just data
storage" but also can be use for everything.

The closest thing(1) was dbase/foxpro. You can actually build a full app with
it. Send email from the database? Yes. Is not that wrong? Is wrong just
because RDBMS (2) made it wrong, not because is _actually wrong_. Why is
better to split in separated languages/run times/models a app than one
integrated?

(1): Taking in consideration that neither Fox or any "modern" rdbms have take
the relational model to it full extension.

(2): A RDBMS is a full-package with a defined role, and limited capabilities.
A relational-alike language will not be a exact replica of that. Not even is
demanded to implement a full-storage solution.

The biggest mistake the relational guys have commited is to think always in
terms of full-databases instead of micro-database. Ironically, kdb+ (or lisp?
or rebol?) could be the closest thing to the idea (where data+code are not
enemies but friends).

~~~
discreteevent
Have a look at Eve

[http://witheve.com](http://witheve.com)

~~~
mamcx
Yeah, it look interesting. I wonder how it could be for a full realistic app.

------
api
From my experience with both PostgreSQL and RethinkDB (and other NoSQL
stores):

For SQL, complex queries, and data warehousing: yes. It's an excellent
database and I'm not sure why you'd pick another SQL DB unless it were a lot
better on point two.

For high availability and scaling: no, absolutely not.

The problem with the latter is an arcane deployment process and arcane error
messages that provide constant worry that you're doing something wrong. It's a
many week engineering project to deploy HA Postgres, while HA RethinkDB takes
hours -- followed by some testing for prudence... our testing revealed that it
does "just work" at least at our scale. We were overjoyed.

The docs for Postgres HA and clustering are also horrible. There are like five
different ways to do it and they're all in an unknown state of completion or
readiness.

Of course if/when we _do_ want complex queries and more compact storage, we
will probably offload data from the RethinkDB cluster to... drum roll... a
PostgreSQL database. Of course that will probably be for analytics done
"offline" in the sense that if the DB goes down for a bit we are fine. HA is
not needed there.

TL;DR: everything has its limitations.

~~~
jeltz
I do not get what you refer to by arcane error messages that give constant
worry. While some error messages need to be googled I think PostgreSQL is
pretty good at only printing error messages when something is wrong. So if you
are getting strange errors from PostgreSQL it is almost certainly an issue
which should be fixed.

------
anko
I love Postgres, but the one thing I think sucks is it's COUNT() performance.

I've read all sorts of hacks but I would love for someone to solve this for
me!

~~~
falcolas
One trick with counts is that you very rarely need a perfectly accurate count
for that exact moment in time; doing an explain on an appropriate 'SELECT' and
capturing the estimated number of rows returned by that is usually good enough
in 98% of the cases.

When you do need an accurate count, phrasing the query so the results can be
pulled exclusively from the table index is also usually good enough.

~~~
ishi
Never thought of using EXPLAIN to get an estimated count... nice trick!

------
agentgt
We had to learn this the hard way. We have many of the data/services the
article mentions and while we still use them when ever it gets massive we
actually will go back to Postgresql.

For example for our internal analytics/logs/metrics we use ELK and Druid but
believe it or not these tools despite their purported scaling abilities are
actually damn expensive. These new cloud "elastic" stuff cheat and use lots
and lots of memory. For a bootstrapped solvent self-funded startup like us we
do care about memory usage.

For customer analytics we use... yes Postgresql.

For counters and stream like things we don't use Redis we use Pipelinedb
(Postgresql fork). For Cassandra like stuff we use Citus (Postgresql
extension).

Some of our external search uses SOLR (for small fields) but Postgresql text
search is used for big fields.

The only part of our platform we don't really leverage on Postgresql is the
message queue and this because RabbitMQ so far has done a damn good job (that
and the damn JDBC driver isn't asynchronous so LISTEN/NOTIFY isn't really
useful).

~~~
ledgerdev
Here's an async JDBC driver. [https://impossibl.github.io/pgjdbc-
ng/](https://impossibl.github.io/pgjdbc-ng/)

~~~
agentgt
Interesting. There was a project a while back that offered async through netty
but it was trying to do everything async (ie not follow the JDBC spec).

This project appears to follow the JDBC spec and thus is synchronous with the
exception of the Listen/Notify (of which you have to work the driver directly
as there is no JDBC analog).

------
mooneater
Postgres' awesome extensible type system means it will continue to increase in
functionality much more easily than most comparable DBs.
[https://www.postgresql.org/docs/9.6/static/extend-
how.html](https://www.postgresql.org/docs/9.6/static/extend-how.html)

------
TheAceOfHearts
It's really amazing how far you can with a relational database. If you have
very minimal constraints, keeping everything in a single place can make life
so much easier. Configuration hell is real. I hadn't considered using PG for
storing binary data, but I've hacked together a few toy projects where I used
mongo and just shamelessly shoved everything in there.

I have a some slightly tangential questions, which I'd love to hear people's
thoughts on: How do you decide where to draw the line between what's kept and
defined in the application and database? For example, how strict would you
make your type definitions and constraints? Do you just accept that you'll end
up duplicating some of it in both places? Also, how do you track and manage
changes when you have to deal with multiple environments?

~~~
barrkel
In a heterogeneous language environment (Java, Ruby and Coffeescript where I
am), defining types in the application doesn't give sufficient coverage
because not all languages are statically typed. Referential integrity is
important, but sometimes has to be sacrificed for performance reasons (ideally
foreign keys are only temporarily disabled for bulk actions). Other column-
level constraints are nice to have but less critical.

For the kind of work we do at my company - involving bulk upload, bulk
inserts, bulk joins - thinking relationally is much more productive than
thinking in objects, and really fast key-based lookups are far less important
than really fast joins. Only configuration data lives in objects; other data
is served up using a lightweight translation of the result set to JSON, with
no business entity intermediary; one of the primary reasons being that the
user gets to define their own schema, so any object would have a variable
number of fields and types anyway.

Writing the same logic in different places is definitely a problem; you want
to avoid that. We've taken to putting some logic in JRuby just to be able to
use the same source in Rails and Java. For other logic, we have a Java service
API endpoint that Rails can call. For maintenance purposes, it's really
worthwhile not duplicating very complex logic in multiple places.

Relational databases - or rather, relational algebra, even if computed over
something like Spark or Impala - is, to me, worth far more than something more
suited for storing objects. Thinking in terms of relations is just more
productive for efficient code than navigating object graphs.

------
crudbug
Has someone played with threading model within Postgres.

I was reading the documents, looks like for every client request Postgres
forks a new Process and uses shared memory model.

Using multi-processor threads/coroutines might be useful for scaling it
further.

~~~
falcolas
I'm more familiar with MySQL, but if PostgreSQL does not use threads, I'd eat
my hat.

They're just too useful in helping to spread the load of all the IO required
to return even one result.

~~~
dboreham
PG pre-dates (working) threads in Unix by a long time. Processes and threads
are much the same thing as far as concurrency and parallelism are concerned.
PG maintains a pool of pre-forked processes so there is typically no process
startup delay. Shared memory regions and SysV cross-process locks facilitate
IPC. Back in the day this is how threads were done in user-land:

[https://www.postgresql.org/docs/9.6/static/kernel-
resources....](https://www.postgresql.org/docs/9.6/static/kernel-
resources.html)

It does have parallel scans though :
[http://rhaas.blogspot.co.uk/2015/11/parallel-sequential-
scan...](http://rhaas.blogspot.co.uk/2015/11/parallel-sequential-scan-is-
committed.html)

~~~
falcolas
Guess I should get some seasoning for my hat.

> Back in the day

This seems somehow problematic: a DB trying to meet modern performance
requirements by relying on how it was done over 12 years and 2 major kernel
versions ago.

Even MySQL makes judicious use of threads for managing periodic tasks and
parallel data parsing.

Per-connection processes does make some sense to me, but it seems wasteful
when most connections to a DB are idle most of the time. Having to coordinate
locks cross-process also seems wasteful; more syscalls and context switches
than should be necessary.

~~~
anarazel
> Per-connection processes does make some sense to me, but it seems wasteful
> when most connections to a DB are idle most of the time.

Per-process vs per-thread overhead isn't that different in e.g. linux. Some
things are more expensive with multiple processes (more page tables/more
wasted space/increased process switch cost), others are cheaper (e.g. memory
allocation, although that's getting better over the last few years).

> Having to coordinate locks cross-process also seems wasteful; more syscalls
> and context switches than should be necessary.

I don't think there's a meaningful difference here. We use atomic operations
for the non-sleeping lock paths (which'd not be different in threads) and for
sleeping locks when we need to sleep, we use semaphores for directed wakeups -
but you'd need something similar for threads as well.

Really, the majority of the cost of threading is when you explicitly want to
share more state, _after_ processes have initialized. It's e.g. a lot harder
to dynamically scale the size of the buffer pool up/down. It's also one of the
things that made intra-query parallelism harder.

------
mooneater
How do you get a "hotstandby replica for $5/month"?

------
mooneater
Another big plus for postgres: PL/Python, PL/R, etc

------
ckdarby
Why is the font so small on this site?

------
ausjke
still using php+mysql here as I can find so many documents about their various
usage easily.

------
hartator
I am more the MongoDB bandwagon. Shemaless makes prototyping so much easier.
And no migrations!

~~~
madhadron
Just make sure that you replace it before you try to load it with replication
and sharding. Mongo's replication is not atomic, despite their claims
otherwise. This includes versions in the last six months.

Mind you, you should be very cautious about loading it in a single instance as
well, since it has unpleasant behaviors there, too, but you simply cannot
trust it when replication is involved.

------
treve
Or just MySQL. Popular choice, unpopular opinion. I trust it more because it
gives me a tried and tested path when I need replication (which tends to
happen rather early). My understanding is that Postgres replication is not
nearly as battle-tested.

~~~
qaq
PostgreSQL replication is very well tested and is used by a very large number
of very prominent projects.

~~~
api
Can you give me a link? Searching for how to deploy it turns up dozens of
guides of varying age that refer to different projects of different levels of
maturity.

~~~
qaq
[https://www.youtube.com/watch?v=n5yt67uZdgQ](https://www.youtube.com/watch?v=n5yt67uZdgQ)
is fairly decent talk.

