
Creating a Document-Store Hybrid in Postgres 9.5 - craigkerstiens
https://blog.andyet.com/2016/02/04/postgres-9.5-document-store-hybrid/
======
SwellJoe
I'm often surprised at how many of the new web frameworks only support
something like MongoDB. There's nothing wrong with MongoDB (as far as I know),
but when I'm building something new, I'm planning for a long haul, and I
expect to want/need real database features in the future, even if I don't need
them today.

Given that PostgreSQL has JSON data types, and performance when using those
data types is competitive with the dedicated JSON or document databases, I
honestly can't think of any reason to use something else. Especially given the
long reliability history, the large pool of expertise, the tooling, etc. of
PostgreSQL vs. the other options.

Am I missing something in this assessment of the situation? I see so many
smart people deploying web applications (in particular) on MongoDB and
others...why are they choosing it over PostgreSQL?

~~~
l8again
For us, it boiled down to multi-master writes, and global replication (multi-
datacenters). Cassandra worked best for our use case. Moreover, as others have
pointed out, high availability with fault tolerance is another major reason.
CAP theorem is real, and there are trade-offs to be made in several use cases.

~~~
merb
Still with Cassandra you loose trnsactions and you can't start "small", as
small as you could start with postgresql. A 10000 user service works well with
a database running on 512mb ram, while cassandra needs at least 1gig (running
lower could work there also, however mostly you will run into longer gc pauses
than, it's really hard to trim the jvm down to less. I mean there are some
ways to do it but it's way harder than just using something different).

and mostly people start with as low as 10.000 people. Scaling means starting
from 1 going to X. and postgresql is well enough for most stuff. You don't
need multi-master replication for most stuff. people mostly don't need write
scalability. and for stuff that needs some writes they could easily cache
that. (see instagram)

------
ahachete
Most of this work, and much more, including compatibility with the MongoDB
protocol (including replication) is already done in ToroDB
([https://github.com/torodb/torodb](https://github.com/torodb/torodb)).

Hope that this self-plug is hopeful :)

------
Sanddancer
Neat little tutorial. One of the things I'd suggest to the author is to
perhaps add in something on creating standard and materialized views for some
of the more complicated queries, to show more of the interesting data
manipulation options that postgresql gives you.

------
orf
Interesting post, but I don't understand his way of deleting rows that aren't
in data.json. Creating a separate table seems overly complex: why not just
store a list of ID's in memory and delete any that aren't in that set?
Creating a list of 100,000,000 integers in Python takes 900mb (4.2gb for a
set!), which isn't that much. Stick it in one big 'DELETE WHERE id NOT IN
(...)' query.

Or better yet just TRUNCATE beforehand.

~~~
empthought
> Stick it in one big 'DELETE WHERE id NOT IN (...)' query.

I _think_ that's a terrible idea, because it will scan the entire NOT IN (...)
array for each row. Since it's NOT IN, it will even have to scan the whole
thing every time. A temporary table is generally the way to go for unbounded
lists of things; there's even support for it in PostgreSQL and the SQL
standard.

~~~
orf
I always assumed pg would be smart and use something with a constant lookup
time. This[1] is a query plan 9.5 made with a table containing 10,000,000
sequential rows, executing "DELETE WHERE id IN (select * from
generate_series(1, 1000000));". Maybe your right, it spends 5.8 seconds on the
nested loop. I've only used this for smaller volumes where an extra table
might be overkill, but I guess with loads more rows a temporary table is
better.

1\.
[http://tatiyants.com/pev/#/plans/plan_1454709615696](http://tatiyants.com/pev/#/plans/plan_1454709615696)

~~~
empthought
PG is doing the smart thing; it's optimizing for the common case of short
IN/NOT IN literals in SQL queries. Simple arrays have much better constant
factors.

If the clause is a subselect (like yours) and is also an actual table (unlike
yours), it will be smart and rewrite it as a JOIN, which will generally have
better complexity if the indexes are correct.

------
jarjoura
Only thing I would caution here is that Postgres HA is notoriously difficult
to set up and get right. Something like CouchDB is designed from the ground up
for high availability and scalability. So if your design calls for only a
document store, probably best to avoid Postgres.

~~~
empthought
This is flat-out misleading. PostgreSQL is also designed "from the ground up
for high availability and scalability," it's just optimized for a different --
and far more common and useful -- set of system assumptions and constraints.
Unfortunately those assumptions and constraints don't make good blog or
product marketing material copy. "We have a typical LOB app," "our data needs
are nothing extraordinary," etc.

~~~
coderzach
Postgres isn't designed from the ground up for high availability or
scalability. Most of the in-built functionality for setting up HA wasn't
introduced until 9.0, and the process of setting up HA postgres with automatic
failover requires third-party tools. As for scalability, postgres can only
scale vertically.

If you need a solution that can scale writes to more than than one node, or a
solution that has first party support for HA with automatic failover, you
shouldn't be using postgres.

As an aside, I find the dogmatic "just use postgres for everything" choir just
as bad as the marketing BS associated with noSQL databases.

~~~
empthought
"High availability" and "high availability with automatic failover" are not
the same thing. (The tricky bit is actually the un-failing-over part.) I
assure you that people were setting up high availability systems with
PostgreSQL 7.x in 2004 or so.

Very few organizations need to scale their database in any other direction
than vertically. If it works for Stack Overflow, it's certainly likely to work
for your application.

~~~
threeseed
> Very few organizations need to scale their database in any other direction
> than vertically

Almost every major organisation and plenty of startups/SMEs are investing big
into analytics programs. And whilst they don't have massive data sets they are
expecting real time performance so being able to scale horizontally is
important.

If you could vertically scale I/O that would be one thing but you can't.

~~~
empthought
That's a bunch of read-only replicas though; RDBMSes can do that. Or if it's
really big data, then we are talking about Redshift, Greenplum, or Teradata.

There's no need for something like CouchDB.

