
Document Storage Gymnastics with Postgres - mrmondo
http://rob.conery.io/2015/03/01/document-storage-gymnastics-in-postgres/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+wekeroad%2FEeKc+%28Rob+Conery%29
======
rcoder
PostgreSQL displays an amazing and rare combination in _any_ software, much
less a free (beer/libre) product: reliable, predictable performance and
usability for standard workloads combined with an extensible, flexible
platform for experimentation and development.

Rock-solid "classic" OLTP database? Check. Rich geospatial data platform? You
betcha. Structured document store? That too. Pluggable storage engines and
column stores? Why not!

Don't get me wrong: there are definitely workloads for which Postgres is a
poor choice. Its replication features are far behind MySQL, much less its
pricey commercial competitors and the more reliable NoSQL options (Cassandra,
HBase, etc.). There's also been little work that I know of* to optimize the
storage engine for SSDs. (*- I'd love to be learn this is just my ignorance
and someone is really focused on large-scale SSD-backed Postgres deploys and
tuning. Links/references appreciated.)

If you can work within those constraints though it's an awesome product.

~~~
s_kilk
> predictable performance

Hmm, maybe.

Where I work we use Postgres a lot, and the performance (for our workload) can
only be described as "brittle". We often see problems whereby making slight
syntactic changes to queries will cause confusion in the query planner and
performance will go through the floor.

Getting around this often involves trying out a few semantically equlivalent
ways of phrasing a query and trying to find one that doesn't suck, then just
remembering that for next time.

Another problem we've had is that after a while the query planner can go nuts
suddenly, making bad decisions based on accumulated stats. It's fun watching
what is usually a two-second query last multiple hours because the query
planner has lost its mind.

Anywhere else I'd claim that this is down to incompetence, but we have two
senior engineers with multiple decades of DB experience, and even they end up
tearing their hair out over postgres performance.

Overall, I like postgres, a lot, and I'd choose it over MongoDB and other
noSql solutions for most workloads, but it's performance can be very hit-or-
miss compared to other (proprietary) relational databases.

TLDR: postgres performance is usually pretty good, but in some cases can be
brittle, resulting in a lot of effort spent needling the query planner into
doing the right thing. The effort spent is much more than would be spent with
some other (unfortunately proprietary) databases.

~~~
tomiko_nakamura
The only response I have to that is "talk to developers on the mailing list"
(either pgsql-performance or pgsql-hackers).

Maybe it's possible to improve the planning - maybe your queries are uncommon
/ difficult to estimate / running into a thinko in the planner. I don't know.

~~~
fleetfox
Try asking on irc, #postgresql at freenode.

~~~
tomiko_nakamura
Yeah, IRC is a good place to ask too. The obvious problem is that not all
developers hang there all the time, so for longer discussions the mailing
lists are better (and you can just attach test cases, for example).

------
munro
Postgres could already index JSON fields since it can index arbitrary
functions!

    
    
       create index on invoice_docs((body->> 'created_at')::timestamptz)
    

Love hearing new Postgres features though :) -- jsonb for less space, new
jsonb operators [1], & jsonb_path_ops for fast arbitrary querying [2].

[1] [http://www.postgresql.org/docs/9.4/static/functions-
json.htm...](http://www.postgresql.org/docs/9.4/static/functions-
json.html#FUNCTIONS-JSONB-OP-TABLE) [2]
[http://www.postgresql.org/docs/9.4/static/datatype-
json.html...](http://www.postgresql.org/docs/9.4/static/datatype-
json.html#JSON-INDEXING)

~~~
ddorian43
I thought jsonb was MORE space, since it has a little issue with toast (easier
to compress text-json than binary-jsonb) ?

~~~
tomiko_nakamura
JSONB had issue with compression, because the compression is decided on the
fly by checking if first part of the serialized data, and in the original
format the first part was incompressible (thus no compression was used). But
this was fixed in beta3, and so 9.4.0 works fine.

Regarding size, it's more complicated. For example after loading the
'delicious-1250k' dataset from

[http://randomwalker.info/data/delicious/](http://randomwalker.info/data/delicious/)

I get 1322MB for JSON and 1415MB for JSONB, so yes, in this case it's slightly
larger.

FWIW, both JSON and JSONB use TOAST, because that's how PostgreSQL treats
works around the 8kB page size (or whatever page size you use).

------
angersock
So, I love Postgres to death, but I'm a bit let-down by the json and jsonb
stuff.

It's really great for document storage and retrieval, but the fact that there
doesn't seem to be any story for partial updates kind of kills it for me. If
it just had that, it'd cover every use case I could possibly want (except for
maybe some of the Cassandra stuff).

It's annoying, though, because clearly that's what people often may want--just
look at Mongo.

~~~
tomiko_nakamura
PostgreSQL does not have in-place update in the first place, on every UPDATE
it does a DELETE+INSERT. So "partial update" does not make much sense, because
you'll create a new copy of the row anyway.

~~~
NickNameNick
That's not quite how PostgreSQL's Multi-value-Concurrency system works.

~~~
tomiko_nakamura
MVCC means "Multiversion Concurrency Control" and yes, that's pretty much
exactly how it works (creating copies of the rows, so that different snapshots
see different rows).

In practice it's more complicated of course. A nice intro to MVCC is available
here:
[http://momjian.us/main/writings/pgsql/mvcc.pdf](http://momjian.us/main/writings/pgsql/mvcc.pdf)
(and in the Bruce's other talks
[http://momjian.us/main/presentations/internals.html](http://momjian.us/main/presentations/internals.html)).

Or maybe you mean something else by Multi-value-Concurrency. In that case
please elaborate.

------
PaulHoule
Just about anything blows away mongodb.

~~~
takeda
Yep, and even in scaling out [1].

The main problem of MongoDB is that they tried to make a general purpose NoSQL
database. The thing is that this is the area that RDBMS are trying to solve
from the beginning. There are decades of research and solving problems. When
you're using MongoDB you're essentially going back in evolution to 60-70s when
hierarchical databases were used[2] and you're running into problems that were
already addressed in RDBMS decades ago (for example locking vs MVCC[3]).

The original NoSQL was created to solve very specific type of data storage
(originally in terabytes+ range, that you can process using highly
parallelizeable map&reduce operations) and of course you're trading something
in return (most of the time database consistency). There's no silver bullet
solution.

The worst reason for picking MongoDB database though is because of storing
data as JSON. Many companies that do that (including mine) believe DBAs just
are slowing them down, so by picking a schema-less database they can release
new features faster. They don't realize that they will need to still enforce
schema in their applications and that is actually much harder and can become
very ugly. Otherwise the data will resemble a pig sty.

Personally I'm not too enthusiastic about JSON in PostgreSQL, but I guess it
gives flexibility as of what data schema should be enforced and what not. And
there's always an option to migrate to use schema.

[1] [http://www.datastax.com/wp-content/uploads/2013/02/WP-
Benchm...](http://www.datastax.com/wp-content/uploads/2013/02/WP-Benchmarking-
Top-NoSQL-Databases.pdf)

[2]
[https://en.wikipedia.org/wiki/Hierarchical_database_model](https://en.wikipedia.org/wiki/Hierarchical_database_model)

[3]
[https://en.wikipedia.org/wiki/Multiversion_concurrency_contr...](https://en.wikipedia.org/wiki/Multiversion_concurrency_control)

------
uberneo
JSOB indexing looks good performance wise but all non commercial dbs including
monetdb lacks UPSERTS (as inbuild functionality) apart from MySQL. If we can
UPSERT the data in JSOB based on key that would be awesome. Also since monetdb
doesnt have UPSERTS so that the only reason for me not to use it ,otherwise
its an awsome database for Analytical queries.

~~~
tomiko_nakamura
I really wonder what you mean by UPSERTS. Because the traditional meaning for
UPSERT is "UPDATE if exists, INSERT otherwise", aka MERGE. And you're right
PostgreSQL is missing that, but hopefully we'll get something like that in
PostgreSQL 9.5 at the end of 2015. There are ways to do that in PostgreSQL
using writable CTEs, in some cases.

But that's completely unrelated to JSONB, so maybe you mean something else?

Nitpick: it's JSONB, not JSOB.

------
KaiserPro
The JSON appears to be highly structured, so why isn't it stored in a
structured way?

I mean invoices I'd assume by their definition made up of structured data:

The buyer (and associated data), the items ordered (and associated data) total
volume and price.

As all the relationships are fairly simple and rigid, surely you'd want them
in SQL?

~~~
tomiko_nakamura
I think that if you're using JSONB for data that can be easily stored in a
proper tabular form (aka relation), you're doing it wrong. No doubt there will
be people using it this way, though, and maybe there are cases where it really
makes sense.

What JSONB makes much easier is storing data that are not a good fit for
relational paradigm. For example if you don't know the exact structure (e.g.
if you don't know what keys to expect), you can't reasonably decompose that
into relational schema.

This is the case with mail headers, for example - it's highly variable, every
client/MTA can define their own headers, and so on. You could decompose that
into EAV model (yuck), use tables with generic PARAM_1, ..., PARAM_N columns
(yuck!), or you can just stuff that into a JSONB column.

------
jamescun
Just today I released a small Go package which wraps JSONB in an API similar
to `database/sql` for the purpose of rapidly prototyping apps.

[https://github.com/jamescun/jsonb](https://github.com/jamescun/jsonb)

~~~
alimoeeny
looks very handy. How complex can queries be?

------
acabajoe
Is there a guide somewhere for someone that has used mongodb before but is new
to postgres on how to map the first to the second? Would love to try some of
this out, but the complexity is a little daunting.

~~~
takeda
The SQL can look a bit complex at first, and advanced queries can look
complex, but MongoDB is quite limited what operations you can do with it, so
the SQL queries that correspond to MongoDB operations are fairly simple.

Here's a documentation from MongoDB which could be used the other way as well:
[http://docs.mongodb.org/manual/reference/sql-
comparison/](http://docs.mongodb.org/manual/reference/sql-comparison/)

For normal use you really need 4 operations:

SELECT - query data

INSERT - inserting data

UPDATE - modifying existing data

DELETE - remove data

Everything else is either used for database maintenance or DDL (Data
Definition Language, defining how the data is stored) commands.

The DDL is the main reason why some organization are considering using
MongoDB, they believe that by removing the schema (definition how data is
stored) they will get more flexibility, the problem with it is that if you use
MongoDB you still will have schema, the issue is that the schema will be
enforced by your application. And if you don't enforce the structure you'll
end up with quite a mess, and very complex code.

Also if you use Java (or JVM based language) there are many frameworks that
supposed to help interacting with database each has their strengths and
weaknesses. There's one called JOOQ[1] that allows you to write SQL queries in
Java in a way very similar to MongoDB.

Edit: There's also article about MongoDB aggregate functions and equivalent
PostgreSQL: [http://tapoueh.org/blog/2014/02/17-aggregating-nba-data-
Post...](http://tapoueh.org/blog/2014/02/17-aggregating-nba-data-PostgreSQL-
vs-MongoDB)

[1] [http://www.jooq.org/](http://www.jooq.org/)

------
ExpiredLink
10 years ago people put XML into the database. Now it's JSON. Same mistake
again.

~~~
andybak
Funnily enough I found this Steve Yegge piece from 2004:
[https://sites.google.com/site/steveyegge2/ten-
predictions](https://sites.google.com/site/steveyegge2/ten-predictions)

Check out the first prediction.

~~~
ExpiredLink
Thanks for the link. Funny indeed. Replace 'XML' with 'JSON' and you have an
up-to-date blog snippet.

