
The Durable Document Store You Didn't Know You Had, But Did - joshuacc
http://robots.thoughtbot.com/post/13829210385/the-durable-document-store-you-didnt-know-you-had-but
======
pilif
Please also have a look at this presentation the original article linked to:
<http://pgeu-plv8.herokuapp.com>

Ever since we began storing document data as JSON instead of serialized PHP
arrays (yes. PHP - I can't port this behemoth to a more convenient language),
I was thinking that maybe Postgres would eventually get some way of peeking
inside these blobs.

Now that pl/v8 exists, we can. Combined with the right functional indexes,
this can make our life much easier and help greatly to either reduce joins or
data amounts transmitted between database and application server - both of
which are concerning me at particular places - nothing real bad yet, but
potentially.

Thanks a lot for posting this article. You might have just made my day.

~~~
underwater
PHP's serialization is pretty straightforward, you don't even have to worry
about escape characters. You could write encoding/decoding functions in a
couple of dozen lines.

~~~
pilif
I probably could, but in this case there were other issues which lead me to
move to JSON. Speed being one of them json_*code is about twice as fast as
un/serialize.

I know igbinary exists to solve that, but it would make debugging something
using psql even worse.

Also, there is no formal specification for the output of serialize() and there
is some talk about changing the format sometime in the future. While they
could have unserialize() understand both formats and serialize only write the
new one, it would mean that I have to play catch-up and fix my code as the
format changes.

As I would probably never really have time to do it, this would crate a fixed
dependency on an old version of PHP which a) would preclude me from using new
language features and b) would get me into trouble as distros update and the
last version which writes the current format falls out of support.

Then there are also some encoding issues which are too complicated to explain
here but basically force you to store the output of serialize() in a bytea
column which again sucks for quick psql-based inspection and for PDO access
which converts bytea fields into streams. See
<http://pilif.github.com/2011/09/serialize-mistake/> if you are really
interested in how I f'ed up.

No. JSON only had advantages for me and now I get one more huge additional
advantage without having to write even a single line of C code.

On a related note: do have a look at the sourcecode of unserialize() - it was
generated by some parser generator and is completely unreadable on casual
glance, forcing me to dive even deeper into PHPs build system were I to
rebuild unserialize() as a Postgres extension, which I don't want to do
anyways for all the other reasons.

PS: no, looking at the output of serialize() and kinda sorta inferring the
format and then building a quick parser is a recipe for disaster. Think
escaping. Think arbitrary data types, think nesting. Are you sure you can hack
together a parser which is guaranteed to be correct in all cases?

------
rbranson
For those wondering, the real power comes when you use the support for
functional indices in combination with XML. You can create indexes on
fragments of the XML and perform some very nice queries against large sets of
XML documents. Hey, I'm not advocating this, but if you've got a ton of XML
data and also an existing swath of relational, it's hard to map them onto each
other without this level of in-database support.

It's also nice because it supports partial updates of the XML documents do
that concurrent updates don't clobber each other during some kind of horrible
read-then-write dance.

It would be really great to have first class JSON support at the level of this
XML support built-in though. Certainly could convince some potential MongoDB
converts to stay put.

------
tom_b
I have stored XML in Oracle XMLType columns off and on as part of a
reporting/data warehouse infrastructure with some success.

It is a really cheap way of adding in XML data documents that are small-ish
and infrequently used. We typically extract a field or two from the XML doc.

I wouldn't do it for really big XML docs (for some definition of really big)
but it is handy for thousands of little XML docs.

------
euroclydon
Aside from querying it, my biggest problem with storing XML in a DB column has
been, in SQLite, the sheer size of it. I've had to move to compressing some of
it, but then it's not possible to query it anymore.

Do other databases handle compression and querying simultaneously?

~~~
skymt
PostgreSQL does some compression automatically & transparently for large
values: [http://www.postgresql.org/docs/current/static/storage-
toast....](http://www.postgresql.org/docs/current/static/storage-toast.html)

------
dmitriy_ko
It feels wrong to store whole XML document in a single field of a relational
database. It makes more sense to parse XML and then store it in an appropriate
relational manner.

~~~
jacques_chester
The problem is that for a while there, XML was hot hot hot and so DB vendors
added "XML support!" as a tickbox feature.

There's a case to be made that, by storing XML documents in a field, you've
managed to violate first normal form. And it's generally downhill from there.

~~~
dmitriy_ko
Exactly! So you are going to store data as XML blob, and then use XPath to
query it. You get to use XPath, how cool is that! I'm sure there are cases
when storing XML documents in DB is useful -- e.g. if you need to store a
complex document structure of which you don't know ahead of time. But the
example they provided -- storing beer info is exactly the wrong case do it.

~~~
jacques_chester
> So you are going to store data as XML blob, and then use XPath to query it.
> You get to use XPath, how cool is that!

I think you've missed my point.

Storing XML in a field defeats normalisation and the benefits flowing there
from. Being able to query it with XPath just doubles the number of query
languages that I will need to know and keep track of in the database.

If you want to store XML blobs, use the file system and XPath to your heart's
content.

~~~
dmitriy_ko
I did get your point. I was being sarcastic when I said that using XPath is
"cool." I guess should've marked it with <sarcasm></sarcasm> to make it clear
;). I was trying to convey exactly what you just said: it doesn't make sense
to use XPath to query relational DB.

~~~
jacques_chester
My mistake then.

------
miketuritzin
One thing this article doesn't address (as far as I can see) is the efficiency
of updates to a document in a key-value store in Postgres. If I'm using
hstore/PLV8/XML, how efficient is it to add to or update the document
incrementally? (Say I am dealing with large, multi-megabyte documents.)

~~~
saurik
PostgreSQL has multiversion concurrency, which means it will not do update-in-
place (a strategy that makes concurrency very difficult); the result is that
any part of an individual row, such as the entire hstore/xml field (but also
including any other fields with types such as integer), cannot be
"incrementally updated", and must be replaced (by which I mean, having a new
one written and the old one effectively garbage collected at a later time)
"whole hog".

------
flyingyeti
Are there any plans to add support for a native JSON data type?

~~~
olefoo
The article mentions PLV8 which transparently converts everything to JSON
internally and can be used to output json directly. And there are various
other extensions to Postgres that will allow you move JSON into and out of the
database.

If you want to talk to an arbitrary webservice as though it were a local table
you can use www_fdw (Foreign Data Wrapper)
<https://github.com/cyga/www_fdw/wiki/Examples>

And if you're doing geo stufff PostGIS speaks GeoJSON.

------
danso
Wow, xpath in a query call. As a cheap and sloppy compromise ( _NOT_ for
production, obviously), I'll sometimes just throw raw HTML into a database
text field, and eat the cost of retrieving it, putting it in memory, and then
running a parser on it. I don't know if parsing it in a database call would
save much time, but it sure would simply the ORM.

That is, if Postgres integrity checks would even allow HTML, which I'm
guessing it probably wouldn't....

~~~
colanderman
You can disable DTD checking and then use XHTML (unless of course you're going
for strict SGML compliance as opposed to something that can be parsed by a
browser).

