
Hstore development for 9.4 release - andreypopp
http://obartunov.livejournal.com/175235.html
======
twic
This is wonderful work, and it is the foundation for yet more wonderful work
in the future. However, as much as i am a huge PostgreSQL fanboy and a
dedicated MongoDB peanut-thrower, i worry that the caption applied to this
post is misleading.

Once this work is released, PostgreSQL will be faster than current versions of
MongoDB at carrying out queries for documents which look for a given value at
a given path (or perhaps more generally, values satisfying a given predicate
at paths satisfying a given predicate).

But that has never been one of MongoDB's important strengths, has it? The
operations for which MongoDB is optimised are inserting documents, retrieving
document by ID, and conducting map-reduces across the whole database. Yes, it
has the ability to query by values in leaves, and it can use indices to make
that faster, but that's a bit of a second-tier, bag-on-the-side feature, isn't
it? If that was the main thing you needed to do, you wouldn't have chosen
MongoDB, right?

~~~
rpedela
You can insert and retrieve by ID just as fast with PostgreSQL.

Map/reduce? Isn't that slow in general? Can't PostgreSQL JOIN and aggregate
functions do most of what map/reduce is for but faster?

~~~
yummyfajitas
_You can insert and retrieve by ID just as fast with PostreSQL._

How can postgres possible be faster on insert? Unlike Mongo, Postgres actually
needs to write bytes to a disk (with a WAL!).

~~~
clubhi
Postgres also has stupid mode option. (unlogged)

~~~
pilif
The feature works per table and is off by default. Unlogged tables have
another nice property: If the data gets corrupted somehow, then the table will
be completely empty. This makes it clear to users that you'd use unlogged
tables for caching or other data that you have no problem with losing.

It also means that you can be sure that the data you read from an unlogged
table is that data that you put there. I much prefer this over the possibility
of maybe reading corrupt data without any possibility of detection.

In my case, the use-case of an unlogged table is a table that practically
works like a materialized view (until postgres learns to refresh them without
an exclusive lock and maybe even automatically as the underlying data
changes). If the table is empty, I can easily recreate the data, so I don't
care about any kind of WAL logging or replication. If the table's there,
perfect. I'll use it instead of a ton of joins and an equally big amount of
application logic.

If the data isn't there, well. Then I run that application logic which will
cause the joins to happen. So that's way, way slower, but it'll work the same.

The advantage of using an unlogged table here is that it's cheaper to update
it and it doesn't use replication traffic (unlogged tables are not
replicated).

------
andreypopp
This is impressive but not really surprising given the quality of PostgreSQL
database. Quote:

"We added performance comparison with MongoDB. MongoDB is very slow on loading
data (slide 59) - 8 minutes vs 76s, seqscan speed is the same - about 1s,
index scan is very fast - 1ms vs 17 ms with GIN fast-scan patch. But we
managed to create new opclass (slides 61-62) for hstore using hashing of full-
paths concatenated with values and got 0.6ms, which is faster than mongodb !"

~~~
baudehlo
But note it's at the expense of a very large index file - 800MB (vs Mongo's
100MB). Although Pg's index covers the entire JSON structure, whereas Mongo's
index only covers the leafs being searched, so it will optimize a larger
variety of queries.

~~~
eCa
Space is cheap. Time is expensive.

~~~
falcolas
It takes a non-0 amount of time to load large indices into memory (upwards of
5 seconds, depending on your choice of disks and arrays), and to iterate
through large indices once they are in memory.

Such large indices also limit how many of them can be stored in memory at the
same time - making transactions against those indexes slower over time if the
indices have to be loaded and unloaded from memory frequently.

You should always choose indexes with care, because indexes not only cost
space, they cost time. And as you said, Time is expensive.

~~~
mason55
_> It takes a non-0 amount of time to load large indices into memory (upwards
of 5 seconds, depending on your choice of disks and arrays), and to iterate
through large indices once they are in memory. _

This is why MongoDB recommends all indices fit in memory

------
jeffdavis
Great work and research; horrible submission headline.

The news here is that, in addition to the huge market for traditional DBs,
postgres is going to compete in a serious way on MongoDB's home turf. As that
becomes more apparent, it will validate postgres's flexability/adaptability
and cast doubt over special-purpose database systems and NoSQL.

MongoDB still has a story around clustering, of course. But that story is
somewhat mixed (as all DB clustering stories are); and postgres is not
standing still on that front, either.

(Disclaimer: I'm a member of the postgres community.)

~~~
craigching
Hey Jeff, been a postgresql fan for a long time, but clustering is my number
one issue. What is coming down the pipe for clustering/replication for
PostgreSQL?

~~~
jeffdavis
Logical replication will be a big deal over the next couple releases. I'm not
sure exactly what will be available when, but I know that god foundations are
being built now.

~~~
craigching
Jeff, first, thanks for the response and I'm heartened by what you say.

If I could suggest, please look at ease of configuring replication in MongoDB.
If you make it that easy, even if it's just for the JSON stores, I know that
maybe you can't make it easy for relational tables, I will move my product to
PostgreSQL in a heartbeat.

I have used and promoted PostgreSQL throughout my career, but I'm currently
stuck with having to deliver HA in a product for redistribution to customers
and Slony can't cut it for our use cases, it's too complicated. I think the
key is that we are a vendor and we need to provide a database with our
products, so it's not me that's responsible for configuring HA, it's our
users. And replication is what I'm concerned about 90% of the time, not
necessarily horizontally scaling out (i.e. we're not using sharding in 90% of
our cases, we just need HA).

~~~
jeffdavis
"I'm currently stuck with having to deliver HA in a product"

Have you looked at streaming replication in postgres?

[http://www.postgresql.org/docs/9.3/static/high-
availability....](http://www.postgresql.org/docs/9.3/static/high-
availability.html) [http://www.postgresql.org/docs/9.3/static/warm-
standby.html#...](http://www.postgresql.org/docs/9.3/static/warm-
standby.html#STREAMING-REPLICATION)

I'm not sure what level of expertise your customers are at, so this might not
work for you. But it seems like a better fit than slony at least.

------
integraton
See also the PostgreSQL as a Schemaless Database slides (includes several
benchmarks vs MongoDB): [http://thebuild.com/presentations/pg-as-nosql-pgday-
fosdem-2...](http://thebuild.com/presentations/pg-as-nosql-pgday-
fosdem-2013.pdf)

------
gibybo
Can I index fields within a JSON document in PostreSQL? Can I query them?

I.e. can I do something like SELECT json_field FROM data WHERE json_field.age
> 15 ?

~~~
rpedela
That query would be this:

SELECT json_field FROM data WHERE json_field->'age' > 15

Part of the performance increase for hstore is improvements for GIN indexes,
and according to the author can be applied to json. So yes you can use indexes
on your hstore or json documents.

------
desireco42
Which is not surprising, as this is 'real' database. I read review of MongoDB
internals and it sounded not very positive. While you can create NoSQL
quickly, to do it well with any db, it takes time which PostgreSQL definitely
has.

Just to be clear, MongoDB is fine db for certain scenarios and I am using it
in production.

~~~
fennecfoxen
The internals are so very slim... so I like this guy's advice for a Mongo
replacement that's almost as slim: [http://nyeggen.com/blog/2013/10/18/the-
genius-and-folly-of-m...](http://nyeggen.com/blog/2013/10/18/the-genius-and-
folly-of-mongodb/)

"ZFS has a heckuva caching strategy, and you _know when it accepts a write_."

:P

------
craigkerstiens
If you want to find more of the detail and background around this which came
from a talk by the author of the blog post then you check out his slides at
[http://www.sai.msu.su/~megera/postgres/talks/hstore-
dublin-2...](http://www.sai.msu.su/~megera/postgres/talks/hstore-
dublin-2013.pdf?utm_source=postgresweekly&utm_medium=email). One of the key
things not in the blog post but addressed in the slides is that there will be
a new index type that really pushes the performance gains even further.

------
tete
We actually use both of them in our system.

I really wish PostgreSQL wasn't such an enormous challenge to scale
horizontally.

~~~
asdasf
What is challenging about it? If you make the same sacrifices you make to use
mongo, it should be pretty straight forward.

~~~
tete
Well, if I make the same sacrifices I make to use Mongo I don't really have a
reason to use PostgreSQL.

No, but I really like the recent enhancements in PostgreSQL. Failover is
nearly as easy as in MongoDB, however all this doesn't play so nicely yet, if
you are using stuff like extensions (PostGIS), your own functions and still
isn't really an out-of-the box experience.

I agree, if you use it with the same limits and the performance gain is worth
it, you can as well be using Postgres. However, a lot of this actually only
changed in the recent releases. It's all still fairly new.

Also there are still a number of things that are basically missing, like out
of the box upserts (we are using a function for this, but it more a hack) and
if you are still somewhat in development a lot of little changes get really
hard in PostgreSQL. Converting your data structure, even with stuff like CTE
and surrounding functionality can become really challenging, especially when
you think there must be an easier way.

Where it is easier to modify structures in MongoDB it is actually harder to
aggregate it sometimes. Using stuff like Map Reduce (even the lightweight
version called aggregate) frequently appears sort of an overkill.

I think however it really depends on the kind of data you are dealing with.
That's why we are using a hybrid system right now. Both systems are actually
evolving really quickly and if you have the joy of using their most recent
versions one is always excited about new releases.

~~~
joeshaw
About the upsert function, have you found a way to do this generically, or are
you generating (at least) one function per table from a template? I've found
the hackiness of this to be easily the worst part about using Postgres.

------
eksith
Implementing our doc store on Postgres a while ago in lieu of switching to
MongoDB suddenly seems like a very good call.

We had nothing against Mongo (frankly haven't gone into deep analysis of how
it would turn out). It was simply the db we already had at hand and we knew it
well and trusted it.

------
gosukiwi
Postgres is quite scary, I feel that if I use it I'll miss 90% of the features
@_@

~~~
wmil
Don't worry about it, it's not like features have feelings.

There's always a fight between sticking to what your ORM supports and trying
to use every feature of the database.

Either can simplify your code under certain circumstances.

------
pstuart
According to the roadmap page, 9.4 is due out in the 3rd quarter of 2014.

------
roncohen
Fantastic!

No mention of arrays in the post, but in the slides: you can use {1, 2} syntax
for arrays and hstore now eats it \o/

------
andrewcooke
context: [http://www.postgresql.org/docs/9.1/static/textsearch-
indexes...](http://www.postgresql.org/docs/9.1/static/textsearch-indexes.html)
(GIN and GiST are index types)

------
willvarfar
It would be great if they could run these benchmarks with mongodb using
TokuMX.

I think tokutek's fractal trees would make an exciting data store for
postgresql, but benchmarks are difficult to find... Who is perceived to be
faster these days generally? Mysql+innodb, mysql+tokudb or postgresql?

------
lucisferre
Now if I could only get Mongoid to work with Postgres instead of Mongo I would
be happy.

------
andyl
Postgres Array/Hstore/JSON datatypes have made my Rails apps way simpler.
Great combination.

~~~
eksith
I'd love to know how you've set things up [ if you don't mind me prying ;) ].
It would be interesting to see someone else's approach.

~~~
andyl
What I did: 1) Use the latest Postgres (9.3) 2) use Rails4 - this provides
String/Integer Arrays, HStore, UUIDs out of the box 3) use post_json
([https://github.com/webnuts/post_json](https://github.com/webnuts/post_json)).

These tools made it simpler to do things like tagging, user-defined data
attributes, hierarchical data, flexible configuration data.

It was always possible to store much of this data using text fields and
JSON.parse (or some such), but the new Postgres stuff makes the flexible
datatypes queryable. And as built-in support for flexible data types comes
online, you can throw away your custom serializers.

There's other Postgres datatypes that I haven't used - for
geocoding/geosearching, network addresses, etc.

------
joshguthrie

        $ brew remove mongodb
    

Okay guys, now we're talking!

------
dkhenry
So a database developer is saying that his database is faster then a
competitors database. Shocking.....

