Hacker News new | comments | ask | show | jobs | submit login

This is wonderful work, and it is the foundation for yet more wonderful work in the future. However, as much as i am a huge PostgreSQL fanboy and a dedicated MongoDB peanut-thrower, i worry that the caption applied to this post is misleading.

Once this work is released, PostgreSQL will be faster than current versions of MongoDB at carrying out queries for documents which look for a given value at a given path (or perhaps more generally, values satisfying a given predicate at paths satisfying a given predicate).

But that has never been one of MongoDB's important strengths, has it? The operations for which MongoDB is optimised are inserting documents, retrieving document by ID, and conducting map-reduces across the whole database. Yes, it has the ability to query by values in leaves, and it can use indices to make that faster, but that's a bit of a second-tier, bag-on-the-side feature, isn't it? If that was the main thing you needed to do, you wouldn't have chosen MongoDB, right?

You can insert and retrieve by ID just as fast with PostgreSQL.

Map/reduce? Isn't that slow in general? Can't PostgreSQL JOIN and aggregate functions do most of what map/reduce is for but faster?

You can insert and retrieve by ID just as fast with PostreSQL.

How can postgres possible be faster on insert? Unlike Mongo, Postgres actually needs to write bytes to a disk (with a WAL!).

Even modest db work usually involves concurrency.

Mongo locks. Postgres rocks. [1]

1: http://www.postgresql.org/docs/current/static/mvcc-intro.htm...

> Unlike Mongo, Postgres actually needs to write bytes to a disk (with a WAL!).

Not with unlogged tables it does not!

Or just a single transaction with synchronous_commit!

> Specifies whether transaction commit will wait for WAL records to be written to disk before the command returns a "success" indication to the client.

> This parameter can be changed at any time; the behavior for any one transaction is determined by the setting in effect when it commits.


And you can set fsync off for extra yolo!

Postgres also has stupid mode option. (unlogged)

The feature works per table and is off by default. Unlogged tables have another nice property: If the data gets corrupted somehow, then the table will be completely empty. This makes it clear to users that you'd use unlogged tables for caching or other data that you have no problem with losing.

It also means that you can be sure that the data you read from an unlogged table is that data that you put there. I much prefer this over the possibility of maybe reading corrupt data without any possibility of detection.

In my case, the use-case of an unlogged table is a table that practically works like a materialized view (until postgres learns to refresh them without an exclusive lock and maybe even automatically as the underlying data changes). If the table is empty, I can easily recreate the data, so I don't care about any kind of WAL logging or replication. If the table's there, perfect. I'll use it instead of a ton of joins and an equally big amount of application logic.

If the data isn't there, well. Then I run that application logic which will cause the joins to happen. So that's way, way slower, but it'll work the same.

The advantage of using an unlogged table here is that it's cheaper to update it and it doesn't use replication traffic (unlogged tables are not replicated).

I think it's been referred to as 'running with scissors' mode actually.

You can also tune the fsync option to get mongo-like performance and reliability.

It isn't turned on by default just like it was in MongoDB initially.

It is not stupid it actually makes sense for ETL workflows where you load raw data into unlogged tables and transform them in the database into the final (logged) representation.

>Unlike Mongo, Postgres actually needs to write bytes to a disk

I'm now immune to MongoDB bashing now but jesus keep up with updates. This isn't true as of maybe two years ago.

> I'm now immune to MongoDB bashing now but jesus keep up with updates.

We do want Jesus to keep up with the updates, for sure, but this isn't just about a random slip-up, goes a bit deeper than that. This wasn't an unfortunate bug, "oh we meant to do all we could to keep your data safe" it was a deliberate design decision.

Dishonesty and shadiness was the problem. When finally they started shipping with better defaults, did we see an apology? Was anyone responsible terminated? Nope. Don't remember.

Now, it could all be a statistics fluke and people randomly all decided to bash MongoDB, say, more than Basho, Cassandra, PostgreSQL, HBase and other database products? Odd isn't it. Or, isn't it more rational, that it isn't quite random, there is a reason for it.

When developers sell to developers, it is expected a certain level of honesty. If I claim I am building a credit card processing API but I deliberately add a feature that randomly double charges to gain extra performance, you should be very upset about. Yes, it is in the documentation, on page 156, that says in fine print "oh yeah enable safe transaction, should have read that". And you as a result should probably never buy or use my product and stop trusting me with credit card processing from them on.

The bottom line is, they are probably good guys to have beers with but I wouldn't trust them with storing data.

Your analogy rests on the absurd assumption that mission critical components don't need to be fully vetted and understood.

Shady? Yes. Dishonest? No. To be sure, both of these attributes suck, but beyond that there is an absolute world of difference between the two. One passively leaves knowledge gaps that even novice consumer programmers would uncover (the source is wide open after all, and speaking from personal experience, any claims of hyper performance should arouse enough suspicion to inspire some basic research), while the other asserts no knowledge gaps and actively inhibits knowledge discovery.

That said, they fell very far form the ideal, and for that they should be faulted. They're now paying the price by living in a dense fog of distaste and disgust. I don't think anyone believes it's a "statistics fluke" but it's a likely that a mob/tribal mentality plays some role (this is an industry notoriously susceptible to cargo cults) and that many individuals on the burn-MongoDB-at-the-stake bandwagon do not actually have deep--or moderate!--experience with it and competing products. Also, would you chalk it up to statistical anamoly that they were able to close funding and that many large scale systems use their product?

Isn't it possible that the company had shameful marketing practices that repel the programmer psyche, but that the product itself has a unique enough featureset (schemaless yet semi-structured querying, indexing, dead simple partitioning and replication, non-proprietary management language, simple backups, all at the cost of RAM/SSD) to make it compelling for some problem classes?

Well now you are assuming the documentation is telling the truth. The reality is, things still don't go to disk as often as you think.


Unacknowledged writes are disturbing but this can happen in any system: data hits the disk, and just before the server can send an ack to the client, it crashes.

That same blog notes Postgres is susceptible to the same issue: http://aphyr.com/posts/282-call-me-maybe-postgres

When map/reduce comes into play, speed isn't usually the main issue. It's a way to do work in parallel - which may be done to increase speed, but more often than not it's done because there's too much work to be done any other way.

Put another way, I see map/reduce as a background task - building indexes, reporting, etc. I don't think speed is as important there as it is for a foreground task like GET, PUT and DELETE.

For these operations, especially by key, Postgres has always been right up there - what's happening now is that Postgres is getting really good at NoSQL features - documents, key value pairs, deep querying etc - good enough that if you have Postgres you can do pretty much whatever you could have done with Mongo - while still making use of everything else it has to offer.

I don't think Mongo map-reduce is parallel for a given instance.

Mongo clusters distribute map reduce so it can be reasonably fast.

You mischaracterize one of MongoDB's main draws. It's not just indexing, but indexing over a schemaless dataset. Mongo isn't unique in providing this, but combined with other features it is compelling for certain scenarios.

That said, if Postgres matches/exceeds enough Mongo features, it could certainly outshine Mongo for those scenarios.

AFAIK, map-reduce is relatively expensive on Mongo... typically recommended only for background execution, nothing synchronous.

I've harped on this many times in the past. I love PostgreSQL, but where is its easy to replicate feature that matches MongoDB for ease of replication? Please answer "you must not be keeping up with PostgreSQL lately" and I'll eat my words!

Easy replication ... that's why I use MongoDB.

Setting up a new PostgreSQL slave is just a one-liner with pg_basebackup in 9.3 (no outside tools needed):


IMHO this seems like building a better apple by making it taste like an orange. This could definitely be useful for storing json from javascript frameworks like backbone/angular/ember, but I don't see how this is going to improve the lives of PostgreSQL users in general. Hopefully this leads to some weird cool emergent use case that make PostgreSQL way better than other databases, but I'm not sure at this point that it will. Does anybody have more context on the long-game for this approach? I'm not sure what the goal is for adding this feature.

Look at some of the linked PDFs. It's the indexing technique and on-disk format that is an enabler for several notable use cases; personally, I'm looking forward to a time when I can use pg for full-text search without having to resort to another external system like Lucene or Sphinx. What it won't have is the query "smarts" like Lucene or Sphinx do today, but you need speed first.

Pg's built in full text search works amazingly well and has all the smarts. It's also faster than lucene.

Huh? Please come up with about ten references (one isn't enough) to back up those claims.

I do not think so, but it might be after the GIN optimization work is committed. After it they beat Sphinx in several benchmarks.


I didn't say it was faster than sphinx (yet) but it is definitely faster than lucene in all the benchmarks I've thrown at it.

I'm very excited to put my 3m documents into hstore, as they are currently in a horrible, unqueryable, unwritable (mostly) blobs of compressed, b64encoded serialized ruby hash structures (long story, predating nosql).

This makes me very, very happy.


I should point out why: -my document structures pretty much match 1:1 with hstore now (we have nested arrays of hashes) -i can query & report on the data with normal tools -i can update individual fields on my data (currently impossible) -i can do relational joins & such-like in the db (currently impossible) -i can have sensible transactions

The title will be changed soon. Don't worry.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact