Hacker News new | past | comments | ask | show | jobs | submit login
PostgreSQL as Schemaless Database [pdf] (postgresql.org)
247 points by xenator on Mar 31, 2013 | hide | past | web | favorite | 89 comments

Where is taligent when we need him? MongoDB is pretty impressive considering its age to the market, but I'd still go with the relational version of PostgreSQL anyday. Not because of the benchmarks, but in general, most use-cases are covered well enough by our relational db's like MySQL and PostgreSQL. The most expensive mistake one could make is using a NoSQL db where an SQL db fits perfectly. I admittedly made this mistake and learnt a LOT. Now, these benchmarks are just some extra spice for sticking with my lovely PostgresSQL :)

I think it's easy to forget that mongo did hit an under-served sweetspot when it turned up in 2008ish. But SSDs and the explosion in per-core RAM density have steadily rendered those advantages less compelling.

I imagine that as time goes on the PostgreSQL team will do their usual tortoise impression, steadily closing the distance on the NoSQL pioneers (and they deserve credit for reopening and re-exploring this chapter in database history).

You managed to choose the worst new database out there to call impressive :) Cassandra is impressive. Mongo? Not so much.

Why do you need me ? I am just that crazy troll who believes that no database is perfect for all use cases and that making blind generalisations like "all NoSQL is bad" is stupid. Especially since MongoDB is as different from Cassandra or Riak as it is from Spanner as that is from the other 120+ odd NoSQL databases.

And if benchmarks impress you I suggest you take a look at Cassandra sometime.


I keep hearing Cassandra touted as fast, but I just don't see it. Instacluster brags about 17ms at the 99th percentile being fast, but where I come from that gets you thrown out of the running.[1] The Netflix benchmark linked above sees 10,000 ops/sec/node. Again, not impressive.

Skip to slide 22 of this comparative benchmark with Couchbase Server[2] and you'll see that <5ms latencies at 99% is totally achievable. We can't take too much credit for these stats, most of the blame goes to the fact we use memcached at the front end, and it's been optimized for consistent low-latency for a long time.

I don't want to make this a speed contest, as there are a lot of good things about Cassandra. It just strikes me as odd that people sell Cassandra based on low latency. If you want consistent high-performance, don't use a database that runs on a language virtual machine. There are lots of great reasons to use Cassandra, standout benchmarks aren't one of them.

[1] https://www.instaclustr.com/solution#tco [2] http://www.slideshare.net/renatko/couchbase-performance-benc...

What else would you say is good about Cassandra?

I find the datamodel inflexible and often hard to work with for real world use cases. I think the tooling is immature and the hector API leaves me feeling depressed (pycassa and astyanx are quite a lot better).

All that said, I find cassandra to be fast, incredibly fast. 99th percentile on our 9Tb time series store is around 20ms on below par hardware.

Another thing i like is we've never had downtime, not even scheduled. We've had nodes fail, we've had datacentres isolated during disaster recovery tests, yet our cluster has continued on regardless (2 x sub clusters per region). Whatever way you stack it up, that's impressive. Remember with cassandra there's no load balancers or other shenanigans involved.

Garbage collection / virtual machine concerns are a red herring due to the mechanics of a cassandra query (certainly for CL < RF)

I'm glad you are happy with 20ms, to me it is slow...

Probably the best thing about Cassandra is that it is written in Java, so it is easy to fork and add custom behavior. Of course this is orthogonal to performance.

I only tagged you because you seem to have experience running Mongodb in production (as claimed by you in one of the previous threads) and I honestly wanted your feedback about this. Also I was curious to see if you would evaluate this nifty PostgreSQL feature :)

So, the problem that I have with PostgreSQL isn't that it doesn't have every datastore under the sun, but rather the lack of automated distribution, and fault-tolerance.

This is not a hard thing to build though (continue reading please). I don't mean that it's an easy task, but that the semantics by which you may be able to build a fault-tolerant, distributed database on top of Postgresql are pretty straightforward.

1. Partitioning: So, Postgresql has multiple notions of partitions. Ideally, this would be done at the database level in Postgres, but the idea is that instead of having your entire database fail at once, a transparent layer in which you SET PARTITION_KEY=X before querying would make this pretty straightforward. It would be nice if there was a semi-transparent proxy layer that this.

2. Replication / Fault-tolerance: Postgresql today has built-in synchronous master/slave replication. It would be nice if someone built a automated failover system on top this. Alternatively, it would be really interesting if someone had multiple postgresql instances, and then build a WAL log per partition, and performed 3PC between N replicas, and you could have a fully peer-to-peer postgres. I imagine these transactions would either need to be PL/SQL, or fully written-out, staged serializable SQL queries.

3. Co-processors: One of the biggest benefits a distributed, fault-tolerant Postgresql setup would give you is the ability to collocate your application with the data. Instead of having to do read, modify, write, you can write complex business logic that can run on whichever Postgres partition's master node, and the data doesn't have to go over the network.

I feel by introducing some basic wrappers around postgres to support partitioned, master-slave topologies, and perhaps a slightly different query interface to support this transparently, PostgreSQL could effectively replace most modern NoSQL databases.

Hot Standby in postgres is pretty damn good for replication (like you said making failover occur properly takes a little bit of effort but its not too hard).

Also pgpool2 helps a lot with partitioning and parallel queries.

For a relational db, postgres does a pretty good job at distribution and fault tolerance.

Postgres won't let you perform any sufficiently long (read-only) queries on slaves. For example, I was unable to run pg_dumpall on a slave to have DB snapshots.

This is [well-explained](http://www.postgresql.org/message-id/201102272005.00234.jens...), but still inconvenient. This way, slaves are only useful for failover and possibly offloading very short read-only transactions, which is pretty limited.

This used to be the case, but it's not anymore. You can run long-living queries, run pg_dump, etc on slaves.

Ah, that's really good news to hear. Seems it's time to upgrade. :)

Could you, please, provide a reference to a documentation or a changelog? Maybe I'm looking for a wrong keywords, but I can't find any mentions of such changes by myself.

Do you have a reference in which pg version this got fixed? We still pg_xlog_replay_pause() our slaves, it would be nice to drop that.

PostgreSQL today has synchronous and asynchronous master/slave replication built in. The dev team is working on master/master replication (or peer-to-peer, as you called it).

Then you may be interested in PostgreSQL-XC: http://postgres-xc.sourceforge.net/

It is transparent, consistent, load-balancing and fault-tolerant cluster. Though some of the features of PostgreSQL are not supported yet, eg. triggers.

This is pretty much the reason we're using Postgres for our document storage. Admittedly, our doc store isn't really extensive and it's mostly for internal use (with limited client access), but we just didn't see why it was necessary to go with a new database when the existing stack worked well.

YMMV, but we've yet to come across an instance where a pure document DB was necessary or desired, even to store documents (which from my perspective seems nothing more than a flattened EAV model).

Do you have any thoughts on expression indexes? Assumed I would have to pull out fields from the document to index reasonably on pgsql, but this presentation at least shows that simple expression indexes perform well. I wonder about compound, nested, etc.

I'm not too familiar with the current schema, but I do know we make extensive use of partial indexes (WHERE (vnnk IS NOT NULL), WHERE (fhs > 0) etc...) and compound indexes in btree. In GIN, for example, this isn't possible, but GIN itself works reasonably well for partial matches (since it's also btree) so it's not too much of a deal. We use a combination of hstore, GIN, conventional fields and multiple indexes with the schema abstracted out on our app to provide a document storage and search engine with a separate app for client login. I believe some metadata is stored in conventional fields (varchar integer etc...).

Performance compared to the conventional schema is of course noticably slower, but not unacceptably so for a few thousand docs and growing (this is compared to hundereds of millons of rows on the conventional schema).

But such is the sacrifice for reliability. When NASA hit a spat of failed missions to Mars (explosions, collisions, miscalculations etc...) people were joking about the creed at the time, "Faster, Better, Cheaper; Pick 2".

This kind of engineering is a turn off for a lot of folks, so I can certainly see the appeal of a document DB. I've also seen people fall into the trap of "I don't know what kind of data I'll be getting so I'll go with a doc store". Well, there has to be some hint of what kind of data could be arriving so engineering for that with some wiggle room early is a good idea.

You shouldn't take my word or the slides at face value, though. Really think through, experiment, throw stones at it etc... and see if it works for you.

Reliability is gold.

Thanks for the pointers. I will read up on GIN and hstore.

If nothing else, duplicating an element from a doc column to a first class field seems reasonable. Also, even in mongo one must define indexes which is a schema of sorts and can be non trivial to change.

In addition to this, we have found that the JSON field - PLV8 - JavaScript user-defined function - function-based index combination to work and perform really well. As the parent said, is not as fast as a full relational schematic but it's entirely acceptable. We try to stick to relational where speed and/or referential integrity is crucial and just use JSON fields for everything else.

One approach to expression indexes on JSON would be a multicolumn index. Column A is the path, e.g. "foo.bar[2].baz" and column B is the value at that path... of course you'd need to use GIN -- which as mentioned already is slow to build -- since one row would map to multiple index entries. But this ought to be a fairly workable approach.

Really great presentation/slidedeck.

For me I still feel MongoDB is a semi niche solution. I rather go with a standard SQL database unless there is a specific reason why you need NoSQL.

I've tried both MongoDB+Mongoid and Postgresql+hstore as a way to do better inheritance than STI in ActiveRecord.

Both have deficiencies: Mongo is not mature enough for an old lag like me to trust (and is incompatible with some Rails extensions), and hstore had problems with key deletion and cannot do object references or indeed types other than strings.

Can you elaborate on what you mean by key deletion problems?

First thing I remembered after reading the title is http://pgsnake.blogspot.in/2010/04/postgres-91-release-theme...

haha, thanks for posting that. i enjoyed:

"It's time for us to switch to something fresher. I personally would have preferred XSLT, but QUEL is almost as good."

The new JSON functions have landed in the development version, for anyone interested in seeing the future syntax: http://www.postgresql.org/docs/devel/static/functions-json.h...

Damn. The "XML. It seemed a good idea at the time" slide caused me to spray coffee on my laptop.

See also "XML literals as a native syntax in Scala."

I want to admit that conclusion in presentation can be interpreted vise versa: MongoDB almost is fast as PostgreSQL. Also, notice that comparsion was done with Mongo 2.2, in 2.4 JS engine was changed to V8, so there is a lot of speed improvements. And compare PG have 18 years of development vs 5 of MongoDB. So what I want to say? Difference is not so big after all.

PostgreSQL hasn't spent 18 years pitching itself as a pure performance play, though. It started as a research project into the object-relational model and was adapted to be a safe and featuresome database first and performant second.

Mongo's performance will converge downwards as people demand more features and safety guarantees.

If however you want the original premise -- memory-only, schemaless, auto-sharded document database -- Mongo is still the easier pick.

It's a cliche, but it's easy to:

1. overestimate the size of your data, (especially since relational models by design squeeze out all duplicated statements of data)

2. overestimate how much traffic they need to serve

3. underestimate what can be achieved by stock RDBMSes with modern hardware

4. underestimate the deferred development costs of schemaless data and

5. underestimate the costs of avoiding durable storage.

It's a really hard field to make generalisations about right at the moment. The algo-economics of hardware (especially SSDs) and software have been changing so fast that any generalisation formed even 2 years ago is unreliable.

Speaking only for myself, I need a lot of talking down from my data-safety/data-integrity tree. I don't trust myself. That's why I use tools that slow me down.

"1. overestimate the size of your data, (especially since relational models by design squeeze out all duplicated statements of data)"

Of course, this is opt-in in a SQL world, more's the pity. I've seen some nightmarish ActiveRecord spawned Postgres monstrosities.

I wonder how much of the attraction of schema-less DBs comes from the lack of friction between the largely guarantee-free idea of data storage precipitated by Rails misfeatures -- there's much less distance between a MySQL or Postgres database as "designed" by ActiveRecord and a NoSQL instance. Having already opted out of your database's particular guarantees, there's very little keeping you there.

Here's a question for people who prefer to put their guarantees (i.e. unique/check constraints, foreign keys etc) in their database. How do you present these constraints to the user? How do you show when they've been violated? Just present them with the PG error message? Maintain a mapping of PG errors to more domain specific messages? Bite the bullet and just code them twice (once for the DB, and once for the user)?

Don't think if these guarantees as your primary data entry validation (though they are a stopgap for mistakes there). Your app will (ideally) never encounter violations; if it does, it means it's doing something wrong and should probably bail out with a "I am confused, please have my developers fix me" type of error.

Good question, and one that could use some discussion (e.g. in blog posts). Perhaps I'll write one.

I think some redundancy is a practical necessity, regardless of your approach (using database constraints or only application code). Consider registering a unique username: even if you check for availability beforehand, you still need to handle the race condition. That's guaranteed to be a somewhat awkward user interaction because they thought it was available but you found out that it was taken by a concurrent user.

So, your application always needs to have a nice way to avoid errors when they can be caught early (e.g. javascript check while filling out the form), and a less-nice way of handling errors when they can't (e.g. "sorry, that name has already been registered" after submitting the form).

There are SQL-standard codes for many constraint errors, so those can be turned into exceptions.

For things like CHECK constraints, think of those more like an assert: intentionally redundant.

Other code-generating tools build from the schema outwards.

So for example, my professional life is currently about an obscure system called Oracle Application Express. You point ApEx at a schema and it will auto generate forms, reports, validations and so forth by introspecting on the database.

Rails used to do something like this: you would point it at a database and it would generate code based on a superficial reading of the schema. Certainly nothing to do with validations etc.

One could build out from the database schema -- Postgres certainly provides sufficient type information to build consistency double-checks in other layers of the application. As an added bonus, other consumers of the data won't be flying blind.

Yeah, I'm trying to build something like that right now. Nice to know others have similar ideas.

Don't remember exactly when, probably is was in time when multicore hosting servers becomes relatively cheap. We made measurement MySQL vs PostgreSQL and surpisely found that PG is visible faster. Don't remember exactly numbers. So even if team never push server as supa-pupa-fast on the market they was neat choice for people who know.

But the same time, even if PG documentation can be used as textbook for RDB theory in high schools I always hear from other people that it is boring. And I agree with that. SQL in general and PG interfaces in particular is just "so old school". MongoDB looks more friendly.

Right, so MongoDB goes back to my pet theory about how reward schedules shape technology selection.

It takes very little upfront to get data into or out of a MongoDB instance. Import the relevant module, name the server, persist. Blam, done, no need to worry about how to map an object model to a relational model or vice versa. Here kid, have a dopamine hit.

Meanwhile, a relational database imposes a schema on you upfront.

It seems harder because there's a minimal standard you must meet first.

It continues to seem harder because the database will insist on conformance to that schema rather than flexing to accomodate new cases.

And lastly, it seems harder because when you try to change the model, a relational database will insist that the new schema is consistent with data stored under the old schema (or that you provide a consistent way to update it).

Mongo and other document DBs don't do this. They just let you do what you want to do.

The problem is that eventually, stuff is done wrong. And you wind up with a document database full of corner cases. And you need to write complex query and load code to deal with all the quirks and historical mistakes and errors that crept in previously.

To summarise:

Loosey-goosey systems reward you now and punish you later.

Strict systems punish you now to prevent future punishment.

Humans are bad are trading off such different reward schedules. We hyperbolically discount future rewards and we are dreadful at accounting for avoided costs.

The only reason relational systems got a foothold in the first place was because the people who chose them knew from personal experience how much hierarchical-model and network-model databases could suck.

Care to elaborate on your last sentence? I'm really curious about some corners of software history and this seems like a hint of a corner I never knew existed...

Basically, hierarchical and network databases were around before relational databases.

Both of them have the problem that they require a lot of up-front planning. When you build a hierarchy, you need to know how it's going to be used as that will determine the structure. Turning hierarchies into networks -- ie trees into graphs -- eases this a bit but you still have the problem of arranging things for easy navigation to nodes.

Say you build a project management system. Under a hierarchical system you might build it around the central concept of a Work Breakdown Structure. There are standard ways to produce these and they have a strictly hierarchical structure.

But here comes trouble, because now you've privileged the WBS over all other views into the model. So if for instance you're interested in an activity-based breakdown (percentage $ spent on documentation vs manufacturing etc), you're going to have to walk the entire WBS to work it out. Lucky you.

If you've ever queried an XML document in a way it wasn't originally set up for, this will seem familiar.

When relational databases introduced joins, lots of heads exploded. Pretty much anything that can be logically derived from relations can be expressed with joins, projections etc, without having to modify the database structure to accomodate those queries.

Now you can create a schema that relates WBS items to activities to cost accounts to whatever else. Now you can query them in any direction you like. Relational algebra means that the database can provably answer questions without having to waste time walking through a graph or visiting, potentially, ever node.

The historical systems to look for are IMS (the original hierarchical system) and CODASYL (a standard for network database programming).

Though interestingly, for quite a while the one thing hierarchical database were good at was the one thing RDBMS was bad at: reconstructing an arbitrary-depth hierarchy. Take the classic "bill of materials"... easy in a hierarchical database, but in an RDBMS you have to recursively query to walk the hierarchy, or use special support like Oracle's "CONNECT BY PRIOR"

I actually read a paper a while back describing how to do trees in an RDBMS by encoding the parent hierarchy of a row in a string primary key or something like that. To get everything in a subtree, just check for keys with a certain prefix. I can't seem to find it now, though.

EDIT: should have gone to HNSearch first. Here's the comment where I found it: https://news.ycombinator.com/item?id=4921541 . Specifically, http://www.tetilab.com/roberto/pgsql/postgres-trees.pdf

EDIT 2: tetilab link seems broken. Try this: http://darwin.zoology.gla.ac.uk/~rpage/MyToL/www/downloads/t...

Alternatively, you can use nested sets [1], optionally storing the hierarchy in a different table. Unless you're really interested in good write performance.

1: http://en.wikipedia.org/wiki/Nested_set_model

Kudos for the history lesson. I can't believe IBM still has a market for IMS (!?!), with all the nice ways of combining SQL and "NoSQL" nowadays... Are they hypnotizing their clients or what?

I think you missed a word. "Strict systems punish you now to prevent POTENTIAL future punishment". and that makes a big difference.

I'm not sure why I'm getting downvoted. Many of us are doing startups. Punishing me now can have significant business impact, because it basically means slower development. In many startups, that future punishment may never come!

Also, I've spent over two years working with a schema-less database and we don't have the corner cases the OP is mentioning. Most of them are rather easy to avoid.

I suppose the takeaway is meant to be: why bother having two separate data storage systems (relational + schemaless) if only one will do? I mean, if Mongo isn't offering much (if any) performance gain or schemaless flexibility... what's the advantage? With Postgres, you can have your cake and eat it: your transactional data gets the benefits of a strict schema and referential integrity, but you also get an "escape hatch" out of the relational model for when that is needed. (And 18 years of proven code, deployed at massive scale, that definitely isn't going to lose your data.)

Side note: what's with Mongo using so much storage? Perhaps an append-only log without garbage collection?

I'd always head the performace pitch for Mongo being "slightly slower on a single machine, but way easier to scale out horizontally"

>I want to admit that conclusion in presentation can be interpreted vise versa: MongoDB almost is fast as PostgreSQL.

Which given how little MongoDB does, and how it can lose data unless you cripple the performance even more (ie. with waiting for successful write status), is impressive in itself.

>Also, notice that comparsion was done with Mongo 2.2, in 2.4 JS engine was changed to V8, so there is a lot of speed improvements.

Not really. The JS engine was never the bottleneck in MongoDB. The demanding stuff is done in the C/C++ core.

>And compare PG have 18 years of development vs 5 of MongoDB.

Not really relevant. PG had 18 years of development implementing hundreds of things MongoDB doesn't do, not merely doing the same schema-less stuff and tuning it for speed.

It's only one of several interesting comparisons in the slides, but I think we need more data points on the disk space to make use of the data. Does it grow linearly? probably, but is there a +x in there? Mongo has some big upfront disk allocations. Were those considered somehow?

When the benchmark shows postgres can do the same as mongo, only faster, then why aren't we seeing more hybrid solutions? Online retailers would be better off using relational storage for monetary transactions and time series, and json storage for document like data.

The great thing about PostgreSQL is that it is hybrid.

With http://www.postgresql.org/docs/9.1/static/hstore.html you can put schemaless documents into the same database, even into the same table row as your usual relational data.

You could start with a schema that's only a primary key and hstore for all other data (like a fully schemaless database), but over time extract important fields from the hstore into columns to add indexes, constraints or joins.

That is... I am not sure, either madness or genius.

But an intriguing idea either way.

Maybe sharding?

I am always wondering why no-sql has better performance than SQL if we have same simple use case (no join etc.).

If the data is well structured and no join is needed, can I assume that relational database should have better performance than no-SQL?

Depends what you mean by "NoSQL". MongoDB is pretty terrible overall, so don't even bother.

Systems like Dynamo/Riak/Cassandra sacrifice consistency (most of the time) for performance, largely because a distributed setup (sharding in particular) is easier when you don't guarantee consistency. Note that none of them sacrifice durability.

It is possible to have a distributed and consistent database, but you either have a bottleneck (master nodes in BigTable/HBase), require significant infrastructure (Google Spanner) or make schema changes very hard (Hyperdex). It's a hard problem.

Thanks for your reply.

But I have one more question, if I need to manage bank account, then the database must be consistent. Lets say that I just need to keep balance and social security number for the account, and only need to query the balance. Can No-SQL like redis have better performance than relational database? (assume same on logs and transactions).

Redis is consistent, it's just less durable than Postgres. A query cache is a very good usage for it. It's also likely to be faster than Postgres for the simple operations it is capable of.

The only downside is that (for now at least) it is single node, with optional replica slave(s).

There is still some way to go to support better indexing of json fields, but I'm pretty excited about these developments. In the future you might see schemaless and schema driven data living side by side on postgres.

Would be interesting to know which version of MongoDB was used (2.4 also uses V8).

The original slides are at: https://wiki.postgresql.org/images/b/b4/Pg-as-nosql-pgday-fo...

There is also another presentation on the use of PL/V8 (on Heroku) at the PostgreSQL wiki - from last week's NYC talks: http://plv8-talk.herokuapp.com/#1

In one slide it says he used MongoDB 2.2 on his mac.

Sadly it's not a fair fight, of course it'll beat Mongo (even though Postgres is ACID).

I would be more interested in comparing the biggest machine you can get on AWS running Postgres with a cluster of smaller machines running Cassandra. I expect Cassandra to win, but of course it's not consistent.

Well that's Monday's afternoon decided !

Almost 6,000 synchronous commits per second to a 7,200 rpm disk seems high to me.

Anyone seeing those numbers for any database that synchronously writes the log to disk?

While very impressive, I don't see how your post is applicable in this case. The OP at the very least implies that the transaction are done end to end and not batched. I find 6,000 distinct disk I/O operations per second, to append the log, very very high for a 7.2k disk.

that post, also discusses using a 7200rpm disk.

I think where you may be confused is considering them as distinct IO operations, vs seeing it as a point which must have been flushed past ( sequentially )

No, I don't think I'm confused. The OP says postgresql is used in its default configuration:

Stock PostgreSQL 9.2.2, from source. No changes to postgresql.conf.

Given that statement then either the writes are serialized or postgresql is not ACID compliant in its default configuration. I'm not an expert on postgresql but I assume that it is ACID compliant in its default configuration. Therefore my skepticism on the 6K writes per second.

from the link: "In Postgres 9.2, this improvement automatically becomes available without any further configuration."

"Essentially, it accomplishes this by reducing the lock contention surrounding an internal lock called WALWriteLock. When an individual backend/connection holds this lock, it is empowered to write WAL into wal_buffers, an area of shared memory that temporarily holds WAL until it is written, and ultimately flushed to persistent storage."

"With this patch, we don’t have the backends queue up for the WALWriteLock to write their WAL as before. Rather, they either immediately obtain the WALWriteLock, or else queue up for it. However, when the lock becomes available, no waiting backend actually immediately acquires the lock. Rather, each backend once again checks if WAL has been flushed up to the LSN that the transaction being committed needs to be flushed up to. Oftentimes, they will find that this has happened, and will be able to simply fastpath out of the function that ensures that WAL is flushed (a call to that function is required to honour transactional semantics). In fact, it is expected that only a small minority of backends (one at a time, dubbed “the leader”) will actually ever go through with flushing WAL. In this manner, we batch commits, resulting in a really large increase in throughput..."

I am sorry to cut and paste so much of the article, but I hope this is helpful?

Looking at your linked posting more closely I do not believe it is applicable to this situation for two reasons. Again I'm not an expert on postgresql but:

1) It appears to me that the technique discussed in your linked post is about ganging unrelated transactions together into a single flush to disk. I do not see that to be the case in the OP. Since all of the writes are going to the same table they are related.

2) Looking at the second graph in the linked post pgbench transactions/sec insert.sql. The number of clients is high. I got the impression from the OP that there was only a single client. Indeed if there were more than a few clients the benchmark would have been subject to the deficiencies of the client libraries used.

you are good to be suspicious into any benchmark(as even the original slides note). I dont think we really have any information as it relates to transaction boundaries present, nor clients used:

From the slides: "Scripts read a CSV file, parse it into the appropriate format, INSERT it into the database. • We measure total load time, including parsing time. • (COPY will be much much much faster.) • mongoimport too, most likely."

Postgres does its best to use intelligent defaults, but it is only a part of the system, it is generally up to practioners to be wary. Tools like: http://www.postgresql.org/docs/current/static/pgtestfsync.ht... Assist in this goal, but generally have to watch out for things like raid controllers that are not battery backed etc ( depending upon your environment).

There is no mention of the number of clients used directly.

I was simply trying to highlight that with the (little) information available, it is very possible.

It is also possible, that the session(s) should themselves choose to pursue an asynchronous commit strategy ( on a session level see:http://www.postgresql.org/docs/9.2/static/wal-async-commit.h... ) which would also not require modifying the configuration, I do not know as I have not seen the scripts, but it is similar to how a library could interact with:



Nice discussion, thank you.

The problem with this scheme is that Postgres does not support JSON indexing and partial updates which are important for high transaction databases.

When I checked hstore's functionality the last time (~6 months ago) it was rather limited compared to Mongo's full-blown API. Any new observations?

MongoDB's native storage type is not JSON, but BSON. One should know this if they're about to wade in a deep river, trying to argue against it.

would it be possible to build an extension to expose a mongo compatible interface in postgres - essentially being able to use all the ORM/code built around mongo for postgres, seamlessly.

Yes, but why bother? There are already so many good SQL clients out there, like SQLAlchemy.

there are many apps built around mongodb that it would be interesting to do a drop-in swap with postgres (assuming protocol compatibility).

hell, ill personally pay a lot to have a mysql compliant interface just for WordPress!

I see, that motivation makes more sense to me :-)

I guess it should be possible to write a wrapper that executed at isolation level "single".

Yes, I don't think it would even be that difficult if you used plv8js, but it would (probably?) be relatively slow.

Oh. This is legit?

I skipped this story because I assumed it was an April Fools' prank.

What I get from these slides is that MongoDB has better average performance (when both using indexes or doing full scans) compared to all schemaless PostgreSQL options.

sorry but the first thing I readt was "PostgreSQL the shemales database". Confusing ..

Same here, don't feel bad.

And to the downvoters, the correct word is aschematic or schema-less with a dash.

If a word ends in a vowel, you must use a dash.

Hatless is fine, but scrupleless should be scruple-less. The more you know, you scruple-less bastards!

> If a word ends in a vowel, you must use a dash.

Feeling prescriptive these days?

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact