While I understand is tempting to easily conclude that this is apples-to-oranges or that "obviously a relational database is better at data relationships", I recommend to dive deeper into the topic.
First of all, almost all data models you may think of are in some or another way relational (call it "related data", if you prefer). So any database, RDBMS or NoSQL or whatever, will need to deal with data that is related to other bits of data. So this is, from this perspective, very apples-to-apples IMHO.
Second, and regardless of the above statement, the exercise here is to model a data model that is, necessarily, two (or three) pieces of related data. And actually we considered the two models that MongoDB themselves consider for data modelling: embedded and reference. This is explained in the post and references to official MongoDB documentation are provided.
Third: benchmark is open and public. The repository also contains all the query examples used in the post. If anyone considers that there's a better way to represent this data model (problem statement), that would still answer all the "business" questions (queries) considered in the post, please send a Merge Request to the repository. We will be very happy to analyze it.
All in all, this is an exercise of data modeling, for a given dataset. MongoDB requires data modeling, is not "schema-less", and this is pervasive through their official documentation. The reference model, and their $lookup operator, are not a corner case nor a hidden downplayed option. It is a first-class operation, well advertised. But its complexity, performance and rewriting needs to adapt to changes are what are not, apparently, advertised, after our observations.
If your data is relatively freeform with good indices on stable fields, mongo is hugely convenient, and can be easily structured to scale horizontally. For non-join based aggregation (i.e, aggregation within a collection), mongo is very performant, and works fine for a lot of use cases. I'm using it storing large amounts of simple timeseries data and it's great, largely because my use case doesn't require more complex data structures.
As soon as you want to deal with more complex structured data, mongo is worse than any RDBMS. The only viable noSQL solution to complex structures is denormalization, which is sometimes a legit choice. But unless you know your primary use is just retrieving a set of documents based on indexed fields (and many are), get an RDBMS.
MongoDB has built-in multi-instance sharding support. Again this is possible for PostgreSQL with table partitioning and FDW but it is hard to maintain.
PostgreSQL documentation makes it quite clear how to set up HA, I don't think you need any third-party tools other than general OS-level support for failover.
From the official docs:
>> PostgreSQL does not provide the system software required to identify a failure on the primary and notify the standby database server. Many such tools exist and are well integrated with the operating system facilities required for successful failover, such as IP address migration. 
Postgres HA is far from easy. I gave a talk 2 years ago about this (video and slides: https://aht.es/#talks-modern_postgresql_high_availability). This is an operational disadvantage.
(1) what’s faster than a fast join? No join. With nosql you have more flexibility to store the data organized in the same way it is accessed. You don’t need to join to another table if the data your app needs is already directly part of the main data the app requests. You do need to understand your access patterns well, and develop migration plans when they change.
(2) scalability. nosql databases generally let you scale horizontally more easily/gracefully than relational databases, though there are trade offs.
Put it together and you get performance at scale, though you generally need to understand your data access patterns and usually need to be more resilient to inconsistencies.
I've found that being able to change quickly is more valuable that the marginal performance boost that mongo would give. Denormalization is possible in postgres.
Where have you seen this trade off go well?
You can definitely do this in PostgreSQL via materialized views. They're directly supported in recent releases, and they don't require you to denormalize the underlying datamodel unlike NoSQL.
I agree with the general point that Postgres and other relational databases can do nosql things (and nosql databases can do relational things).
This sounds like a minor issue, but it's a huge operational pain. Migrating a data structure to accommodate new queries involves operations and application changes, both substantial. Migrating all data in a consistent way is either extremely hard or requires significant downtime.
In my opinion, for most cases, this problem is significantly worse than doing joins in first place.
I don't know what you mean by migrations are extremely hard though.
Either you involve in significant downtime, or you face the risk of breaking data consistency in the face of concurrent applications.
Mongo cannot do a TB-sized migration atomically (see limitations for transactions) so you are faced with a potential incremental backfill. That will very likely break consistency and create more data problems that it is trying to solve.
Last but not least, you need to recode your application heavily. And worse, make it work with two potential schemas at the same time (while backfill happens), and data in, potentially, two places at once.
Or incur in hours or days of downtime.
Anyway, from my perspective, the chance you can atomically update a system, whatever kind of database you're using, isn't something you necessarily get. You either don't have reliable control of all clients, middleware components and data stores or can't pay the price of downtime. So for data access pattern changes between components we're talking about incremental migration in any case, that needs to maintain compatibility until all dependent clients/components are updated.
> Last but not least, you need to recode your application heavily.
I don't think that's generally true. At a high-level, the point of nosql is to store your data more directly in the way your app wants to consume it, so data reorg follows the needs to the app, not the other way around. I guess you're talking about the need for code somewhere to take a more explicit role in managing storage. That generally does exist with nosql and is a maintenance burden, but should not be deeply embedded in your app. It should be a layer with clear separation of concerns, the implementation should be shared so you don't have redundant implementations, etc.
I'm not disputing relational databases are more flexible, and mostly simpler when a single instance does the job (though you generally pay back some complexity transforming relational data into the form your clients need). So relational is usually the default/general choice if you don't need performance at scale.
Postgres supports fully atomic DML and DDL operations. You can do transactions as big as you want (I have seen single transactions of multi-TB of changes).
> I don't think that's generally true
When you do a data migration of the kind we're talking here, let's say you are reversing how embeds who (you had employees embed departments, and now you want departments embed employees), you need to do a whole collection rewrite.
Since you want this to be incremental (without downtime) you necessarily need to deal with, among others:
* Change all your queries to do an "union" between the results you get from the $old model (those documents not migrated yet) and the $new model (those already migrated). This can be really hard.
* You need to change all your updates. They are not trivial neither, as they need to support now cases in which the pre-updated document is in two possible forms. Worse, depending on your migration code, you probably need to update the document in the $old form it is in the $old form, but in the $new form if it is in the $new form.
* Once migration is done, you need to undone all these "duplicated" code paths.
* At all times, you are dealing with potentially significant consistency issues. Due to the limitations to ACID operations in MongoDB --and the lack of them in other NoSQL systems--, you might see spurious effects of document in one form appearing or disappearing while being migrated.
> So relational is usually the default/general choice if you don't need performance at scale.
Define scale. One of my customers is gitlab.com. It handles millions of users, with dozens to hundreds of thousands of queries per second, and some TBs of daily changes, with a single Postgres cluster. Which still has significant headroom for vertical scaling.
Not everybody needs to get significantly bigger than this.
Please elaborate on this, it would be instructive
Personally, having inherited a database that wqs absolutely chock full of bad data due to the lack of schema, I'd be very reluctant to go that route unless I really needed to.
Split of the relational data and make it fast with relational DBs.
Everything else goes in MongoDB and if fields always occure, they move into the relational DB.
If your Postgres JSON schema modified for new application requirements, the table is locked until existing data is copied into the new schema, requiring applications to be quiesced during schema migration.
Thats a key difference between the two and one area Mongo performs best in.
Are you saying that if you INSERT a row with a new JSONB property that wasn't previously included in any of the rows of that table, then the entire table gets locked while all rows are copied. And if so, do you have a source for that? I can't find anything through google.
Useful links for erudition:
When do I usually need to scale up / out in either database? How do I assess that? And how do I actually go about scaling?
Specifically I would love to know how to scale up and scale out (what are the different multi-node options and what are the consistency / replication delay guarantees)?
The last time I used Mongo in production (2013) most startups eventually had to switch databases or hire a consultant to scale. This has always stuck with me.
Most recently I have been using Citus and it has been ridiculously easy to shard data.
To be fair I have only been exposed to workloads of a few Terabytes of data ingest a day so YMMV.
Unfortunately, I can't compare to Mongo as I have had somewhat limited experience with it.
Doesn’t seem to stop most people here.
My previous company store coupon code on it (referrals). I prefer a hash nosql for that though (cassandra) but for a one man shop it's overkill.
IIRC postgresql have their own document storage like MongoDB. I just never had to use such thing because I guess I'm too OCD about how my data suppose to look like.
I use MongoDB for everything where I don't know the data structure before hand.
MongoDB is very good in indexing unknown data structures and makes them searchable.
Also, when I build prototypes I often use MongoDB.
To me, this almost never happens. You almost always need JOINs, otherwise your data is sterile and useless. What point is there to talk about entities devoid of relationships to other entities? Data is always relational; we don't live in a 1-dimensional world.
The only way I see it being useful is for scaling purposes. But there are way better, more stable, more resistant to failure kv stores out there.
you're comparing the feature that RDBs do best to the feature that NoSQL DBs do worst...
Is there any real reason why a document store can't have good joins between collections? If you index the lookup keys, I don't understand what about the underlying document model would make joining impossible to optimize as well as a relational model.
When you say difficult do you mean from a developer ergonomics standpoint (knowing there's data integrity), or from a database optimizability standpoint?