While I understand is tempting to easily conclude that this is apples-to-oranges or that "obviously a relational database is better at data relationships", I recommend to dive deeper into the topic.
First of all, almost all data models you may think of are in some or another way relational (call it "related data", if you prefer). So any database, RDBMS or NoSQL or whatever, will need to deal with data that is related to other bits of data. So this is, from this perspective, very apples-to-apples IMHO.
Second, and regardless of the above statement, the exercise here is to model a data model that is, necessarily, two (or three) pieces of related data. And actually we considered the two models that MongoDB themselves consider for data modelling: embedded and reference. This is explained in the post and references to official MongoDB documentation are provided.
Third: benchmark is open and public. The repository also contains all the query examples used in the post. If anyone considers that there's a better way to represent this data model (problem statement), that would still answer all the "business" questions (queries) considered in the post, please send a Merge Request to the repository. We will be very happy to analyze it.
All in all, this is an exercise of data modeling, for a given dataset. MongoDB requires data modeling, is not "schema-less", and this is pervasive through their official documentation. The reference model, and their $lookup operator, are not a corner case nor a hidden downplayed option. It is a first-class operation, well advertised. But its complexity, performance and rewriting needs to adapt to changes are what are not, apparently, advertised, after our observations.
Ok everyone, calm down. I think it is obvious that the author knows that he is comparing apples to oranges here... but that isn't the point. The point of this article was to quantify the difference. His work accomplishes that for these specific tests very well. Yes, we know relational systems are better in these cases, but it is useful for those considering no-sql solutions to see this kind of hard data.
While I think it's easy to poke fun at this article - relational database better at handling relational data, more at 10 - it does make me want to understand why one would choose mongo over postgres?
There's really only one reason to choose mongo over a competent RDBMS, and that's schemalessness at the database level--which most users of mongo then toss away by adding a schema package at the code layer.
If your data is relatively freeform with good indices on stable fields, mongo is hugely convenient, and can be easily structured to scale horizontally. For non-join based aggregation (i.e, aggregation within a collection), mongo is very performant, and works fine for a lot of use cases. I'm using it storing large amounts of simple timeseries data and it's great, largely because my use case doesn't require more complex data structures.
As soon as you want to deal with more complex structured data, mongo is worse than any RDBMS. The only viable noSQL solution to complex structures is denormalization, which is sometimes a legit choice. But unless you know your primary use is just retrieving a set of documents based on indexed fields (and many are), get an RDBMS.
MongoDB has built-in auto-failover with replica sets. It is possible to set this up for PGSQL with third-party tools but it's hard to get right.
MongoDB has built-in multi-instance sharding support. Again this is possible for PostgreSQL with table partitioning and FDW but it is hard to maintain.
> MongoDB has built-in auto-failover with replica sets. It is possible to set this up for PGSQL with third-party tools but it's hard to get right.
PostgreSQL documentation makes it quite clear how to set up HA, I don't think you need any third-party tools other than general OS-level support for failover.
I don't think PostgreSQL supports out of the box auto-failover as MongoDB does. It either needs custom scripting or tools like repmgr.
From the official docs:
>> PostgreSQL does not provide the system software required to identify a failure on the primary and notify the standby database server. Many such tools exist and are well integrated with the operating system facilities required for successful failover, such as IP address migration. [1]
Or Patroni (https://github.com/zalando/patroni), a better option. But it is complex to setup, requires other external tools too (DNS, virtual IPs, Envoy proxy, etc) and none of that is included in Postgres.
I can’t address MongoDB, but nosql vs relational generally, there are two reasons:
(1) what’s faster than a fast join? No join. With nosql you have more flexibility to store the data organized in the same way it is accessed. You don’t need to join to another table if the data your app needs is already directly part of the main data the app requests. You do need to understand your access patterns well, and develop migration plans when they change.
(2) scalability. nosql databases generally let you scale horizontally more easily/gracefully than relational databases, though there are trade offs.
Put it together and you get performance at scale, though you generally need to understand your data access patterns and usually need to be more resilient to inconsistencies.
> You do need to understand your access patterns well,
I've found that being able to change quickly is more valuable that the marginal performance boost that mongo would give. Denormalization is possible in postgres.
> what’s faster than a fast join? No join. With nosql you have more flexibility to store the data organized in the same way it is accessed.
You can definitely do this in PostgreSQL via materialized views. They're directly supported in recent releases, and they don't require you to denormalize the underlying datamodel unlike NoSQL.
Well, a materialized view is denormalized data. You distinguish it from the “underlying data model”, but you can do that in nosql as well. Also, materialized views still tend to be oriented to rectangular data, which isn’t always how you want to access data.
I agree with the general point that Postgres and other relational databases can do nosql things (and nosql databases can do relational things).
> You do need to understand your access patterns well, and develop migration plans when they change.
This sounds like a minor issue, but it's a huge operational pain. Migrating a data structure to accommodate new queries involves operations and application changes, both substantial. Migrating all data in a consistent way is either extremely hard or requires significant downtime.
In my opinion, for most cases, this problem is significantly worse than doing joins in first place.
A migration in this context may involve rewriting substantial parts of the database. Imagine you need to rewrite a several TB collection. How would you do it?
Either you involve in significant downtime, or you face the risk of breaking data consistency in the face of concurrent applications.
Mongo cannot do a TB-sized migration atomically (see limitations for transactions) so you are faced with a potential incremental backfill. That will very likely break consistency and create more data problems that it is trying to solve.
Last but not least, you need to recode your application heavily. And worse, make it work with two potential schemas at the same time (while backfill happens), and data in, potentially, two places at once.
Yes, I'm thinking of incremental migrations. You don't need to break things though.
Anyway, from my perspective, the chance you can atomically update a system, whatever kind of database you're using, isn't something you necessarily get. You either don't have reliable control of all clients, middleware components and data stores or can't pay the price of downtime. So for data access pattern changes between components we're talking about incremental migration in any case, that needs to maintain compatibility until all dependent clients/components are updated.
> Last but not least, you need to recode your application heavily.
I don't think that's generally true. At a high-level, the point of nosql is to store your data more directly in the way your app wants to consume it, so data reorg follows the needs to the app, not the other way around. I guess you're talking about the need for code somewhere to take a more explicit role in managing storage. That generally does exist with nosql and is a maintenance burden, but should not be deeply embedded in your app. It should be a layer with clear separation of concerns, the implementation should be shared so you don't have redundant implementations, etc.
I'm not disputing relational databases are more flexible, and mostly simpler when a single instance does the job (though you generally pay back some complexity transforming relational data into the form your clients need). So relational is usually the default/general choice if you don't need performance at scale.
> the chance you can atomically update a system, whatever kind of database you're using, isn't something you necessarily get
Postgres supports fully atomic DML and DDL operations. You can do transactions as big as you want (I have seen single transactions of multi-TB of changes).
> I don't think that's generally true
When you do a data migration of the kind we're talking here, let's say you are reversing how embeds who (you had employees embed departments, and now you want departments embed employees), you need to do a whole collection rewrite.
Since you want this to be incremental (without downtime) you necessarily need to deal with, among others:
* Change all your queries to do an "union" between the results you get from the $old model (those documents not migrated yet) and the $new model (those already migrated). This can be really hard.
* You need to change all your updates. They are not trivial neither, as they need to support now cases in which the pre-updated document is in two possible forms. Worse, depending on your migration code, you probably need to update the document in the $old form it is in the $old form, but in the $new form if it is in the $new form.
* Once migration is done, you need to undone all these "duplicated" code paths.
* At all times, you are dealing with potentially significant consistency issues. Due to the limitations to ACID operations in MongoDB --and the lack of them in other NoSQL systems--, you might see spurious effects of document in one form appearing or disappearing while being migrated.
> So relational is usually the default/general choice if you don't need performance at scale.
Define scale. One of my customers is gitlab.com. It handles millions of users, with dozens to hundreds of thousands of queries per second, and some TBs of daily changes, with a single Postgres cluster. Which still has significant headroom for vertical scaling.
Not everybody needs to get significantly bigger than this.
You can do that by using a JSONB column in postgres (including indexes), and that way you can upgrade to typed columns at yout leisure (and on a per-column basis).
Personally, having inherited a database that wqs absolutely chock full of bad data due to the lack of schema, I'd be very reluctant to go that route unless I really needed to.
This is a side feature for Postgresql and other relational dbs.
MongoDB focuses on especially this.
I would not mix the two concepts.
Split of the relational data and make it fast with relational DBs.
Everything else goes in MongoDB and if fields always occure, they move into the relational DB.
I agree with you on the original premise of "Why would you use Mongo over Postgres?", managing two DBs is terrible. But if you replace Mongo with something like DynamoDB, I think it gets more complicated. You don't really "manage" DynamoDB as much as just define what you're trying to do, as it's a managed service. I quite enjoyed using DynamoDB for some specific tasks at work, much more than I've enjoyed PostgreSQL's JSON columns.
I would use the best tools even if it requires two DBs.
When I have many transcations with many joins but also user documents (maybe from a online editor which is currently still in heave development), I would use two DBs.
In an ideal world yes, you only manage one database but no such "jack of all trades" DB exists for the same reason the CAP theorem remains true.
If your Postgres JSON schema modified for new application requirements, the table is locked until existing data is copied into the new schema, requiring applications to be quiesced during schema migration.
Thats a key difference between the two and one area Mongo performs best in.
> If your Postgres JSON schema modified for new application requirements, the table is locked until existing data is copied into the new schema, requiring applications to be quiesced during schema migration.
Are you saying that if you INSERT a row with a new JSONB property that wasn't previously included in any of the rows of that table, then the entire table gets locked while all rows are copied. And if so, do you have a source for that? I can't find anything through google.
Sure there are situations where a Postgres/Mongo hybrid is the best but my preferred default is Postgres only. In most cases this will work fine, is simple and straightforward. Why make things more complicated than necessary?
I've been using jsonb columns in PostgreSQL recently and have used CouchDB extensively in the past. I find the builtin support for JSON superior in PostgreSQL. There are a lot of nice built in operators for querying and creating indexes is much more concise. I wouldn't call it a side feature at all.
when you need a geo-replicated and highly available data store in a way that doesn't make you lose sleep. (also, if you don't need to join relations in it. or have relations at all.)
PostgreSQL has the JSONB data type which is essentially the same as the BSON type except slightly better optimized for reading partial documents. And PostgreSQL also provides good indexing support for them, and did so before MongoDB.
Postgres can also be made to shard any table to multiple hosts via the PARTITION BY RANGE/PARTITION OF feature, combined with the FOREIGN DATA WRAPPER (FDW) core extension.
Only one person in this comment thread mentioned CAP theorem, which is the (highly informal*) rationale for NoSql's existence and also its deliberate aversion to join dependencies (which more or less violate the core principles of eventual consistency modeled distributed data stores.)
I would really love to see a direct comparison for scaling MongoDB and PostgreSQL.
When do I usually need to scale up / out in either database? How do I assess that? And how do I actually go about scaling?
Specifically I would love to know how to scale up and scale out (what are the different multi-node options and what are the consistency / replication delay guarantees)?
The last time I used Mongo in production (2013) most startups eventually had to switch databases or hire a consultant to scale. This has always stuck with me.
Can someone who is using MongoDB and a RDBMS at the same time tell me the exact use cases where MongoDB is fitter? Structured logs? Caching? Sessions? And why MongoDB does it better?
Cluster is very easy in MongoDB and for unstructured data. I've looked into clustering postgresql a few time and it seems like a pain in the butt. TBH, storing data without thinking about relation is just kicking the can down the road.
My previous company store coupon code on it (referrals). I prefer a hash nosql for that though (cassandra) but for a one man shop it's overkill.
---
edit:
IIRC postgresql have their own document storage like MongoDB. I just never had to use such thing because I guess I'm too OCD about how my data suppose to look like.
Relational DBs for everything that needs joins.
MongoDB is horrible with that and should not be used.
One would not put an off-road vehicle on a race track and expect it to win.
I use MongoDB for everything where I don't know the data structure before hand.
MongoDB is very good in indexing unknown data structures and makes them searchable.
Also, when I build prototypes I often use MongoDB.
The biggest one is searching jsons. You essentially have a database that stores jsons and you can search it. Just dump a json and search it goes a long way. My problems are more the atomicity guarantees which are not very consistent.
A basis for the apples to oranges argument is that you don't use MongoDB when you need JOINs.
To me, this almost never happens. You almost always need JOINs, otherwise your data is sterile and useless. What point is there to talk about entities devoid of relationships to other entities? Data is always relational; we don't live in a 1-dimensional world.
The only way I see it being useful is for scaling purposes. But there are way better, more stable, more resistant to failure kv stores out there.
The most common use case for Mongo is storing relational data for web apps, and the most common outcome for those apps is complete dysfunction. How do I know that’s the most common use case? Aside from my own anecdotal experience, you just need look at the features those DBs implement. They’re constantly implementing new RDBMS-like and ACID-like features. They wouldn’t be doing that if that wasn’t their market. Why does this happen? Because people who don’t know better keep thinking that they can somehow work around the constraints, and just make a mongo (or dynamo) RDBMS work somehow. It is an apples to oranges comparison. But it’s worth making, because of the large number of people who think it’s ok to substitute oranges into apple pie recipes.
None of that matters because your data needs to be modeled around your application usage patterns. If you maintain inter-collection relationships via Joins in Mongo just as if you were in Postgres, you're not doing it right.
Your schema needs to be modelled around your domain, because that will dictate what your application usage patterns actually are. That will determine wether using a document database is feasible or not. If not, then there isn’t any design cleverness that can save you.
This was quite the lopsided comparison - a document store just doesn't get fair competition in a relational setting. Additionally, both the SQL and the MongoDB queries appear unnecessarily complicated (or inefficient) for the examples they illustrate.
Is this actually an objective truth? "relational" just refers to the relation within the table, it's less about having a database system with good inter-table relationship handling. It happens to be that a lot of the things we think about as relational are OLTP and have good fkeys, etc.
Is there any real reason why a document store can't have good joins between collections? If you index the lookup keys, I don't understand what about the underlying document model would make joining impossible to optimize as well as a relational model.
I don't think most document stores enforce foreign key relationships (and hence consistency) like RDBMS do, so in practice that would make joins difficult.
Is joining on an fkey any more performant than without it assuming the same indexing is applied?
When you say difficult do you mean from a developer ergonomics standpoint (knowing there's data integrity), or from a database optimizability standpoint?
The former. You have a guarantee of consistency, whereas otherwise you must adjust for this in your application code. Database performance wise I agree they are equivalent.
This does feel like an article that would have made more sense in 2013 or so, back when the habit of mistaking a document store for a database was much more widespread.
Because lots of people are still using it. In Q1 2020 they 89m in revenue, up from 50m in Q1 2019. Just because something is bad technically doesn't mean people will immediately abandon it. See: php.
While I understand is tempting to easily conclude that this is apples-to-oranges or that "obviously a relational database is better at data relationships", I recommend to dive deeper into the topic.
First of all, almost all data models you may think of are in some or another way relational (call it "related data", if you prefer). So any database, RDBMS or NoSQL or whatever, will need to deal with data that is related to other bits of data. So this is, from this perspective, very apples-to-apples IMHO.
Second, and regardless of the above statement, the exercise here is to model a data model that is, necessarily, two (or three) pieces of related data. And actually we considered the two models that MongoDB themselves consider for data modelling: embedded and reference. This is explained in the post and references to official MongoDB documentation are provided.
Third: benchmark is open and public. The repository also contains all the query examples used in the post. If anyone considers that there's a better way to represent this data model (problem statement), that would still answer all the "business" questions (queries) considered in the post, please send a Merge Request to the repository. We will be very happy to analyze it.
All in all, this is an exercise of data modeling, for a given dataset. MongoDB requires data modeling, is not "schema-less", and this is pervasive through their official documentation. The reference model, and their $lookup operator, are not a corner case nor a hidden downplayed option. It is a first-class operation, well advertised. But its complexity, performance and rewriting needs to adapt to changes are what are not, apparently, advertised, after our observations.