This might sound disparaging, and we all like to think our own data matters to the Nth degree, but consider: how important is that banner that displays at the bottom of the Amazon or Ebay page, "people who looked at this thing bought this stuff"? It has to be there, but how accurate does it have to be? Are you willing to wait an extra five seconds for it to be right?
In the overwhelming majority of places where you might want a database to help you manage the data involved, the majority if the data does not need to be 100% right if it would make the main operations slower. For those parts of the job, MongoDB in its default configuration with default options is good enough.
Many of us with what we like to think of as higher standards prefer to work where everything matters, and leave the dancing penguins to others. But those others do need a database, and they have the money to buy one.
(Disclosure: I was fired from MongoDB right after their IPO, perhaps in part for saying stuff like the above.)
The stock market demands that the product have all the same stickers on as everybody else. But nobody is obliged to believe them. It may take quite a lot before anybody does, again.
The people working on the product are very smart and hard-working, but the problem they have set themselves is extremely complex, and possibly not solvable; or, even if ever solved, maybe not demonstrably so.
And, anyway, it is not what the customers actually want, although they might often need to convince their own management that they are getting it.
That is a very interesting take on the situation, especially because the tort of fraud starts from the opposite first principle: that customers may reasonably rely on the promises made by sellers.
The stickers come from organizations that are largely captured by vendors. Kudos to Jepsen for maintaining independence.
Would you say the same thing for company that sells parachutes?
There's a market for cheap, fast, easy, and good-enough reliability.
There are plenty of data stores like redis or memcached which promise and deliver exactly that, with known parameters, and they're awesome. I'm building an application right now where long-term data integrity doesn't matter too much (a dashboard will be down for a few minutes if we lost data), and they're the right way to go.
What I do grapple with is that MongoDB is none of those. When I've seen it used in projects, it was slow, hard, and expensive.
Because Mongo overpromises, it's hard to tell what its real limits are. You're quite likely to step on a landmine where something was promised but didn't come through. When that happens, you either have a huge amount of deep voodoo debugging, or perhaps it's a problem without a solution. That's almost always a lot more expensive than designing around psql in the first place.
And it's not particularly fast either.
The ideal use case for MongoDB is when you are a) collecting a lot of data, b) only need to retain a large enough sample for statistical purposes, and c) don't need historical archives.
In those cases, turn all the safeties off and go for it. It really does shine in that scenario. You want to get a heat map of clicks on a high-traffic website? Perfect. You'll get a good-enough picture of what's happening. And if you lose some or get some conflicting data, that's okay because we only care about a statistical approximation. Archive the results of the analysis instead of worrying about the underlying data.
People often get bogged down in the details of CAP theorem too early in their decision-making process. The reality is that database systems occupy a 2-d spectrum with raw speed on one end and safety on the other end. CAP theorem is a rigorous study of 3 dimensions of trade-offs you have to understand as you move that slider from one side to another. A lot of hand-wringing and debate could be shut down if people analyzed the lower decision first. Most use cases are binary and occupy one extreme or another: you either need them to be safe, or you don't. If you don't need them to be safe, then you want them to be as fast as possible. If you need them to be safe, your options are limited.
If you phrase the question only on the fundamental 2-d spectrum, your product and technology managers will almost always come up with a safe/speed answer. If you start at the CAP level analysis, you end up with endless discussions about what safe enough means--and it's usually a discussion among people who don't really understand the theorem at all.
There are tons of cases where any specific item of data doesn't matter. It's only the aggregate that matters. Great. Perfect. Use whatever it doesn't matter. And there are other times when you absolutely cannot afford to lose anything. These seem like obviously different things and I can't for the life of me imagine why anyone would try to use the same tool for both purposes.
Anyway, I'll shut this down now. To me, the bottom line is to identify what the business case is for data first. Fast or Safe. Pick one. Okay, now we basically know what our options are. Then we can get into the weeds among ourselves.
If you are really at the point you need to super scale, then you can look into specialized databases.