There are definitely use cases for noSQL and I clicked on this thread hoping for information and war stories about cockroach, mongo, redis, couchdb, the current state of noSql in postgres, and a few name I maybe never heard of.
Let me derail the conversation of berating this technology. I've happen to had a requirement which needs a dinamically changing dB structure (saving lots of json data from an dinamically changeable form). Which noSql db would you recommend for me?
Any pitfalls? I'm primarily looking for self hosted solutions.
The OP has harped quite a bit on connection limit issues and this is a valid concern but also something that you can mitigate by using connection pooling. Geographic replication is an issue and it's one of those things where I'm not really convinced any sql or nosql db offers a really good solution. For example, if you run mongodb in replica you must take extra steps to ensure your replicas do not suffer from split brain during a network partition. And as another commenter has pointed out, mongodb's transactions are flawed. I would not trust it for anything transactional beyond single document transactions in a non-replicated configuration.
DynamoDB is pretty decent for a document DB but I personally dislike working with it. It requires you to make many sacrifices that RDBMS's like postgres offer out of the box. For example, if you want to fetch all records you will have to do it in a loop because there are limits to the number of records that can be fetched at the same time. But of course you can't self host it.
Another hosted option is Datatomic. I've heard great things but have never used it so I can't really comment.
Can you do connection pooling to Postgres from cloud functions?
I know there is a Node driver that does it for MySQL  but I've never seen one for Postgres.
For what it is worth, the popular postgres and mysql libraries for nodejs support connection pooling out of the box. Cloud functions break this functionality because you are spinning up an instance per connection rather than handling multiple connections with a single instance of nodejs. In this case I think that longer lived containers or vms that can autoscale my be a better solution. That is not to say you should not use cloud functions. I understand they offer many benefits. Managing a fleet of vms or containers introduces its own challenges. In general I think this is a good example of how simplifying one thing can complicate other things. There are no silver bullets. It's all about determining the compromises you can and cannot accept.
I'm using Cloudflare Workers which are distributed on the CF network and run at the edge to reduce latency. Achieving the same thing with containers would be probably complicated and expensive.
Right now I'm using Fauna DB which is also distributed and doesn't have connection limits, but I want to explore all my options.
If anyone wants to know more about PGBouncer and pooling with postgres I found this article very informative:
OP, try https://www.arangodb.com
It is the the best NoSQL IMHO. Its multi model, extremely performant, has fantastic distributed/replication capabilities and good documentation. They even have a hosted offering of it.
It looks good but their cloud offering seems quite expensive starting at $0.20 per hour or about $150 per month.
Firestore is better than the RTDB but still very limited compared to say Mongo or Fauna.
That being said, it's still not really a database engine on itself and would also require a slight paradigm change on how you think about your data and how you create your schema so ymmv. But I've personally used it across a few non-data-heavy projects as primary datasource and have been quite happy with it. It was also famously used as primary datasource for a well known adult website generating 200M pageviews/day even back in 2012   although I don't know if that is still the case.
I'm already using Cloudflare Workers so it would make more sense to simply use Workers KV instead of Redis which are stored at the edge:
- $15 per month
- 50MB of memory
- 40 connections limit
- $5 per month
- 1GB of storage (then $0.5 per GB)
- 10M reads (then $0.5 per 1M reads)
- Unlimited connections (via API or REST)
On a side note, for the $15 a month Heroku charges for a 50MB Redis (wtf?) one could, for example, get a 4vCPU/8GB/160SSD server on Hetzner cloud with 20TB of traffic or multiple smaller servers and have a much better infrastructure and room to grow without any added costs. Oh well, I guess "managed" is where the big bucks are :) /rant
I wish I could manage a VPS. I've dabbled with DO et al but I wouldn't sleep at night having a self configured VPS in production.
Being used to configure my own servers with relatively low traffic/processing needs on the cheap, I just get weirded out sometimes by the very expensive nature of these managed services like heroku, aws, etc and their "charge for every little thing" mentality.
Still, the "elastic" functionalities and auto growth features they offer are usually pretty awesome and not trivial to setup on a VPS so it's a great and worry free option in that regard.
Hope you can find the right solution for your case, good luck :)
If that still doesn’t convince you, then you may actually may need nosql: go for Mongo.
Typically the problem with RDBMS is that it's very expensive to handle thousands of concurrent connections. NoSQL doesn't have that issue. For example FaunaDB is designed for serverless and has no practical connection limits, Mongo Atlas gives you 500 concurrent connections on the free tier , etc.
In comparison Postgres on Heroku only gives you 500 connections on the most expensive plans. Even the $50 per month Postgres plan only gives you 50 concurrent connections.
A pain to distribute geographically? What do you think big enterprises and banks use? Oracle or Mongo? If by “a pain” you mean “not free” then you’re right. Depends on how valuable your data is and how much you care about integrity.
Also I’m not a big enterprise nor a bank.
IIRC I read that each connection to Postgres consumes 10MB or RAM.
Maybe the proposal above by itself isn't enough, but it's going to solve some of the many problems that put a restriction on max allowable connections in the first place. For example, I am not sure if there is anything in there to reduce the baseline per connection memory consumption, but maybe that will come up as the next problem to solve and will hopefully be solved sooner than later.
To me, data has "live" somewhere. When one computer is storing that and serving it, it is by definition a server.
Is there something I've missed??
Cloud functions are triggered on demand so these scale up and down as needed.