One could argue that for sake of consistency you could run PG in K8S, but that is just hammer & nail argument for me.
But if you have a really good shared storage, then it is worth considering. But, I still don't know if any network attached storage can beat local attached RAID of Solid state disks in terms of performance and/or latency. And there is/was fsync bug, which is terrible in combination with somewhat unreliable network storage.
For me, I see any database the same way I see etcd and other components of k8s masters: they are the backbone. And inside cluster I run my apps/microservices. This apps are subject to frequent change and upgrades and thus profit most from having automatic recovery, failover, (auto)scaling, etc.
All of this means that I don't worry about most things you mention. PG upgrade? Failover and upgrade the pod. Upgrade fails? wal-g clone from object storage and rejoin cluster. Scaling up? Adjust the resource claims. If resource claims necessitate node migration, see failover scenario. It's so resilient. And this is all with raid 10 nvme direct attached storage just as fast as any other setup.
You mention etcd but people don't run etcd the way you're describing postgres. You run a redundant cluster that can achieve quorum and tolerate losses. If you follow that paradigm, you end up with postgres on k8s.
This is fine if you have a small single node database that can have some downtime. Once you need replicas, to fit your database in ram for performance, or reliability with a hot standby for failovers it becomes a lot more complicated.
You should consider what (if anything) you miss out on by just running in a single vm that you can scale out much easier down the road should you need to. Alternatively pay extra for a hosted solution that simplifies those steps further.
I’m not sure how experience could lead you to this conclusion. This wouldn’t work for any of our production needs.
This is great for the cloud provider as those are high margin services, I for one would rather have a single orchestration API that I have to interact with, that being the one offered by k8s. All the benefits of running workloads in k8s apply equally to data stores. Cattle, not pets. If your data store needs coddling then it's a pet and you're doing it wrong.
PVC = Persistent Volume Claim
HA = High avalability
PG = Postgres
generally though in production, you're not going to be taking down DBs on purpose. If it's not supposed to be ephemeral, it doesn't fit the model
Anecdotally, keeping stateful components outside of K8s makes running your cluster and application so much simpler and it is much easier to maintain and troubleshoot. The burden is increased configuration friction though, so often you don't want to do it for your ephemeral deployments (eg. dev environments, integrated test runners, temporary staging instances).
You can use tools like kustomize to keep your configuration as clean as possible for each deployment type. Only bring in the configurations for the stateful services when needed.
I feel like this is the "right" way for smaller teams to do K8s, assuming it's already a good fit for the application.
Has that changed? (It may well have, but once burned, twice shy and all that).
I've never had a problem with Postgres either in Docker or in k8s. Docker Compose local volumes, and k8s persistent volume claims work really well. But I'm no veteran at this so I can only speak for what little time I've used them.
The whole reason I do this is because it lets you put your entire stack in one config, and then spin up local dev environment or deploy to remote with zero changes. And that's really magical.
In production I don't use an in-cluster Postgres, and it's a bit of a pain in the ass to do so. I would rather use an in-cluster service, but the arguments you hear about being responsible in the event of a failure, and the vendor assuming the work for HA and such seems hard to refute.
Probably you could run Postgres in production k8s and be fine though. If I knew what I was doing I likely wouldn't be against it.
Why oh why did I ever leave silicone...
Here’s a different point to think about: is your use of Postgres resilient to network failures, communication errors or one-off issues? Sometimes you have to design for this at the application layer and assume things will go wrong some of the time...
As with anything, it could vary with your particular workload... but if I knew my very-stable-yet-cloud-hosted copy of Postgres wasn’t configured with high availability, well, you might have local performance and no update lag but you also have a lot of risk of downtime and data loss if it goes down or gets corrupted. The advantage to cloud storage is not having to read in as many WAL logs, and just reconnect the old disk before the instance went down, initialize as if PG had just crashed, and keep going... even regular disks have failures after all...
A container is just a collection of namespaces.
We transition to k8s, with PG and other data stores in cluster, specifically RabbitMQ, and Mongo, which runs surprisingly well in k8s. In any case, after the whole adoption period and a great deal of automation work against the k8s APIs, we were able to get new dev environment provisioning down to 90 seconds.
There was clearly some pent up demand for development resources as we went from a few dev environments to roughly 30 in one month's time.
Following that, the team added the ability to "clone" any environment including production ones, that is, the whole data set and configuration was replicated into a new environment. One could also replicate data streaming into this new environment, essentially having an identical instance of a production service with incoming data.
This was a huge benefit for development and testing and further drove demand for environments. If a customer had a bug or an issue, a developer could fire up a new environment with a fix branch, test the fix on the same data and config, and then commit that back to master making its way into production.
These are the benefits of running data stores governed by one's primary orchestration framework. Sounds less threatening when put that way, eh?
This particular database was not a system of record. The database stored the results of a stream processing system. In the event of a total loss of data, the database could be recovered by re-streaming the original data, making the operation of PG in a kubernetes cluster a fairly low risk endeavor. As such, HA was not implemented.
This setup has been running in production for over two years now. In spite of having no HA, each instance of this application backed by this DB had somewhere between four and five nines of availability while being continuously monitored on one minute intervals from some other spot on the Internet.
During the course of my tenure, there was only one data loss incident in which an engineer mistakenly dropped a table. Recovery was painless.
I've since moved on to another role and can't imagine having to run a database without having the benefits of Kubernetes. I'm forced to, however, as some folks aren't as progressive, and damn does it feel archaic.
kubectl exec, attach and cp all make this trivial. Whatever type of inspection you want should be relatively doable.
Putting stuff in kubernetes also lets you take advantage of the ecosystem, including network policies, monitoring and alerting systems, health and liveness testing, load balancing, deployment and orchestration, etc.
This is true in Video Games (current job) and e-commerce (what became part of Oracle Netsuite)
Database CPU usage is negligible.
It was easier to put a replica on every host than try to rearchitect things to tolerate single-digit millisecond RTT to a cloud database.
Especially when you look at nvme benchmarks using small files , they can be over 10 times faster than regular SSDs in those scenarios and have beefier firmare too.
I have a strong feeling that the reason databases crap out in containers is mainly because you're likely using a dynamically allocated volume with them. While your cloud providers storage solution will handle their all errors and redundancy for you, it wont provide consistent latency or even bandwidth. The hardware of the storage server will usually consist of old worn out SATA ssds ( which albeit being advertised as 250-500mbps, can drop down to 10-20 mbps in high loads).
When combine this along with noisy neighbours all sharing the same same pathetic storage bandwidth then yeah your database is gonna have trouble dealing with all the tiny errors in your I/O chain.
Whereas nvmes, especially ones that are local to the machine, running at 2000-4000 Mbps, and over 200mbps even at the most demanding benchmarks, wont have any of the issues. Their firmware is usually a lot befier and faster at dealing with bad cells.
This is probably half the reason why GraphQL exists. Your request can execute multiple queries asynchronously and it's the default mode of operation, not something you have to explicitly work on. Extreme example. Round trip latency is 100ms. Your GraphQL server is on the same node as the database and sends four SQL queries at the same time. The entire response arrives after 100ms.