Hacker News new | past | comments | ask | show | jobs | submit login

Unless you have a really good shared storage, I don't see any advantage for running Postgres in Kubernetes. Everything is more complicated without any real benefit. You can't scale it up, you can't move pod. If pg fails to start for some reason, good luck jumping into container to inspect/debug stuff. I am neither going to upgrade PG every 2 weeks nor it is my fresh new microservice that needs to be restarted when it crashes or scaled up when I need more performances. And PG has high availability solution which kind of orthogonal to what k8s offers.

One could argue that for sake of consistency you could run PG in K8S, but that is just hammer & nail argument for me.

But if you have a really good shared storage, then it is worth considering. But, I still don't know if any network attached storage can beat local attached RAID of Solid state disks in terms of performance and/or latency. And there is/was fsync bug, which is terrible in combination with somewhat unreliable network storage.

For me, I see any database the same way I see etcd and other components of k8s masters: they are the backbone. And inside cluster I run my apps/microservices. This apps are subject to frequent change and upgrades and thus profit most from having automatic recovery, failover, (auto)scaling, etc.

You don't run shared/network storage. You run PVCs on local storage and you run an HA setup. You ship log files every 10s/30s/1m to object storage. You test your backups regularly which k8s is great for.

All of this means that I don't worry about most things you mention. PG upgrade? Failover and upgrade the pod. Upgrade fails? wal-g clone from object storage and rejoin cluster. Scaling up? Adjust the resource claims. If resource claims necessitate node migration, see failover scenario. It's so resilient. And this is all with raid 10 nvme direct attached storage just as fast as any other setup.

You mention etcd but people don't run etcd the way you're describing postgres. You run a redundant cluster that can achieve quorum and tolerate losses. If you follow that paradigm, you end up with postgres on k8s.

For dummys like me who are interested but not familiar with infra.

PVC = Persistent Volume Claim HA = High avalability PG = Postgres

It sounds pretty simple and enticing. Any problems convincing the team of this route?

Speaking from experience..

This is fine if you have a small single node database that can have some downtime. Once you need replicas, to fit your database in ram for performance, or reliability with a hot standby for failovers it becomes a lot more complicated.

You should consider what (if anything) you miss out on by just running in a single vm that you can scale out much easier down the road should you need to. Alternatively pay extra for a hosted solution that simplifies those steps further.

> single vm that you can scale out much easier

I’m not sure how experience could lead you to this conclusion. This wouldn’t work for any of our production needs.

Historically speaking, convincing anyone to run data stores on k8s is a giant pain. This isn't helped by all the mythology around "k8s is great for stateless" further promoted by folks like Kelsey Hightower who firmly believes that one should rely on the cloud provider's data stores.

This is great for the cloud provider as those are high margin services, I for one would rather have a single orchestration API that I have to interact with, that being the one offered by k8s. All the benefits of running workloads in k8s apply equally to data stores. Cattle, not pets. If your data store needs coddling then it's a pet and you're doing it wrong.

How big are your DBs in terms of size?

50gb - multi TB

The nice thing about running a DB inside a cluster is running your entire application, end to end, through one unified declarative model. It's really really easy to spin up a brand new dev or staging environment.

generally though in production, you're not going to be taking down DBs on purpose. If it's not supposed to be ephemeral, it doesn't fit the model

We run dev environments and some staging entirely inside Kubernetes, but in prod we run all of our stateful components outside K8s (generally using things like RDS and ElastiCache).

Anecdotally, keeping stateful components outside of K8s makes running your cluster and application so much simpler and it is much easier to maintain and troubleshoot. The burden is increased configuration friction though, so often you don't want to do it for your ephemeral deployments (eg. dev environments, integrated test runners, temporary staging instances).

You can use tools like kustomize to keep your configuration as clean as possible for each deployment type. Only bring in the configurations for the stateful services when needed.

I feel like this is the "right" way for smaller teams to do K8s, assuming it's already a good fit for the application.

My biases are probably 5+ years old, but it used to be that running PostGres on anything other than bare metal (for example, running it in Docker) was fraught with all sorts of data/file/io sync issues that you might not be able to recover from. So, I just got used to running the databases on metal (graph, postgres, columnar) with whatever reliability schemes myself and leaving docker and k8s outside of that.

Has that changed? (It may well have, but once burned, twice shy and all that).

My "run-of-the-mill" setup for local toy apps is Postgres + [some API service] in Docker Compose, and for non-trivial apps (IE more than 3-4 services) usually Postgres in k8s.

I've never had a problem with Postgres either in Docker or in k8s. Docker Compose local volumes, and k8s persistent volume claims work really well. But I'm no veteran at this so I can only speak for what little time I've used them.

The whole reason I do this is because it lets you put your entire stack in one config, and then spin up local dev environment or deploy to remote with zero changes. And that's really magical.

In production I don't use an in-cluster Postgres, and it's a bit of a pain in the ass to do so. I would rather use an in-cluster service, but the arguments you hear about being responsible in the event of a failure, and the vendor assuming the work for HA and such seems hard to refute.

Probably you could run Postgres in production k8s and be fine though. If I knew what I was doing I likely wouldn't be against it.

Ha! That's probably the problem - I don't know what I am doing.

Why oh why did I ever leave silicone...

I might have lost count but I think they reimplemented the file storage twice in local Docker installs and twice in Kubernetes... it’s at a point now where if you trust cloud network storage performance guarantees, you can trust Postgres with it. Kubernetes and Docker don’t change how the bits are saved, but if you insist on data locality to a node, you lose the resilience described here for moving to another node.

Here’s a different point to think about: is your use of Postgres resilient to network failures, communication errors or one-off issues? Sometimes you have to design for this at the application layer and assume things will go wrong some of the time...

As with anything, it could vary with your particular workload... but if I knew my very-stable-yet-cloud-hosted copy of Postgres wasn’t configured with high availability, well, you might have local performance and no update lag but you also have a lot of risk of downtime and data loss if it goes down or gets corrupted. The advantage to cloud storage is not having to read in as many WAL logs, and just reconnect the old disk before the instance went down, initialize as if PG had just crashed, and keep going... even regular disks have failures after all...

That's great to know. Thank you. The stuff in your 3rd paragraph is something I have always tried to do, but it largely depended on the largely reliable single node instances. I tend to (currently) handle single disk failures with master-slave configs locally, and it has worked at my volumes to date, but I am trying to learn how to grow without getting increasingly (too) complicated or even more brittle.

Slack, YouTube, and lots of huge tech companies use Vitess for extreme scale in mysql it’s started to move towards being much less of a headache than dealing with sharding and proxies

Assuming you run kubelet/Docker on bare metal, containers will of course run on bare metal as well.

A container is just a collection of namespaces.

There's only network overhead when running Postgres in Docker container (and obv data is mounted, not ephemeral), but it's a performance hit. This in itself can be enough reason not to run pg in Docker/k8s in production.

It hasn't, if your DB is sufficiently large, or under a lot of writes, the shutdown time will be larger than the docker timeout before killing pg. Data loss occurs in that case.

In my previous gig (2017 timeframe), we moved from AWS instances with terraform plus chef which took a couple hours of focused engineer time to generate a new dev environment. This was after some degree of automation, but there were still a great deal of steps one had to get through and if you wanted to automate all that how would you glue it all together? Bash scripts?

We transition to k8s, with PG and other data stores in cluster, specifically RabbitMQ, and Mongo, which runs surprisingly well in k8s. In any case, after the whole adoption period and a great deal of automation work against the k8s APIs, we were able to get new dev environment provisioning down to 90 seconds.

There was clearly some pent up demand for development resources as we went from a few dev environments to roughly 30 in one month's time.

Following that, the team added the ability to "clone" any environment including production ones, that is, the whole data set and configuration was replicated into a new environment. One could also replicate data streaming into this new environment, essentially having an identical instance of a production service with incoming data.

This was a huge benefit for development and testing and further drove demand for environments. If a customer had a bug or an issue, a developer could fire up a new environment with a fix branch, test the fix on the same data and config, and then commit that back to master making its way into production.

These are the benefits of running data stores governed by one's primary orchestration framework. Sounds less threatening when put that way, eh?

I ran ~200 instances of Postgres in production for a SaaS product. This was on top of GCP persistent disk, which qualifies as quite good network storage, all of it backed up by what is now called Velero.

This particular database was not a system of record. The database stored the results of a stream processing system. In the event of a total loss of data, the database could be recovered by re-streaming the original data, making the operation of PG in a kubernetes cluster a fairly low risk endeavor. As such, HA was not implemented.

This setup has been running in production for over two years now. In spite of having no HA, each instance of this application backed by this DB had somewhere between four and five nines of availability while being continuously monitored on one minute intervals from some other spot on the Internet.

During the course of my tenure, there was only one data loss incident in which an engineer mistakenly dropped a table. Recovery was painless.

I've since moved on to another role and can't imagine having to run a database without having the benefits of Kubernetes. I'm forced to, however, as some folks aren't as progressive, and damn does it feel archaic.

GCP volumes are over network already. You can deploy stateful workloads using StatefulSets. We've run an HBase workloads for development purposes (about 20-30x cheaper than BigTable) and it worked great (no issues for over 12 months). While Postgres is hardly a distributed database, there may be some advantages to ensure availability and perhaps even more in replicated setup.

Why are you comparing HBase with Postgres? They are very different technologies with completely different architectural constraints?

They both require persistence semantics. I forgot to mention, I was referring to a single node, no-HDFS setup for HBase, solely relying on Kubernetes StatefulSets for data availability, in the simplified persistene sense, not much different than a single Posgres server.

Under those conditions I'd agree theres likely a high overlap of similarity. I have a use case on my home playpen that would likely be served well by that. Thanks for describing your approach.

> If pg fails to start for some reason, good luck jumping into container to inspect/debug stuff.

kubectl exec, attach and cp all make this trivial. Whatever type of inspection you want should be relatively doable.

Putting stuff in kubernetes also lets you take advantage of the ecosystem, including network policies, monitoring and alerting systems, health and liveness testing, load balancing, deployment and orchestration, etc.

Ime most pg deployments don’t need insanely high iops and become cpu bound much quicker. So running ebs or gcp pd ssd or even ceph pd is usually enough.

this is very contra to my own experience, we're IOPS bound far more than we're CPU bound.

This is true in Video Games (current job) and e-commerce (what became part of Oracle Netsuite)

IOPS limitations and network latency are the reason I want to run my Postgres replicas in k8s. Every machine gets NVMe and a replica instance. They're still cattle, but they can serve a lot of requests.

Database CPU usage is negligible.

It was easier to put a replica on every host than try to rearchitect things to tolerate single-digit millisecond RTT to a cloud database.

I think you just hit upon the most overlooked factor here, storage hardware. You can get an order of magnitude of benefit from moving to NVMe.

Especially when you look at nvme benchmarks using small files , they can be over 10 times faster than regular SSDs in those scenarios and have beefier firmare too.

I have a strong feeling that the reason databases crap out in containers is mainly because you're likely using a dynamically allocated volume with them. While your cloud providers storage solution will handle their all errors and redundancy for you, it wont provide consistent latency or even bandwidth. The hardware of the storage server will usually consist of old worn out SATA ssds ( which albeit being advertised as 250-500mbps, can drop down to 10-20 mbps in high loads).

When combine this along with noisy neighbours all sharing the same same pathetic storage bandwidth then yeah your database is gonna have trouble dealing with all the tiny errors in your I/O chain.

Whereas nvmes, especially ones that are local to the machine, running at 2000-4000 Mbps, and over 200mbps even at the most demanding benchmarks, wont have any of the issues. Their firmware is usually a lot befier and faster at dealing with bad cells.

This is the thing that keeps my databases off the cloud. No amount of managed database convenience can convince me to give up the performance of hard metal with modern NVMe SSDs. And if you really want that kind of performance in the cloud, the cloud providers are just barely starting to catch up, and it's gonna cost you dearly. I've seen managed database setups costing $3-4k per month that could be had with an $8k server.

One thing I dislike about SQL databases is the synchronous protocol. It really feels like I should be able to send a batch of queries all at once and then process the response in bulk instead of waiting for each individual query to respond incurring a round trip every single time. You can obviously rewrite your queries so that they fetch everything upfront in one single huge query but that is a lot of work and the end result isn't straight forward code that everyone will understand.

This is probably half the reason why GraphQL exists. Your request can execute multiple queries asynchronously and it's the default mode of operation, not something you have to explicitly work on. Extreme example. Round trip latency is 100ms. Your GraphQL server is on the same node as the database and sends four SQL queries at the same time. The entire response arrives after 100ms.

How normalized your tables and do you use indexes? I’m guessing if it’s really denormalized kv style schema it will be more heavy on io. Same if it’s really write heavy

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact