Hacker News new | past | comments | ask | show | jobs | submit login
Running Postgres in Kubernetes [pdf] (sched.com)
232 points by craigkerstiens on June 29, 2020 | hide | past | favorite | 100 comments

Unless you have a really good shared storage, I don't see any advantage for running Postgres in Kubernetes. Everything is more complicated without any real benefit. You can't scale it up, you can't move pod. If pg fails to start for some reason, good luck jumping into container to inspect/debug stuff. I am neither going to upgrade PG every 2 weeks nor it is my fresh new microservice that needs to be restarted when it crashes or scaled up when I need more performances. And PG has high availability solution which kind of orthogonal to what k8s offers.

One could argue that for sake of consistency you could run PG in K8S, but that is just hammer & nail argument for me.

But if you have a really good shared storage, then it is worth considering. But, I still don't know if any network attached storage can beat local attached RAID of Solid state disks in terms of performance and/or latency. And there is/was fsync bug, which is terrible in combination with somewhat unreliable network storage.

For me, I see any database the same way I see etcd and other components of k8s masters: they are the backbone. And inside cluster I run my apps/microservices. This apps are subject to frequent change and upgrades and thus profit most from having automatic recovery, failover, (auto)scaling, etc.

You don't run shared/network storage. You run PVCs on local storage and you run an HA setup. You ship log files every 10s/30s/1m to object storage. You test your backups regularly which k8s is great for.

All of this means that I don't worry about most things you mention. PG upgrade? Failover and upgrade the pod. Upgrade fails? wal-g clone from object storage and rejoin cluster. Scaling up? Adjust the resource claims. If resource claims necessitate node migration, see failover scenario. It's so resilient. And this is all with raid 10 nvme direct attached storage just as fast as any other setup.

You mention etcd but people don't run etcd the way you're describing postgres. You run a redundant cluster that can achieve quorum and tolerate losses. If you follow that paradigm, you end up with postgres on k8s.

It sounds pretty simple and enticing. Any problems convincing the team of this route?

Speaking from experience..

This is fine if you have a small single node database that can have some downtime. Once you need replicas, to fit your database in ram for performance, or reliability with a hot standby for failovers it becomes a lot more complicated.

You should consider what (if anything) you miss out on by just running in a single vm that you can scale out much easier down the road should you need to. Alternatively pay extra for a hosted solution that simplifies those steps further.

> single vm that you can scale out much easier

I’m not sure how experience could lead you to this conclusion. This wouldn’t work for any of our production needs.

Historically speaking, convincing anyone to run data stores on k8s is a giant pain. This isn't helped by all the mythology around "k8s is great for stateless" further promoted by folks like Kelsey Hightower who firmly believes that one should rely on the cloud provider's data stores.

This is great for the cloud provider as those are high margin services, I for one would rather have a single orchestration API that I have to interact with, that being the one offered by k8s. All the benefits of running workloads in k8s apply equally to data stores. Cattle, not pets. If your data store needs coddling then it's a pet and you're doing it wrong.

For dummys like me who are interested but not familiar with infra.

PVC = Persistent Volume Claim HA = High avalability PG = Postgres

How big are your DBs in terms of size?

50gb - multi TB

The nice thing about running a DB inside a cluster is running your entire application, end to end, through one unified declarative model. It's really really easy to spin up a brand new dev or staging environment.

generally though in production, you're not going to be taking down DBs on purpose. If it's not supposed to be ephemeral, it doesn't fit the model

We run dev environments and some staging entirely inside Kubernetes, but in prod we run all of our stateful components outside K8s (generally using things like RDS and ElastiCache).

Anecdotally, keeping stateful components outside of K8s makes running your cluster and application so much simpler and it is much easier to maintain and troubleshoot. The burden is increased configuration friction though, so often you don't want to do it for your ephemeral deployments (eg. dev environments, integrated test runners, temporary staging instances).

You can use tools like kustomize to keep your configuration as clean as possible for each deployment type. Only bring in the configurations for the stateful services when needed.

I feel like this is the "right" way for smaller teams to do K8s, assuming it's already a good fit for the application.

My biases are probably 5+ years old, but it used to be that running PostGres on anything other than bare metal (for example, running it in Docker) was fraught with all sorts of data/file/io sync issues that you might not be able to recover from. So, I just got used to running the databases on metal (graph, postgres, columnar) with whatever reliability schemes myself and leaving docker and k8s outside of that.

Has that changed? (It may well have, but once burned, twice shy and all that).

My "run-of-the-mill" setup for local toy apps is Postgres + [some API service] in Docker Compose, and for non-trivial apps (IE more than 3-4 services) usually Postgres in k8s.

I've never had a problem with Postgres either in Docker or in k8s. Docker Compose local volumes, and k8s persistent volume claims work really well. But I'm no veteran at this so I can only speak for what little time I've used them.

The whole reason I do this is because it lets you put your entire stack in one config, and then spin up local dev environment or deploy to remote with zero changes. And that's really magical.

In production I don't use an in-cluster Postgres, and it's a bit of a pain in the ass to do so. I would rather use an in-cluster service, but the arguments you hear about being responsible in the event of a failure, and the vendor assuming the work for HA and such seems hard to refute.

Probably you could run Postgres in production k8s and be fine though. If I knew what I was doing I likely wouldn't be against it.

Ha! That's probably the problem - I don't know what I am doing.

Why oh why did I ever leave silicone...

I might have lost count but I think they reimplemented the file storage twice in local Docker installs and twice in Kubernetes... it’s at a point now where if you trust cloud network storage performance guarantees, you can trust Postgres with it. Kubernetes and Docker don’t change how the bits are saved, but if you insist on data locality to a node, you lose the resilience described here for moving to another node.

Here’s a different point to think about: is your use of Postgres resilient to network failures, communication errors or one-off issues? Sometimes you have to design for this at the application layer and assume things will go wrong some of the time...

As with anything, it could vary with your particular workload... but if I knew my very-stable-yet-cloud-hosted copy of Postgres wasn’t configured with high availability, well, you might have local performance and no update lag but you also have a lot of risk of downtime and data loss if it goes down or gets corrupted. The advantage to cloud storage is not having to read in as many WAL logs, and just reconnect the old disk before the instance went down, initialize as if PG had just crashed, and keep going... even regular disks have failures after all...

That's great to know. Thank you. The stuff in your 3rd paragraph is something I have always tried to do, but it largely depended on the largely reliable single node instances. I tend to (currently) handle single disk failures with master-slave configs locally, and it has worked at my volumes to date, but I am trying to learn how to grow without getting increasingly (too) complicated or even more brittle.

Slack, YouTube, and lots of huge tech companies use Vitess for extreme scale in mysql it’s started to move towards being much less of a headache than dealing with sharding and proxies

Assuming you run kubelet/Docker on bare metal, containers will of course run on bare metal as well.

A container is just a collection of namespaces.

It hasn't, if your DB is sufficiently large, or under a lot of writes, the shutdown time will be larger than the docker timeout before killing pg. Data loss occurs in that case.

There's only network overhead when running Postgres in Docker container (and obv data is mounted, not ephemeral), but it's a performance hit. This in itself can be enough reason not to run pg in Docker/k8s in production.

In my previous gig (2017 timeframe), we moved from AWS instances with terraform plus chef which took a couple hours of focused engineer time to generate a new dev environment. This was after some degree of automation, but there were still a great deal of steps one had to get through and if you wanted to automate all that how would you glue it all together? Bash scripts?

We transition to k8s, with PG and other data stores in cluster, specifically RabbitMQ, and Mongo, which runs surprisingly well in k8s. In any case, after the whole adoption period and a great deal of automation work against the k8s APIs, we were able to get new dev environment provisioning down to 90 seconds.

There was clearly some pent up demand for development resources as we went from a few dev environments to roughly 30 in one month's time.

Following that, the team added the ability to "clone" any environment including production ones, that is, the whole data set and configuration was replicated into a new environment. One could also replicate data streaming into this new environment, essentially having an identical instance of a production service with incoming data.

This was a huge benefit for development and testing and further drove demand for environments. If a customer had a bug or an issue, a developer could fire up a new environment with a fix branch, test the fix on the same data and config, and then commit that back to master making its way into production.

These are the benefits of running data stores governed by one's primary orchestration framework. Sounds less threatening when put that way, eh?

I ran ~200 instances of Postgres in production for a SaaS product. This was on top of GCP persistent disk, which qualifies as quite good network storage, all of it backed up by what is now called Velero.

This particular database was not a system of record. The database stored the results of a stream processing system. In the event of a total loss of data, the database could be recovered by re-streaming the original data, making the operation of PG in a kubernetes cluster a fairly low risk endeavor. As such, HA was not implemented.

This setup has been running in production for over two years now. In spite of having no HA, each instance of this application backed by this DB had somewhere between four and five nines of availability while being continuously monitored on one minute intervals from some other spot on the Internet.

During the course of my tenure, there was only one data loss incident in which an engineer mistakenly dropped a table. Recovery was painless.

I've since moved on to another role and can't imagine having to run a database without having the benefits of Kubernetes. I'm forced to, however, as some folks aren't as progressive, and damn does it feel archaic.

GCP volumes are over network already. You can deploy stateful workloads using StatefulSets. We've run an HBase workloads for development purposes (about 20-30x cheaper than BigTable) and it worked great (no issues for over 12 months). While Postgres is hardly a distributed database, there may be some advantages to ensure availability and perhaps even more in replicated setup.

Why are you comparing HBase with Postgres? They are very different technologies with completely different architectural constraints?

They both require persistence semantics. I forgot to mention, I was referring to a single node, no-HDFS setup for HBase, solely relying on Kubernetes StatefulSets for data availability, in the simplified persistene sense, not much different than a single Posgres server.

Under those conditions I'd agree theres likely a high overlap of similarity. I have a use case on my home playpen that would likely be served well by that. Thanks for describing your approach.

> If pg fails to start for some reason, good luck jumping into container to inspect/debug stuff.

kubectl exec, attach and cp all make this trivial. Whatever type of inspection you want should be relatively doable.

Putting stuff in kubernetes also lets you take advantage of the ecosystem, including network policies, monitoring and alerting systems, health and liveness testing, load balancing, deployment and orchestration, etc.

Ime most pg deployments don’t need insanely high iops and become cpu bound much quicker. So running ebs or gcp pd ssd or even ceph pd is usually enough.

this is very contra to my own experience, we're IOPS bound far more than we're CPU bound.

This is true in Video Games (current job) and e-commerce (what became part of Oracle Netsuite)

IOPS limitations and network latency are the reason I want to run my Postgres replicas in k8s. Every machine gets NVMe and a replica instance. They're still cattle, but they can serve a lot of requests.

Database CPU usage is negligible.

It was easier to put a replica on every host than try to rearchitect things to tolerate single-digit millisecond RTT to a cloud database.

I think you just hit upon the most overlooked factor here, storage hardware. You can get an order of magnitude of benefit from moving to NVMe.

Especially when you look at nvme benchmarks using small files , they can be over 10 times faster than regular SSDs in those scenarios and have beefier firmare too.

I have a strong feeling that the reason databases crap out in containers is mainly because you're likely using a dynamically allocated volume with them. While your cloud providers storage solution will handle their all errors and redundancy for you, it wont provide consistent latency or even bandwidth. The hardware of the storage server will usually consist of old worn out SATA ssds ( which albeit being advertised as 250-500mbps, can drop down to 10-20 mbps in high loads).

When combine this along with noisy neighbours all sharing the same same pathetic storage bandwidth then yeah your database is gonna have trouble dealing with all the tiny errors in your I/O chain.

Whereas nvmes, especially ones that are local to the machine, running at 2000-4000 Mbps, and over 200mbps even at the most demanding benchmarks, wont have any of the issues. Their firmware is usually a lot befier and faster at dealing with bad cells.

This is the thing that keeps my databases off the cloud. No amount of managed database convenience can convince me to give up the performance of hard metal with modern NVMe SSDs. And if you really want that kind of performance in the cloud, the cloud providers are just barely starting to catch up, and it's gonna cost you dearly. I've seen managed database setups costing $3-4k per month that could be had with an $8k server.

One thing I dislike about SQL databases is the synchronous protocol. It really feels like I should be able to send a batch of queries all at once and then process the response in bulk instead of waiting for each individual query to respond incurring a round trip every single time. You can obviously rewrite your queries so that they fetch everything upfront in one single huge query but that is a lot of work and the end result isn't straight forward code that everyone will understand.

This is probably half the reason why GraphQL exists. Your request can execute multiple queries asynchronously and it's the default mode of operation, not something you have to explicitly work on. Extreme example. Round trip latency is 100ms. Your GraphQL server is on the same node as the database and sends four SQL queries at the same time. The entire response arrives after 100ms.

How normalized your tables and do you use indexes? I’m guessing if it’s really denormalized kv style schema it will be more heavy on io. Same if it’s really write heavy

That looks very interesting and super complex.

I wonder how many companies really need this complexity, I bet 99.99% of the companies could get away with vertical scaling the writes and horizontal scaling the read only replica which would reduce the number of moving parts a lot.

I have yet to play much with kubernetes but when I see those diagrams it just baffles me how people are OK with running so much complexity in their technical stack.

I generally work with smaller companies, but early on (Kubernetes 1.4 ish) I found that hosting mission-critical stateful services inside Kubernetes was more trouble than it was worth. I now run stand-alone Postgres instances in which each service has its own DB. I’ve found this very reliable.

That being said, I think Kubernetes now has much better support for this kind of thing. But given my method has been so stable, I just keep on going with it.

> stateful services

yeah either these services support natively partitioning, fail over and self recovery or you have to be extremely careful not to cause any eviction or agent crash ever.

even something born for the cloud like cockroachdb can fail in interesting ways if the load order varies and you can't just autoscale it because every new node has to be nudged into action with a manual cluster-init, and draining nodes after the peak means manually telling the cluster not to wait for the node to come alive ever again for each node, wait for the repartitioning and then repeat for as many nodes as you need to scale back

This is the kind of work an operator is supposed to manage, just like if one were dealing with standard HA deployments of any stateful service that doesn’t ship with built-in orchestration (like PostgreSQL).

I've come to the conclusion that, much like how purchasing decisions seem irrational until you realize that different kinds of purchases come out of different budgets, there are different "complexity budgets" or "ongoing operational maintenance burden" budgets in an organization, and some are tighter than others.

It actually is not that complex. I'm using Crunchy Postgres Operator at my current employer. You get an Ansible playbook to install an Operator inside Kubernetes, and after that you get a commandline administration tool that let's you create a cluster with a simple

pgo create cluster <cluster_name>


Most administrative tasks like creating or restoring backups (which can be automatically pushed to S3) are just one or two pgo commands.

The linked pdf looks complex, because it:

a. compares 3 different operators

b. goes into implementation details that most users are shielded from.

And I'm actually not sure which one of the three operators is the author recommending :)

Not the author of the slides, but know him well. A number of things to chime in on. First thanks for the kinda words on the Crunchy operator.

Second, on the earlier question higher in the thread about why would you choose to run a database in K8s. In my experience and what I've observed it's not so much you choose explicitly to run a database in K8s. Instead you've decided on K8s as your orchestration layer for a lot of workloads and it's become your standardized mechanism for deploying and managing apps. In some sense it's more that it's the standard deployment mechanism than anything else.

If you're running and managing a single Postgres database and don't have any K8s anywhere setup, I can't say I'd recommend going all in on K8s just for that. That said if you are using it then going with one of the existing operators is going to save you a lot.

I agree that the k8s ecosystem isn't quite as complex as it seems at first, but specifically running stateful apps does come pretty close to earning the bad reputation.

(Disclaimer: I've tried and failed several times to get pgsql up and running in k8s with and without operators, so that either makes me unqualified to discuss this, or perfectly qualified to discuss this)

If the operator were simple enough to be installed/uninstalled via a helm chart that Just Worked, I'd feel better about the complexity. But running a complicated, non-deterministic ansible playbook scares me. The other options (installing a pgo installer, or installing an installer to your cluster) are no better.

Also, configuring the operator is more complicated than it should be. Devs and sysadmins alike are used to `brew install postgresql-server` or `apt install postgresql-server` working just fine for 99% of use cases. I'll grant that it's not apples-to-apples since HA pgsql has never been easy, but if the sales pitch is that any superpower-less k8s admin can now run postgres, I think the manual should be shorter.

I run multi terabyte, billions of rows HA postgres in kubernetes using a helm chart and Patroni (baked into the chart), which uses the native k8s API for automatic failover and pgbackrest for seamlessly provisioning new replicas. It's a single helm chart and is by far the easiest DBA I've ever done in many years.

I realise this is possibly asking you to give away secret sauce, but is this written up anywhere? Having an example to point at to be able to say "Look, this isn't scary, we can contemplate retiring that nasty lump of tin underneath that Oracle instance after all" would be quite a valuable contribution.

agreed re: configuring the operator. cockroach labs (full disclosure: I'm an employee) are building a HA pgsql alternative that Just Works with k8s to solve exactly this problem: https://www.cockroachlabs.com/blog/kubernetes-orchestrate-sq...

spencer kimball and alex polvi deploy a scalable stateful application on cockroachdb and k8s in 3 min: https://www.youtube.com/watch?v=PIePIsskhrw

Usually things become complex once something isn't going as planned. Like if your database slows down because the pods get scheduled on a weird node with some noisy neighbour, your backups failed because the node went down or other more hidden issues that take a lot longer to debug compared to some Postgres running on a normal compute instance somewhere.

It's just additional layers to dig through if something goes wrong, if everything works even the most complex systems are nice to operate so I wouldn't call it less complex just because someone wrote a nice wrapper for the happy path.

Building systems on top of complexity doesn't shield anyone from it. The author acknowledges this explicitly:

> High Effort - Running anything in Kubernetes is complex, and databases are worse

By definition, it's more stuff you need to know.

Even if the K8s operator saves time for 95% of the use cases, the last 5% is required. For instance, how do these operators handle upgrading extensions that require on-disk changes? Can you upgrade them concurrently with major version PG upgrades? When the operator doesn't provide a command line admin tool that fits your needs, how do you proceed?

Crunchy PGO is super cool but I'm not sure how we got to the idea that it's not that complex compared to a managed service like RDS.

Coming from someone at Crunchy I don't disagree on the notion of managed service being easier than running/managing yourself inside Kubernetes. Clicking a button and having things taken care of you is great.

Though personally I do feel like much of the managed services have not evolved/changed/improved since their inception many years ago. There is definitely some opportunity here to innovate, though that's probably not actually coupled with running it in K8s itself.

I don't think anyone would argue that RDS isn't vastly simpler. If it weren't, there'd be no reason to pay such a premium for it.

btw. zalando operator is more rough, but still pretty easy to use. crunchy operator does not work in every environment but is extremly simple (btw. the crunchy operator uses the building blocks of zalando) used zalando operator since k8s 1.4, no data loss, everything just works, ok major upgrades are rough, but they are rough even without zalando operator.

"But it has to run in Kubernetes!"

Please don't. It's not because it's possible that it is a good idea. The PDF itself clearly shows how it can get complex quickly. The great majority of people won't ever be able to do this properly, securely and with decent reliability. Of course I may have to swallow my words in the future in case a job requires it but unless you REALLY REALLY REALLY need PostgreSQL inside Kubernetes IMHO you should just stick with private RDS or Cloud SQL then point your Kubernetes workloads to it inside your VPCs, all peered etc. Your SRE mental health, your managers and company costs will thank you.

I've done MySQL RDS, and I've seen k8s database setups. (But not w/ PG.)

RDS is okay, but I would not dismiss the maintenance work required; RDS puts you at the mercy of AWS when things go wrong. We had a fair bit of trouble with failovers taking 10x+ longer than they should. We also set up encryption, and that was also a PITA: we'd consistently get nodes with incorrect subjectAltNames. (Also, at the time, the certs were either for a short key or signed by a short key, I forget which. It was not acceptable at that time, either; this was only 1-2 years ago, and I'm guessing hasn't been fixed.) Getting AWS to actually investigate, instead of "have you tried upgrading" (and there's always an upgrade, it felt like). RDS MySQL's (maybe Aurora? I don't recall fully) first implementation of spatial indexes was flat-out broken, and that was another lengthy support ticket. The point is that bugs will happen, and cloud platform support channels are terrible at getting an engineer in contact with an engineer who can actually do something about the problem.

I'm a DBA, so I'll do a deep dive into your comment:

- RDS is awesome overall - there's no maintenance work required on your part. If you think encryption is a problem, don't use it until later. Since RDS is a managed service, I just tell compliance auditors, "It's a managed service."

- Aurora had issues (um, alpha quality) for the first year or so. So don't use new databases in the first 5 years for production, as recommended.

> there's no maintenance work required on your part.

The post of mine you are replying to outlines maintenance work we had to do on an actual RDS instance. My point is that you shouldn't weigh managed solutions as maintenance-free: they're not (and examples of why and how they are not are in the first post). They might win out, and they do have a place, but if you're evaluating them as "hassle-free", you will be disappointed.

> If you think encryption is a problem, don't use it until later.

We had compliance requirements that required encryption, so waiting until later was not an option.

> Since RDS is a managed service, I just tell compliance auditors, "It's a managed service."

I'm not a big fan of giving compliance auditors half-truths that mislead them into thinking we're doing something we're not.

> So don't use new databases in the first 5 years for production, as recommended.

You mean we should run our own? (/s… slightly.) We were exploring Aurora as the performance of normal RDS was not sufficient. Now, there was plenty we could have done better in other area, particularly in the database schema department, but Aurora was thought to be the most pragmatic option.

I bet your life is tough, being as dumb as a box of rocks.

The above answers are from somebody who doesn't know what they're talking about, either about database administration or compliance.

Ok, that's enough. Given that https://news.ycombinator.com/item?id=23670678 was just a couple days ago, we've banned this account.

I have no compliance requirements and I use encryption even when the database is on the same node as the application just to eliminate that excuse. There is no need to justify encryption. Just use it. Setting it up won't take more than an hour.

In my personal opinion, there are three database types.

'Small' Databases are the first, and are easy to dump into kubernetes. Anything DB with a total storage requirement 100GB or less (if I lick my finger and try to measure the wind), really, can be easily containerized, dumped into kubernetes and you will be a happy camper because it makes prod / dev testing easy, and you don't really need to think too much here.

'Large' database are too big to seriously put into a container. You will run into storage and networking limits for cloud providers. Good luck transferring all that data off bare metal! Your tables will more than likely need to be sharded to even start thinking about gaining any benefit from kubernetes. From my own rubric, my team runs a "large" Mysql database with large sets of archived data that uses more storage that managed cloud SQL solutions can provide. It would take us months to re-design to take advantage of the Mysql Clustering mechanisms, along with following the learning curve that comes with it.

'Massive' databases need to be planned and designed from "the ground up" to live in multiple regions, and leverage respective clustering technologies. Your tables are sharded, replicated and backed up, and you are running in different DCs attempting to serve edge traffic. Kubernetes wins here as well, but, as the OP suggests, not without high effort. K8S give you the scaling and operational interface to manage hundreds of database nodes.

It seems weird to me that the Vitess and OP belabour their Monitoring, Pooling, and Backup story, when I think the #1 reason you reach for an orchestrator in these problem spaces is scaling.

All that being said, my main point here is that orchestration technologies are tools, and picking the right one is hard , but can be important :) Databases can go into k8s! Make it easy on yourself and choose the right databases to put there

So, a bit OT, but I'm looking for some advice on building a Postgres cluster, and I'm pretty sure k8s is going to add a lot of complexity with no benefit.

I'm a Postgres fan, and use it a lot, but I've never actually used it in a clustered setup.

What I'm looking at clustering for is not really for scalability (still at the stage where we can scale vertically), but for high availability and backup - if one node is done for update, or crashes, the other node can take over, and I'd also ideally like point-in-time restore.

There seems to be a plethora of OSS projects claiming to help with this, so it looks like there isn't "one true way" - I'd love to hear how people are actually setting up their Postgres clusters for in practice?

Compared to many databases, postgres HA is a mess. It has builtin streaming, but no fail over of any kind, all of that has to be managed by another application.

We've had the best luck with patron, but even then you'll find the documentation confusing, have weird issues, etc. You'll need to setup etcd/Consul to use it. That's right you need a second database cluster to setup your database cluster.... Great...

I have no clue how such a community favorite database has no clear solution to basic HA.

Very true. My sentiments exactly as Spilo/Patroni users. One benefit to k8s is you can use it as the DCS for Patroni

Patroni might be interesting: https://github.com/zalando/patroni

The main advantage with Kubernetes (especially in low ops environments like GKE) is not scalability, but availability and ease of development (spinning things up and down is super-easy). The learning curve to stand something up is not very high and pays of over time compared to SSH-ing into VMs.

I'm very comfortable with containers (less so specifically with k8s), but generally for stateless or stateless'ish services. What are the advantages of k8s specifically for a database?

High availability, deployment consistency and, if needed, ability to scale-out on demand to name a few. You can can have an all-inclusive environment for sandboxing, development or production use-cases. Easy to spin-up and tear down.

A good alternative is to use a database service and not have to worry about it at all. However if you do have to operate one, orchestration systems give you the tools to handle it properly WRT your availability and performance SLA/SLOs.

Kubernetes can’t change any database’s HA or durability features; there’s no magic k8s can apply to make a database that does e.g. asynchronous replication have the properties of one that does synchronous replication. So you’ll never gain any properties your underlying database is incapable of providing.

However, if I had to run Postgres as part of something I deployed on k8s AND for some reason couldn’t use my cloud provider’s built in solution (AWS RDS, Cloud SQL, etc.) I would probably go with using/writing a k8s operator. The big advantage of this route is that it gives you good framework for coordinating the operational changes you need to be able to handle to actually have failover and self-healing from a Postgres cluster, in a self-contained and testable part of your infrastructure.

When setting up a few Postgres nodes with your chosen HA configuration you’ll quickly run into a few problems you have to solve:

* I lose connectivity to an instance. Is it ever coming back? How do I signal that it’s dead and buried to the system so it knows to spin up a fresh replica in the cases where this cannot be automatically detected?

* How do I safely follow the process I need to when upgrading a component (Postgres, HAProxy, PGBouncer, etc.)? How do I test this procedure, in particular the not-so-happy paths (e.g. where a node decides to die while upgrading).

* How do I make sure whatever daemon that watches to figure out if I need to make some state change to the cluster (due to a failure or requested state change) can both be deployed in a HA manner AND doesn’t have to contend with multiple instances of itself issuing conflicting commands?

* How do I verify that my application can actually handle failover in the way that I expect? If I test this manually, how confident am I that it will continue to handle it gracefully when I next need it?

A k8s operator is a nice way to crystallize these kinds of state management issues on top of a consistent and easily observable state store (namely the k8s API’s etcd instance). They also provide a great way to run continuous integration tests that you can actually throw the situations you’re trying to prepare for at the implementation of the failover logic (and your application code) to actually give you some confidence that your HA setup deserves the name.

But again, I wouldn’t bite this off if you can use a managed service for the database. Pay someone else to handle that part, and focus on making your app actually not shit the bed if a failover of Postgres happens. The vast majority of applications I’ve worked on that were pointed at a HA instance would have (and in some cases did) broken down during a failover due to things like expecting durability but using asynchronous replication. You don’t get points for “one of the two things that needed to work to have let us avoid that incident worked”.

> AND for some reason couldn’t use my cloud provider’s built in solution (AWS RDS, Cloud SQL, etc.)


> Pay someone else to handle that part

Aside from the money (managed Postgres is expensive), I'd actually like to understand what good, high-availability Postgres solutions look like.

Totally valid if you’re reaching the larger instance sizes. However I’d caution you to not underestimate what running a good HA Postgres setup costs in engineering/operations time (particularly if you’re not familiar with running one). Be ready to get a clearheaded TCO that you won’t be happy with.

In general, you need some sort of separate strictly serializable store run in an HA configuration to manage the state changes, and whatever is managing the state changes needs to be run in multiple instances across your fault domains as well. Others have mentioned Patroni; I’ve used it before (though with etcd, not k8s) and been quite happy with it. Be aware that (as it cautions you in the README) it’s not a total point and shoot tool, you do need to read through it’s caveats and understand the underlying Postgres replication features.

The documentation is pretty good, if you want to get an idea of what the logic looks like they have a nice state diagram: https://github.com/zalando/patroni/blob/master/docs/ha_loop_...

Google Cloud blog, gently dissuading you from running a traditional DB in K8s: https://cloud.google.com/blog/products/databases/to-run-or-n...

K8s docs explaining how to run MySQL: https://kubernetes.io/docs/tasks/run-application/run-replica...

You could also run it with Nomad, and skip a few layers of complexity: https://learn.hashicorp.com/nomad/stateful-workloads/host-vo... / https://mysqlrelease.com/2017/12/hashicorp-nomad-and-app-dep...

One of the big problems of K8s is it's a monolith. It's designed for a very specific kind of org to run microservices. Anything else and you're looking at an uphill battle to try to shim something into it.

You can also skip all the automatic scheduling fancyness and just build system images with Packer, and deploy them however you like. If you're on a cloud provider, you can choose how many instances of what kind (manager, read-replica) you deploy, using the storage of your choice, networking of choice, etc. Then later you can add cluster scheduling and other features as needed. This gradual approach to DevOps allows you to get something up and running using best practices, but without immediately incurring the significant maintenance, integration, and performance/availability costs of a full-fledged K8s.

> One of the big problems of K8s is it's a monolith

while I pretty much agree with everything else you mention, I think it's kind of the opposite; since k8s is fundamentally an API, it's very modular and extensible and this is why it's being successful (I agree it wants you to do things its way and things like databases need to be shimmed at the moment, so the conclusion is similar... for now)

It looks like microservices from the high level. But then you dig into each component, and realize that other than using APIs, they still require all the other components, use a shared storage layer, sometimes use non-standard protocols. The kube-controller-manager alone is literally a tiny monolith: a single binary with 5 different controllers in it. K8s operates like a monolith because you mostly can't just remove one layer and have it keep running.

Compare that to HashiCorp's tools. You can run a K8s-like system composed mostly of Hashi's tools, but you can also run each of those tools as a single complete unit of just itself. Now, each of those tools is actually multiple components in one, like mini-monoliths. But in operation, they can work together or as independent services. The practical result is truly stand-alone yet reusable and interoperable components. That's the kind of DevOps methodology I like: buy a wheelbarrow today and hitch it to an ATV next week, rather than going out and buying a combine harvester before you need it.

Yes, these are very good points, all true, thanks.

I still think that while this monolith has its drawbacks, the fact that any component can be substituted as long as it confirms to the official API is really powerful. For example k3s uses sqlite instead of etcd.

Having small components that do one thing well (Unix philosophy) is certainly one way to go (I still haven't found somebody who doesn't love Hashicorp tooling, myself included) but the k8s idea of having one (big, possibly bloated for many cases) "standard" way of doing things while being customizable/extensible is really powerful. If Hashi came up with some (extendible) glue/package tooling then a lot of people doing/looking at k8s right now will seriously look at them (myself included).

I much prefer just using RDS Aurora. Far fewer headaches. If I don't need low latency I'd use RDS Aurora no matter which cloud I'm hosted on. Otherwise I'll use hosted SQL.

The reason I mention this is that Kubernetes requires a lot of management to run so the best solution is to use GKE or things like that. If you're using managed k8s, there's little reason to not use managed SQL.

The advantages of k8s are not that valuable for a SQL server cluster. You don't even really get colocation of data because you're realistically going to use a GCE Persistent Disk or EBS volume and those are network attached anyway.

For the MySQL folks, see Vitess as an example on how to run Kubernetes on MySQL: https://vitess.io

There are also MySQL operators from Oracle, Presslabs, and Percona. Vitess is much more than just MySQL in k8s, and not everyone will be able to switch to it easily (if at all).

To all the commenters in this thread.

If kubernetes cannot run a database then what good is it? (And I suppose the same issues pop up for things like a persistent queue or a full text indexer.)

The end goal of Kubernetes is to able to create and recreate environments and scale them up and down at will all based a declarative configuration. But if you take databases out of it; then you are not really achieving that goal and just left with the flipside of kubernetes: a really complex setup and a piece of technology that is very hard to master.

> he end goal of Kubernetes is to able to create and recreate environments and scale them up and down at will all based a declarative configuration.

PG already has its own clustering solution to scale up and down, which is orthogonal to Kubernetes. So running PG in Kubernetes does not add anything. Also, you are much more likely to mess them up when trying to mix two orthogonal technologies.

And the DB is not meant to create and recreate often unless you want to purge the data. So my take is this: Kubernetes is to manage and configure microservices and DBs are not microservices.

Some say that running stateful applications on K8S is not a good idea anyways, and K8S is best used for stateless applications. Sure you can connect to a stateful DB but the app itself is stateless.

Postgres + stolon + k8 is the easiest time I've ever had bootstrapping a DB for high availability. I'm not sure I'd use it for extremely high throughput apps, but for smallish datasets that NEED to be online, it was amazing. The biggest reason it's amazing? The dev, staging, and prod environments look exactly the same from a coder's perspective, and bringing a fresh one up is always a single command, because that's just how you work in kube-land.

ooh! I've been running the Zalando operator in production on Azure for ~ a year now, nothing crazy but a couple thousand qps and a tb of data spread across a several clusters. It's been a little rough since it was designed for AWS, but pretty fun. At this point, I'm 50/50, our team is small and i'm not sure that the extra complexity added by k8s solved any problems that azures managed postgres product doesn't also solve. We weren't sure we were going to stay on azure at the time we made the decision as well -- if I was running in a hybrid cloud environment I would 100% choose postgres on k8s.

The operator let us ramp up real quickly with postgres as a POC and gave us mature clustering and point-in-time restoration, and the value is 100% there for dev/test/uat instances, but depending on our team growth it might be worth it to switch to managed for some subset of those clusters once "Logical Decoding" goes GA on the azure side. Their hyperscale option looks pretty fun as well, hopefully some day i'll have that much data to play with.

I can also say that the Zalando crew has been crazy responsive on their github, it's an extremely well managed open source project!

I have been running my own postgres helm chart with read replication and pgpool2 for three years and never had major trouble. If you're interested check out https://github.com/sspies8684/helm-repo

Thanks for sharing!

Curious after quick look: how come the primary container (pg upstream image) has a volume for the replica which doesn't seem used and the replica (custom image which wipes $PGDATA at start) has no host volume (hence data is in the container)? (very possible I've missed something)

Looks interesting but difficult to get the details from just the slides.

Also, not sure why Azure Arc still gets mentioned. I would have expected something more cloud independent.

Our approach, for now, is to use Kubernetes Postgres for dev, test, and even stage, but cloud Postgres for prod. We have one db.yaml that in production just become an endpoint so that all of the services do not even have to know if it is an internal or external Postgres.

Another interesting use of Kubernetes Postgres would be for some transient but bigger than memory store that needs to be queryable for a certain amount of time. It's probably a very niche use-case, but the deployment could be dramatically more straightforward since HA is not performance bound.

Why? So you pay more money to AWS? Deploying databases is a solved problem. What's the point of the overhead?

What's the use-case for running databases in k8s, is this a widely accepted best practice?

I guess I look at it the opposite way - which is why wouldn't you run everything in k8s once you have the basic investment in it. Let's you spin up new environments, vertical scaling becomes trivial, disaster recovery/business continuity is automatic along with everything else in your k8s environment.

I don't think its a widely accepted best practice yet, mainly because its hard to do well, and by its self its hard to take advantage of the benefits of using k8s. The company I work for has been building out the tools require to run databases well in k8s ( fully automated, fully managed, survivable, and scale-able ) and we are seeing people come around to it. Once you have all the tools in place you can have a system that scales right along side your applications on heterogeneous hardware. Isn't dependent on any single server, can be deployed and managed exactly like your applications, and can be transported everywhere. If you want to take a look check out planetscale.com

If you are running Kubernetes, happen to be a fairly large organization and use microservices, you probably have many databases. Hundreds of them. Most of them are going to be small, using few resources of any kind.

In that context running postgres on K8S makes a lot of sense. You already have K8S and experience running it. Running postgres there makes it possible to share resources between databases and other applications. That improves utilization which means you can reduce costs significantly.

Another advantage is that unlike managed solutions such as RDS, you can use a more recent postgres version and postgres extensions that RDS doesn't support. Extensions such as PgQ or TimescaleDB or ...

Having said all of this. In a large organization, you have the benefit of economies of scale. Large fixed costs (such as developing the expertise required to run (postgres on) K8S reliably) can be amortized. It's possible for this to be a good idea, even a best practice for large organizations while at the same time a terrible idea for smaller ones. Most of the time, using a managed service like RDS is probably a better choice. In other words: You are not Google. You are not Facebook. You are probably not even Zalando. Figure out what's right for you.

Losing your data.

JK, sort of.

My first go to is something like RDS, but I’ve run Postgres in k8s for pretty much one use case: everything else is already in k8s _and_ I need a PG extension/functionality not present in RDS.

Conway's Law. The hardware team deals with the lower parts of the stack: hardware, OS, up to Kubernetes. The applications team(s) deal exclusively with Kubernetes payloads.

I remember running mongodb one socket had so many gotchas and stuff that it wasn’t worth it.

I think cockroachDB is designed for this.

They've thought about the use case. But it still ends up being a cluster inside a cluster, which sounds potentially pretty bad to me. Clusters of different types, mostly unaware of each other. Schema changes and database version upgrades would be complicated.

There certainly are pain points. I don't work on this myself, but one of our other engineers wrote this blog post [0] that discusses the experience of running CockroachDB in k8s and why we chose to use it for our hosted cloud product. Another complication mentioned in there is about how to deal with the multi-region case.

[0] https://www.cockroachlabs.com/blog/managed-cockroachdb-on-ku...

Instead of messing around with Kubernetes, I would rather advocate for something like Amazon RDS.

I would agree that the complexity is compounded, having gone through the work automate various operators in kubernetes and the requisite deploy projects for the actual app/service (database) clusters, etc.

The problem is often that the actual costs of maintaining solutions like this isn't always clear and easy to budget for, and perhaps more importantly--explain to management--this includes the continued costs for engineering time to architect H/A solutions, maintain, research solutions, etc. Add to this the abstraction and compounding of complexity and the plethora of hand-waving blogs, etc.

IMHO, the real problems arise when you deploy a PostgreSQL (via Kubernetes Operator) into an existing multi-AZ cloud-based kubernetes cluster--without knowing and understanding all of the requisite requirements and restrictions. At the time when I was working on deploying postgres clusters with the operator (mid 2019 AIR) there was not a lot (much at all) in the strimzi kafka operator docs about handling multi-AZs in kube with the Kubernetes autoscaler and using cluster ASG's, etc. Note that persistentvolumes and persistentvolumeclaims in the cloud cannot span multiple AZ's--this is a critical concern, especially when you throw in Kubernetes and an ASG (autoscaling group). What this means if you have some app/service running in a specific AZ that has persistentvolumes and claims in that AZ, you must ensure that that app/service stays in that AZ and all of its requisite storage resources must also remain available in that AZ. The complexity that is required to manage this is not trivial for most teams. I.e. some helm charts that I installed (after `helm templating` in our IAC code), configured nodelablels on the existing kube clusters worker nodes--but note that this was not documented in the helm chart BTW. So, when we later did a routine upgrade of the Kubernetes version and the ASGs spawned new worker nodes, that left those aps/services processes effectively hard-coded to use nodes that were terminated by the ASG (as they were older versions that were replaced by the newer versions during the upgrade) and their PV's were in a specific AZ, as noted above.

To do it right, I think you'd need to define AZ-specific storage classes and then ensure that when you are deploying apps/services into kubernetes you ensure that you manage those. Again, from my past experience, when you have Kubernetes in the cloud, with the kubernetes autoscaler, and cloud-based ASG (autoscaling groups), running in an H/A (high-availability i.e. multi-AZ), and now add in stateful requirements using PV's, and now add in very resource intensive apps and services, now this starts to get a bit tricky to maintain--again--despite what the "experts" might be blogging about. Keep in mind that the companies sponsoring the experts might have teams of 10-15 DevOps Kubernetes engineers managing a cluster. This is something we definitely don't have.

I'm sure it will get better with time, but for now, we are doing all we can to maintain stateful apps/services externally--i.e.: and per your initial post, this would be PostgreSQL with RDS. IMHO, RDS does a fantastic job and allows us to abstract all of this, and we simply deploy our clusters with IAC and forget about them to some degree. For the cost point and specifically regarding resource contention, I think it's an ideal ROI to have the cloud provider worry about failover, H/A database internals, scaling with multi-AZ storage, etc.

Applications are open for YC Winter 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact