Hacker News new | past | comments | ask | show | jobs | submit login
Deploying PostgreSQL Clusters Using StatefulSets (kubernetes.io)
164 points by TheIronYuppie on Feb 26, 2017 | hide | past | favorite | 65 comments

If you are interested in deployments of databases on top of cluster orchestrators, here are some interesting projects:

* Postgres automation for Kubernetes deployments https://github.com/sorintlab/stolon

* Automation for operating the Etcd cluster:https://github.com/coreos/etcd-operator

* Kubernetes-native deployment of Ceph: https://rook.io/

There are a number of projects and storage solutions that you can use depending on what is important. Using persistent volumes and external storage like EBS is one way. Using distributed file-systems with local or network attached storage like Portworx or Ceph is another. I’m not sure using NFS for the post in this example demonstrates what you would be doing in the real world however.

Thanks for the links. Do you know about a project that Targets PostgreSQL as a operator, like etcd?

Yes, we are currently developing a PostgreSQL operator for Kubernetes. We are heavy PostgreSQL users. Spilo[0] supports all kinds of slave configurations, backups, monitoring tools[1] and all advanced capabilities.

[0] https://github.com/zalando/spilo [1] https://github.com/zalando/PGObserver

Edit: https://github.com/zalando/patroni is a better link

Thank you! And Zalando as a bit company behind it is even better.

Are you going to use this as is in production, or do you have an internal version it?

We don't have a separate internal version. We are already running some hundreds PostgreSQL clusters in AWS + etcd with Spilo/Patroni. The Kubernetes variant is not used yet in production but we plan to migrate the first datasets soonish (probably some weeks).

This is great to hear, will definitely try it.

This one: https://github.com/sorintlab/stolon

It is not called the operator, but in many ways it is similar to what Etcd Operator does

If your pattern requires you to connect to etcd, then you're not taking the best advantage of Kubernetes.

Yes, you can connect to an etcd instance outside of Kubernetes, or deploy etcd on Kubernetes. CoreOS developed the etcd operator for the purposes of self-hosting the system KV store, but you can of course run a "user" etcd.

A better pattern is relying on the objects provided by Kubernetes, which are themselves backed by a lock service (which happens to be etcd, but that's an implementation detail you shouldn't worry about.)

You could, for example, use an Annotation on an object to do leader election. Look at how HA masters are implemented on Kubernetes, for example:


That's a great point. But doesn't this make it mandatory for the applications to use Go? It would have been more useful if there was a REST interface for leader election.

Not at all - it's just a language for which there is a Kubernetes client. The API is 100% REST. We have a Python library underway also: https://github.com/kubernetes-incubator/client-python

Right, but I couldn't find a REST API for leader election. If my app is in Go, I can use the leaderelection package here https://github.com/kubernetes/kubernetes/blob/master/pkg/cli.... But if my app is in Python, I won't be able to use that package. In that case I'd need to use etcd or zookeeper on top of k8s. Am I missing something?

You can reimplement that logic in Python. There's nothing magic about it: you can see in https://github.com/kubernetes/kubernetes/blob/master/pkg/cli... that it just uses a Kubernetes annotation.

There's some more information (and a sidecar pod prebaked) at http://blog.kubernetes.io/2016/01/simple-leader-election-wit...

Yes, one can reimplement it in Python, but don't forget:

// This implementation does not guarantee that only one client is acting as a // leader (a.k.a. fencing). A client observes timestamps captured locally to // infer the state of the leader election. Thus the implementation is tolerant // to arbitrary clock skew, but is not tolerant to arbitrary clock skew rate.

makes sense, this was written before TPR though, so there is a room for improvement.

Here's a quick question - isn't StatefulSets and PersistentVolumeClaim the exact opposite of how fault tolerant database systems are generally built ?

The kubernetes StatefulSets project is actually trying to build a usecase for Postgresql-on-s3. I think its the wrong abstraction to go after in most usecases for database systems including things like ElasticSearch. I dont think that even Aurora is trying to solve this using persistent storage.

You dont really care about solving underlying persistence... you make the whole system fault tolerant : which includes high availability, failover, etc.

IMHO kubernetes still has problems in orchestrating an ElasticSearch/Zookeeper cluster for high availability. The problem of discovery, load balancing, etc are bottlenecks.

> Here's a quick question - isn't StatefulSets and PersistentVolumeClaim the exact opposite of how fault tolerant database systems are generally built ?

In some cases it is perfectly valid to rely on underlying storage to get HA. E.g. RDS relies on EBS https://aws.amazon.com/message/65648/

> IMHO kubernetes still has problems in orchestrating an ElasticSearch/Zookeeper cluster for high availability

I don't think that is the case any more. It is not trivial to build a proper automation on top of K8s, but possible, here is an example of a complex distributed DB deployed on top of Kubernetes:


There are many challenges, but they are not specific to Kubernetes, rather than just automating the deployment of such complex systems is generally hard.

As your link illustrates, Multi-AZ is the only real HA option for RDS PostgreSQL, and it doesn't share storage between nodes. So even though RDS uses EBS, you can't really say it is the underlying storage layer that provides RDS's high availability.

Thanks for pointing out the Operator framework. Please correct me if I am wrong, but Operator is not a Kubernetes thing - its a custom functionality built by coreos. It is undoubtedly awesome, but it is very hard to get it working on non coreos distro.

I personally believe that orchestration primitives like operators, linkerd, ingress, etc are what the k8s project should focus on. Im actually unsure if StatefulSets are solving production problems for anybody.

IMHO they are a fairly Googley concept and hard to apply outside of Google for anybody (without solving the networked filesystem problem first).

Small correction: Operators aren't a CoreOS-specific thing, and they're easy to install on any cluster.

Ultimately, the "Operator framework" is just a pattern: you represent some class of software (e.g. etcd, Elasticsearch, PostgreSQL) as a ThirdPartyResource, and you have some controller that monitors those TPRs, driving other resources toward the desired state.

sounds good - my bad then. I think they are awesome and not very well utilized.

It might be positioning - Operators vs Helm for example (even though the blog post tries to differentiate)

Think of Helm like a Linux package manager. You tell it you want etcd, and it makes it happen for you.

The way it makes it happen might be by using an operator, a Deployment/ReplicaSet, or a StatefulSet.

Think of operators as a "replica set that is aware of what it is replicating", and can do work to get things in the correct state.

Slightly OT, but, from a quick glance at the post, it looks like the postgres data is persisted on a network file system.

Does anybody have any experience with a big enough and this kind of configuration, and what is the performance compared with SSD local disks? Intuitively I wouldn't expect a networked disk to be a viable solution for postgres, but seeing it proposed like this makes me think that maybe my intuition was wrong...

People have run Postgres on network disks — Elastic Block Storage (EBS) volumes on AWS, or Persistent Disks on Google Cloud — for a long time.

They're implemented like a software SAN, with dedicated network paths, and throughput/latency is very good. Not as good as local, but still good enough for most apps.

On Google Cloud, you do have to option of using a local SSD [1], but it comes with a bunch of limitations. It maxes out at 375GB (though you can allocate up to 8 per VM). They're somewhat expensive. They're not durable; if you stop the VM, the disk is lost. They're classified as "scratch disks" in the UI, which is a good indicator of their intended purpose.

Edit: Using a networked file system, however, is a bad idea.

[1] https://cloud.google.com/compute/docs/disks/local-ssd

> People have run Postgres on network disks — Elastic Block Storage (EBS) volumes on AWS, or Persistent Disks on Google Cloud — for a long time.


The former is a shared file system accessible by multiple machines simultaneously.

The latter is a block device accessed over the network by a single machine at a time.

It's generally a bad idea to run a database on NFS as it'll be fine until it's not. Network block devices are fine though as it's just like a disk except it "plugs in" through your network jack rather than something like SATA.

The parent poster referred to both "networked disk" and "networked file system". I didn't catch the implied reference to NFS, so I assumed he was talking about the former, so that's what my comment is about.

I agree that NFS is a terrible idea for databases.

I err.. never thought I'd be talking about this seriously.

We (3 companies ago) used to use postgresql databases backed by a huge netapp filer.

Performance was as you would expect, horrible but tolderable. Why was it tolerable? Because high availability was more important than performance at the time and native HA wasn't in PostgreSQL yet. (8.1 had just come out)

The first DBA ever on staff was an absurdly knowledgable guy (and paid handsomely for being so knowledgeable) and his first and strongest recommendation was, in this order:

1) Upgrade to the latest PostgreSQL to get a better vacuum implementation.

2) Get the data off the filer.

I highly advise against it unless your application caches tremendously well.

Forget performance, I'd be more worried about file locking, file fsync, and directory fsync. All are critical for a database and an "optimized" NFS driver that lies about the latter two can lead to corruption or data loss.

Looking at the other comments here : how quickly people forget everything!

Those who don't learn the lessons of what's wrong trying to jam everything atop NFS are doomed to repeat them.

If you can maintain network quality (latency, throughput, packet loss) the tradeoffs of using a SAN/NAS solution can be well worth it. For example it makes moving services+storage between compute a trivial affair.

To (inexactly) translate SAN/NAS into AWS terms for those that may not be familiar with these types of on-prem deployments: S3/EFS/Glacier is NAS–files are stored and retrieved over the network. EBS is SAN–the storage is exposed as a block level device over the network and appear as locally attached devices.

Depends on your performance tolerance. Networked IO can be an order of magnitude faster or more than a local spinning disk, but sure, network latency will likely never top the performance of cutting edge SSD used locally.

As a case in point, I've pushed greater than 600mb/s of simultaneous heavy read and write traffic to an AWS EBS volume with acceptably low mean latency, which is more than enough for many Postgres use cases, but there's definitely cases where even that level IO would be a major bottleneck.

EBS isn't a network file system.

QQ - how do you deal with the situation where a node goes away?

GCP offers networked SSD, and performance is fairly good, but, as always, YMMV.

Disclosure: I work at Google on Kubernetes

I think the posters point is that the article talks about using NFS or other shared network filesystems, not a block storage protocol?

Assuming you mean a network block device rather than a network filesystem (be very, very vary about network filesystems that claim sufficient POSIX compliance to run a database; it may work, but test very carefully and see what people say about your specific setup as it's very easy to get it wrong and end up with corrupted data), it'll work quite well, but it will typicall be awfully performance wise unless you spend a fortune working around bottlenecks that won't be there in most local setups.

It's a performance vs. availability/reliability tradeoff, and it's not a straightforward one.

The post explicitly says "Network File System":

> The example in this blog uses NFS for the Persistent Volumes, but any shared file system would also work (ex: ceph, gluster).

I must admit I just skimmed over the initial parts of the post.

NFS is possible, but it's very easy to get the settings wrong, and it will depend on your specific NFS server whether or not you risk losing data in corner cases... I'd not consider NFS for a database for that reason without knowing precisely which NFS server and which version and testing that specific combination extensively (or verifying others have experience with exactly that combination).

It will be as slow as a block store, but most of the block stores offer up much simpler semantics that are much easier to verify if makes sense for a database workload (they need to store the blocks they claim to have stored, and that's pretty much what we care about; for the rest it's down to trusting a well tested local filesystem).

As an added note, the "any shared file system would also work" is patently false and horrible, dangerous advice, for a database deployment as it depends entirely on the file systems crash semantics in terms of what data is guaranteed to hit disk in a durable way, and whether or not there's any chance of writes being reordered etc.

If one has Ceph in place, of course, the wisest choice is to the Ceph block device service, which would be akin to using EBS or a SAN.

There would no doubt be some performance differences between local and networked but it is very normal in enterprises to deploy databases on a SAN to take advantage of the array's data services. On the NAS side I believe Oracle also are often deployed on NetApp arrays, they use their own NFS client to ensure the DB isn't compromised by bad NFS client implementations.

Generally you would get a network attached block storage disk (Google Cloud Persistent Disk, EBS volume, etc).

The blog writes from the perspective of an on-prem deployment and suggests you could use NFS to provide a PersistentVolume to Kubernetes. While this is possible, it's probably not best practice, especially if a failure in your NFS will render all your nodes offline at once.

Networked disks are fairly normal for production databases - fibre channel SANs are networked disks - but this is quite different to network file systems.

When running PostgreSQL in any cloud provider most of your data is served from network. I would say performance varies, but quite good. In comparison to magnetic disks much better. Most workloads you should be able to handle with network mounted storage.

Stateful Sets went to beta in Kubernetes 1.5 (in December) and we've been really excited to see so many applications get ported over. Please let me know if you have any questions!

Disclosure: I work at Google on Kubernetes.

Once this goes out of beta, would you say that running PostgreSQL on Kubernetes is now considered good practice?

I've been wary of running anything stateful on K8s for other reasons than the stable identity stuff. Historically, Docker's I/O has not been particularly robust — well, Docker has not been particularly robust. You also have to be very careful about resource limits and requests.

In particular, you have to ensure that the QoS class "Guaranteed" [1] kicks in, because the last thing you want is for Kubernetes to think that PostgreSQL is something that is happily rescheduled.

[1] https://github.com/kubernetes/community/blob/master/contribu...

I think I saw you on this thread, but I'd defer to Clayton Coleman of Red Hat/OpenShift in this post: https://news.ycombinator.com/item?id=13680244

tl;dr: do you want to be a DBA? do you want to be a cluster admin? If you want to be both you can run databases on Kubernetes. If you don't want to be a DBA, use a managed service.

Of course. But there's a difference between knowing that there are failure modes, and actually knowing them.

Non-containerized environments (VMs, bare metal) are well-understood and fairly predictable; Docker and Kubernetes are moving targets. There are "known unknowns".

As an example, I am involved in a project that uses continuous delivery for deploys, and we recently had an incident where the GKE node ran out of inodes because Docker was keeping all the layers of dead containers around. Kubernetes handles this correctly: Best-effort pods were evicted and the node went back to being ready once the inode use went down. PostgreSQL would have stayed up in this case, but it was still a little surprising that GKE, which is nominally managed, does not have a process in place for cleaning up Docker's garbage. (Edit: Apparently it's a bug [1].)

When discussing Docker and Kubernetes, there are very often edge cases that come up. For example, until a couple of months ago, Kubernetes had a serious bug that would often cause the wrong network volume to be mounted to a new pod. You can be the world's best containerized-database admin and still be surprised by things like this, which have nothing to do with the nature of containers, and everything to do with the specifics of the software that controls them.

[1] https://github.com/kubernetes/kubernetes/issues/41033

You sound like you might be first in line to use a rkt-powered GKE :)

I'm indeed hopeful that rkt is the way forward. There are too many competing, overlapping concerns between K8s and Docker.


"containerd is all we wanted from @docker in @kubernetesio and none of what we didn't need: kudos to the team!"

May I ask why you want to run pg inside kubernetes? I'm trying to wrap my head around why that's a thing that people want to do.

If you have to maintain Kubernetes _and_ classical VMs, you have two overlapping systems. I'd rather just use one, and get the flexibility it brings over classical VMs.

For example, I'd be able to upgrade by just starting another container and replicating into it, then routing traffic over. I'd retain the option of rolling back by routing traffic back if anything goes wrong. I'd be able to spin up/down redundant read-only replicas simply by changing the number of replicas. And other apps would be able to piggyback on Kubernetes' discovery system (typically you let service names resolve into IPs/SRV records through DNS). And so on.

You can do anything you want with classical VMs, of course, it's just more static and slower: Provisioning another VM, configuring it (Puppet, Salt, Chef, etc.), connecting persistent storage, reconfiguring everything to point to the right places, configuring DNS, etc.

In short: If you use Kubernetes as your orchestrator, why would you not want to use it for everything?

I totally get the attraction of learning one set of tooling, rather than k8s and ami/homebrew. Blue/green deploys are nice, particularly because we deploy for most PRs.

But it just seems like you're trying to shove a very stateful management problem (the schema, the associated storage) into a system designed for mostly stateless stuff. And that way lies a giant hassle.

Plus we're certainly not at huge scale, but db rollbacks are nontrivial, and we have to be somewhat careful about schema modifications.

Thanks for your comment!

But Kubernetes _is_ designed to support stateful containers; that's what StatefulSets are about.

There's nothing about containers that says they need to be stateless, it's just more convenient that way because you can treat every container as expendable, redundant, fungible commodities. But the real world needs state, and Kubernetes is capable of managing it. In particular, the pattern pretty much required for stateful apps on Kubernetes is that of using an exclusive-access network block device. Since only one container (or pod) can mount it at any given time, you can ensure that the state follows the container around.

I understand it can be done; it just seems like a hassle (viz this thread!) and I don't understand the upside. Discovery is an upside, but I bet there's a way to get discovery with manual updates.

It may just be our use case -- we have enough data (low TB) that cloning is not a fast thing to do.

Thanks again!

Sure, with Kubernetes you can create a "headless" service that points statically at some external service instead of a container.

This is unrelated to StatefulSets, but I'm going to take the opportunity to ask a Kubernetes engineer for help, since the the kubernetes-users Slack channel sort of feels like shouting into a void.

We deploy a small cluster (1 master, 6 nodes) at our startup that started misbehaving last week. All of a sudden three of the nodes went down - one became unresponsive and two had the error "container runtime is down." We couldn't ssh into the unresponsive one, but according to AWS the machine was fine, still receiving network requests and using CPU.

Since we couldn't diagnose the issue, we spun up an entirely new cluster using kops, but started seeing the exact same behavior later that night, and again over the weekend. Three nodes were in a not ready state, for the same reasons (unresponsive and container runtime is down). Right now our only solution to solve this issue is to manually terminate the EC2 instances and rely on the Auto-Scaling Group to create new ones. In the mean time, Kubernetes tells us that it can't schedule all of our desired pods, so half of our jobs aren't running, obviously an undesirable situation.

A handful of questions I have about the situation: Why are these nodes going down? What causes a node to go unresponsive? Why does the container runtime go down on a node and why doesn't it get restarted? Why doesn't Kubernetes destroy these nodes when they've been out of commission for 3-4 hours?

Any help would be appreciated!!! I've been looking through half a dozen log files and gotten zero answers.

So first, sorry about the problem. Please come hang out in the sig-aws or kops channels - we're a bit smaller and more focused than kubernetes-users, and can typically get these problems solved pretty quickly together.

IIRC we improved garbage collection settings in the latest kops (1.5.1), so if you were running out of disk, using the latest kops should fix everything. It's also easy to reconfigure to use a bigger root disk if you're churning through containers faster than GC can keep up. But if it's something else we can try to diagnose it as well!

> Why doesn't Kubernetes destroy these nodes when they've been out of commission for 3-4 hours?

We should, I believe. I actually thought we had an issue for this very problem, though I can't find it. I'll open a new one if I can't track it down. There is maybe an argument that we should fix the root cause, but there's an unlimited number of things that can go wrong, so we need to do both.

(edit: Gave up on finding the existing issue and opened https://github.com/kubernetes/kops/issues/2002 )

I ran into something very similar with a cluster almost identical to you. Turns out the default disk size for kops is 25G and when your masters run out of space things start to die with almost no way of telling why.

I rerolled with 100G and I've seen zero problems since.

> Why doesn't Kubernetes destroy these nodes when they've been out of commission for 3-4 hours?

Kubernetes isn't responsible for the lifecycle of its nodes. It can run in a DC where "destroying a node" might mean paging a tech to turn off a server. Something external - in your case, kops & your ASG - is responsible for the nodes that Kubernetes runs on. That's a deliberate design choice.

It should make a correct decision not to schedule work there, which it sounds like it did.

Given that, your other questions are hard to answer. kubelet is a process that runs on the nodes. So is docker. If you can't get into the machine to diagnose the fault, I'd encourage you to set up some monitoring/log shipping off the node so you can see what the state was when it failed.

There's nothing inherently "Kubernetes" about this diagnosis - it's more EC2, node/kernel/OS and Docker troubleshooting, in that order.

Correct, Kubernetes is not responsible for the nodes. I would build a health check into your Autoscale Group (I don't know exactly how to do this on AWS, but am happy to show you an example on GCP - aronchick (at) google).

If you can't get to the machine, there are a million reasons why this would be the case - but ssh is a totally separate process, it's way outside of Kubernetes. VERY commonly, you've run out of memory and processes are fighting among themselves (especially since EVERYTHING seems to be failing), but this is total speculation. OS issues are common too - I've spun up clusters switching from one distro to another, same config, and everything worked great.

Disclosure: I work at Google on Kubernetes.

Speaking of distros and considering your background. What would be the "best" Distro for running Kubernetes ?

If all the OS does is provide a minimal surface for running containers, I'd focus on whatever gives me the best security, manageability and updates.

The Container Optimised OS is what GKE uses on Google Cloud Platform https://cloud.google.com/container-optimized-os/docs/

It's conceptually very similar to CoreOS' Container Linux, so I might try that if I were looking at Kubernetes elsewhere and wanted a container-only OS.

If I am running an environment with multiple purposes - some container hosts, some regular machines - I'd err on the side of "who is my current vendor/what does my ops team support and know best".

Great thanks for the valuable infos. We are running SLES12 and also a Suse Openstack Cloud on bare metal and only recently Suse has announced their container strategy (SLE MicroOS Distro) but we haven't had time to evaluate it yet. At a recent DevConf I saw some interesting talks about immutable container hosts such Fedora Atomic. Seems that there is a lot of work done in this area.

There's also https://github.com/staples-sparx/Agrajag which I've been working on. It is an abstraction on top of repmgr and repmgrd. It relies on repmgrd for quorum and uses zookeeper as a distributed store for who the new master is. This way your apps/bouncer/HAProxy can read off of it. It does not use zk for leader election and relies on repmgrd for that. It's a move away from the normal repmgrd style of push notifications to a pull + broadcast mechanism. This allows many nodes to broadcast the same information (increasing the surface area of success) and fencing old masters.

repmgr and repmgrd are written by core contributors of postgres and this is just an orchestration layer (with built-in monitoring) on top of it. It's very much under development at the moment.

More details are here: https://github.com/staples-sparx/Agrajag/blob/dev/doc/tradeo...

There are plans to add more lines of defense in cases where repmgrd dies. I'll be speaking about it a bit this week at pgconf.in if anyone's interested: http://pgconf.in/schedule/a-postgres-orchestra/

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact