There are a number of projects and storage solutions that you can use depending on what is important. Using persistent volumes and external storage like EBS is one way. Using distributed file-systems with local or network attached storage like Portworx or Ceph is another. I’m not sure using NFS for the post in this example demonstrates what you would be doing in the real world however.
Yes, we are currently developing a PostgreSQL operator for Kubernetes. We are heavy PostgreSQL users. Spilo[0] supports all kinds of slave configurations, backups, monitoring tools[1] and all advanced capabilities.
We don't have a separate internal version. We are already running some hundreds PostgreSQL clusters in AWS + etcd with Spilo/Patroni. The Kubernetes variant is not used yet in production but we plan to migrate the first datasets soonish (probably some weeks).
If your pattern requires you to connect to etcd, then you're not taking the best advantage of Kubernetes.
Yes, you can connect to an etcd instance outside of Kubernetes, or deploy etcd on Kubernetes. CoreOS developed the etcd operator for the purposes of self-hosting the system KV store, but you can of course run a "user" etcd.
A better pattern is relying on the objects provided by Kubernetes, which are themselves backed by a lock service (which happens to be etcd, but that's an implementation detail you shouldn't worry about.)
You could, for example, use an Annotation on an object to do leader election. Look at how HA masters are implemented on Kubernetes, for example:
That's a great point. But doesn't this make it mandatory for the applications to use Go? It would have been more useful if there was a REST interface for leader election.
Right, but I couldn't find a REST API for leader election. If my app is in Go, I can use the leaderelection package here https://github.com/kubernetes/kubernetes/blob/master/pkg/cli.... But if my app is in Python, I won't be able to use that package. In that case I'd need to use etcd or zookeeper on top of k8s. Am I missing something?
Yes, one can reimplement it in Python, but don't forget:
// This implementation does not guarantee that only one client is acting as a
// leader (a.k.a. fencing). A client observes timestamps captured locally to
// infer the state of the leader election. Thus the implementation is tolerant
// to arbitrary clock skew, but is not tolerant to arbitrary clock skew rate.
Here's a quick question - isn't StatefulSets and PersistentVolumeClaim the exact opposite of how fault tolerant database systems are generally built ?
The kubernetes StatefulSets project is actually trying to build a usecase for Postgresql-on-s3. I think its the wrong abstraction to go after in most usecases for database systems including things like ElasticSearch. I dont think that even Aurora is trying to solve this using persistent storage.
You dont really care about solving underlying persistence... you make the whole system fault tolerant : which includes high availability, failover, etc.
IMHO kubernetes still has problems in orchestrating an ElasticSearch/Zookeeper cluster for high availability. The problem of discovery, load balancing, etc are bottlenecks.
> Here's a quick question - isn't StatefulSets and PersistentVolumeClaim the exact opposite of how fault tolerant database systems are generally built ?
> IMHO kubernetes still has problems in orchestrating an ElasticSearch/Zookeeper cluster for high availability
I don't think that is the case any more. It is not trivial to build a proper automation on top of K8s, but possible, here is an example of a complex distributed DB deployed on top of Kubernetes:
There are many challenges, but they are not specific to Kubernetes, rather than just automating the deployment of such complex systems is generally hard.
As your link illustrates, Multi-AZ is the only real HA option for RDS PostgreSQL, and it doesn't share storage between nodes. So even though RDS uses EBS, you can't really say it is the underlying storage layer that provides RDS's high availability.
Thanks for pointing out the Operator framework. Please correct me if I am wrong, but Operator is not a Kubernetes thing - its a custom functionality built by coreos. It is undoubtedly awesome, but it is very hard to get it working on non coreos distro.
I personally believe that orchestration primitives like operators, linkerd, ingress, etc are what the k8s project should focus on. Im actually unsure if StatefulSets are solving production problems for anybody.
IMHO they are a fairly Googley concept and hard to apply outside of Google for anybody (without solving the networked filesystem problem first).
Small correction: Operators aren't a CoreOS-specific thing, and they're easy to install on any cluster.
Ultimately, the "Operator framework" is just a pattern: you represent some class of software (e.g. etcd, Elasticsearch, PostgreSQL) as a ThirdPartyResource, and you have some controller that monitors those TPRs, driving other resources toward the desired state.
Slightly OT, but, from a quick glance at the post, it looks like the postgres data is persisted on a network file system.
Does anybody have any experience with a big enough and this kind of configuration, and what is the performance compared with SSD local disks? Intuitively I wouldn't expect a networked disk to be a viable solution for postgres, but seeing it proposed like this makes me think that maybe my intuition was wrong...
People have run Postgres on network disks — Elastic Block Storage (EBS) volumes on AWS, or Persistent Disks on Google Cloud — for a long time.
They're implemented like a software SAN, with dedicated network paths, and throughput/latency is very good. Not as good as local, but still good enough for most apps.
On Google Cloud, you do have to option of using a local SSD [1], but it comes with a bunch of limitations. It maxes out at 375GB (though you can allocate up to 8 per VM). They're somewhat expensive. They're not durable; if you stop the VM, the disk is lost. They're classified as "scratch disks" in the UI, which is a good indicator of their intended purpose.
Edit: Using a networked file system, however, is a bad idea.
> People have run Postgres on network disks — Elastic Block Storage (EBS) volumes on AWS, or Persistent Disks on Google Cloud — for a long time.
NFS != SAN
The former is a shared file system accessible by multiple machines simultaneously.
The latter is a block device accessed over the network by a single machine at a time.
It's generally a bad idea to run a database on NFS as it'll be fine until it's not. Network block devices are fine though as it's just like a disk except it "plugs in" through your network jack rather than something like SATA.
The parent poster referred to both "networked disk" and "networked file system". I didn't catch the implied reference to NFS, so I assumed he was talking about the former, so that's what my comment is about.
I agree that NFS is a terrible idea for databases.
I err.. never thought I'd be talking about this seriously.
We (3 companies ago) used to use postgresql databases backed by a huge netapp filer.
Performance was as you would expect, horrible but tolderable. Why was it tolerable? Because high availability was more important than performance at the time and native HA wasn't in PostgreSQL yet. (8.1 had just come out)
The first DBA ever on staff was an absurdly knowledgable guy (and paid handsomely for being so knowledgeable) and his first and strongest recommendation was, in this order:
1) Upgrade to the latest PostgreSQL to get a better vacuum implementation.
2) Get the data off the filer.
I highly advise against it unless your application caches tremendously well.
Forget performance, I'd be more worried about file locking, file fsync, and directory fsync. All are critical for a database and an "optimized" NFS driver that lies about the latter two can lead to corruption or data loss.
If you can maintain network quality (latency, throughput, packet loss) the tradeoffs of using a SAN/NAS solution can be well worth it. For example it makes moving services+storage between compute a trivial affair.
To (inexactly) translate SAN/NAS into AWS terms for those that may not be familiar with these types of on-prem deployments:
S3/EFS/Glacier is NAS–files are stored and retrieved over the network.
EBS is SAN–the storage is exposed as a block level device over the network and appear as locally attached devices.
Depends on your performance tolerance. Networked IO can be an order of magnitude faster or more than a local spinning disk, but sure, network latency will likely never top the performance of cutting edge SSD used locally.
As a case in point, I've pushed greater than 600mb/s of simultaneous heavy read and write traffic to an AWS EBS volume with acceptably low mean latency, which is more than enough for many Postgres use cases, but there's definitely cases where even that level IO would be a major bottleneck.
Assuming you mean a network block device rather than a network filesystem (be very, very vary about network filesystems that claim sufficient POSIX compliance to run a database; it may work, but test very carefully and see what people say about your specific setup as it's very easy to get it wrong and end up with corrupted data), it'll work quite well, but it will typicall be awfully performance wise unless you spend a fortune working around bottlenecks that won't be there in most local setups.
It's a performance vs. availability/reliability tradeoff, and it's not a straightforward one.
I must admit I just skimmed over the initial parts of the post.
NFS is possible, but it's very easy to get the settings wrong, and it will depend on your specific NFS server whether or not you risk losing data in corner cases... I'd not consider NFS for a database for that reason without knowing precisely which NFS server and which version and testing that specific combination extensively (or verifying others have experience with exactly that combination).
It will be as slow as a block store, but most of the block stores offer up much simpler semantics that are much easier to verify if makes sense for a database workload (they need to store the blocks they claim to have stored, and that's pretty much what we care about; for the rest it's down to trusting a well tested local filesystem).
As an added note, the "any shared file system would also work" is patently false and horrible, dangerous advice, for a database deployment as it depends entirely on the file systems crash semantics in terms of what data is guaranteed to hit disk in a durable way, and whether or not there's any chance of writes being reordered etc.
There would no doubt be some performance differences between local and networked but it is very normal in enterprises to deploy databases on a SAN to take advantage of the array's data services. On the NAS side I believe Oracle also are often deployed on NetApp arrays, they use their own NFS client to ensure the DB isn't compromised by bad NFS client implementations.
Generally you would get a network attached block storage disk (Google Cloud Persistent Disk, EBS volume, etc).
The blog writes from the perspective of an on-prem deployment and suggests you could use NFS to provide a PersistentVolume to Kubernetes. While this is possible, it's probably not best practice, especially if a failure in your NFS will render all your nodes offline at once.
Networked disks are fairly normal for production databases - fibre channel SANs are networked disks - but this is quite different to network file systems.
When running PostgreSQL in any cloud provider most of your data is served from network.
I would say performance varies, but quite good. In comparison to magnetic disks much better.
Most workloads you should be able to handle with network mounted storage.
Stateful Sets went to beta in Kubernetes 1.5 (in December) and we've been really excited to see so many applications get ported over. Please let me know if you have any questions!
Once this goes out of beta, would you say that running PostgreSQL on Kubernetes is now considered good practice?
I've been wary of running anything stateful on K8s for other reasons than the stable identity stuff. Historically, Docker's I/O has not been particularly robust — well, Docker has not been particularly robust. You also have to be very careful about resource limits and requests.
In particular, you have to ensure that the QoS class "Guaranteed" [1] kicks in, because the last thing you want is for Kubernetes to think that PostgreSQL is something that is happily rescheduled.
tl;dr: do you want to be a DBA? do you want to be a cluster admin? If you want to be both you can run databases on Kubernetes. If you don't want to be a DBA, use a managed service.
Of course. But there's a difference between knowing that there are failure modes, and actually knowing them.
Non-containerized environments (VMs, bare metal) are well-understood and fairly predictable; Docker and Kubernetes are moving targets. There are "known unknowns".
As an example, I am involved in a project that uses continuous delivery for deploys, and we recently had an incident where the GKE node ran out of inodes because Docker was keeping all the layers of dead containers around. Kubernetes handles this correctly: Best-effort pods were evicted and the node went back to being ready once the inode use went down. PostgreSQL would have stayed up in this case, but it was still a little surprising that GKE, which is nominally managed, does not have a process in place for cleaning up Docker's garbage. (Edit: Apparently it's a bug [1].)
When discussing Docker and Kubernetes, there are very often edge cases that come up. For example, until a couple of months ago, Kubernetes had a serious bug that would often cause the wrong network volume to be mounted to a new pod. You can be the world's best containerized-database admin and still be surprised by things like this, which have nothing to do with the nature of containers, and everything to do with the specifics of the software that controls them.
If you have to maintain Kubernetes _and_ classical VMs, you have two overlapping systems. I'd rather just use one, and get the flexibility it brings over classical VMs.
For example, I'd be able to upgrade by just starting another container and replicating into it, then routing traffic over. I'd retain the option of rolling back by routing traffic back if anything goes wrong. I'd be able to spin up/down redundant read-only replicas simply by changing the number of replicas. And other apps would be able to piggyback on Kubernetes' discovery system (typically you let service names resolve into IPs/SRV records through DNS). And so on.
You can do anything you want with classical VMs, of course, it's just more static and slower: Provisioning another VM, configuring it (Puppet, Salt, Chef, etc.), connecting persistent storage, reconfiguring everything to point to the right places, configuring DNS, etc.
In short: If you use Kubernetes as your orchestrator, why would you not want to use it for everything?
I totally get the attraction of learning one set of tooling, rather than k8s and ami/homebrew. Blue/green deploys are nice, particularly because we deploy for most PRs.
But it just seems like you're trying to shove a very stateful management problem (the schema, the associated storage) into a system designed for mostly stateless stuff. And that way lies a giant hassle.
Plus we're certainly not at huge scale, but db rollbacks are nontrivial, and we have to be somewhat careful about schema modifications.
But Kubernetes _is_ designed to support stateful containers; that's what StatefulSets are about.
There's nothing about containers that says they need to be stateless, it's just more convenient that way because you can treat every container as expendable, redundant, fungible commodities. But the real world needs state, and Kubernetes is capable of managing it. In particular, the pattern pretty much required for stateful apps on Kubernetes is that of using an exclusive-access network block device. Since only one container (or pod) can mount it at any given time, you can ensure that the state follows the container around.
I understand it can be done; it just seems like a hassle (viz this thread!) and I don't understand the upside. Discovery is an upside, but I bet there's a way to get discovery with manual updates.
It may just be our use case -- we have enough data (low TB) that cloning is not a fast thing to do.
This is unrelated to StatefulSets, but I'm going to take the opportunity to ask a Kubernetes engineer for help, since the the kubernetes-users Slack channel sort of feels like shouting into a void.
We deploy a small cluster (1 master, 6 nodes) at our startup that started misbehaving last week. All of a sudden three of the nodes went down - one became unresponsive and two had the error "container runtime is down." We couldn't ssh into the unresponsive one, but according to AWS the machine was fine, still receiving network requests and using CPU.
Since we couldn't diagnose the issue, we spun up an entirely new cluster using kops, but started seeing the exact same behavior later that night, and again over the weekend. Three nodes were in a not ready state, for the same reasons (unresponsive and container runtime is down). Right now our only solution to solve this issue is to manually terminate the EC2 instances and rely on the Auto-Scaling Group to create new ones. In the mean time, Kubernetes tells us that it can't schedule all of our desired pods, so half of our jobs aren't running, obviously an undesirable situation.
A handful of questions I have about the situation: Why are these nodes going down? What causes a node to go unresponsive? Why does the container runtime go down on a node and why doesn't it get restarted? Why doesn't Kubernetes destroy these nodes when they've been out of commission for 3-4 hours?
Any help would be appreciated!!! I've been looking through half a dozen log files and gotten zero answers.
So first, sorry about the problem. Please come hang out in the sig-aws or kops channels - we're a bit smaller and more focused than kubernetes-users, and can typically get these problems solved pretty quickly together.
IIRC we improved garbage collection settings in the latest kops (1.5.1), so if you were running out of disk, using the latest kops should fix everything. It's also easy to reconfigure to use a bigger root disk if you're churning through containers faster than GC can keep up. But if it's something else we can try to diagnose it as well!
> Why doesn't Kubernetes destroy these nodes when they've been out of commission for 3-4 hours?
We should, I believe. I actually thought we had an issue for this very problem, though I can't find it. I'll open a new one if I can't track it down. There is maybe an argument that we should fix the root cause, but there's an unlimited number of things that can go wrong, so we need to do both.
I ran into something very similar with a cluster almost identical to you. Turns out the default disk size for kops is 25G and when your masters run out of space things start to die with almost no way of telling why.
I rerolled with 100G and I've seen zero problems since.
> Why doesn't Kubernetes destroy these nodes when they've been out of commission for 3-4 hours?
Kubernetes isn't responsible for the lifecycle of its nodes. It can run in a DC where "destroying a node" might mean paging a tech to turn off a server. Something external - in your case, kops & your ASG - is responsible for the nodes that Kubernetes runs on. That's a deliberate design choice.
It should make a correct decision not to schedule work there, which it sounds like it did.
Given that, your other questions are hard to answer. kubelet is a process that runs on the nodes. So is docker. If you can't get into the machine to diagnose the fault, I'd encourage you to set up some monitoring/log shipping off the node so you can see what the state was when it failed.
There's nothing inherently "Kubernetes" about this diagnosis - it's more EC2, node/kernel/OS and Docker troubleshooting, in that order.
Correct, Kubernetes is not responsible for the nodes. I would build a health check into your Autoscale Group (I don't know exactly how to do this on AWS, but am happy to show you an example on GCP - aronchick (at) google).
If you can't get to the machine, there are a million reasons why this would be the case - but ssh is a totally separate process, it's way outside of Kubernetes. VERY commonly, you've run out of memory and processes are fighting among themselves (especially since EVERYTHING seems to be failing), but this is total speculation. OS issues are common too - I've spun up clusters switching from one distro to another, same config, and everything worked great.
It's conceptually very similar to CoreOS' Container Linux, so I might try that if I were looking at Kubernetes elsewhere and wanted a container-only OS.
If I am running an environment with multiple purposes - some container hosts, some regular machines - I'd err on the side of "who is my current vendor/what does my ops team support and know best".
Great thanks for the valuable infos.
We are running SLES12 and also a Suse Openstack Cloud on bare metal and only recently Suse has announced their container strategy (SLE MicroOS Distro) but we haven't had time to evaluate it yet.
At a recent DevConf I saw some interesting talks about immutable container hosts such Fedora Atomic. Seems that there is a lot of work done in this area.
There's also https://github.com/staples-sparx/Agrajag which I've been working on. It is an abstraction on top of repmgr and repmgrd. It relies on repmgrd for quorum and uses zookeeper as a distributed store for who the new master is. This way your apps/bouncer/HAProxy can read off of it. It does not use zk for leader election and relies on repmgrd for that. It's a move away from the normal repmgrd style of push notifications to a pull + broadcast mechanism. This allows many nodes to broadcast the same information (increasing the surface area of success) and fencing old masters.
repmgr and repmgrd are written by core contributors of postgres and this is just an orchestration layer (with built-in monitoring) on top of it. It's very much under development at the moment.
There are plans to add more lines of defense in cases where repmgrd dies.
I'll be speaking about it a bit this week at pgconf.in if anyone's interested: http://pgconf.in/schedule/a-postgres-orchestra/
* Postgres automation for Kubernetes deployments https://github.com/sorintlab/stolon
* Automation for operating the Etcd cluster:https://github.com/coreos/etcd-operator
* Kubernetes-native deployment of Ceph: https://rook.io/