* Postgres automation for Kubernetes deployments https://github.com/sorintlab/stolon
* Automation for operating the Etcd cluster:https://github.com/coreos/etcd-operator
* Kubernetes-native deployment of Ceph:
Edit: https://github.com/zalando/patroni is a better link
Are you going to use this as is in production, or do you have an internal version it?
It is not called the operator, but in many ways it is similar to what Etcd Operator does
Yes, you can connect to an etcd instance outside of Kubernetes, or deploy etcd on Kubernetes. CoreOS developed the etcd operator for the purposes of self-hosting the system KV store, but you can of course run a "user" etcd.
A better pattern is relying on the objects provided by Kubernetes, which are themselves backed by a lock service (which happens to be etcd, but that's an implementation detail you shouldn't worry about.)
You could, for example, use an Annotation on an object to do leader election. Look at how HA masters are implemented on Kubernetes, for example:
There's some more information (and a sidecar pod prebaked) at http://blog.kubernetes.io/2016/01/simple-leader-election-wit...
// This implementation does not guarantee that only one client is acting as a
// leader (a.k.a. fencing). A client observes timestamps captured locally to
// infer the state of the leader election. Thus the implementation is tolerant
// to arbitrary clock skew, but is not tolerant to arbitrary clock skew rate.
The kubernetes StatefulSets project is actually trying to build a usecase for Postgresql-on-s3. I think its the wrong abstraction to go after in most usecases for database systems including things like ElasticSearch. I dont think that even Aurora is trying to solve this using persistent storage.
You dont really care about solving underlying persistence... you make the whole system fault tolerant : which includes high availability, failover, etc.
IMHO kubernetes still has problems in orchestrating an ElasticSearch/Zookeeper cluster for high availability. The problem of discovery, load balancing, etc are bottlenecks.
In some cases it is perfectly valid to rely on underlying storage to get HA. E.g. RDS relies on EBS https://aws.amazon.com/message/65648/
> IMHO kubernetes still has problems in orchestrating an ElasticSearch/Zookeeper cluster for high availability
I don't think that is the case any more. It is not trivial to build a proper automation on top of K8s, but possible, here is an example of a complex distributed DB deployed on top of Kubernetes:
There are many challenges, but they are not specific to Kubernetes, rather than just automating the deployment of such complex systems is generally hard.
I personally believe that orchestration primitives like operators, linkerd, ingress, etc are what the k8s project should focus on. Im actually unsure if StatefulSets are solving production problems for anybody.
IMHO they are a fairly Googley concept and hard to apply outside of Google for anybody (without solving the networked filesystem problem first).
Ultimately, the "Operator framework" is just a pattern: you represent some class of software (e.g. etcd, Elasticsearch, PostgreSQL) as a ThirdPartyResource, and you have some controller that monitors those TPRs, driving other resources toward the desired state.
It might be positioning - Operators vs Helm for example (even though the blog post tries to differentiate)
The way it makes it happen might be by using an operator, a Deployment/ReplicaSet, or a StatefulSet.
Think of operators as a "replica set that is aware of what it is replicating", and can do work to get things in the correct state.
Does anybody have any experience with a big enough and this kind of configuration, and what is the performance compared with SSD local disks? Intuitively I wouldn't expect a networked disk to be a viable solution for postgres, but seeing it proposed like this makes me think that maybe my intuition was wrong...
They're implemented like a software SAN, with dedicated network paths, and throughput/latency is very good. Not as good as local, but still good enough for most apps.
On Google Cloud, you do have to option of using a local SSD , but it comes with a bunch of limitations. It maxes out at 375GB (though you can allocate up to 8 per VM). They're somewhat expensive. They're not durable; if you stop the VM, the disk is lost. They're classified as "scratch disks" in the UI, which is a good indicator of their intended purpose.
Edit: Using a networked file system, however, is a bad idea.
NFS != SAN
The former is a shared file system accessible by multiple machines simultaneously.
The latter is a block device accessed over the network by a single machine at a time.
It's generally a bad idea to run a database on NFS as it'll be fine until it's not. Network block devices are fine though as it's just like a disk except it "plugs in" through your network jack rather than something like SATA.
I agree that NFS is a terrible idea for databases.
We (3 companies ago) used to use postgresql databases backed by a huge netapp filer.
Performance was as you would expect, horrible but tolderable. Why was it tolerable? Because high availability was more important than performance at the time and native HA wasn't in PostgreSQL yet. (8.1 had just come out)
The first DBA ever on staff was an absurdly knowledgable guy (and paid handsomely for being so knowledgeable) and his first and strongest recommendation was, in this order:
1) Upgrade to the latest PostgreSQL to get a better vacuum implementation.
2) Get the data off the filer.
I highly advise against it unless your application caches tremendously well.
To (inexactly) translate SAN/NAS into AWS terms for those that may not be familiar with these types of on-prem deployments:
S3/EFS/Glacier is NAS–files are stored and retrieved over the network.
EBS is SAN–the storage is exposed as a block level device over the network and appear as locally attached devices.
As a case in point, I've pushed greater than 600mb/s of simultaneous heavy read and write traffic to an AWS EBS volume with acceptably low mean latency, which is more than enough for many Postgres use cases, but there's definitely cases where even that level IO would be a major bottleneck.
GCP offers networked SSD, and performance is fairly good, but, as always, YMMV.
Disclosure: I work at Google on Kubernetes
It's a performance vs. availability/reliability tradeoff, and it's not a straightforward one.
> The example in this blog uses NFS for the Persistent Volumes, but any shared file system would also work (ex: ceph, gluster).
NFS is possible, but it's very easy to get the settings wrong, and it will depend on your specific NFS server whether or not you risk losing data in corner cases... I'd not consider NFS for a database for that reason without knowing precisely which NFS server and which version and testing that specific combination extensively (or verifying others have experience with exactly that combination).
It will be as slow as a block store, but most of the block stores offer up much simpler semantics that are much easier to verify if makes sense for a database workload (they need to store the blocks they claim to have stored, and that's pretty much what we care about; for the rest it's down to trusting a well tested local filesystem).
As an added note, the "any shared file system would also work" is patently false and horrible, dangerous advice, for a database deployment as it depends entirely on the file systems crash semantics in terms of what data is guaranteed to hit disk in a durable way, and whether or not there's any chance of writes being reordered etc.
The blog writes from the perspective of an on-prem deployment and suggests you could use NFS to provide a PersistentVolume to Kubernetes. While this is possible, it's probably not best practice, especially if a failure in your NFS will render all your nodes offline at once.
Disclosure: I work at Google on Kubernetes.
I've been wary of running anything stateful on K8s for other reasons than the stable identity stuff. Historically, Docker's I/O has not been particularly robust — well, Docker has not been particularly robust. You also have to be very careful about resource limits and requests.
In particular, you have to ensure that the QoS class "Guaranteed"  kicks in, because the last thing you want is for Kubernetes to think that PostgreSQL is something that is happily rescheduled.
tl;dr: do you want to be a DBA? do you want to be a cluster admin? If you want to be both you can run databases on Kubernetes. If you don't want to be a DBA, use a managed service.
Non-containerized environments (VMs, bare metal) are well-understood and fairly predictable; Docker and Kubernetes are moving targets. There are "known unknowns".
As an example, I am involved in a project that uses continuous delivery for deploys, and we recently had an incident where the GKE node ran out of inodes because Docker was keeping all the layers of dead containers around. Kubernetes handles this correctly: Best-effort pods were evicted and the node went back to being ready once the inode use went down. PostgreSQL would have stayed up in this case, but it was still a little surprising that GKE, which is nominally managed, does not have a process in place for cleaning up Docker's garbage. (Edit: Apparently it's a bug .)
When discussing Docker and Kubernetes, there are very often edge cases that come up. For example, until a couple of months ago, Kubernetes had a serious bug that would often cause the wrong network volume to be mounted to a new pod. You can be the world's best containerized-database admin and still be surprised by things like this, which have nothing to do with the nature of containers, and everything to do with the specifics of the software that controls them.
"containerd is all we wanted from @docker in @kubernetesio and none of what we didn't need: kudos to the team!"
For example, I'd be able to upgrade by just starting another container and replicating into it, then routing traffic over. I'd retain the option of rolling back by routing traffic back if anything goes wrong. I'd be able to spin up/down redundant read-only replicas simply by changing the number of replicas. And other apps would be able to piggyback on Kubernetes' discovery system (typically you let service names resolve into IPs/SRV records through DNS). And so on.
You can do anything you want with classical VMs, of course, it's just more static and slower: Provisioning another VM, configuring it (Puppet, Salt, Chef, etc.), connecting persistent storage, reconfiguring everything to point to the right places, configuring DNS, etc.
In short: If you use Kubernetes as your orchestrator, why would you not want to use it for everything?
But it just seems like you're trying to shove a very stateful management problem (the schema, the associated storage) into a system designed for mostly stateless stuff. And that way lies a giant hassle.
Plus we're certainly not at huge scale, but db rollbacks are nontrivial, and we have to be somewhat careful about schema modifications.
Thanks for your comment!
There's nothing about containers that says they need to be stateless, it's just more convenient that way because you can treat every container as expendable, redundant, fungible commodities. But the real world needs state, and Kubernetes is capable of managing it. In particular, the pattern pretty much required for stateful apps on Kubernetes is that of using an exclusive-access network block device. Since only one container (or pod) can mount it at any given time, you can ensure that the state follows the container around.
It may just be our use case -- we have enough data (low TB) that cloning is not a fast thing to do.
We deploy a small cluster (1 master, 6 nodes) at our startup that started misbehaving last week. All of a sudden three of the nodes went down - one became unresponsive and two had the error "container runtime is down." We couldn't ssh into the unresponsive one, but according to AWS the machine was fine, still receiving network requests and using CPU.
Since we couldn't diagnose the issue, we spun up an entirely new cluster using kops, but started seeing the exact same behavior later that night, and again over the weekend. Three nodes were in a not ready state, for the same reasons (unresponsive and container runtime is down). Right now our only solution to solve this issue is to manually terminate the EC2 instances and rely on the Auto-Scaling Group to create new ones. In the mean time, Kubernetes tells us that it can't schedule all of our desired pods, so half of our jobs aren't running, obviously an undesirable situation.
A handful of questions I have about the situation: Why are these nodes going down? What causes a node to go unresponsive? Why does the container runtime go down on a node and why doesn't it get restarted? Why doesn't Kubernetes destroy these nodes when they've been out of commission for 3-4 hours?
Any help would be appreciated!!! I've been looking through half a dozen log files and gotten zero answers.
IIRC we improved garbage collection settings in the latest kops (1.5.1), so if you were running out of disk, using the latest kops should fix everything. It's also easy to reconfigure to use a bigger root disk if you're churning through containers faster than GC can keep up. But if it's something else we can try to diagnose it as well!
> Why doesn't Kubernetes destroy these nodes when they've been out of commission for 3-4 hours?
We should, I believe. I actually thought we had an issue for this very problem, though I can't find it. I'll open a new one if I can't track it down. There is maybe an argument that we should fix the root cause, but there's an unlimited number of things that can go wrong, so we need to do both.
(edit: Gave up on finding the existing issue and opened https://github.com/kubernetes/kops/issues/2002 )
I rerolled with 100G and I've seen zero problems since.
Kubernetes isn't responsible for the lifecycle of its nodes. It can run in a DC where "destroying a node" might mean paging a tech to turn off a server. Something external - in your case, kops & your ASG - is responsible for the nodes that Kubernetes runs on. That's a deliberate design choice.
It should make a correct decision not to schedule work there, which it sounds like it did.
Given that, your other questions are hard to answer. kubelet is a process that runs on the nodes. So is docker. If you can't get into the machine to diagnose the fault, I'd encourage you to set up some monitoring/log shipping off the node so you can see what the state was when it failed.
There's nothing inherently "Kubernetes" about this diagnosis - it's more EC2, node/kernel/OS and Docker troubleshooting, in that order.
If you can't get to the machine, there are a million reasons why this would be the case - but ssh is a totally separate process, it's way outside of Kubernetes. VERY commonly, you've run out of memory and processes are fighting among themselves (especially since EVERYTHING seems to be failing), but this is total speculation. OS issues are common too - I've spun up clusters switching from one distro to another, same config, and everything worked great.
The Container Optimised OS is what GKE uses on Google Cloud Platform https://cloud.google.com/container-optimized-os/docs/
It's conceptually very similar to CoreOS' Container Linux, so I might try that if I were looking at Kubernetes elsewhere and wanted a container-only OS.
If I am running an environment with multiple purposes - some container hosts, some regular machines - I'd err on the side of "who is my current vendor/what does my ops team support and know best".
repmgr and repmgrd are written by core contributors of postgres and this is just an orchestration layer (with built-in monitoring) on top of it. It's very much under development at the moment.
More details are here: https://github.com/staples-sparx/Agrajag/blob/dev/doc/tradeo...
There are plans to add more lines of defense in cases where repmgrd dies.
I'll be speaking about it a bit this week at pgconf.in if anyone's interested: http://pgconf.in/schedule/a-postgres-orchestra/