
Deploying PostgreSQL Clusters Using StatefulSets - TheIronYuppie
http://blog.kubernetes.io/2017/02/postgresql-clusters-kubernetes-statefulsets.html
======
alexk
If you are interested in deployments of databases on top of cluster
orchestrators, here are some interesting projects:

* Postgres automation for Kubernetes deployments [https://github.com/sorintlab/stolon](https://github.com/sorintlab/stolon)

* Automation for operating the Etcd cluster:[https://github.com/coreos/etcd-operator](https://github.com/coreos/etcd-operator)

* Kubernetes-native deployment of Ceph: [https://rook.io/](https://rook.io/)

~~~
foxylion
Thanks for the links. Do you know about a project that Targets PostgreSQL as a
operator, like etcd?

~~~
sarnowski
Yes, we are currently developing a PostgreSQL operator for Kubernetes. We are
heavy PostgreSQL users. Spilo[0] supports all kinds of slave configurations,
backups, monitoring tools[1] and all advanced capabilities.

[0] [https://github.com/zalando/spilo](https://github.com/zalando/spilo) [1]
[https://github.com/zalando/PGObserver](https://github.com/zalando/PGObserver)

Edit: [https://github.com/zalando/patroni](https://github.com/zalando/patroni)
is a better link

~~~
foxylion
Thank you! And Zalando as a bit company behind it is even better.

Are you going to use this as is in production, or do you have an internal
version it?

~~~
sarnowski
We don't have a separate internal version. We are already running some
hundreds PostgreSQL clusters in AWS + etcd with Spilo/Patroni. The Kubernetes
variant is not used yet in production but we plan to migrate the first
datasets soonish (probably some weeks).

~~~
foxylion
This is great to hear, will definitely try it.

------
sandGorgon
Here's a quick question - isn't StatefulSets and PersistentVolumeClaim the
exact _opposite_ of how fault tolerant database systems are generally built ?

The kubernetes StatefulSets project is actually trying to build a usecase for
Postgresql-on-s3. I think its the wrong abstraction to go after in most
usecases for database systems including things like ElasticSearch. I dont
think that even Aurora is trying to solve this using persistent storage.

You dont really care about solving underlying persistence... you make the
whole system fault tolerant : which includes high availability, failover, etc.

IMHO kubernetes still has problems in orchestrating an ElasticSearch/Zookeeper
cluster for high availability. The problem of discovery, load balancing, etc
are bottlenecks.

~~~
alexk
> Here's a quick question - isn't StatefulSets and PersistentVolumeClaim the
> exact opposite of how fault tolerant database systems are generally built ?

In some cases it is perfectly valid to rely on underlying storage to get HA.
E.g. RDS relies on EBS
[https://aws.amazon.com/message/65648/](https://aws.amazon.com/message/65648/)

> IMHO kubernetes still has problems in orchestrating an
> ElasticSearch/Zookeeper cluster for high availability

I don't think that is the case any more. It is not trivial to build a proper
automation on top of K8s, but possible, here is an example of a complex
distributed DB deployed on top of Kubernetes:

[https://github.com/coreos/etcd-operator](https://github.com/coreos/etcd-
operator)

There are many challenges, but they are not specific to Kubernetes, rather
than just automating the deployment of such complex systems is generally hard.

~~~
sandGorgon
Thanks for pointing out the Operator framework. Please correct me if I am
wrong, but Operator is not a Kubernetes thing - its a custom functionality
built by coreos. It is undoubtedly awesome, but it is very hard to get it
working on non coreos distro.

I personally believe that orchestration primitives like operators, linkerd,
ingress, etc are what the k8s project should focus on. Im actually unsure if
StatefulSets are solving production problems for anybody.

IMHO they are a fairly Googley concept and hard to apply outside of Google for
anybody (without solving the networked filesystem problem first).

~~~
xnxn
Small correction: Operators aren't a CoreOS-specific thing, and they're easy
to install on any cluster.

Ultimately, the "Operator framework" is just a pattern: you represent some
class of software (e.g. etcd, Elasticsearch, PostgreSQL) as a
ThirdPartyResource, and you have some controller that monitors those TPRs,
driving other resources toward the desired state.

~~~
sandGorgon
sounds good - my bad then. I think they are awesome and not very well
utilized.

It might be positioning - Operators vs Helm for example (even though the blog
post tries to differentiate)

~~~
crb
Think of Helm like a Linux package manager. You tell it you want etcd, and it
makes it happen for you.

The way it makes it happen might be by using an operator, a
Deployment/ReplicaSet, or a StatefulSet.

Think of operators as a "replica set that is aware of what it is replicating",
and can do work to get things in the correct state.

------
danmaz74
Slightly OT, but, from a quick glance at the post, it looks like the postgres
data is persisted on a network file system.

Does anybody have any experience with a big enough and this kind of
configuration, and what is the performance compared with SSD local disks?
Intuitively I wouldn't expect a networked disk to be a viable solution for
postgres, but seeing it proposed like this makes me think that maybe my
intuition was wrong...

~~~
lobster_johnson
People have run Postgres on network disks — Elastic Block Storage (EBS)
volumes on AWS, or Persistent Disks on Google Cloud — for a long time.

They're implemented like a software SAN, with dedicated network paths, and
throughput/latency is very good. Not as good as local, but still good enough
for most apps.

On Google Cloud, you do have to option of using a local SSD [1], but it comes
with a bunch of limitations. It maxes out at 375GB (though you can allocate up
to 8 per VM). They're somewhat expensive. They're not durable; if you stop the
VM, the disk is lost. They're classified as "scratch disks" in the UI, which
is a good indicator of their intended purpose.

Edit: Using a networked _file system_ , however, is a bad idea.

[1] [https://cloud.google.com/compute/docs/disks/local-
ssd](https://cloud.google.com/compute/docs/disks/local-ssd)

~~~
koolba
> People have run Postgres on network disks — Elastic Block Storage (EBS)
> volumes on AWS, or Persistent Disks on Google Cloud — for a long time.

NFS != SAN

The former is a shared file system accessible by multiple machines
simultaneously.

The latter is a block device accessed over the network by a single machine at
a time.

It's generally a bad idea to run a database on NFS as it'll be fine until it's
not. Network block devices are fine though as it's just like a disk except it
"plugs in" through your network jack rather than something like SATA.

~~~
lobster_johnson
The parent poster referred to both "networked disk" and "networked file
system". I didn't catch the implied reference to NFS, so I assumed he was
talking about the former, so that's what my comment is about.

I agree that NFS is a terrible idea for databases.

------
TheIronYuppie
Stateful Sets went to beta in Kubernetes 1.5 (in December) and we've been
really excited to see so many applications get ported over. Please let me know
if you have any questions!

Disclosure: I work at Google on Kubernetes.

~~~
PaulJulius
This is unrelated to StatefulSets, but I'm going to take the opportunity to
ask a Kubernetes engineer for help, since the the kubernetes-users Slack
channel sort of feels like shouting into a void.

We deploy a small cluster (1 master, 6 nodes) at our startup that started
misbehaving last week. All of a sudden three of the nodes went down - one
became unresponsive and two had the error "container runtime is down." We
couldn't ssh into the unresponsive one, but according to AWS the machine was
fine, still receiving network requests and using CPU.

Since we couldn't diagnose the issue, we spun up an entirely new cluster using
kops, but started seeing the exact same behavior later that night, and again
over the weekend. Three nodes were in a not ready state, for the same reasons
(unresponsive and container runtime is down). Right now our only solution to
solve this issue is to manually terminate the EC2 instances and rely on the
Auto-Scaling Group to create new ones. In the mean time, Kubernetes tells us
that it can't schedule all of our desired pods, so half of our jobs aren't
running, obviously an undesirable situation.

A handful of questions I have about the situation: Why are these nodes going
down? What causes a node to go unresponsive? Why does the container runtime go
down on a node and why doesn't it get restarted? Why doesn't Kubernetes
destroy these nodes when they've been out of commission for 3-4 hours?

Any help would be appreciated!!! I've been looking through half a dozen log
files and gotten zero answers.

~~~
crb
> Why doesn't Kubernetes destroy these nodes when they've been out of
> commission for 3-4 hours?

Kubernetes isn't responsible for the lifecycle of its nodes. It can run in a
DC where "destroying a node" might mean paging a tech to turn off a server.
Something external - in your case, kops & your ASG - is responsible for the
nodes that Kubernetes runs on. That's a deliberate design choice.

It should make a correct decision not to schedule work there, which it sounds
like it did.

Given that, your other questions are hard to answer. kubelet is a process that
runs on the nodes. So is docker. If you can't get into the machine to diagnose
the fault, I'd encourage you to set up some monitoring/log shipping off the
node so you can see what the state was when it failed.

There's nothing inherently "Kubernetes" about this diagnosis - it's more EC2,
node/kernel/OS and Docker troubleshooting, in that order.

~~~
TheIronYuppie
Correct, Kubernetes is not responsible for the nodes. I would build a health
check into your Autoscale Group (I don't know exactly how to do this on AWS,
but am happy to show you an example on GCP - aronchick (at) google).

If you can't get to the machine, there are a million reasons why this would be
the case - but ssh is a totally separate process, it's way outside of
Kubernetes. VERY commonly, you've run out of memory and processes are fighting
among themselves (especially since EVERYTHING seems to be failing), but this
is total speculation. OS issues are common too - I've spun up clusters
switching from one distro to another, same config, and everything worked
great.

Disclosure: I work at Google on Kubernetes.

~~~
timeu
Speaking of distros and considering your background. What would be the "best"
Distro for running Kubernetes ?

~~~
crb
If all the OS does is provide a minimal surface for running containers, I'd
focus on whatever gives me the best security, manageability and updates.

The Container Optimised OS is what GKE uses on Google Cloud Platform
[https://cloud.google.com/container-optimized-
os/docs/](https://cloud.google.com/container-optimized-os/docs/)

It's conceptually very similar to CoreOS' Container Linux, so I might try that
if I were looking at Kubernetes elsewhere and wanted a container-only OS.

If I am running an environment with multiple purposes - some container hosts,
some regular machines - I'd err on the side of "who is my current vendor/what
does my ops team support and know best".

~~~
timeu
Great thanks for the valuable infos. We are running SLES12 and also a Suse
Openstack Cloud on bare metal and only recently Suse has announced their
container strategy (SLE MicroOS Distro) but we haven't had time to evaluate it
yet. At a recent DevConf I saw some interesting talks about immutable
container hosts such Fedora Atomic. Seems that there is a lot of work done in
this area.

------
kitallis
There's also [https://github.com/staples-
sparx/Agrajag](https://github.com/staples-sparx/Agrajag) which I've been
working on. It is an abstraction on top of repmgr and repmgrd. It relies on
repmgrd for quorum and uses zookeeper as a distributed store for who the new
master is. This way your apps/bouncer/HAProxy can read off of it. It does not
use zk for leader election and relies on repmgrd for that. It's a move away
from the normal repmgrd style of push notifications to a pull + broadcast
mechanism. This allows many nodes to broadcast the same information
(increasing the surface area of success) and fencing old masters.

repmgr and repmgrd are written by core contributors of postgres and this is
just an orchestration layer (with built-in monitoring) on top of it. It's very
much under development at the moment.

More details are here: [https://github.com/staples-
sparx/Agrajag/blob/dev/doc/tradeo...](https://github.com/staples-
sparx/Agrajag/blob/dev/doc/tradeoffs.md)

There are plans to add more lines of defense in cases where repmgrd dies. I'll
be speaking about it a bit this week at pgconf.in if anyone's interested:
[http://pgconf.in/schedule/a-postgres-
orchestra/](http://pgconf.in/schedule/a-postgres-orchestra/)

