
Experiences with running PostgreSQL on Kubernetes - craigkerstiens
https://gravitational.com/blog/running-postgresql-on-kubernetes/
======
vasco
So they say it's hard to run it in Kubernetes because they weren't running it
in StatefulSets. Then they say you can actually run it properly in a
StatefulSet but hand-wave it away with "but people run it on top of Ceph which
has issues with latency". That sounds like a shitty excuse to just dismiss the
whole thing as too hard. If one particular underlying storage system wasn't
good, use a better one. Kubernetes allows you to abstract the actual storage
implementation to whatever you want so it's not like there's lack of options.

Also I'd be curious to see if "most people" actually use Ceph vs network
storage like EBS volumes where AWS guarantees me that I won't have data
corruption issues in exchange for money.

~~~
twakefield
"That sounds like a shitty excuse to just dismiss the whole thing as too hard"

Disclosure: I work at the company that published this post.

I read it differently (albeit, I have much more context). I read it as a
cautionary tail that Kubernetes makes it easier to get in trouble if you don't
know what you are doing - so you better have a deep knowledge of you stateful
workloads and technology. Perhaps obvious to some, but still a good reminder
when dealing with a well-hyped technology like Kubernetes.

~~~
toomuchtodo
I appreciate pragmatic replies like this, providing opportunities to
gracefully get off the hype train for those who bought in.

~~~
tyingq
I agree. Kubernetes seems clear that "pets" isn't their strong suit. I'm
surprised at the surprise here.

They will likely get better at it over time. Until then, either run it outside
of K8S or deal with the warts.

------
dijit
I am highly skeptical of container systems for persistence. Docker does not
have disk I/O as a first class citizen for a reason, they're focusing
(rightfully) on isolated compute and deployment/dependencies, and kubernetes,
to my mind builds on that quite nicely.

I generally avoid abstractions for persistence layers, but I believe I'm the
minority and I believe I'm going against what the industry desires me to do.

I'm not convinced this is a good idea /yet/. I did, however, see some
interesting docker/kubernetes integration with ScaleIO (clustered filesystem)
which cut out huge chunks of the disk I/O pipeline (for performance) and was
highly resilient.

The demo I saw was using postgresql, the dude yanked the cord out of the host
running the postgresql pod.

Quite impressive in my opinion.

[https://github.com/thecodeteam/rexray](https://github.com/thecodeteam/rexray)

[https://github.com/kubernetes/examples/blob/master/staging/v...](https://github.com/kubernetes/examples/blob/master/staging/volumes/scaleio/README.md)

------
llama052
Reminds of the Jurassic park quote:

"Your scientists were so preoccupied with whether or not they could, they
didn’t stop to think if they should."

~~~
old-gregg
Yes they should. It's nice to have a (ultimately) data-center-wide scheduling
platform with unified role-based access control, monitoring, utilization
optimization and reporting, cost control and even exotic features like
hardware validation. Just because earlier Kubernetes versions weren't good at
everything at once doesn't mean it's doomed forever to be unsuitable for
something.

Moreover, the use case he's discussing is quite unusual: Gravitational takes a
snapshot of an existing Kubernetes cluster (including all applications inside
of it) and gives you a single-file installer you can sell into on-premise
private environments, basically it's InstallShield + live updating for cloud
software. So, running everything on Kubernetes opens entirely new markets for
a SaaS company to sell to.

This level of zero-effort application introspection hasn't been possible prior
to Kubernetes, so that's another reason to use it for everything: it promises
true infrastructure independence (i.e. developers do not have to even touch
AWS APIs) that actually works.

~~~
ris
> basically it's InstallShield + live updating for cloud software

Sounds like a nightmare to me.

~~~
tomsthumb
Honestly, kube is a no-brainer thank-you-sweet-baby-jesus improvement in ops
posture over an easy handful of shops I’ve come across.

------
antoncohen
I think the core point of this article is that to run replicated DBs on
Kubernetes requires deep knowledge of the DBs in question. You can't just use
a StatefulSet and expect it to work well.

You need to override Kubernetes' built-in controllers to customize them to the
details of the database, for example when is it safe to failover, or when is
it safe to scale down. Outside of Kubernetes these decisions are made by cloud
services like RDS, or manually by people with knowledge like DBAs.

To put that knowledge into Kubernetes you need Custom Resource Definitions
(CRDs).

I know know if it is any good, but I just found a project called KubeDB that
has CRDs for various DBs: [https://kubedb.com/](https://kubedb.com/) and
[https://github.com/kubedb](https://github.com/kubedb)

~~~
joshberkus
I wouldn't say that you need _deep_ knowledge at this point. However, you do
need at least journeyman level knowledge. Lots of folks (including me, on
Patroni) are working to make more things automatic and lower the knowledge
barrier, but we're not there yet.

A big part of the obstacle is that preserving state in a distributed
environment is just _hard_ , no matter what your technology, and the failure
cases are generally catastrophic (lose all data, everywhere). This is true
both for the new distributed databases, and for the retrofits of the older
databases. So building DBMSes which can be flawlessly deployed by junior
admins on random Kubernetes clusters requires a lot of plumbing and hundreds
of test cases, which are hard to construct if you don't have a $big budget for
cloud time in order to test things like multi-zone netsplits and other Chaos
Monkey operations.

Making distributed databases simple and reliable is a lot like writing
software for airplanes, but clearly that's possible, it's just hard and will
take a while.

~~~
joshberkus
Also, the article does show us the kind of knowledge that admins will _always_
need to have, such as the tradeoffs between asynchronous and synchronous
replication.

------
UK-Al05
I don't get it, all the failure modes talked are not kubernetes specific. They
happen if your running any ha database cluster?

If you have async replication with a large amount of lag, well of course your
gonna lose data if master goes down if your not careful. Regardless if your
using kubernetes or not...

Can anyone explain why these failure modes are kubernetes specific? They just
sound like things you have to think about regardless if your running a HA
cluster...

~~~
_jal
They're not kubernetes-specific, but apparently there are people who think
deploying PG in it will magically Just Work(tm).

What I took away from TFA is two-fold: clustered-DB administration is hard to
automate (e.g., there's a reason DBAs exist), and a lot of the tooling DBAs
use now has to be rebuilt to function in a k8 environment, or replaced.

The first struck me as blindingly obvious; the second I find highly relevant,
personally.

~~~
zokier
> clustered-DB administration is hard to automate (e.g., there's a reason DBAs
> exist)

This is why I'm excited about CockroachDB; it hopefully will make operating DB
cluster easier.

~~~
bonesss
Without pretending I've ever done it in production: I'd have to assume that
distributed databases will make the hop into a Kubernetes environment easier
than databases that started monolithic and added replication over time.

Most of the issues presented in the article relate to data loss during network
segregation and leader election. Those are important, but distributed systems
are generally a bit more explicit in their CAP compromises.

------
majewsky
Note that all those difficulties in the article only apply when you a high-
availability setup. While usually appropriate for services with a defined
service level, you can get away with single-replica DBs for a lot of things.

At work, we run OpenStack on Kubernetes with Postgres for persistence, and
it's entirely okay if Postgres fails for, say, an hour, because we don't have
a defined SLA on the OpenStack API. The important thing is that the customer
payloads (VMs, SDN assets, LBs) keep working when OpenStack is down, which
they do.

~~~
zzzcpan
You will probably still encounter data loss.

There are two main things to understand. First is that your application is
running in a distributed environment and will encounter data loss if it is not
designed for it. Second, even if it is properly designed to run in a
distributed environment it's also has to be aware that it's running on
kubernetes, configured specifically for it.

~~~
majewsky
> First is that your application is running in a distributed environment and
> will encounter data loss if it is not designed for it.

The data itself is on a self-replicating storage (and Postgres obtains the
proper exclusive locks when using it), so I don't care if it runs on
Kubernetes or a Raspberry Pi.

------
qaq
[https://github.com/kubernetes/charts/tree/master/incubator/p...](https://github.com/kubernetes/charts/tree/master/incubator/patroni)
There is Patroni Helm chart for K8s

~~~
aberoham
Josh Berkus gave a really good overview of Patroni at KubeCon Austin:
[https://youtu.be/Zn1vd7sQ_bc](https://youtu.be/Zn1vd7sQ_bc)

~~~
jcastro
Semi-related, Josh just gave some great advice during the last k8s office
hours regarding running postgres databases:
[https://youtu.be/Aj0yozuQ0ME?t=50m39s](https://youtu.be/Aj0yozuQ0ME?t=50m39s)

------
kureikain
I have run kafka, postgres, cassandra on K8S but eventually I move off
Postgres and Cassandra to normal server.

Ideally StatefulSet does help a bit. Such as with StatefulSet you have DNS and
hostname like service0.namespace etc. And you can change StatefulManifest
without updating the pod. Pod only get updates when we deleted it(OnDelete
updating policy) or rolling update with staging parition which mimick real
server behaviour where we can pause/restart process on a server.

However, what I realized is resource scheduling and EBS volume.

1\. Soon I realized the node run db pod should only run db and I make a
dedicated node pool for it.

When this occurs, It feels like eventually I'm provisioning a server to run
this workload.

2\. EBS volume cannot mount to other zone. So it really annoying when I kill a
pod and it cannot start because the volume cannot attach to node in other
zone.

And when we need to upgrade Kubernetes itself in an immutable way, mean kill
old node and bring up new node. It's a pain to control that process carefully
to avoid casscade node re-balancing/replication.

More over, the ability to easily goes in server and edit/tweak config is lost
with K8S. We have to use ConfigMap, some trick of init container, entry point
script to generate custom configuration file etc.

An example is broker id in case of kafka or slave-id in case of MySQL. In
other words, I feel like running stateful service on K8S is no longer a joy.

Once I move these stateful service out of K8S, suddenly everything is so
smooth. Running and upgrading K8S itself become a walk in the part.

Also, on AWS, when you have a large amount of server, the chance that you got
AWS notification about node replacement/rebooting (old hardware, host
migration) is very frequent. Dealing with these when all of node have stateful
service running on is not easy.

------
solatic
So it seems like their TL:DR is: 1) Either go with a prebuilt solution like
Citus, or be prepared to build an external service locator that allows a human
to define the leader so that a human admin can manually trigger the failover;
2) Don't forget that it's a DB and it needs fast storage; 3) Nothing is new
under the sun and you need to beware leaky abstractions.

This doesn't seem to me like a reason not to run PG on K8s?

~~~
joshberkus
Yes, and (1) no longer applies because service locators which locate the
master are now easy to do.

------
brugidou
Clearly running stateful services is hard on something like Kubernetes without
delegating volume management to something like ceph.

Would something like Mesos resource reservation mechanism with persistent
volumes do the job? When you run on premise you usually want to recover from a
temporary failure or reboot and maybe run a special admin script if you feel
like the node is not going to come back anytime soon.

------
manigandham
At this point, my bet is on CockroachDB and other database systems built from
the ground up to be natively distributed. It will be far easier for them to
build functionality (especially the 80% that most people ever use) then it
will be to bolt on and coerce the same distributed behavior for a single-node
RDBMS.

~~~
mikekchar
The thing about stuff that is hard is that it's usually hard no matter which
way you look at it. I don't really know anything about CockroachDB, but I use
Couch on a daily basis. I _like_ the way it works, but it's a massive foot-gun
if you don't realise that you have to do things completely differently. In our
shop we a variety of DBs (Couch, Postgres, MySQL and even MSSQL). If my only
concern was scaling, I would _not_ choose Couch. While it makes scaling and
replication easier, it does so by making a whole ton of things harder (and
basically unscalable, ironically). You still need to choose the tool most
appropriate for your problem. (In case you are wondering, Couch is awesome for
problems where you need versioned, immutable data -- for example financial
systems. But there are a lot of trade offs with respect to querying).

------
bechampion
I'm still not a big fan to run stateful services in a platform mostly built
for stateless . Also not a huge psql user , but replication and failover in
psql has always been a half-done job, kube isn't gonna fix that.

------
jontro
Wouldn't you have to wait until all replicas are in sync before a commit can
be completed? Even though there might be a low replica lag I would be really
careful of depending on transactions in application code using this.

~~~
maga_2020
in the video referenced by @aberoham , a the end there is a question to the
presenter (Josh Berkus), about the 2 types of replication their deployment
tool is supporting.

this was on about 33rd minute of the presentation

He suggested , that if you need to use , synchronous replication, to use
PostgreSQL 10 with synchronous quorum replication, so that it will not block
transactions if one of the replicas fail.

~~~
joshberkus
Yes, that's correct. If you are going to use sync rep because you can't lose
transactions, you really want to use the latest version of PostgreSQL, which
supports quorum sync (i.e. "one of these three replicas must ack"), even in
complex configurations ("one replica from each availability zone must ack").
Note, though, that the existing HA automation solutions (Patroni, Stolon)
don't currently have support for complex topographies, so you'd need to do
some hacking.

It _is_ a tradeoff though. With synch rep, you are _at a minimum_ adding the
cost of two network trips to the latency of each write (distributed consistent
databases like Cassandra pay this cost as well, which is why they tend to have
relatively high write latency). It turns out that a lot of users are willing
to lose a few transactions in a combination failure case instead of having
each write take three times as long.

Postgres also has some "in between" modes because write transactions can be
individually marked as synch or asynch, so less critical writes can be faster.
I believe that Cassandra has something similar.

------
vermaden
Whole team just to 'manage' ONE virtualization 'technology'?

No thanks.

One admin is more then enough to cover and master FreeBSD Jails and many other
technologies ... no just Kubernetes.

------
collyw
I don't know much about Kubernetes, but isn't the conventional wisdom that the
database should not be containerised? Has that changed?

~~~
parasubvert
No, but Kubernetes is at peak hype right now. It’s a floor wax and a dessert
topping.

