
Running Postgres in Kubernetes [pdf] - craigkerstiens
https://static.sched.com/hosted_files/ossna2020/fc/Running%20Postgres-as-a-Service%20in%20Kubernetes.pdf
======
sasavilic
Unless you have a really good shared storage, I don't see any advantage for
running Postgres in Kubernetes. Everything is more complicated without any
real benefit. You can't scale it up, you can't move pod. If pg fails to start
for some reason, good luck jumping into container to inspect/debug stuff. I am
neither going to upgrade PG every 2 weeks nor it is my fresh new microservice
that needs to be restarted when it crashes or scaled up when I need more
performances. And PG has high availability solution which kind of orthogonal
to what k8s offers.

One could argue that for sake of consistency you could run PG in K8S, but that
is just hammer & nail argument for me.

But if you have a really good shared storage, then it is worth considering.
But, I still don't know if any network attached storage can beat local
attached RAID of Solid state disks in terms of performance and/or latency. And
there is/was fsync bug, which is terrible in combination with somewhat
unreliable network storage.

For me, I see any database the same way I see etcd and other components of k8s
masters: they are the backbone. And inside cluster I run my
apps/microservices. This apps are subject to frequent change and upgrades and
thus profit most from having automatic recovery, failover, (auto)scaling, etc.

~~~
qeternity
You don't run shared/network storage. You run PVCs on local storage and you
run an HA setup. You ship log files every 10s/30s/1m to object storage. You
test your backups regularly which k8s is great for.

All of this means that I don't worry about most things you mention. PG
upgrade? Failover and upgrade the pod. Upgrade fails? wal-g clone from object
storage and rejoin cluster. Scaling up? Adjust the resource claims. If
resource claims necessitate node migration, see failover scenario. It's so
resilient. And this is all with raid 10 nvme direct attached storage just as
fast as any other setup.

You mention etcd but people don't run etcd the way you're describing postgres.
You run a redundant cluster that can achieve quorum and tolerate losses. If
you follow that paradigm, you end up with postgres on k8s.

~~~
coding123
It sounds pretty simple and enticing. Any problems convincing the team of this
route?

~~~
sunshinekitty
Speaking from experience..

This is fine if you have a small single node database that can have some
downtime. Once you need replicas, to fit your database in ram for performance,
or reliability with a hot standby for failovers it becomes a lot more
complicated.

You should consider what (if anything) you miss out on by just running in a
single vm that you can scale out much easier down the road should you need to.
Alternatively pay extra for a hosted solution that simplifies those steps
further.

~~~
qeternity
> single vm that you can scale out much easier

I’m not sure how experience could lead you to this conclusion. This wouldn’t
work for any of our production needs.

------
aprdm
That looks very interesting and super complex.

I wonder how many companies really need this complexity, I bet 99.99% of the
companies could get away with vertical scaling the writes and horizontal
scaling the read only replica which would reduce the number of moving parts a
lot.

I have yet to play much with kubernetes but when I see those diagrams it just
baffles me how people are OK with running so much complexity in their
technical stack.

~~~
mamon
It actually is not that complex. I'm using Crunchy Postgres Operator at my
current employer. You get an Ansible playbook to install an Operator inside
Kubernetes, and after that you get a commandline administration tool that
let's you create a cluster with a simple

pgo create cluster <cluster_name>

command.

Most administrative tasks like creating or restoring backups (which can be
automatically pushed to S3) are just one or two pgo commands.

The linked pdf looks complex, because it:

a. compares 3 different operators

b. goes into implementation details that most users are shielded from.

And I'm actually not sure which one of the three operators is the author
recommending :)

~~~
Townley
I agree that the k8s ecosystem isn't quite as complex as it seems at first,
but specifically running stateful apps does come pretty close to earning the
bad reputation.

(Disclaimer: I've tried and failed several times to get pgsql up and running
in k8s with and without operators, so that either makes me unqualified to
discuss this, or perfectly qualified to discuss this)

If the operator were simple enough to be installed/uninstalled via a helm
chart that Just Worked, I'd feel better about the complexity. But running a
complicated, non-deterministic ansible playbook scares me. The other options
(installing a pgo installer, or installing an installer to your cluster) are
no better.

Also, configuring the operator is more complicated than it should be. Devs and
sysadmins alike are used to `brew install postgresql-server` or `apt install
postgresql-server` working just fine for 99% of use cases. I'll grant that
it's not apples-to-apples since HA pgsql has never been easy, but if the sales
pitch is that any superpower-less k8s admin can now run postgres, I think the
manual should be shorter.

~~~
jjjensen90
I run multi terabyte, billions of rows HA postgres in kubernetes using a helm
chart and Patroni (baked into the chart), which uses the native k8s API for
automatic failover and pgbackrest for seamlessly provisioning new replicas.
It's a single helm chart and is by far the easiest DBA I've ever done in many
years.

~~~
regularfry
I realise this is possibly asking you to give away secret sauce, but is this
written up anywhere? Having an example to point at to be able to say "Look,
this isn't scary, we can contemplate retiring that nasty lump of tin
underneath that Oracle instance after all" would be quite a valuable
contribution.

------
caiobegotti
Please don't. It's not because it's possible that it is a good idea. The PDF
itself clearly shows how it can get complex quickly. The great majority of
people won't ever be able to do this properly, securely and with decent
reliability. Of course I may have to swallow my words in the future in case a
job requires it but unless you REALLY REALLY REALLY need PostgreSQL inside
Kubernetes IMHO you should just stick with private RDS or Cloud SQL then point
your Kubernetes workloads to it inside your VPCs, all peered etc. Your SRE
mental health, your managers and company costs will thank you.

~~~
deathanatos
I've done MySQL RDS, and I've seen k8s database setups. (But not w/ PG.)

RDS is okay, but I would not dismiss the maintenance work required; RDS puts
you at the mercy of AWS when things go wrong. We had a fair bit of trouble
with failovers taking 10x+ longer than they should. We also set up encryption,
and _that_ was also a PITA: we'd consistently get nodes with incorrect
subjectAltNames. (Also, at the time, the certs were either for a short key or
signed by a short key, I forget which. It was not acceptable at that time,
either; this was only 1-2 years ago, and I'm guessing hasn't been fixed.)
Getting AWS to actually investigate, instead of "have you tried upgrading"
(and there's _always_ an upgrade, it felt like). RDS MySQL's (maybe Aurora? I
don't recall fully) first implementation of spatial indexes was flat-out
_broken_ , and that was another lengthy support ticket. The point is that bugs
will happen, and cloud platform support channels are _terrible_ at getting an
engineer in contact with an engineer who can actually do something about the
problem.

~~~
redis_mlc
I'm a DBA, so I'll do a deep dive into your comment:

\- RDS is awesome overall - there's no maintenance work required on your part.
If you think encryption is a problem, don't use it until later. Since RDS is a
managed service, I just tell compliance auditors, "It's a managed service."

\- Aurora had issues (um, alpha quality) for the first year or so. So don't
use new databases in the first 5 years for production, as recommended.

~~~
deathanatos
> _there 's no maintenance work required on your part._

The post of mine you are replying to outlines maintenance work we had to do on
an actual RDS instance. My point is that you shouldn't weigh managed solutions
as maintenance-free: they're not (and examples of why and how they are not are
in the first post). They _might_ win out, and they do have a place, but if
you're evaluating them as "hassle-free", you will be disappointed.

> _If you think encryption is a problem, don 't use it until later._

We had compliance requirements that required encryption, so waiting until
later was not an option.

> _Since RDS is a managed service, I just tell compliance auditors, "It's a
> managed service."_

I'm not a big fan of giving compliance auditors half-truths that mislead them
into thinking we're doing something we're not.

> _So don 't use new databases in the first 5 years for production, as
> recommended._

You mean we should run our own? (/s… slightly.) We were exploring Aurora as
the performance of normal RDS was not sufficient. Now, there was _plenty_ we
could have done better in other area, particularly in the database schema
department, but Aurora was thought to be the most pragmatic option.

~~~
redis_mlc
I bet your life is tough, being as dumb as a box of rocks.

The above answers are from somebody who doesn't know what they're talking
about, either about database administration or compliance.

~~~
dang
Ok, that's enough. Given that
[https://news.ycombinator.com/item?id=23670678](https://news.ycombinator.com/item?id=23670678)
was just a couple days ago, we've banned this account.

------
IanGabes
In my personal opinion, there are three database types.

'Small' Databases are the first, and are easy to dump into kubernetes.
Anything DB with a total storage requirement 100GB or less (if I lick my
finger and try to measure the wind), really, can be easily containerized,
dumped into kubernetes and you will be a happy camper because it makes prod /
dev testing easy, and you don't really need to think too much here.

'Large' database are too big to seriously put into a container. You will run
into storage and networking limits for cloud providers. Good luck transferring
all that data off bare metal! Your tables will more than likely need to be
sharded to even start thinking about gaining any benefit from kubernetes. From
my own rubric, my team runs a "large" Mysql database with large sets of
archived data that uses more storage that managed cloud SQL solutions can
provide. It would take us months to re-design to take advantage of the Mysql
Clustering mechanisms, along with following the learning curve that comes with
it.

'Massive' databases need to be planned and designed from "the ground up" to
live in multiple regions, and leverage respective clustering technologies.
Your tables are sharded, replicated and backed up, and you are running in
different DCs attempting to serve edge traffic. Kubernetes wins here as well,
but, as the OP suggests, not without high effort. K8S give you the scaling and
operational interface to manage hundreds of database nodes.

It seems weird to me that the Vitess and OP belabour their Monitoring,
Pooling, and Backup story, when I think the #1 reason you reach for an
orchestrator in these problem spaces is scaling.

All that being said, my main point here is that orchestration technologies are
tools, and picking the right one is hard , but can be important :) Databases
can go into k8s! Make it easy on yourself and choose the right databases to
put there

------
GordonS
So, a bit OT, but I'm looking for some advice on building a Postgres cluster,
and I'm pretty sure k8s is going to add a lot of complexity with no benefit.

I'm a Postgres fan, and use it a lot, but I've never actually used it in a
clustered setup.

What I'm looking at clustering for is not really for scalability (still at the
stage where we can scale vertically), but for high availability and backup -
if one node is done for update, or crashes, the other node can take over, and
I'd also ideally like point-in-time restore.

There seems to be a plethora of OSS projects claiming to help with this, so it
looks like there isn't "one true way" \- I'd love to hear how people are
actually setting up their Postgres clusters for in practice?

~~~
penagwin
Compared to many databases, postgres HA is a mess. It has builtin streaming,
but no fail over of any kind, all of that has to be managed by another
application.

We've had the best luck with patron, but even then you'll find the
documentation confusing, have weird issues, etc. You'll need to setup
etcd/Consul to use it. That's right you need a second database cluster to
setup your database cluster.... Great...

I have no clue how such a community favorite database has no clear solution to
basic HA.

~~~
qeternity
Very true. My sentiments exactly as Spilo/Patroni users. One benefit to k8s is
you can use it as the DCS for Patroni

------
peterwwillis
Google Cloud blog, gently dissuading you from running a traditional DB in K8s:
[https://cloud.google.com/blog/products/databases/to-run-
or-n...](https://cloud.google.com/blog/products/databases/to-run-or-not-to-
run-a-database-on-kubernetes-what-to-consider)

K8s docs explaining how to run MySQL: [https://kubernetes.io/docs/tasks/run-
application/run-replica...](https://kubernetes.io/docs/tasks/run-
application/run-replicated-stateful-application/)

You could also run it with Nomad, and skip a few layers of complexity:
[https://learn.hashicorp.com/nomad/stateful-workloads/host-
vo...](https://learn.hashicorp.com/nomad/stateful-workloads/host-volumes) /
[https://mysqlrelease.com/2017/12/hashicorp-nomad-and-app-
dep...](https://mysqlrelease.com/2017/12/hashicorp-nomad-and-app-deployment-
with-mysql/)

One of the big problems of K8s is it's a monolith. It's designed for a very
specific kind of org to run microservices. Anything else and you're looking at
an uphill battle to try to shim something into it.

You can also skip all the automatic scheduling fancyness and just build system
images with Packer, and deploy them however you like. If you're on a cloud
provider, you can choose how many instances of what kind (manager, read-
replica) you deploy, using the storage of your choice, networking of choice,
etc. Then later you can add cluster scheduling and other features as needed.
This gradual approach to DevOps allows you to get something up and running
using best practices, but without immediately incurring the significant
maintenance, integration, and performance/availability costs of a full-fledged
K8s.

~~~
lazyant
> One of the big problems of K8s is it's a monolith

while I pretty much agree with everything else you mention, I think it's kind
of the opposite; since k8s is fundamentally an API, it's very modular and
extensible and this is why it's being successful (I agree it wants you to do
things its way and things like databases need to be shimmed at the moment, so
the conclusion is similar... for now)

~~~
peterwwillis
It _looks_ like microservices from the high level. But then you dig into each
component, and realize that other than using APIs, they still require all the
other components, use a shared storage layer, sometimes use non-standard
protocols. The kube-controller-manager alone is literally a tiny monolith: a
single binary with 5 different controllers in it. K8s operates like a monolith
because you mostly can't just remove one layer and have it keep running.

Compare that to HashiCorp's tools. You can run a K8s-like system composed
mostly of Hashi's tools, but you can also run each of those tools as a single
complete unit of just itself. Now, each of those tools is actually multiple
components in one, like mini-monoliths. But in operation, they can work
together or as independent services. The practical result is truly stand-alone
yet reusable and interoperable components. That's the kind of DevOps
methodology I like: buy a wheelbarrow today and hitch it to an ATV next week,
rather than going out and buying a combine harvester before you need it.

~~~
lazyant
Yes, these are very good points, all true, thanks.

I still think that while this monolith has its drawbacks, the fact that any
component can be substituted as long as it confirms to the official API is
really powerful. For example k3s uses sqlite instead of etcd.

Having small components that do one thing well (Unix philosophy) is certainly
one way to go (I still haven't found somebody who doesn't love Hashicorp
tooling, myself included) but the k8s idea of having one (big, possibly
bloated for many cases) "standard" way of doing things while being
customizable/extensible is really powerful. If Hashi came up with some
(extendible) glue/package tooling then a lot of people doing/looking at k8s
right now will seriously look at them (myself included).

------
renewiltord
I much prefer just using RDS Aurora. Far fewer headaches. If I don't need low
latency I'd use RDS Aurora no matter which cloud I'm hosted on. Otherwise I'll
use hosted SQL.

The reason I mention this is that Kubernetes requires a lot of management to
run so the best solution is to use GKE or things like that. If you're using
managed k8s, there's little reason to not use managed SQL.

The advantages of k8s are not that valuable for a SQL server cluster. You
don't even really get colocation of data because you're realistically going to
use a GCE Persistent Disk or EBS volume and those are network attached anyway.

------
caniszczyk
For the MySQL folks, see Vitess as an example on how to run Kubernetes on
MySQL: [https://vitess.io](https://vitess.io)

~~~
arronax
There are also MySQL operators from Oracle, Presslabs, and Percona. Vitess is
much more than just MySQL in k8s, and not everyone will be able to switch to
it easily (if at all).

------
chvid
To all the commenters in this thread.

If kubernetes cannot run a database then what good is it? (And I suppose the
same issues pop up for things like a persistent queue or a full text indexer.)

The end goal of Kubernetes is to able to create and recreate environments and
scale them up and down at will all based a declarative configuration. But if
you take databases out of it; then you are not really achieving that goal and
just left with the flipside of kubernetes: a really complex setup and a piece
of technology that is very hard to master.

~~~
kbumsik
> he end goal of Kubernetes is to able to create and recreate environments and
> scale them up and down at will all based a declarative configuration.

PG already has its own clustering solution to scale up and down, which is
orthogonal to Kubernetes. So running PG in Kubernetes does not add anything.
Also, you are much more likely to mess them up when trying to mix two
orthogonal technologies.

And the DB is not meant to create and recreate often unless you want to purge
the data. So my take is this: Kubernetes is to manage and configure
microservices and DBs are not microservices.

------
how_gauche
Postgres + stolon + k8 is the easiest time I've ever had bootstrapping a DB
for high availability. I'm not sure I'd use it for extremely high throughput
apps, but for smallish datasets that NEED to be online, it was amazing. The
biggest reason it's amazing? The dev, staging, and prod environments look
exactly the same from a coder's perspective, and bringing a fresh one up is
always a single command, because that's just how you work in kube-land.

------
blyry
ooh! I've been running the Zalando operator in production on Azure for ~ a
year now, nothing crazy but a couple thousand qps and a tb of data spread
across a several clusters. It's been a little rough since it was designed for
AWS, but pretty fun. At this point, I'm 50/50, our team is small and i'm not
sure that the extra complexity added by k8s solved any problems that azures
managed postgres product doesn't also solve. We weren't sure we were going to
stay on azure at the time we made the decision as well -- if I was running in
a hybrid cloud environment I would 100% choose postgres on k8s.

The operator let us ramp up real quickly with postgres as a POC and gave us
mature clustering and point-in-time restoration, and the value is 100% there
for dev/test/uat instances, but depending on our team growth it might be worth
it to switch to managed for some subset of those clusters once "Logical
Decoding" goes GA on the azure side. Their hyperscale option looks pretty fun
as well, hopefully some day i'll have that much data to play with.

I can also say that the Zalando crew has been crazy responsive on their
github, it's an extremely well managed open source project!

------
sspies
I have been running my own postgres helm chart with read replication and
pgpool2 for three years and never had major trouble. If you're interested
check out [https://github.com/sspies8684/helm-
repo](https://github.com/sspies8684/helm-repo)

------
jeremychone
Looks interesting but difficult to get the details from just the slides.

Also, not sure why Azure Arc still gets mentioned. I would have expected
something more cloud independent.

Our approach, for now, is to use Kubernetes Postgres for dev, test, and even
stage, but cloud Postgres for prod. We have one db.yaml that in production
just become an endpoint so that all of the services do not even have to know
if it is an internal or external Postgres.

Another interesting use of Kubernetes Postgres would be for some transient but
bigger than memory store that needs to be queryable for a certain amount of
time. It's probably a very niche use-case, but the deployment could be
dramatically more straightforward since HA is not performance bound.

------
zelly
Why? So you pay more money to AWS? Deploying databases is a solved problem.
What's the point of the overhead?

------
kgraves
What's the use-case for running databases in k8s, is this a widely accepted
best practice?

~~~
ghshephard
I guess I look at it the opposite way - which is why _wouldn 't_ you run
everything in k8s once you have the basic investment in it. Let's you spin up
new environments, vertical scaling becomes trivial, disaster recovery/business
continuity is automatic along with everything else in your k8s environment.

------
m3kw9
I remember running mongodb one socket had so many gotchas and stuff that it
wasn’t worth it.

------
nightowl_games
I think cockroachDB is designed for this.

~~~
tyingq
They've thought about the use case. But it still ends up being a cluster
inside a cluster, which sounds potentially pretty bad to me. Clusters of
different types, mostly unaware of each other. Schema changes and database
version upgrades would be complicated.

~~~
rafiss
There certainly are pain points. I don't work on this myself, but one of our
other engineers wrote this blog post [0] that discusses the experience of
running CockroachDB in k8s and why we chose to use it for our hosted cloud
product. Another complication mentioned in there is about how to deal with the
multi-region case.

[0] [https://www.cockroachlabs.com/blog/managed-cockroachdb-on-
ku...](https://www.cockroachlabs.com/blog/managed-cockroachdb-on-kubernetes/)

------
pjmlp
Instead of messing around with Kubernetes, I would rather advocate for
something like Amazon RDS.

------
cmcc123
I would agree that the complexity is compounded, having gone through the work
automate various operators in kubernetes and the requisite deploy projects for
the actual app/service (database) clusters, etc.

The problem is often that the actual costs of maintaining solutions like this
isn't always clear and easy to budget for, and perhaps more importantly--
explain to management--this includes the continued costs for engineering time
to architect H/A solutions, maintain, research solutions, etc. Add to this the
abstraction and compounding of complexity and the plethora of hand-waving
blogs, etc.

IMHO, the real problems arise when you deploy a PostgreSQL (via Kubernetes
Operator) into an existing multi-AZ cloud-based kubernetes cluster--without
knowing and understanding all of the requisite requirements and restrictions.
At the time when I was working on deploying postgres clusters with the
operator (mid 2019 AIR) there was not a lot (much at all) in the strimzi kafka
operator docs about handling multi-AZs in kube with the Kubernetes autoscaler
and using cluster ASG's, etc. Note that persistentvolumes and
persistentvolumeclaims in the cloud cannot span multiple AZ's--this is a
critical concern, especially when you throw in Kubernetes and an ASG
(autoscaling group). What this means if you have some app/service running in a
specific AZ that has persistentvolumes and claims in that AZ, you must ensure
that that app/service stays in that AZ and all of its requisite storage
resources must also remain available in that AZ. The complexity that is
required to manage this is not trivial for most teams. I.e. some helm charts
that I installed (after `helm templating` in our IAC code), configured
nodelablels on the existing kube clusters worker nodes--but note that this was
not documented in the helm chart BTW. So, when we later did a routine upgrade
of the Kubernetes version and the ASGs spawned new worker nodes, that left
those aps/services processes effectively hard-coded to use nodes that were
terminated by the ASG (as they were older versions that were replaced by the
newer versions during the upgrade) and their PV's were in a specific AZ, as
noted above.

To do it right, I think you'd need to define AZ-specific storage classes and
then ensure that when you are deploying apps/services into kubernetes you
ensure that you manage those. Again, from my past experience, when you have
Kubernetes in the cloud, with the kubernetes autoscaler, and cloud-based ASG
(autoscaling groups), running in an H/A (high-availability i.e. multi-AZ), and
now add in stateful requirements using PV's, and now add in very resource
intensive apps and services, now this starts to get a bit tricky to maintain--
again--despite what the "experts" might be blogging about. Keep in mind that
the companies sponsoring the experts might have teams of 10-15 DevOps
Kubernetes engineers managing a cluster. This is something we definitely don't
have.

I'm sure it will get better with time, but for now, we are doing all we can to
maintain stateful apps/services externally--i.e.: and per your initial post,
this would be PostgreSQL with RDS. IMHO, RDS does a fantastic job and allows
us to abstract all of this, and we simply deploy our clusters with IAC and
forget about them to some degree. For the cost point and specifically
regarding resource contention, I think it's an ideal ROI to have the cloud
provider worry about failover, H/A database internals, scaling with multi-AZ
storage, etc.

