
Stateful Apps on Kubernetes: A quick primer - loiselleatwork
https://www.cockroachlabs.com/blog/kubernetes-state-of-stateful-apps/
======
bajsejohannes
I would recommend against running stateful apps in kubernetes. It's not really
ready for it. Big problems include routing (it works fine for http requests,
but not for DBs, message brokers, etc) and just the pain of setting up
stateful sets.

If you don't believe me, take it from someone who should know what they're
talking about:
[https://twitter.com/kelseyhightower/status/96341350830081229...](https://twitter.com/kelseyhightower/status/963413508300812295)

~~~
lobster_johnson
We run stateful apps on Kubernetes. There are obvious rough areas (the lack of
persistent volume resizing, for example, which is scheduled for 1.11), but
overall, it's great.

What a lot of naysayers leave out, or choose to ignore, is that the challenges
running stateful apps on Kubernetes mirror those of running stateful apps
anywhere. If you run Postgres on a VM, for example, you're completely reliant
on that VM staying up -- this is no different from Kubernetes. Some will also
point out the dangers of co-locating lots of software (such as Postgres) on
the same machine as many other containers, as they will compete for CPU and
I/O; but this is also no different than on Kubernetes, which provides plenty
of tools (affinities/anti-affinities, node selectors) to isolate containers to
machines. And so on. Containers bring some new challenges, but Kubernetes
meets them quite well.

What specific issues do you have? I'm not sure I understand the point about
routing. I also don't understand what the "pain" of stateful sets refers to.

~~~
FBISurveillance
The the original commenter but I'll jump in:

1\. While "we already rely on VM staying up", with k8s we reply on both VM
staying up and kubernetes infra on top of that VM staying UP 2\. Maintaining a
complex stateful system on k8s _requires_ having and maintaining an operator
for that system. 3\. You reduce your options when it comes to tweaking
systems, e.g. local SSDs on GCP are available in SCSI and NVMe flavors, while
GKE supports only SCSI; harder to perform fine-tuning and other tasks on the
underlying VMs that would have been trivial with Chef or similar. 4\.
Enterprise systems like Splunk explicitly mention that their support does not
cover Splunk clusters running on kubernetes. 5\. As mentioned, you can't even
resize a disk without going through dance of operations that would take days
or weeks when you're working something like Kafka at scale. 6\. Some stateful
services like Zookeeper require stable identities and this is far from perfect
on kubernetes. 7\. More complex traffic routing that involves additional fees
because to achieve (6) you sometimes need to expose things publicly.

That's just from the top of my head.

Disclaimer: We run 10+ stateful services on Kubernetes.

------
rrdharan
> Because Kubernetes itself runs on the machines that are running your
> databases, it will consume some resources and will slightly impact
> performance. In our testing, we found an approximately 5% dip in throughput
> on a simple key-value workload.

5% seems like a surprisingly large overhead. What is k8s doing in this
situation that would have that kind of impact?

~~~
smarterclayton
CPU Cache contention, network overhead introduced by Kubernetes service proxy
model, even the liveness checks.

We haven’t yet evolved Kubernetes services to prefer specific cores and avoid
app workloads quite yet (although cpu management is getting closer).

Docker is also somewhat hefty memory wise and you may contend on disk if not
careful.

5% seems pretty reasonable to me in general, just as a consequence of having
something heavier weight on the same node managing workloads.

~~~
a-robinson
Yeah, it appeared to just be general resource contention from having to share
the machine -- CPU interrupts, less memory available, etc.

I'll note, though, that the 5% number is when using host networking for both
Cockroach and the client load generator. Using GKE's default cluster
networking through the Docker bridge is closer to 15% worse than running
directly on equivalent non-Kubernetes VMs.

------
stefanatfrg
I'd like to know how to solve the storage dilution problem with stateful apps
in k8s where you have to buy 3-18x more raw capacity than desired to meet
availability & durability guarantees.

For example if you ran CDB on a baremetal cluster of 3 nodes with 30TB of raw
capacity, 15TB is lost to RAID10, 10TB is lost to running a replicated
database such as cockroach DB, leaving you with 5TB effective capacity which
is a 1/6 dilution of your initial capacity.

If you ran cockroach DB on a replicated network volume, with a replication
factor of three, it gets worse. If you bought 30 TB of disks, you'd lose 20 TB
to volume replication, ~6.67TB to CDB replication leaving you with 3.3TB of
effective capacity or a 1/9 dilution. If those disks were configured with RAID
your effective capacity would drop to a 1/18 dilution.

You could achieve a 1/3 dilution which is the effective minimum for a
replicated database if you didn't configure RAID, but you increase the impact
of disk failure, in that it would take much much longer to recover a cluster.

------
lowbloodsugar
>Given its pedigree of literally working at Google-scale

I understood that a team at google developed k8s but google doesn't actually
run it for their "google-scale" workloads. Am I misinformed?

~~~
bajsejohannes
You are correct.

> [kubernetes is] a simplified clone of Google’s internal borg system

[https://medium.com/@steve.yegge/honestly-i-cant-
stand-k8s-48...](https://medium.com/@steve.yegge/honestly-i-cant-
stand-k8s-48c9a600e405)

~~~
outside1234
That said, Google, to my understanding, does run a completely containerized
infrastructure internally, including databases and other stateful things, so
it is not wildly off to suggest running a database on Kubernetes.

~~~
puzzle
Before it ran on F1/Spanner, Adwords ran on sharded MySQL on Borg. Not for its
entire early life, but for quite a few years. Later in life, Checkout and
maybe Wallet ran on MySQL on Borg, too. So did YouTube, which used Vitess, a
sharding layer (now ported to Kubernetes and open sourced).

Bigtable, Spanner and even Colossus/D run in containers on Borg.

~~~
tayo42
How is colossus running in a container in borg? That's their network file
system that makes running databases in containers possible?

~~~
puzzle
There's a bootstrapping issue, of course, but that's how it works.

And it's even crazier than what you're picturing. What if I told you that
Bigtable runs on top of Colossus, but Colossus itself stores metadata about
files, including Bigtable's files, in... Bigtable? It's really turtles all the
way down, the last of which is, luckily, Chubby.

Some of the details here: [http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-
Google-Ke...](http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-
Keynote.pdf)

------
daxfohl
Has anyone looked at Service Fabric (Microsoft tech) for things like this?
That has offered stateful services for years now. I'm pretty sure it runs on
Linux, and I've seen that it's Docker compatible. I know it's kinda in the
same space as K8s but I don't really know the details. Would SF be able to do
something like this in a similar (or better?) way?

~~~
zapita
It's complicated, because the definition of Service Fabric seems to be in
flux.

The "original" Service Fabric is a high-level framework which requires
invasive source code changes (you can't just drop an existing app on top of
it), but gives you lots of benefits (scale, reliability etc) if you make the
effort.

Recently container-based platforms - Docker, Kubernetes, etc - have come along
with a different tradeoff: better compatibility with existing applications in
exchange for less magical benefits. That approach is getting much more
traction, and I think internally at Microsoft there is some infighting between
the "Service Fabric camp" and the "Containers camp". One consequence of the
infighting is that Service Fabric is extending its scope to include features
like "container support". It's not clear to what extent that is done in
collaboration with the "container people", or as a way to bypass them. I think
they are still trying to decide whether to embrace Kubernetes, or replicate
the functionality in-house. My prediction is that the container-based approach
will win, but if will take time for the politics to fully play out. In the
meantime things will continue to be confusing.

Bottom line: when evaluating Service Fabric, watch out for confusing and
inconsistent use of the brand. It's a common pattern with large vendors - for
example IBM with "Bluemix", SAP with "Hana", etc.

~~~
daxfohl
Okay that's about what it looked like to me too. There's only so many magic
words you can throw at a tech and expect it to work together happily. Looking
into it, the stateful service side of SF doesn't seem particularly compatible
with the container side of it. A stateful service is a stateful SF service,
and a container service is its own thing. _Maybe_ there's a way to plug them
together but unfortunately I didn't see it.

------
tapirl
Are there any cloud providers providing remote disks without replications? It
looks such needs are popular for deploying databases in which replications are
maintained by the databases themselves.

~~~
jen20
That's effectively what EBS is, no?

~~~
tapirl
I have the impression that EBS does replication automatically, is it wrong?

