
Why Is Storage on Kubernetes So Hard? - kiyanwang
https://softwareengineeringdaily.com/2019/01/11/why-is-storage-on-kubernetes-is-so-hard/
======
jrockway
I don't really think this is a Kubernetes-specific problem. If you have a
million machines, want your database to run on one of them that is selected by
some upstream orchestrator, and want the physical SSDs with that data on it to
be in the same machine, you're going to have to do some work. But at the same
time, you have to realize that you are doing this to get that tiny last bit of
performance (most likely that 99.9%-ile latency) and that that last 0.1% is
always the most expensive. Sometimes you need it, and I get that, but it's not
a problem that everyone has.

Most applications are not so IOPS limited that they depend on the difference
between PCIE request latency and going out over the network to get stuff
that's actually stored on a nearby rack. And in that case what Kubernetes
offers (with the help of cloud providers) is fine. You make a storage class.
You make a persistent volume claim. Your pod mounts that. Not all that hard.
If the performance isn't good enough, though, then you have to build something
yourself.

I am used to a completely different model that we had at Google. You could not
get durable storage in your job allocation. Everything went through some other
controller that did not give you a block device or even POSIX semantics, and
you designed your app around that. If you needed more IOPS you talked to more
backends and duplicated more data.

Meanwhile in the public cloud world, you get to have a physical block device
with an ext4 filesystem that can magically appear in any of your 5
availability zones, provisioned on the type of disk you specify with a
guaranteed number of IOPS. It's honestly pretty good for 90% or even 99% of
the things people are using disks for. (In my last production environment I
actually ran stuff like InfluxDB against EFS, the fully-managed POSIX
filesystem that Amazon provides. It did fine.)

~~~
SomeHacker44
For databases, local storage beats SAN storage massively. I can get better
performance and IOPS from an Intel NUC with a decent PCIe SSD than I can get
from an AWS RDS instance that costs as much per month as the NUC did to buy.
If I optimize a proper rack mount server for database I can get an insane
amount and storage and performance compared to even a few months of RDS. Even
a pair of 40 Gbps SAN links would not compare.

~~~
YZF
I'm not sure this is necessarily true. 2 x 40Gbps is bandwidth which is
typically not the limiting factor. If it was you can go with higher bandwidth
like FC. RDS is a database service, not SAN. Look at SAN boxes from storage
companies, you can get things like 60 SSDs in a single box. To match the IOPs
of that you'd need a lot of servers with local storage. I think in general
dis-aggregating compute and storage is the optimal approach, whether some
particular solution is better/price effective is a different question. Having
all your drives in a box means they're easier to share, easier to service,
easier to replace the servers as well ... lots of wins there.

~~~
latch
If you're doing it locally, you'd ideally use RAID with a battery backed unit
and your fsync essentially end up going to memory.

~~~
zbentley
Until a battery relearn takes your app down in the middle of the night. I
don't miss those.

------
hardwaresofton
Kubernetes has a few good storage solutions that are built on well-known
technology...

[https://rook.io](https://rook.io) (built on ceph)

[https://www.openebs.io](https://www.openebs.io) (built on Jiva/cStor)

The _distributed storage_ problem is difficult on it's own, this isn't a
kubernetes specific issue. IMO Kubernetes is improving on the state of the art
by dealing with both the distributed storage problem _and_ dynamic
provisioning, and that's why it might seem somewhat flustered.

What isn't shown here is how effortless it feels when you _do_ have a solution
like rook or openEBS in place -- we've never seen ergonomics like this before
for deploying applications. Also, the CSI (Container Storage Interface) that
they're building and refining as they go will be extremely valuable to the
community going forward.

BTW, if you want to do things in a static provisioning sense, support for
local volumes (and hostPaths) have been around for a very long time -- just
use those and handle your storage how you would have handled it before
kubernetes existed.

Shameless plug I've also written about this on a fair number of occasions:

[https://vadosware.io/post/kicking-the-tires-on-openebs-
for-c...](https://vadosware.io/post/kicking-the-tires-on-openebs-for-cluster-
storage/)

[https://vadosware.io/post/disassembling-raid-on-hetzner-
with...](https://vadosware.io/post/disassembling-raid-on-hetzner-without-
rescue-mode/)

I've gone from a hostPath -> Rook (Ceph) -> hostPath (undoing RAID had issues)
-> OpenEBS, and now I have easy to spin up, dynamic storage & resiliency on my
tiny kubernetes cluster.

All of this being capable as someone who is not a sysadmin by trade, should
not be understated. The bar is being lowered -- I learned enough about ceph to
be dangerous, enough about openebs to be dangerous, and got resiliency and
ease of use/integration from kubernetes.

~~~
justinclift
OpenEBS looked interesting, but it turns out they're spammers. :(

Starred the GitHub repo... and a few minutes later received email spam from
them to my personal email (it's in my GitHub profile). :(

Completely lost interest in their project at that point.

Probably best to skip it, as rewarding spammers doesn't lead to good things.
:(

~~~
hardwaresofton
I had no idea they did that, just made an issue about it[0].

Also, I wouldn't be so quick to write them off -- their solution is based on
Container Attached Storage (CAS), and is the only relatively mature solution
so far I've seen (I haven't seen any others that do CAS) that sort of take the
Ceph model and turn it inside out -- pods talk to volumes over iSCSI via
"controller" pods, and writes are replicated amongst these controller pods
(controller pods have anti-affinity to ensure they end up on separate
machines).

I have yet to do any performance testing on ceph vs openebs but I can tell you
it was easier to wrap my head around than Ceph (though of course ceph is a
pretty robust system), and way easier to debug/trace through the system.

[0]:
[https://github.com/openebs/openebs/issues/2345](https://github.com/openebs/openebs/issues/2345)

~~~
justinclift
Thanks for doing that. I have personal strong aversion to spammers. To the
point where their product doesn't matter, I'll just use/help a competitor
instead.

Reaching out to the occasional person who stars a project, if there's some
strong overlap of stuff then maybe sure. An automatic spam approach though...
that's not on. That makes starring projects a "dangerous" thing for end users,
as they'd have to be open to emails for every one. :/

And yeah, it did look useful up until this point. Lets see what their response
to the GH issue is like. :)

~~~
umamukkara
Thank you for the feedback. Feedback like this will help us understand the
best practices in community building. We have disabled the email trigger on
starring. Thanks again for this open feedback.

~~~
justinclift
No worries Uma. The project looks interesting, and it's even written in Go, so
I'll probably take it for a spin in the next few weeks. :)

------
notacoward
Here's the root of the problem.

> Static provisioning also goes against the mindset of Kubernetes

Then the mindset of Kubernetes is wrong. Or at least incomplete. Persistent
data is an essential part of computing. In some ways it's the most important
part. You could swap out every compute element in your system and be running
exactly the way you were very quickly. Now try it with your storage elements.
Oops, screwed. The data is the identity of your system.

The problem is that storage is not trivially relocatable like compute is, and
yet every single orchestration system I've seen seems to assume otherwise. The
people who write them develop models to represent the easy case, then come
back to the harder one as an afterthought. A car that's really a boat with
wheels bolted on isn't going to have great handling, but that's pretty much
where we are with storage in Kubernetes.

~~~
nemothekid
AFAICT, the problem is have a separate, generic, distributed storage layer
that is also performant is a solution very few have (like Google), and the
rest of us have tried to hack things together with StatefulSets, GlusterFS,
NFS, Persistent Volumes, etc.

The Borg/Omega model kind of assumes you have a Google-like storage tier
interconnected with 10GbE links that is automatically replicated and
accessible from potentially anywhere. Once you have that, then storage on k8s
is "easy".

~~~
candiodari
Unless I totally misread the "colossus" papers, Google does not in fact have
an "accessible from anywhere" storage layer. GFS/Colossus is a per-cluster
filesystem capable of read and append only. It is very much NOT generic and
only supports custom applications.

Presumably that means that on every Google machine there is a directory that
is shared for the entire cluster BUT:

1) It is per-cluster, not global (given Google's cluster sizes one imagine's
that's not that much of a limitation, but then again, it seems like that is a
clear limit)

2) you can only create or append to files (or delete them I guess) (and this
is therefore non-posix, and does not support things like mysql or postgres)

3) this is very different from what GlusterFS, NFS, persistent volumes, etc
provide. Therefore disks on google cloud are presumably very much not just
files on this GFS/colossus thing.

4b) it was single master at least until 2004. Maybe until 2010. Apparently
that can work.

[https://cloud.google.com/files/storage_architecture_and_chal...](https://cloud.google.com/files/storage_architecture_and_challenges.pdf)

[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en/us/archive/gfs-
sosp2003.pdf)

~~~
philsnow
> GFS/Colossus is a per-cluster filesystem capable of read and append only. It
> is very much NOT generic and only supports custom applications.

For the most part, google only has custom applications. Just about everything
is written in-house, and takes advantage of things like being able to open a
file from the local disk file just as easily as opening one in Colossus or
their equivalent of Zookeeper.

------
huffmsa

      Why is it so hard to hammer
      a nail with a screwdriver?
    

Because stateful storage isn't the problem the system was developed to solve.
The author is conflating what K8s is (a stateless container orchestrator) with
what he wants it to be (a full service devops guy).

Now, if you simply must have stateful storage, I've had a pretty good time
with PVCs pointing to Plain Jane NFS volumes combined with node tainting and
performing local replication of high need data on the pods as needed.

If your pods are scaling up and down or dying so quickly that this seems
untenable, you have other fish to fry.

------
halbritt
I think the premise of this article is misleading. It's true that creating a
storage subsystem in Kubernetes is exceedingly difficult, however using
Kubernetes for stateful applications isn't any more difficult than using
instances with persistent disk in any cloud provider.

For applications deployed in cloud infrastructure, most folks are using
persistent disk simply for the ease of management. Some folks end up going to
local disk for scale and performance reasons, at which point they end up
having exactly the same problem one would have trying to do the same with
Kubernetes.

------
ToFab123
Microsoft had the same problems when launching their microservice platform
(Service Fabric). They relialized most of their problems went away if they
gave their stateless service the notion of state and so they did. Most MS
services, including Azure itself, now runs of something similar to K8 but that
is state full.

Service Fabric powers many Microsoft services today, including Azure SQL
Database, Azure Cosmos DB, Cortana, Microsoft Power BI, Microsoft Intune,
Azure Event Hubs, Azure IoT Hub, Dynamics 365, Skype for Business, and many
core Azure services.

[https://blogs.msdn.microsoft.com/azuredev/2018/08/15/service...](https://blogs.msdn.microsoft.com/azuredev/2018/08/15/service-
fabric-and-kubernetes-comparison-part-1-distributed-systems-architecture/)

------
kthejoker2
Netapp has Kubernetes-as-a-Service that includes their own persistent volume.
As a consultant I had it foisted on me for a project, but it worked pretty
well - I never had to worry about storage - it just worked - so I guess I
never realized it was "hard"? Definitely something I'm glad I let someone else
manage.

[https://cloud.netapp.com/kubernetes-
service](https://cloud.netapp.com/kubernetes-service)

------
ohiovr
I could not figure out cluster storage with docker swarm at least. With
containers springing up and going away on multiple computers I could not
figure out where the volumes were supposed to “live”

The only thing that at least seemed possible without additional software was
nfs mounts but I was thinking it does not help the high availability cause to
see the one and only storage node go down. Nor could I see any benefits to a
cluster that used the same network for file access as external network
requests. Say I wanted to build a horizontally scalable video sharing site I
don’t see how I could test the real world performance of it before it went
live.

~~~
outworlder
You could use multiple network interfaces, one for storage, the other for the
rest.

Not sure if this is in a datacenter or not. In a datacenter, you could use
something like 3-PAR to have network attached storage.

But storage is hard. This is one of the advantages of using cloud providers,
they have this part figured out for you. AWS's EBS volumes are network-
attached storage.

Then there's another layer of abstraction when you are using containers. Ok,
so you have this volume accessible from the network now, in whatever form. How
do you attach it to running containers? That's what this article is about,
mostly. Not with the underlying storage mechanisms.

~~~
ohiovr
I am Leary of Amazon because of the potential for sudden ruinous expenses. I
looked over digital ocean and they say they are working on kubernetes but I
don’t see how block storage is going to make it work. I’m just developing this
on my home server for now. I’m just trying to learn the basics. Seems like
cluster computing isn’t something that can be roll your own like ordinary
docker can be.

Thanks for the lead though.

~~~
LukeShu
_> I looked over digital ocean and they say they are working on kubernetes_

FYI, Digital Ocean Kubernetes opened to everyone on December 11th.

------
xfactor973
I ended up making this:
[https://github.com/gluster/piragua](https://github.com/gluster/piragua) to
make connecting glusterfs to kubes easier/scalable.

------
KaiserPro
A few observations here

o managing state is hard. Kubernetes makes it look simple because we have been
moving the state from apps to databases/messaging queues o storing data
durably at scale, with speed and consistently is a CAP problem o distributed
block storage is slow, complex and resource hungry o distributed storage with
metadata almost always has a metadata speed problem (again CAP) o managing
your own high performance storage system is bankruptingly hard

Ceph/rook is almost certainly not the answer. Ceph has been pushed with
openstack, which is a horridly mess of complexity. Ceph is slow hard to mange
and eats resources. if you are running k8s on real steel, then use a SAN. if
you're on AWS/google use the block primitives provided.

Firstly, there is no general storage backend that is a good fit for all
workloads. Some things need low latency (either direct attached, or local SAN)
some can cope with single instances of small EBS volumes. Some need shared
block sorage, some need shared posix.

With AWS, there is no share block storage publicly available. Mapping volumes
to random containers is pretty simple. Failing that there is EFS, which
despite terrible metadata performance, kicks out a boat load of data. If you
app can store all its state in one large file, this might be for you.

Google has the same, although I've not tried their new EFS/NFS service.

------
fh973
While storage generally not a simple problem, Kubernetes' design makes it even
harder. Why Kubernetes went this way is not entirely clear to me. One of the
reasons could be its owners' focus on cloud infrastructures and tutorial-level
developer use cases, where you want to show how to dynamically create and
consume a file system backed by GCE persistent disks or EBS. Also getting a
secure and usable Kubernetes setup running is currently a lucrative business
model.

But as the OP writes, the standard case it should support is secure access to
existing storage, and Kubernetes fails there on all levels. Simple things
should be simple, hard things should be possible.

In its API design, it hides storage under abstractions like PVC, PV,
StorageClass that most users have not seen before in their career (where do
they come from btw?) and which do not re-use general system architecture
concepts. Worse, these abstractions do not have exact definitions as their
semantics can be massaged with access control rules.

Consider the most simple case: mounting an existing file systems and
authenticate with a Kubernetes secret. You should be able to do this in 3-4
lines in a pod defnition. Instead you need to create several yaml files, whose
content is strongly dependent on how your Kubernetes was set up, so no general
tutorial will help you (but others are, see above).

And this is just for getting the basics going. Secure access / user
authentication is still unsolved after two years
([https://www.quobyte.com/blog/2017/03/17/the-state-of-
secure-...](https://www.quobyte.com/blog/2017/03/17/the-state-of-secure-
storage-access-in-container-infrastructures/)), and does not seem to be high
on the agenda neither in Kubernetes nor CSI.

There are other basics missing (like file systems do not have necessarily a
"size"), but let's not go into detail there.

------
parasubvert
Storage itself is a complicated problem domain, one that is easily dismissed
until you actually have to deal with it in gory detail.

K8s grew organically here, and thus the seams show. It’s getting better every
release, particularly being able able to dynamically provision and schedule
storage to pods in a “zone aware” manner, which has been tricky.

The deeper issue issue is we are spoiled for choice on storage engines. Using
Ceph for random access r/w low latency block storage I wouldn’t wish on my
worst enemy, for example. But it’s hard to distinguish hard numbers for
comparative purposes.

------
pkphilip
The old fashioned approach of using a database server in a co-hosted setup
with a locally hosted storage on SSD with another couple of servers in active
or standby type of replication is the best option for the db side.

The rest of the stateless compute cores can run on kubernetes.

------
kkaffes
There are solutions like Arrikto's Rok [1] that solve the problem of handling
stateful workloads in Kubernetes.

[1] www.arrikto.com

------
linkmotif
Kind of feels like the author doesn’t know kubernetes? Stateful sets make all
this very easy. I only have a user-end knowledge of k8s from GCE and it seems
like k8s makes storage very easy.

EDIT K8s makes storage very easy. No “seems.”

~~~
cat199
> k8s from GCE and it seems like k8s makes storage very easy.

or, using storage in an environment where the real storage is completely
managed for you makes storage seem easy..

------
wora
I saw many comments about stateful workloads. I am not sure it is a necessary
issue for cloud environment.

Within a zone or a cluster, the latency is about 1ms, which is faster than
most hard disks. The network bandwidth is on par with disk throughput. What we
really need is a faster database and a faster object storage that can match
the network performance (1ms and 10Gbps), then all workloads can be stateless.

If one uses a VM on GCP, the VM has no local storage besides the small local
SSDs. Practically even the VM is stateless besides some cache.

~~~
lostmyoldone
From a database perspective, 1ms to disk is an eternity.

A good disk subsystem had less write latency than that in the early 90’s.

~~~
0xEFF
I don’t think that’s true. I recall ~5ms seek times being top of the line.

~~~
cat199
he said 'subsystem' not 'disk'-

what was the latency to the controller with ram cache?

using seek time as a measure is also somewhat worst case -
controllers/filesystems also queue(d) according to drive geometry.

