Running production databases in docker last year: https://thehftguy.com/2016/11/...

gnufied · on Dec 30, 2018

I can't possibly hope to change your mind but stability issues with union filesystem driver in docker(part of it was not even docker's problem) and persistent volumes of kubernetes are two very different things. Cassandra running standalone on host(and crashing) is no different from cassandra crashing when running using a PV inside a container.

Moreover - all/most Linux distros have switched to using Overlay2 as default driver. If you are running latest version of RHEL/CentOS/Fedora/Ubuntu that is the driver you will be most likely using.

user5994461 · on Dec 31, 2018

Don't get me wrong, I know it's not a bug in kubernetes, it's a bug in the filesystem. Kubernetes is as stable as the weakest part and the weakest part is the container engine (docker and underneath).

Containers require volumes/filesystems to run and some implementations are buggy as fuck.

Docker abandoned CentOS 6 many years ago, whether they stated officially or not, the last docker package and kernel/drivers are unstable. Similar story on some other distributions.

It wasn't production-ready at all back then and it's still not a good idea to containerize databases now. Besides bugs that come and go, there are other challenges around lifecycle, performance and permissions that are not trivial to deal with.

bogomipz · on Dec 31, 2018

>"I can't possibly hope to change your mind but stability issues with union filesystem driver in docker(part of it was not even docker's problem)"

Can you outline what those stability issues are/were? Was the non-Docker part of the problem kernel related? Genuinely curious.

user5994461 · on Dec 31, 2018

See RHEL and Debian sections: https://thehftguy.com/2017/02/23/docker-in-production-an-upd...

The filesystem drivers are buggy as fuck. You would experience kernel panics on Debian Jessie (overlayFS), or containers + docker daemon hanging on CentOS 6 (devicemapper). The fix in both cases is a reboot.

You might not notice it if you barely used docker, but it can be very outstanding at scale. I've been consulting briefly at a major web company that was deploying their web services to 5-20 nodes, daily. On every service deployment there would be up to 3 nodes dying.

ex3ndr · on Dec 30, 2018

For sure it is a very different thing. Local SSD or remote drive? That means a lot for Cassandra.

atombender · on Dec 30, 2018

Kubernetes supports local volumes. With GKE you get local SSDs.

ex3ndr · on Dec 31, 2018

That doesn't make sense to use GKE for this. Eventually you will just have bunch of VMS that run only your DB (since you need to avoid interference of other workloads) and there are no support for multi DC mode... And what benefits? Restarting SQL or Cassandra is not very cheap operation and can cause large data migrations.

geggo98 · on Dec 31, 2018

In the Cassandra case, you would not write the persistent data in the Docker image (that's the part of the file system mounted as a layered file system, using AUFS or OverlayFS). Instead, you would write it in a volume. For a local volume, that's just a part of the "normal" file system (Ext4, XFS, ...) exposed to the Docker container through a bind mount.

Volumes are quite stable and reliable when based on a stable file system.

So while you could lose the container due to the but describe, you would not lose the persistent data.

It's best practice not to write to the Docker image at all during runtime (no log files, no PID file, etc), but to write only to volumes or tmpfs mounts. I'm a little bit suspicious about the crashes you described: are you sure you followed that best practice?