
Why Databases Are Not for Docker Containers - user5994461
https://myopsblog.wordpress.com/2017/02/06/why-databases-is-not-for-containers/
======
XorNot
This article is terrible. It's a lot of wishy-washy explanations devoid of
technical detail - because their _isn 't_ a technical explanation or
justification for this list.

I've run extensive benchmarks of Hadoop/HBase in Docker containers, and there
is no performance difference. There is no stability difference (oh a node
might crash? Welcome to thing which happens every day across a 300 machine
cluster).

Any clustered database setup should recover from failed nodes. Any regular
relational database should be pretty close to automated failover with
replicated backups and an alert email. Containerization doesn't make this
better or worse, but it helps _a lot_ with testing and deployment.

~~~
alexnewman
Yeah volumes skip unionfs. This article is full of FUD. The author
demonstrates they don't really have enough experience to make these claims. I
wonder if google has database nodes in containers? Kubernetes is adding the
features for containers now. I think it is stable now.

~~~
jdc0589
> I wonder if google has database nodes in containers?

I've been wondering this for a while. I'm sure some of the big players do it,
but I'd really like to see a case study from one of them.

~~~
mondoshawan
Google runs MySQL on Borg internally.

~~~
puzzle
Same for Bigtable and Spanner.

~~~
ldoguin
I assume these guys have their own network controller and kickass optic fiber
links. Network attached storage in poor cloud environments leads to issues.

------
derefr
> But what about Configuration Management systems? They’re designed to solve
> this kind of routine by running one command.

The problem with this for most of the developers you see praising containers,
is that with a containerized setup, you've already got the _rest_ of your
deployment process down to `docker service update --image
myorg/myservice:1.3.0 myservice`

(And, in fact, maybe you're even running that code against immutable-
infrastructure container-host OS like CoreOS.)

And _now_ , you're suggesting that these developers would have to add this
whole other process just for managing the deployment of the DBMS package—and
probably the OS it's running on, too. (Maybe they would even have to add
process to manage the _VM_ it's running on, if they were doing everything else
until now using autoscaling + swarm auto-join.)

Developers put DBMSes in containers because they're developers, not DBAs. If
you _are_ a DBA, then obviously this will seem wrong to you. A DBA wants to
manage a DBMS using the DBMS's tooling. Developers, meanwhile, essentially
want to manage DBMSes as part of the same "release" as their apps—being able
to "pin a dependency" to a specific DBMS version; update the DBMS as part of a
commit and see the whole updated stack go through CI to integration-test it;
etc. These are development-time concerns that—at small scale—usually override
operation-time concerns.

~~~
sqldba
This sounds right to me. As a DBA containers appear to be a nightmare. I'm
employed as an absolute expert in my product. A developer may know how to use
Docker but are they an expert? Now...

* Who is going to look after the middle ground when the database is in the container?

* Who is going to be responsible for rewriting enterprise tools to discover those instances to gather metrics? Because none of the traditional methods (WMI, registry keys, etc) are going to work. You've just broken SCCM, ServiceNow, and everything else under the sun.

* Who owns the patching? Because WSUS can't discover it and isn't going to be able to patch inside a container.

* Who owns the backups? You know backups are complicated right and not just an on/off switch? You have to schedule the backup, but also make sure you're scheduling the backups in a standard way across your hundreds of hosts (now containers), and then validate those backups are actually being taken, and test those backups regularly. Developers couldn't care less about this stuff - it's someone else's problem - my problem - until it's in a container and then nobody is going to do it.

* And when something breaks in between, and the business suffers a massive loss of data, who are they going to sue? A liability-free open source project? I don't think so.

There's more to being a DBA than just stuffing it in a container and saying,
"she'll be right mate".

~~~
kiallmacinnes
These are all valid concerns, but, none seem specific to databases and - for
companies who have moved even parts of their infrastructure to containers -
have been deemed acceptable.

(Also, while you mentioned all Microsoft tools here, the same issues apply to
Linux based containers)

------
raarts
Alright, all you brave people who run databases in docker: where do you store
the actual data?

\- Host mounts? So what happens when a container gets rescheduled to another
node?

\- Docker volumes? What happens when a container gets rescheduled to another
node?

\- External SAN? Congratulations on your budget. That's not easily doable in
public cloud I guess?

\- Shared filesystem like NFS, or Ceph? How's the performance for you? And the
filesystem itself runs outside of your docker cloud I guess?

\- In the containers themselves? So you probably run some clustered database.
So what if disaster strikes and all your containers go down? And get moved
around?

Also, databases in many cases need to be tuned for performance. How do you do
that in a cloud?

Maybe most of you are not running container scheduling. Which really is only
taking containers half way.

~~~
jazoom
Mount a volume on the host and make sure only one database container runs on
that host. Done.

Then it's exactly like a traditional setup except better because if the
process crashes or config gets corrupted it can be replaced automatically in a
few seconds, good as new.

There's nothing brave about that.

~~~
hmottestad
This is what we do. We use Docker Swarm and label our database host. Then add
a constraint to the service/container i our docker-compose file so that the
database is always scheduled to that host.

Great thing about this setup is that we can now run the entire production
system locally on our laptops and test servers with or without swarm.
Everything is deployed in the same way and everyone uses the same versions and
configurations (even for our database).

------
activatedgeek
I am so surprised and disappointed that such a shallow article has made it to
the top. It provides absolutely no value.

Most of the upvotes (gathering a consensus from comments) are not because they
believe that Docker is not the right tool but because they have been
frustrated by the ops part of things.

I'm also surprised at how many think that one tool will come and solve all
their problems. Guys, it doesn't work that way. Docker had one job and it does
it fairly well - process isolation for humans (ok the engine goes crazy
sometimes but hey everything does). For all the other things, you need to
setup your own workflows, tools and processes.

Tomorrow, another rant article would come at how apt sucks but in fact, apt
and friends are just an amazing example of how packaging should be done. It is
not ideal, but it works! Everything fails sometimes and that is when new
things for it come up.

If every immature developer started adding rants on the internet, we would
pretty much be disregarding half of the software. To the OP, if you are not
able to achieve something, please don't rant just because you couldn't do it.

~~~
user5994461
There are little to no rant against apt, yet there are a lot about Docker.

Trends always have an explanation...

------
mstump
Part of the problem is that you're using databases that can't cope with
failure. In large scale production systems things fail all the time. If you've
got tech that can cope with failure it's not an issue.

Additionally, Docker is pretty handy when you're attempting to manage clusters
consisting of thousands of nodes. In that instance enforcing best practices,
automating workflows, scaling teams, auditing and preventing configuration
drift are much bigger problems than a single server failing.

~~~
user5994461
There is no tech in the universe that can cope with cascading failures, like
ALL instances of a docker container crashing on ALL hosts one by one in quick
succession. This usually happen because an app hits an unexpected bug in the
docker disk or the docker network stack and this is the major source of
concerns I have with Docker.

~~~
mstump
Some systems cope with failure better than others. Everything you've said is
also true of DB running on top of a uniform linux stack. From my experience
(500+ large scale production deployments) this doesn't happen very often.

Does it solve all problems? No. Does it make the world a little better and is
it better than monolithic single points of failure? Yes.

~~~
carterehsmith
>> 500+ large scale production deployments

This needs to be qualified....

Did you deploy a single system 500 times, or 500 different systems? Or some
combination thereof.

~~~
mstump
It's a mix. I'm a consultant that specializes in large scale distributed
systems. I have some customers that have >100k production database nodes. I
manage probably >50PB of data. I have designed large distributed systems for
more than 100 customers.

~~~
user5994461
consultant = charge > £600 a day to bring Docker to the company. Yet doesn't
care when shit hits the fan 3 months later because he's already gone. In fact,
he will never known about it.

By the way, How to have 100 customers => leave right after the design phase
every single time. Clients add up quickly.

~~~
mstump
I do a mix of pure consulting but also managed services. I typically have a 12
hour SLA for issues, and 1 hour SLA for some customers. 24/7 support for
mission critical, revenue generating systems. So no, I'm not just a talking
head. It's usually me in the NOC on the hook in case things go wrong. I'm the
world expert in this field, if you want things to work at scale people call
me.

------
wstrange
Kubernetes StatefulSets [1] are intended to address this kind of use case.
They provide stable network identity and stable storage.

Kubernetes enables a container to declare the resources that it requires,
including things like dedicated CPU and memory requirements. There are still
some rough edges (example: how do you set the amount of kernel shared memory
you need), but those issues are being ironed out.

[1] [https://kubernetes.io/docs/tutorials/stateful-
application/ba...](https://kubernetes.io/docs/tutorials/stateful-
application/basic-stateful-set/)

~~~
cookiecaper
StatefulSets are a new feature marked as "beta". They were first available in
the newest k8s release, 1.5.

~~~
anonfunction
They were known as PetSets beforehand.

------
ofrasergreen
As others have mentioned, Docker doesn't really do any magic which might harm
the smooth running of a database, but just leverages process isolation built
into the Linux kernel and provides a convenient way to package and distribute
bundles of software. In the case of a database, the former can be handy any
time you want to run a database on a host where you want to run other software
too. As for the latter, being able to run the same configuration locally as on
production servers, to replicate configuration over many nodes in a cluster,
to distil both configuration and software into atomic units are as
advantageous for databases as for any other software.

At my company we have been using PostgreSQL on Docker for over two years
without and have been sufficiently satisfied with the results that we're in
the process of turning out setup into a product in its own right:
[http://containable.co/](http://containable.co/)

~~~
tmikaeld
I'd suggest skipping the sign-up questions, they might scare people off.

------
ender7
I still can't really figure out what a container is. Every time I think of a
use case for one, I read something like this which says that's a terrible
idea.

The use-case I need solved most often is the following:

Create a standalone "server" that accepts and responds to network traffic, has
some way to store data, and whose dependencies (i.e. system packages,
frameworks, etc) I can manage independently of any of the other "servers" I
have running. Do I just want a bunch of VMs? Or docker instances that all
point to some other DB (that's apparently not in a docker instance...?). But
then they're no longer independent from one another because they all use the
same DB. So do I need a separate DB for each serverlet? Which lives where? On
its own VM?

~~~
NikolaeVarius
There is nothing special about containers to really understand

Containers are a lightweight way of sandboxing a process. Think a level lower
than a VM. You can run multiple containers on a single VM in the same way you
can run multiple VMs on a single host.

Ideally a container should be stateless. If a container crashes, you should be
able to bring it up again without anything actually caring that it is
technically a different process.

A container doesn't solve a "real problem" it mostly makes it easier to manage
applications and processes by abstracting out any dependencies from the host
VM and keeping everything packaged into a single thing.

A container can run any application that it is configured to run on any VM
regardless of the state of the VM (Assuming the VM has a kernel that supports
containers)

~~~
metaphorm
> Containers are a lightweight way of sandboxing a process. Think a level
> lower than a VM.

can you go into a little more depth? my understanding of a VM is that it
installs the OS in a dedicated memory partition, and allocates hardware
resources separately from that of the host machine, such that resource
contention between host and VM never happens. the VM allocated resources just
go dark for the host machine while the VM is running.

what is a lower level than that? I've understood containers to be thin
wrappers around VMs, which would make them higher level, not lower level. do I
have this wrong?

~~~
NikolaeVarius
I'm doing a bit of a simplification here so please someone correct me if I'm
not saying this correctly

Every single instance of a VM has its own Kernel. When a VM boots up, it gets
allocated a portion of hardware and boots up a kernel and allocates memory to
itself. VMs each are isolated from each other in that they don't share
resources and each VM is free to do whatever it wants to do with the hardware
it is given. Like you said, to the host machine, that hardware is no longer
available for any other VM to use.

For Containers, they all live on a SINGLE kernel. They share resources across
each other and the Kernel handles the multiple processes much like it would
handle any other multithreaded process.

If you have 3 VMs that all require a specific set of resources to run an
application, you need 3x that hardware. This is not true for containers. You
can get away with less because the containers will share the resources that
the kernel as access to.

I call them "lower level" in the sense that they do so much less than VMs. You
CAN use containers as a VM in that a container can boot an entire Kernel, but
generally you don't do this.

~~~
user5994461
VMs share resources from the host: disk, network, memory. Just like a
container shares resources from the host.

Containers re-use the running Operating System from the host, it saves memory
but it can only run a single OS.

A VM can run any operating system, and each VM runs its OS independently. VMs
are memory intensive, there is a base 100-500MB to pay to run any VM because
of the independent OS. (Note that the advanced VM managers have evolved to
have memory deduplication and COW across VMs.)

Containers exist to save memory. That was the critical pain at the time they
came to existence. Memory was expensive, it was a major problem when you
wanted to run 10 hello world applications as 10 separate VMs.

~~~
ec109685
Containers give the OS the ability to optimally schedule processes among them.
VMs are black boxes to the hypervisor, limiting its ability to optimize.

~~~
user5994461
The hypervisor has processes inside the VMs. It has limited control but it is
far from blind.

------
k2xl
I've been using docker in production with Elasticsearch and MySQL for 3 years
in the PB scale and have never had data corruption issues occur.

Corruption occurs on data drives even without docker - you still have to plan
for it. This is why you enable replication. This is why you snapshot/backup
your data daily and have disaster recovery plans.

There are some major reasons why I actually think running databases in docker
containers, even if you are mounting a volume for the data.

1) Development environments can be similar to production. Ensures everyone
runs the same version that is running in prod.

2) You don't have to worry as much about what is installed on the host
machine.

3) In a clustered setup, it's easier to ensure each node is running the same
configuration, version, etc...

One of my issues with all the gripes about docker are the assertions that it
causes issues. In all of my time of using docker, 99% of the time when there
is an issue it has nothing to do with docker itself. Everyone loves to blame
it when things go wrong though.

This article doesn't really back up any of the claims about any of its issues.
It just makes blanket statements without backing them up. Don't like docker's
networking? Use host networking then.

What people don't think about is the countless issues that will never come up
when using containerization. I never have to worry about whether or not python
2.7 is installed on a server that I'm going to deploy a python 3 app on. I
also have MUCH higher confidence that if things work on my local development
env (which runs the same containers), then there is a high chance it will work
in production.

YMMV

~~~
Bombthecat
Wow, three years ago? Balls of steel!

~~~
tinco
People make it seem as if Docker is some bleeding edge magical technology, but
in reality its most useful features are just thin wrappers around stable linux
kernel features and some nice automation.

We have also been running databases in Docker (on the tb scale though) for
around 3 years, we had the odd issue here and there, but nothing terrible and
certainly nothing fundamental or resulting in data loss.

If your data is corrupted by a single process dying in an unclean fashion then
you have other operational problems.

~~~
eeZah7Ux
> People make it seem as if Docker is some bleeding edge magical technology,
> but in reality its most useful features are just thin wrappers around stable
> linux kernel features and some nice automation.

That's one of the things to dislike. The company and the community are trying
to sell it as the best thing since sliced bread and usually forget to assign
merit to the kernel developers.

On top of that, its 180k lines of code are unwarranted for a "thin" layer.

~~~
tinco
Well obviously it has a bunch more features than just being the thin layer,
but Docker really is the best thing since sliced bread.

------
gjkood
In the ancient times of databases we had to make sure that we wrote to 'raw'
disk devices bypassing any possibility of the operating system file
caching/buffering failing to 'flush' our writes all the way to the disk. There
was an implied guarantee that what we thought we wrote was actually written to
disk.

In today's world of 'virtual' everything which may sometimes be many levels
removed from the raw disk devices, how do we still ensure that a write to a
database is still a write to a physical device as opposed to an incomplete
write that looks like a completed write to a higher level virtual disk? Is
there a guarantee that everything is flushed to a physical disk?

------
fabian2k
This article mentions offhand that the storage drivers are unreliable, even
for data volumes.

Is that actually the case? Is there a serious risk that a database will be
corrupted by a container crash, as the article claims? A regular crash of the
computer should not be able to corrupt a database, is a container more
dangerous in this regard?

~~~
mstump
I've been running a couple of petabytes in production with Docker and
Cassandra for a couple years (around 2k nodes). I've rarely seen FS
corruption, however I must qualify that this is on bare metal. They could be
running into issues with the interaction with EBS? This is more of a screed
than an argument backed by specific details and facts.

~~~
tayo42
How do you handle making updates to configuration with all your nodes in
containers? Do you blue green deploy the cluster or something? Run config
management in the container?

~~~
mstump
Configuration comes from the environment. We store the configuration per
cluster in a centralized store (C*, etcd, SimpleDB). We bake images that
contain everything else.

Depending on the customer and the tech involved we'll do blue-green by doing a
controlled rolling push of the config or image after it makes it through the
dev/test cycle. Also depending on the type of tech we'll store actual data on
network or host volumes.

------
memracom
The basic problem is that a database server is not what Docker is trying to
containerize. They want to containerize applications which used to be clear
cut things. They used to be single purpose things built by compiling and
linking some source code into a single binary. They did not contain everything
including the kitchen sink and a plumbing kit complete with automatic drain
clearing snake.

Today's database server, and this includes more than just the RDBMSes, are
actually a collection of applications. Even if the developers have the bright
idea of integrating it all into one binary, it is still not a traditional app.
It is a collection of apps built into one binary, like busybox.

To truly dockerize a db server, it would need to be built differently, as a
collection of separate, semi-independent apps, that have clearly defined
interfaces between each other. Until that day arrives better to use something
like Ansible to manage your monolithic db server on it's own instance.

Docker doesn't buy you anything with db servers because you generally want
stability. I know some people are experimenting with highly scalable clusters
of db servers, and using things like MySQL in a way that was never intended,
but they know they are on the bleeding edge. I also expect them to soon start
hacking away the bits of the RDBMS monolith that they do not need, and
building single purpose cluster members that only do one of the jobs in a
normal RDBMS. It might work; give them time.

But if you have to run a db to support your business, don't do it with Docker.
I run PostgreSQL and right beside the monolithic RDBMS there is a docker host
that runs miscellaneous support stuff like serving up pgBadger data, running a
REST interface to data, running an app that listens to PostgreSQL NOTIFY
events using Camel pgconnect, and some other admin tools (simple webapps to do
db related stuff). Docker has a role, but running the main RDBMS is not it.

------
gourao
Docker is a packaging tool. How you chose to deploy your software package has
nothing to do with your operational and DB administration procedures. Mixing
the two topics is very confusing.

~~~
lobster_johnson
Well, it's both. It's both packaging and deployment. Which is convenient, but
probably a mistake. Docker is decent at being a packager, but rather terrible
at deploying stuff, which is why we have better, high-level orchestration
systems like Kubernetes that handle deployment the way it should be done, and
reduce the Docker runtime to a mere container runner.

~~~
gourao
That is a fair point, and most of the end users we work with are using k8s and
mesos to deploy and run the applications. The issue I have with the article is
that it assumes that a packaging tool is what defines the rest of your
operational procedures. They are two different things, as they always have
been in Linux.

------
bborud
"You may corrupt the data in case of container crash where database didn’t
shutdown correctly. And lose the most important part of your service."

This is the point in the article where you know you don't need to read the
rest. If your data integrity hinges on your database being able to shut down
correctly you will be disappointed.

The author believes in fairytales.

~~~
takeda
It's kind of interesting how knowledge works with some subjects.

If you don't know anything, you will agree with the statement. When you know a
bit more, you will disagree, but when you learn more than that you will once
agree.

It's true that real databases need to guarantee data won't disappear even at
power loss, so you would think that container crash should be comparable with
power loss, if not more trivial.

The thing is that the database can provide such guarantees (write things in
correct order, write to disk when database says so etc), but only if the
underlying system provides specific guarantees to the database.

The storage drivers are quite buggy, so reliability of your data is still in
hands of these drivers.

~~~
bborud
You stopped one iteration short. You assume you can, and will, know that your
OS and your disks do not lie to you.

Yes, persisting data in a consistent and durable manner is hard. It is damn
hard. It was hard 20 years ago when systems required to store obscene amounts
of data started to become more common and it is hard today.

(This reminds me of a discussion a couple of years ago on how to kill
processes. There are two schools of thought. One is that you should go through
the SIGTERM - wait - SIGKILL dance, because that's "being nice". The other is
that you always send SIGKILL immediately and instead engineer systems that can
deal with it)

~~~
takeda
Yes it does lie especially with fsync() because that call's purpose was to
flush all caches to disk, which is an expensive operation.

Since then NCQ/TCQ were added to disks and also systems like Linux implemented
write barriers to enable more control[1].

[1] [https://monolight.cc/2011/06/barriers-caches-
filesystems/](https://monolight.cc/2011/06/barriers-caches-filesystems/)

------
blackss2
All system can fail, and failed system should be recovered. That's why we have
disaster recovery plan, back up or replication etc.

(I have just shallow knowledge in docker volume, so please reply if not
corrects exist) I understand docker with local volume is just abstracted file
system have mounted path that volume specified linked with host path. So file
i/o is probably not a problem in local volume.

And if network has bug that make data corruption can make difference between
nodes, docker cannot be used any system. So we can think network bug may not
make data corruption(but can make network separation).

Now I am building on-premise autometic deploying software using Kubernetes as
a outsourcing job, so I tried to find SAN for resolving stateful data. After
many searching, I realize only local path will guarantee stability of database
filesystem. So we mark storage node, and all type of stateful app(limited
kinds by playform) is deploied on that node. So we can easily back up and
manage storages.

As a deployment manager and backup automation, container for database have a
great functionality. All file produced by container are jailed where I
specified and can be copied or backed up. (for stabaility, replication is
first class. pause-backup-resume or copying on running will make operational
unstability. you can use both for backup, replication first and make backup
using that replication node)

------
cazorla19
Hi everyone. I'm this immature who's made this blog post. Thanks for the
feedback, I didn't expect so much people to care about my post, cause I mostly
lead the blog for myself. There were about 20000 users seeing this while I had
only 3000 for the last year. So, if you have any questions for me as an author
- please, write them down at this thread. I'll try to deal with all comments
soon.

------
peterhunt
We have been running stateful services including DBs inside of Kubernetes for
a while with some success. Outside of a few docker bugs it's definitely been
worth it: [https://medium.com/the-smyte-blog/counting-with-domain-
speci...](https://medium.com/the-smyte-blog/counting-with-domain-specific-
databases-73c660472da#.uffwxpp8y)

------
briffle
Its like i'm reading blogs about running databases in vmware all over again,
just 10 years later...

------
throw2016
There is actually little difference between a process running in a container
and in the host. They are using the same network and ideally the same
filesystem so there should be no difference whether you run the database or
app in the host or the container.

Every container article on HN seems to perpetuate more confusion on containers
and often arbitrary misguded rules on what a container should be confusing new
users even more.

The problem is Docker has taken fundamental technologies developed by other
people and wrapped it and since they do seem to want to give credit and
pretend to be more than what they are they obfuscate things in words like
docker filesystem drivers, networking drivers, container drivers etc. Untill
users get familiar with the underlying technologies this sorry state of
affairs will persist.

A container is simply a process launched in its own namespace thanks to kernel
namespaces introduced in Linux 2.6. Its got nothing to do with cgroups,
cgroups 'can' be used to limit resources to container process by cpu, memory
or network resources if you want. If you launch this process by chrooting (or
pivot root) into a basic linux rootfs filesystem you have a container. If you
launch an init in this process you have a LXC container. If you don't and
prefer to launch the process directly from the host you have the Docker
version of the LXC container which is a fussy hack as now your container is
not contained and runs app process not designed to be run in pid 1 in pid 1
and needs to be managed from the host. Kudos. You can also add a network
namespace to the container process so its has its own network layer.

The biggest problem currently is a lot of Linux subsystems are not namespace
aware and you can't really do proper isolation. Even cgroups only recently got
namespace support. Anyone know who these folks are who are doing all this
fundamental work?

The second biggest problem is layers are oversold, their actual practical use
is marginal at best. They are also complicated and buggy with multiple issues
with running overlayfs or aufs on xfs, databases and btrfs. The third biggest
problem is a lot of projects and teams working on Linux containers are pushed
into the background or marginalized and misrepresented like LXC was by Docker
devs instead of giving proper credit and explanations.

The talented developers of overlayfs, aufs for instance are virtual unknowns
in the container ecosystem inspite of Docker fundamentally dependent on them.
These guys can solve a a lot of the problems with containers but first users
must know about them and support them so that bugs can be fixed, rather than
have the Docker team create more workarounds and hacks.

------
old-gregg
If you limit your understanding of "containers" by not advancing past single-
page tutorials produced by content marketing folks at orchestration startups,
you may _feel_ like the author is right. But as it almost always the case with
damn computers and generalized topics, there's no right or wrong. The world is
boring and full of "it depends" but that was conveniently left out of the
article because the goal, I suspect, was to back a sensationalist title and
produce clicks/views. But I'll bite:

> 1\. Data insecurity

The author is mixing up Docker image store with database's own data. It is
true that Docker graph drivers have issues, but they don't store any data,
those are binaries you distribute and you're welcome to start docker
containers from a plain old directory on disk. Layers are sexy but optional
and they have nothing to do with your database data.

> 2\. Specific resource requirements

The author talks about running additional processes on a database machine. Why
is this an argument against containers? Maybe because containers make it
somehow easier? I dunno... Yeah, don't overload your database servers with
other stuff, containers aren't forcing you to do it.

> 3\. Network problems

This one is the most bizarre, with statements from all over the map, basically
saying "networks are hard". Riding unicycles is also hard, but that's not used
as an argument against containers. Here's an obvious conclusion: if you don't
feel like learning software-defined networks (or don't need the benefits they
provide), then don't use them and run containers with native host networking.

> 4\. State in computing environment

This port is just rambling, I do not see anything specific to reply to. If the
point to make was that containers don't play nice with state, it's like saying
"processes do not play nice with state" because that's what a container is: a
Linux process. You have full control over where (pin it to DB machines only)
and how it runs, use features you need (and understand) and don't use others.

> 5\. They just don’t fit major Docker features

In this part the author is basically saying that it's easy (or easier) to
install a database using configuration management tools instead of using
something like Docker. True, there is more than one way to skin a cat and
frankly you can use both a configuration management system and the containers.
I just can't see how this can be used as an argument AGAINST anything.

> 6\. Extra isolation is critical at the database layer

The author again claims the containers bring in significant overhead. That's
simply not true. I would recommend to mentally replacing "container" with
"process" when you read the orchestration blogs to see right through FUD.
Again, you can run a container from a directory on your filesystem using host
networking and it will be no different from any other process on the box.
Using a network namespace does not add any measurable difference to
performance. [1]

> 7\. Cloud platform incompatibility

The title doesn't match the paragraph of the text that follows. The author
basically claims that being provider-agnostic (one of the benefits of
containers) is not valuable. Well, he's a database administrator and it's not
valuable to _him_. But there's a huge business value of being able to run on
different infrastructures: selling $100/mo SaaS subscriptions is nice, but
when the stream of early adopters dries up and you set your aim at those nice
six-figure enterprise license contracts, you may find out that you will need
to be able to run on a VMware cluster in a corporate colo. And containers can
help.

Containers are big not because they make developers happy, they're big because
they let sophisticated companies significantly consolidate their workloads
(via dynamic scheduling) and shrink their infrastructure footprints. I
constantly get shocked by AWS bills people share with me and something like
Kubernetes provide quite significant material value of shrinking them. But
another less obvious advantage is the ability to run [1] the same SaaS stack
on public and private infrastructure, opening up entirely new markets for your
company. What's your revenue from China? Ever thought about containers being
the perfect tool to penetrate The Firewall and run on your Chinese customer's
servers? Anyway, those are good reasons to finally learn and use containers.
And the reason not to? Well, not this blog post.

[1] We are [https://gravitational.com](https://gravitational.com) and some of
our customers ARE database vendors, happily running their mission critical
(everyone is mission critical in our biz) workloads on containers / Kubernetes
and deploying them into behind-firewall corporate clouds. So yes I am biased
but I'm also qualified to respond.

------
alsadi
Since you mentioned union fs then you are not using upstream kernel, most
likely you are blaming your non-enterprise distribution choices on containers
technology.

Fact #1 redhat do have enterprise docker container based solutions. Check
project atomic and openshift

Fact #2 cloud providers like google, azure and amazon do have container basef
solutions

Fact #3 coreos do have production grade docker based solutions

Fact #4 kubernetes do support pet pods aka stateful pods and can get data
volume from reliable ebs or ceph

------
snowwolf
Upgrading database versions is mentioned briefly in the article, but is pretty
much a show stopper at the moment for some databases.

e.g. Postgres ([https://github.com/docker-
library/postgres/issues/37](https://github.com/docker-
library/postgres/issues/37))

------
xen2xen1
Docker is an app store. Once you reach that point of understanding everything
gets easier. On your Windows box or Ubuntu box you can just install whatever
you want. Docker is more like IOS or android. When was the last time you
edited an .ini file on android? It works, it doesn't, take it or leave it. Or
you can clear the local data or reinstall. Not much else. Docker is the same
way, at least by intent. No wonder they don't want you to store data on it, if
you have a SQL database running on android how much would you expect out of
it? Would you really expect it to be persistent? It's easy to install, that's
the point, but it takes away a lot of freedom just like the app store(s).

------
atemerev
Scaling everything else (stateless non-persistent services) is nearly trivial,
with or without Docker. It's scaling databases where things get interesting,
but here we are back to dark ages of DBAs and manual deployment :(

------
outside1234
Of course the counter example of this is Google, which, as I understand it,
runs everything on containers. It seems like if its good enough for Google,
then its good enough for the rest of us.

~~~
user5994461
Google runs NOTHING on Docker. They have their internal secret proprietary
container technology.

The concept of containers is fine. The troubles lies in the implementation
that is available.

~~~
gkop
Google does publish
[https://github.com/google/lmctfy](https://github.com/google/lmctfy) at least,
though.

Edit: _did_ publish

~~~
user5994461
You realize that the first line of the README says that this project is
obsolete and they stopped developing it?

~~~
ec109685
The line before that says they working on taking those learnings into
libcontainer.

~~~
user5994461
Meaning it's not done, it's work in progress, that will take a while.

------
tracker1
I make an exception to this.. that would be something like Redis, or memcached
as a localized (or perhaps sharded) cache cluster serving the systems running
on those docker hosts.

One of the things that bug me in terms of cache as a service in AWS/Azure etc,
it you're dedicating compute nodes to mostly use memory... the big win for
memcached early on was utilizing unused memory on existing systems. You lose
that when you don't have the caching services on the same nodes as
compute/data/etc.

------
Jdam
So glad I didn't read that article before I set up a Cassandra cluster on
Docker to handle 1M requests/min in production. It might have discouraged me.

~~~
ec109685
How has that worked out? Which container orchestration system did you use, if
any.

------
jinjin2
We are considering to use Realm to replicate our data out to all our docker
instances. They would essentially work like databases local to each instance,
and if the instance dies we simply spin up another one which will replicate
out the same data again.

So far it is only at the experimentation stage for us, but it looks very
promising. It is almost like having the ultimate cache (no network latency)
right within each instance.

------
n3m8tz
Hyper converged scale out object storage with docker engine on the same nodes
could be your solution. Docker volumes support native NFS , rather stable IMO,
there is no magic with docker volumes at all, your NAS, SAS, or distributed
storage already implements all sorts of redundancy. And if you don't want to
deal with anything just pay for DBaaS.

------
hyperknot
Why is running a DB in Docker so different from running in an OpenVZ VPS?
There are millions of Wordpress websites and other PHP CMS-es hosted in OpenVZ
VPS-es, running MySQL reliably.

Also, Discourse.org's default setup is PostgreSQL hosted in a Docker
container, it also has probably 1-10 thousand live forums and is a reliable
platform.

~~~
memracom
Because OpenVZ has a philosophy of the container as a simple VM running many
processes just like a server does, but Docker is being developed as a
container which runs one process which only communicated with other processes
through defined interfaces which you configure when you create the container.
Docker makes it easier to scale up and down, and to move stuff bewteen
servers, but you need to do more work up front.

There never is one true solution that works for everything. Typically, there
are solutions which work well for all the small stuff but not so good for a
few big things. Docker works great for all the small stuff. A mission critical
database is often the one big thing that is the exception.

Also, Docker is not the only way to handle all the small things. LXD works
well. Some companies can live on AWS Lambda Functions. Looking for the Holy
Grail of one ring to do it all and in the darkness bind them makes for an
interesting lifelong quest, but IMHO you will never get there.

~~~
hyperknot
You can use Docker as a VM. Discourse is doing so, and Baseimage Docker is the
#1 unofficial image on Docker Hub, so it means a lot of people believe it
makes sense to use Docker like this.

------
derefr
> I’ve seen DBMS containers running on the same host with service layer
> containers. But these service layers are not compatible according to
> hardware requirements.

> Putting your database inside the container, you’re going to waste your
> project’s budget. Why? Because you’re putting a lot of extra resources to
> the single instance. And it’s going out of control. In cloud case you have
> to launch the instance with 64GB memory when you need a 34. In practice some
> of this resources will stay unused.

For some software, resource consumption is of "fixed" size, plus temporary
workload-dependent growth (e.g. application-layer processes, most of the
time.) Whereas some other software will take up all the space available to it
(like DBMSes.) The latter are what resource quotas are for.

Containers are not meant to be treated like "Unix binaries but more easy to
deploy." Containers are just lightweight VMs that don't have to do screwy
things with memory balloon drivers to efficiently pack many of those "fixed
plus temp growth" workloads onto a host.

But like VMs, containers still need resource quotas to ensure they don't
thrash one-another. You _can_ avoid specifying quotas for your fixed-with-
temp-growth workloads, to "oversubscribe" a host, and it'll work (similarly to
oversubscribing memory-ballooed VMs.) But the "all the space available"
workloads _need_ quotas.

The author might be used to public clouds, where VMs have a "size" in vCPUs +
memory and that "size" is charged for, and so might not think of picking an
instance size for a VM as explicitly setting a quota. But when you set up your
own hypervisor cluster, you still have to decide how big each VM should be,
regardless of the fact that a bigger VM doesn't "cost" anything: a VM's "size"
is the compromise you make between the needs of that workload, and the ability
to "fit" other workloads alongside it on a host.

But, to go further: if you're designing "instances" and running dedicated
workloads on them, you're very likely "doing containers wrong." (This is
probably a provocative statement; stay with me.)

Containers are to container hosts as VMs are to hypervisors: in both cases,
their architecture assumes that if you want resource-efficient deployment,
you've got a big generic _cluster_ of hosts, and your guests are loaded onto
them using a bin-packing algorithm (taking into account which guests need what
extra resources that are only available on certain hosts, etc.)

If you don't _have_ a big generic cluster of hosts, then your only packing
options will be necessarily sub-optimal. If your container hosts are real
hardware, you're out of luck; if your container hosts are _themselves_ VMs,
running on some cloud provider, then costs will be heavily in favor of taking
advantage of the _cloud-provider 's_ bin-packing by wrapping each of your
containers in a separate VM and then deploying those VMs.

(Which is, coincidentally, what Amazon's Elastic Beanstalk does for you, and
why it's not the same as Amazon ECS. ECS is for setting up your own "big
generic cluster" of container hosts to bin-pack across; Elastic Beanstalk is
for wrapping containers in VMs so that AWS will bin-pack at their abstraction
level.)

------
halayli
isn't that what was said when VMs came out?

But I'd say it's not ripe yet to run a DB in a container for prod use. it's
not an architecture issue but rather code maturity. DBs hold our beloved data
and are more sensitive to hardware/system glitches which excersises much less
frequent code paths in the DB.

------
holydude
We have been running databases in some form of containers for ages. Oracle DB
on Solaris Zones, db2 in wpars. But yeah people have to re-implement these
ideas poorly so now we have to deal with consequences.

------
amelius
So let your database write a journal. Data corruption problem solved.

~~~
illumin8
And when your journal is corrupted due to a bug in the Docker volume driver?

~~~
amelius
Then you let Docker write to a virtual network drive, streaming the journal
out of the container.

~~~
KaiserPro
I doubt thats going to be fast.

However, if you are going to be running docker on real tin (because thats
where the value/speed comes in, if you're on AWS thats a whole 'nother issue)
Then you might as well use device mapper for what it was originally designed
for: mapping fibre channel. (or iscsi, or SAS [another scsi])

That is assuming you want speed, and have paid enough cash to overcome SPoF in
your storage layer (it'll be cheaper and faster than trying to software your
way out of it.)

