
Dockerizing MySQL at Uber Engineering - Walkman
https://eng.uber.com/dockerizing-mysql/
======
falcolas
This article shows so little understanding of the software they are using that
it frankly makes me a bit mad. Based on my experience as a MySQL DBA, I feel I
can safely say that this method of running databases does not scale, and Uber
will need to do even more engineering here before too long. Of course, that
might be mitigated by the extreme amount of data sharding Uber is doing, but
their data will only grow, and this approach will quickly start coming apart
at the seams.

1) MySQL requires one file to configure it: my.cnf. This is not exactly a huge
amount of configuration which needs to occur. Installs via puppet, Chef or
Ansible tend to consist of two commands - one to install the package, and one
to write the my.cnf (templates are good so you can use the same command on any
sized server). You can add one more command to set up the initial users,
should you so desire.

2) Multiple MySQL processes on the same host wastes that host's resources. A
single MySQL instance is perfectly capable of running multiple databases, and
will respond faster because it will properly allocate the boxes memory
according to each DB's usage. Multiple processes will each chomp up the
configured bit of memory, not allowing individual databases to use the
resources they need. It's always faster to serve data from memory than from
disk (even if that disk is an SSD). Worse, under-utilized DB instances will be
swapped off to disk, causing even more load and delay as they are swapped back
in.

3) Transferring data from one host to another when you need to bring up a
stateful process does not scale. Above a few gigs, the transfer process
creates significant load on both the source and destination host, and will
easily saturate the link between the two. Neither will respond to requests
with any alacrity, meaning you typically want to take both hosts out of the
active DB pool.

4) The DB restoration process from copying over the raw files can take 10+
minutes, depending on how many dirty pages existed on the source. The
restoration process will go faster with logical dumps, but logical dumps will
take longer to generate, transfer, and load.

To reiterate, this are problems for those companies running at scale, with
terabytes of data and dozens (or more) DB servers. When you're running a few
GB of data, a DB container is probably going to work fine. Just don't believe
for a second that you can scale it the same way you do your web frontend
services.

It pays to hire experts. How much time and money has Uber sunk into working
around their DB, instead of with it?

~~~
liveoneggs
5) configuration and version drift is guaranteed since you have to restart
databases to make any changes

5a) ignores MySQL's ability to make _many_ runtime config changes without a
restart

6) hard-codes master/slave relationships into the boot process (at least if
I'm understanding the config json)

7) adds additional risk (noted as docker crashes) and client issues (noted as
"userland proxy" comment)

8) requires additional ops overhead (noted as special care needed for masters)

\---

But if you already run _everything_ in docker and really want to get rid of
puppet, then why not?

\---

this article links to ansible but not puppet and is the only blog post tagged
with either one; I think this really boils down to puppet vs $x

~~~
agentgt
I find it ironic that one of Uber's original impetus for migrating off
Postgres (which I still have serious doubts about) was not being able to _"
trust"_ the system [1].

Given my limited failed experience with playing with docker + postgres (and
various volume stores including flocker) I have continued doubts about Uber's
choices.

I suppose they are having success but I wonder at what cost.

Specifically the quote on [1]:

 _Finally, our decision ultimately came down to operational trust in the
system we’d use, as it contains mission-critical trip data. Alternative
solutions may be able to run reliably in theory, but whether we would have the
operational knowledge to immediately execute their fullest capabilities
factored in greatly to our decision to develop our own solution to our Uber
use case. This is not only dependent on the technology we use, but also the
experience with it that we had on the team._

[1]: [https://eng.uber.com/schemaless-part-
one/](https://eng.uber.com/schemaless-part-one/)

------
Ruphin
Maybe this will help dispel some of the myth that Docker is useless for any
scenario where you need persistent state. There are still many advantages to
wrapping your database in a container, and this post by Uber explains really
well how and when to use this technique.

We have been running our databases using this same methodology for almost two
years now, including automatic initialization for new clusters, auto-
configuration of master-slave and federation setup, and automatic
failover/recovery. It doesn't solve all the problems, and we still need to
manually solve issues (mostly because we use MongoDB and ran into several bugs
and 'features' over the years), but for standard operational work it saves a
ton of work. Expanding our cluster by adding a new replicated shard is no more
than 2 minutes of work in changing the master config, everything else is fully
automated.

~~~
user5994461
> Maybe this will help dispel some of the myth that Docker is useless for any
> scenario where you need persistent state.

Docker is useless for what is not stateless.

You can run database in Docker... until the assumptions of Docker will come
back to bite you.

> There are still many advantages to wrapping your database in a container,
> and this post by Uber explains really well how and when to use this
> technique.

Well. Run databases in Docker if you've got 3000 clusters and your only
concern [that you created yourself] is to run multiple of them on the same
hosts.

------
memracom
Database servers, whether MySQL or PostgreSQL or NoSQL ones like CouchDB
should never be dockerized. This is a use case where it is inappropriate to
use Docker at all. This is a case where the db server should use the entire
resources of a single server and for managing that server (and its replicas or
other cluster members) you use a tool like Ansible or Chef or Puppet. And you
need to learn that management tool well, because you will be pushing the
limits of server automation with a database server.

If you think that a database server is where you put your "application's
database" then I question why you have a db server at all. That is the use
case SQLITE is for. Database servers are for the use case of storing the
business's mission critical data so that all the applications of the business
can access needed data.

I love Docker and use it a lot, but not for everything.

~~~
otterley
I wouldn't go so far as to say they should "never be containerized," but doing
so doesn't solve any of your problems other than a packaging problem, and it
arguably creates problems you might not have had before.

It's my experience that most of the impetus behind Docker is driven by
developers who need it on their Mac/Windows workstations to run Linux
processes, and who (for good reason) want to ensure that their development
environment is as similar to the production environment as possible. In other
words, the developers are driving the production environment, instead of the
production engineers driving the development environment. This leads to
friction when the production environment has its own unique set of constraints
and needs that Docker and containers aren't quite yet a match for.

~~~
memracom
I also develop on OSX or Windows systems, targeting Linux as the deployment
system. And I also want to test the code in an environment as close to
production as possible. That is why I install Virtualbox and set up virtual
Linux servers using the exact same server management scripts (in my case
Ansible) as are used to configure production servers. It works great with no
Docker. But I do use Docker as well for servers that are more cookie cutter
than a db. For instance a JVM app server that has two Docker images, one with
3 layers culminating in the JVM app, and one with 2 layers culminating in an
NGINX SSL proxy.

And I don't run Docker on OSX at all because it is just a wrapper for running
Virtualbox with a Linux server inside. It is simplicity to just set up
Virtualbox directly and just use Linux virtual servers directly.

------
drej
I love these posts for several reasons.

1\. People think of companies like Uber primarily as providers of a regular
service to customers (similarly to how Airbnb or Netflix is perceived), but
it's interesting to see the engineering chops needed to maintain this
operation. 2\. Given the relative youth of the company, the stack employed is
quite modern and often uses cutting edge technologies in production and with
real impact on custormers. Everyday users of these technologies usually only
get 'textbook' explanations, but blog posts like these allow them to actually
find out how these can be used in production, what the caveats are, how they
interplay with other parts of the stack etc. 3\. Some people complain that
when a startup gets bought or implodes, their technology virtually disappears
instead of benefitting the wider developer community for learning purposes
(I'm not here to argue what the best practice should be, just stating an
observation of a common complaint). By continually describing their
engineering practices (similarly to Airbnb in their engineering blog) and open
sourcing their non-business technology (e.g. go-torch), their experience lives
on beyond the life of the company.

Last but not least, it build a positive ethos about this company in the
developer community.

~~~
gaius
_but it 's interesting to see the engineering chops needed to maintain this
operation_

No. Uber has clearly hired too many engineers and has too little for them to
do, leaving them free to come up with Rube Goldberg contraptions like this
one. This is not anything any company that isn't swimming in money should even
contemplate (and neither is slapping a layer on MySQL and claiming to have
invented a "new datastore"). The kids are incharge of the nursery over at Uber
and they are in desperate need of a good, adult CTO...

~~~
privateprofile
> The kids are incharge of the nursery over at Uber and they are in desperate
> need of a good, adult CTO...

That's a proper description of most startups I've had contact with recently,
namely (well) funded startups.

It's a consequence of having inexperienced (and, sometimes, untalented) people
running functions at a company.

It also applies to other areas like sales ("my inside sales script is killer,
even though I haven't done professional sales anywhere before and just found
out what inside sales is") or hiring ("I'll just hire my equally inexperienced
buddies from college, they were great there").

But when it comes to kids building technology with "college playground"
quality, it's too evident not to notice. The product ran fine when their
buddies were testing, but it's unstable, unmaintainable and in need of a
complete redesign when real world traffic comes along. All things that could
have been prevented with a little bit of competence and experience.

It's a kind of mantra for people that take VC blogs as gospel and engineering
blogs as the 10 commandments. Don't go to college, don't get any real world
experience: start a company and, if it actually survives more than a couple of
years, someone will inherit your technical debt.

This investor speak will quickly change when you get funded and have to attend
board meetings to tell investors that your product is still not working, or
you need twice the team that would be required for your goals, or you need
months to produce basic business data...

------
pmlnr
> Running containerized processes makes it easier to run multiple MySQL
> processes on the same host in different versions and configurations.

You're doing it wrong. One doesn't simply run multiple DB servers on the same
iron.

~~~
saturn_vk
Why is that? Each process is using a separate disk on the machine, and if
that's the case, where would you get potential problems?

~~~
merb
And what gives you Docker what cgroups would not?

I mean they write:

    
    
        Initially, all our clusters were managed by Puppet
    

Of course docker won't actually replace such systems. Uber has the worst
enginnering practices i've ever seen.

They replaced Puppet with a handcrafted tool that uses docker and call that
Dockerizing MySQL, just wow. I wonder when they are at zero money. (and the
best thing is probably that they run at AWS and could just use a AMI, but
since that isn't written in the article I would not create such assumptions).

~~~
ec109685
> I would not create such assumptions

But you assume they have the worst engineering practices based on a few blog
posts?

------
eggie5
A lot of what they are describing in the first half sounds like kubernetes. I
wonder if they could do it again would they adopt it...?

~~~
hectaman
agreed. The newer "Pet Set" construct in k8s might simplify this quite a bit:

[http://kubernetes.io/docs/user-guide/petset/](http://kubernetes.io/docs/user-
guide/petset/)

~~~
eggie5
Or as it's been renamed last month to: "StatefulSet"
[https://github.com/kubernetes/kubernetes/issues/35534](https://github.com/kubernetes/kubernetes/issues/35534)

Mostly what I was commenting on was how they are building some type of
declarative infrastructure where they define the topology they want and then
system will build it. This was screaming k8s to me.

------
VLM
Most of the assumptions in the comments boil down to its a bad idea if you're
resource constrained because you'll get much better performance when
everythings on one big server with shared ram.

However, possibly, their engineering goal was detecting and isolating hotspots
and minimizing debug/downtime effort. And possibly they have an infinite pile
of cash or at least they're not resource constrained. In that case it might
make sense to fragment massively.

For example, lets say database #235 is using 75% of a shared server. Then at
least in the very short term migrate the numerous other databases at the
container (or image) level to other servers so the overloaded database doesn't
flood out every other business system you support.

Now the argument is if you ran mysql bare metal then you'd have more memory
and maybe only use 25% of that shared server, but sooner or later you'll get a
big enough flood of traffic that you'll want to segment off and your
operations team understands docker at 2am but not mysql DBA stuff quite as
much. Sometimes its nice to have everything use the same standard and
everything lives in docker might work.

This also has fascinating forensic and QoS implications where you can do
interesting snapshot and cloning tricks affecting precisely and exactly one DB
at a time.

~~~
korzun
It's almost 2017. The scaling choices are not limited to Docker vs. bare
metal.

With their budget, you should be able to clone and scale metal servers within
minutes if you wanted to.

This is like Tumblr spending $50,000 grand a month on their developer AWS
instances while not making any money. Cool but not very practical.

~~~
VLM
Bare metal is operationally very expensive because someone has to monitor it,
back it up, patch it, test it, fix the software when it breaks at 2am, fix the
hardware when it breaks at 2am, explain to security that its either full of
confidential stuff or not, decommission it years later, and none of that is as
standardized as virtualized or containerized stuff.

Also "in the olden days" we had prod servers but shared dev and test servers.
Now a days you just spin up resources assuming your ops is flexible enough.
I'll spin up a test image to eliminate one bug and then destroy it, no sweat.
In theory you can do that with metal but the accounting must be weird. You
could make a pool of bare metal test boxes for people to use as they please, I
guess...

I also like spinning up new images for software upgrades. Oh, a new version of
the database, here's a new image, test it out.

Some of it is organizational hacking. After a few legendary disasters
procedures will be formulated where standing up new iron takes
interdepartmental meetings and signing off with the network and security and
ops teams and the data center guy has to sign off on the thermal and
electrical loads and power points all over the place. In comparison, you
wouldn't make someone changing a cell in a spreadsheet go thru all that,
right? So you deploying a virtual image is just clicking a harmless little
button, as long as you operate under a blanket agreement with ops, infosec,
networking, etc... At least until enough legendary disasters inevitably happen
that clicking "create" on a virtual image requires weeks of time and at least
4 signatures and 3 departmental meetings of micromanagement. Hopefully we'll
invent something new by then.

------
amenod
> The MySQL data directory is mounted from the host file system, which means
> that Docker introduces no write overhead

If I understand correctly, Docker doesn't introduce any write overhead,
mounted from host or not. This is the main difference from VMs - syscalls are
just being "forwarded" to kernel. Or am I missing something?

Still, I am only arguing with the stated reason. Of course you don't want to
save state inside an (ephemeral) container.

~~~
otterley
There's not even a forwarding process. Containers are simply processes running
in a set of one or more different "namespaces." A mount point in a container
is just an ordinary mount point from the kernel's perspective. In this case,
the mount point is relative to the container's root filesystem set via
chroot(2).

------
kennethh
It does not say so in the article but I guess they have a database cluster for
each city or something like that which makes sense since a user do not care
about Uber cars in a different city. Do they use GPS to put the Uber car in
the right cluster? The users move around more but they are more static so they
are centralized somehow?

~~~
elcct
Disclaimer: I have never used Uber.

What if user orders Uber from one city to another? If driver is then in
another city, wouldn't be nice to give him some rides in that city or when
someone wants to go back to original city? What about cities that are near the
border? If user would like to go shopping to another country? (Not uncommon in
Europe. Long time ago I used to take local bus from Poland to Germany, do some
shopping and go back)

~~~
icebraining
Sounds fairly simple (which is not to say easy to implement); you keep the car
tied to the database of origin for the duration of the ride (other users don't
need to know about it while it's occupied), then you can migrate the car to
the other DB when it's idle.

------
throwbsidbdk
It sounds like Uber has a database cluster for roughly every employee?!! Uber
has a single product that largely centers around 1 app. How can this be
necessary?

I have a feeling... That they're dumping realtime GPS data into a bunch of
these when they should be using something like Cassandra...

~~~
Pawka
"Single product" which consists of more than X thousands of micro-services.

~~~
throwbsidbdk
>The entire trip store, which receives millions of trips every day, now runs
on Dockerized MySQL databases together with other stores

It really does sound like this is part of the issue. Millions of trips is
billions of db row per day. A document store is much more amenable to that
kind of workload than MySQL

~~~
simonw
Uber essentually built their own custom document store on top of MySQL. They
explain their design and reasons (and why they didn't use Cassandra etc) in
this post: [https://eng.uber.com/schemaless-part-
one/](https://eng.uber.com/schemaless-part-one/)

~~~
throwbsidbdk
Okay thanks, that explains the why but doesn't make it sound less terrible.
They essentially built a nosql database on top of MySQL. Forgivable years ago
but this was in 2014...

------
user5994461
I guess we should train the new generation better so they don't end up as Uber
so... tip for any company who gets incredibly successfull:

The first thing one has to do is to drop SQL databases as the main data
source.

The usual choice is to move to Cassandra. It does have build-in sharding AND
backup AND multi-master replication AND multi datacenter support AND
performances scale linearly with the number of servers.

The [only] other option is ElasticSearch (which has slightly different
properties regarding data format and data consistence).

You don't need to craft complex custom sharded distributed WTF software to
abstract hundreds of cluster of hundreds of databases. Use the right tool for
the job, that has that build in.

~~~
pg314
Dropping the RDBM and replacing it with something like Cassandra is
fashionable, but sounds like bad advice to me.

A quick google gives these stats for Uber[1]:

\- 8 million users

\- 160 thousand drivers

\- 1 million daily rides

\- 2 billion rides so far

The biggest data, as stated in [2], is the trip info. They are storing the
trip info as a 20kB JSON blob. If you add a custom data type [3] for a trip to
PostgreSQL and encode it efficiently (binary, using deltas), you should be
able to do much better than the ~3kB they get by using messagepack and zlib,
say 1kB.

All that data should easily fit into 4TB. At one million rides per day, one
reasonably beefy RDBM server should be able to easily handle that. You can
partition the trips table and move the trips that are older than say, 1 month,
to a slower disk to save money on disks. That helps with schema migration too:
new data uses the new schema, and you use a view on the old data to adapt to
the new schema.

I very much doubt you need anything much fancier than that. It would help if
people really learned their tools (in this case PostgreSQL) instead of jumping
to a different technology at the first problem they run into.

[1] [http://expandedramblings.com/index.php/uber-
statistics/](http://expandedramblings.com/index.php/uber-statistics/)

[2] [https://eng.uber.com/trip-data-squeeze/](https://eng.uber.com/trip-data-
squeeze/)

[3]
[https://www.postgresql.org/docs/current/static/xtypes.html](https://www.postgresql.org/docs/current/static/xtypes.html)

~~~
user5994461
As a rule of thumbs, putting 5TB of data -that are growing exponentially- in a
single box is always a terrible idea.

You're gonna hit all kind of limits with the RDBMS software and the special
hardware that will be required.

Cassandra will handle the sharding automatically, it will have multiple
instances with always one available for your applications, it will handle
replications across datacenters all around the world, it will have dependable
performances that can scale linearly with the specs you give it, it will allow
you to do maintenance while online, it will let you add remove and refresh
nodes.

A bigger RDBMS has none of these qualities. It's a one trick pony that will
die when either the hardware or the software will reach a limit, then your
whole site will be down. Even if you know what needs to be done to avoid the
disaster, you can't do it because the RDBMS is a SPOF and every maintenance
you perform is called "downtime".

Short term = Postgre/MySQL because it's easier and it gets the job done. Long
term = Cassandra because it's dependable.

~~~
pg314
> As a rule of thumbs, putting 5TB of data -that are growing exponentially- in
> a single box is always a terrible idea.

Storage capacity is _also_ growing exponentially: Samsung is shipping a 15TB
SSD (albeit at $10000), Seagate has previewed a 60TB SSD.

> You're gonna hit all kind of limits with the RDBMS software and the special
> hardware that will be required.

The limits of RDBMs are well understood. You don't need any fancy hardware for
this use case. A couple of Xeons, as much RAM as you can afford and a RAID of
SSDs (or possible spinning disks, it doesn't look like Uber is doing anything
too fancy).

SPOF: you have slave replicas running that can take over if something goes
wrong with the master.

You don't have to take down an RDBM to do maintenance. DDL statements are
transactional in PostgreSQL.

The automatic sharding of Cassandra is nice, when you need it. Of course,
Uber's use case seems like it lends itself to easy geographic sharding when
you're using an RDBM, if needed.

In the end, I'd rather deal with a mature well-understood technology like an
RDBM compared to a 5 year old technology like Cassandra (release 1.0 in 2011).
You obviously prefer the opposite. To each its own.

~~~
user5994461
If you can't get the hardware neither on AWS nor Google nor SoftLayer. I'd
consider that exotic enough.

Don't get me wrong. I know vertical scaling and I've done it before. I'd take
an old school DBA who understands Oracle over a random junior speaking only
NoSQL to everything.

For the majority of use cases (including where I am now), it's easier to pick
the right technology (Cassandra) even if we have to learn and later teach it
around, than it is to find someone who can really do 10TB PostgreSQL and
spreads the knowledge.

Of course, if you have extensive experience with PostgreSQL, that may skew the
choice heavily to the other direction ;)

~~~
pg314
> Of course, if you have extensive experience with PostgreSQL, that may skew
> the choice heavily to the other direction ;)

I have more experience with PostgreSQL than Cassandra, for sure. And I've had
negative experiences with people trying to push Cassandra where it was totally
not appropriate (small problem and no need for high availability). They had no
experience with Cassandra themselves beyond watching a video and doing some
tutorials and couldn't answer basic questions about the underlying technology.
That might be skewing my perception too.

------
praveenster
Is this somewhat similar to the containerizerd MySQL used at YouTube that was
open sourced by Google? Vitesse
([http://vitess.io/overview/](http://vitess.io/overview/))

------
didip
I think the back story here is that they are running MySQL as sharded key-
value storage.

And at 2300 instances, it makes sense that there would be a few of them dying
at any given time.

With this setup, using MySQL is pretty much arbitrary, any databases could
fill in the role as well. E.g. It got me thinking that Redis+Sentinel could do
the job too (if they don't need super durable transactions)

------
neuroid
_Screenshot from our management console._

This looks like an amazing tool.

