Hacker News new | comments | show | ask | jobs | submit login
Dockerizing MySQL at Uber Engineering (uber.com)
117 points by Walkman on Nov 28, 2016 | hide | past | web | favorite | 102 comments



This article shows so little understanding of the software they are using that it frankly makes me a bit mad. Based on my experience as a MySQL DBA, I feel I can safely say that this method of running databases does not scale, and Uber will need to do even more engineering here before too long. Of course, that might be mitigated by the extreme amount of data sharding Uber is doing, but their data will only grow, and this approach will quickly start coming apart at the seams.

1) MySQL requires one file to configure it: my.cnf. This is not exactly a huge amount of configuration which needs to occur. Installs via puppet, Chef or Ansible tend to consist of two commands - one to install the package, and one to write the my.cnf (templates are good so you can use the same command on any sized server). You can add one more command to set up the initial users, should you so desire.

2) Multiple MySQL processes on the same host wastes that host's resources. A single MySQL instance is perfectly capable of running multiple databases, and will respond faster because it will properly allocate the boxes memory according to each DB's usage. Multiple processes will each chomp up the configured bit of memory, not allowing individual databases to use the resources they need. It's always faster to serve data from memory than from disk (even if that disk is an SSD). Worse, under-utilized DB instances will be swapped off to disk, causing even more load and delay as they are swapped back in.

3) Transferring data from one host to another when you need to bring up a stateful process does not scale. Above a few gigs, the transfer process creates significant load on both the source and destination host, and will easily saturate the link between the two. Neither will respond to requests with any alacrity, meaning you typically want to take both hosts out of the active DB pool.

4) The DB restoration process from copying over the raw files can take 10+ minutes, depending on how many dirty pages existed on the source. The restoration process will go faster with logical dumps, but logical dumps will take longer to generate, transfer, and load.

To reiterate, this are problems for those companies running at scale, with terabytes of data and dozens (or more) DB servers. When you're running a few GB of data, a DB container is probably going to work fine. Just don't believe for a second that you can scale it the same way you do your web frontend services.

It pays to hire experts. How much time and money has Uber sunk into working around their DB, instead of with it?


I feel like a lot of their engineering blog posts are patting themselves on the back a bit prematurely. Here's an article I wrote last year about a blog post they put up about their approach to geofencing which I was not particularly impressed with https://medium.com/@buckhx/unwinding-uber-s-most-efficient-s...


Interesting article - thanks for sharing it. I've had some limited involvement as a GIS expert brought into assist the IT department in dealing with a geofencing solution they agreed to use with various ride sharing companies that use the facility I am employed at.

I can't say much but there were multiple issues that could have been easily avoided with a basic understanding of GIS - I have no doubt that some of it was due to the IT department at my employer but am also incredibly surprised that any GIS staff at the ride share companies would have agreed to what was implemented.


Really interesting article. I'm also a bit naive when it comes to geofencing (but, then again, I'm not engineering in a 50bn company).

Waste of existing talent is unfortunately a way of life at large corporations :( .


5) configuration and version drift is guaranteed since you have to restart databases to make any changes

5a) ignores MySQL's ability to make many runtime config changes without a restart

6) hard-codes master/slave relationships into the boot process (at least if I'm understanding the config json)

7) adds additional risk (noted as docker crashes) and client issues (noted as "userland proxy" comment)

8) requires additional ops overhead (noted as special care needed for masters)

---

But if you already run everything in docker and really want to get rid of puppet, then why not?

---

this article links to ansible but not puppet and is the only blog post tagged with either one; I think this really boils down to puppet vs $x


I find it ironic that one of Uber's original impetus for migrating off Postgres (which I still have serious doubts about) was not being able to "trust" the system [1].

Given my limited failed experience with playing with docker + postgres (and various volume stores including flocker) I have continued doubts about Uber's choices.

I suppose they are having success but I wonder at what cost.

Specifically the quote on [1]:

Finally, our decision ultimately came down to operational trust in the system we’d use, as it contains mission-critical trip data. Alternative solutions may be able to run reliably in theory, but whether we would have the operational knowledge to immediately execute their fullest capabilities factored in greatly to our decision to develop our own solution to our Uber use case. This is not only dependent on the technology we use, but also the experience with it that we had on the team.

[1]: https://eng.uber.com/schemaless-part-one/


> But if you already run everything in docker and really want to get rid of puppet, then why not?

And you know, I wouldn't argue too much at the thought of running MySQL out of a container. It does avoid a lot of basic issues around upgrades, versioning, stack split, etc. It's the act of treating it like all other containerized software which gets my panties in a twist.

Stateful systems can't be treated like stateless systems if you want to maintain anything resembling reasonable uptime and performance.


> But if you already run everything in docker and really want to get rid of puppet, then why not?

Docker does not replace puppet and they clearly wrote that they developed a System just for that.


I think there is cyclicity to software development. It starts at some point it goes to greater abstraction and then when it gets a to point it becomes more about getting closer to metal to improve performance and then that cycle skews toward abstraction again.


Typically exacerbated by lack of measurement for either developer productivity or application performance. Without measurement, you're just firing away at targets in the mist.


> 3) Transferring data from one host to another when you need to bring up a stateful process does not scale. Above a few gigs, the transfer process creates significant load on both the source and destination host, and will easily saturate the link between the two...

It depends. @falcolas, could you point out a few other solutions then that? (for resharding live, other then replication / copy / rsync / ...)


Copying data is always going to be expensive, but it can't be avoided.

The lightest weight solution I've seen is restoring from a daily backup in something like S3, then setting up as a slave from a live master to catch up on the day's binlogs. Still a lot of data to move and load, but at least it's not the entire contents of the DB.

The best you can do is be in control of when data transfers happens so you're doing it when it makes sense and not in the middle of your highest traffic period (which is what frequently happens when attempting to automatically scale DBs in response to load).


That sounds a bit like what Joyent is doing in their Autopilot Pattern implementation for MySQL: https://www.joyent.com/blog/dbaas-simplicity-no-lock-in


I'd add that one can use throttling to bring new databases in live, while sustaining peak traffic.

Some databases have a configurable limit in MB/s for replication, or it's possible to assign disk/CPU quota on the slave to slow it down.

Combine that with good planning and monitoring, you'll be fine =)


Additionally, the article appears to falsely assume that one can start a new container with a different and arbitrary MySQL version and reuse the data volumes from the previous container. From experience I know that this is not always the case: upgrades might work, but downgrades often will not. Even in the case of upgrades, various DB schema changes will be needed.


In my experience it's generally safe to upgrade (and usually even downgrade) across post-GA point releases of the same major version, i.e. 5.6.22 -> 5.6.31 or vice versa.

You should always run mysql_upgrade to be safe, and I'd assume Uber's automation does this. But you probably won't encounter problems if you don't.

mysql_upgrade typically just makes some quick changes to the system schema ("mysql" database), not to other data. And it typically only needs to change things across major version upgrades (5.5->5.6) or pre-GA releases; it's rare to make a change in a point release of a major version much after the GA release.

That all said, I don't ever advocate using Docker for MySQL in production at this time; see my longer comment in another subthread below for reasoning.


> In my experience it's generally safe to upgrade (and usually even downgrade) across post-GA point releases of the same major version

My experience with this mirrors the parents. The more data you have, and the more obscure engine features you use, the more likely it is to happen. It only takes one major failure to make you want to test the daylights out of any upgrades or downgrades.


I agree; I wasn't advocating not testing. My point was more that the parent was overstating the risk, particularly around things like required schema changes, or data volumes no longer being usable. Those particular things aren't issues in my experience (for post-GA point releases), and I've worked at crazy scale.

The more common classes of problem are performance degradation, config options being added or renamed, new features being buggy in rare edge cases, etc. Not things particular to InnoDB's storage format or things that would prevent a volume from being usable with a different version.


> 1) MySQL requires one file to configure it: my.cnf. This is not exactly a huge amount of configuration which needs to occur. Installs via puppet, Chef or Ansible tend to consist of two commands - one to install the package, and one to write the my.cnf (templates are good so you can use the same command on any sized server). You can add one more command to set up the initial users, should you so desire.

Really? The mysql System Database is filled with tables that hold important configuration information. You can say "add one more command to set up the initial users," but if you have a lot of grants, you've got six levels of database privileges -- global, database, table, host, stored procedure/functions and proxies -- that all need to be configured. You've got the event scheduler. If you have UDFs that you expect on every server, those need to be configured. None of this is in my.cnf. Maybe at the scale you've worked at, you've never had issues with any of those things in a three-command puppet/Chef/Ansible setup, but maybe Uber has problems at a scale that you don't have, rather than them not understanding how MySQL is configured?


If you're storing the `mysql` DB part of your docker container, you're in for a bad time as well. There's a tremendous amount of mutating metadata which is stored in that DB which needs to be monitored.

Much better to store it as its own version controlled SQL file which is run against a newly started DB (that third command I referenced). You could break out every statement into its own provisioner command, but then you are beginning to muddle the line between machine-level configuration and database level configuration.

So yes, I still stand by my original assertion that you can stand up a fully functional DB with three Ansible/puppet/chef commands.

> maybe Uber has problems at a scale that you don't have

I will admit that it is entirely possible. But from what they present to the public (which is all any of us outside of Uber have to go on), their problems aren't really that unique: a heavy write workload which is easily sharded. Only their chosen method of resolving them is proving to be unique, and they appear to disregard any and all normal methods of resolving their issues.

Sometimes the boring way of doing things - accepting that critical stateful processes are indeed different from stateless processes and managing them accordingly - is still the best way. Getting away from this means getting away from MySQL and its ilk entirely. I hope we can get there one day; managing DB instances separately from your stateless instances is a pain in the ass.


I think you're talking about different things.

MySQL only needs the my.cnf to start a server.

The permissions are stored along the database files (If i remember well), which is a separate problem. 1) It's only configured once on the master 2) Uber says that they don't reuse volumes [which implies many other consequences and special management around that].

By the way, Chef/Puppet/Ansible have build-in commands to manage that as well.


Both of you have good points.


I miss being on a team with a proper DBA. Startups these days expect the full-stack dev to know everything under the sun. Your post was really great to read. Do you keep a blog?


I don't. If you really would like to learn more about the ins and outs of MySQL, though, I couldn't recommend the Percona blog higher.

The unfortunate thing, in my opinion, is that so much of being a software/system administrator of any flavor is having been burned by that software. That experience makes you invaluable, but it also makes you expensive to hire. Expensive hires are really hard for startups to justify.

That said, if you reach the size where you're having to shard your DB to hit performance metrics, you're at the point where it will typically cost you less to absorb that expensive hire (or consultant) than to attempt to naively engineer around the problem.

https://www.percona.com/blog/


> Expensive hires are really hard for startups to justify.

Or... startups are hard for experienced workers to justify.

Who would want to work for less money, shit equity, bad hours, on calls and surrounded by constant waves of variable level developers that you don't/can't get enough time to train.


Maybe this will help dispel some of the myth that Docker is useless for any scenario where you need persistent state. There are still many advantages to wrapping your database in a container, and this post by Uber explains really well how and when to use this technique.

We have been running our databases using this same methodology for almost two years now, including automatic initialization for new clusters, auto-configuration of master-slave and federation setup, and automatic failover/recovery. It doesn't solve all the problems, and we still need to manually solve issues (mostly because we use MongoDB and ran into several bugs and 'features' over the years), but for standard operational work it saves a ton of work. Expanding our cluster by adding a new replicated shard is no more than 2 minutes of work in changing the master config, everything else is fully automated.


> Maybe this will help dispel some of the myth that Docker is useless for any scenario where you need persistent state.

Docker is useless for what is not stateless.

You can run database in Docker... until the assumptions of Docker will come back to bite you.

> There are still many advantages to wrapping your database in a container, and this post by Uber explains really well how and when to use this technique.

Well. Run databases in Docker if you've got 3000 clusters and your only concern [that you created yourself] is to run multiple of them on the same hosts.


If you would use something like Ansible it would be just as fast because it is fully automated, and there would be no work at all to change a config because Ansible will also customize config files from templates. Chef and Puppet and SaltStack work similarly, just learn how to use your tool of choice.


And for a lot of people the tools of choice is docker...


Picking the wrong tool because its easier never really scales.

Docker is not configuration management. Configuration management is not containerization.

Disclaimer: DevOps by day. I care because it makes it easier for me (and cheaper for you) to come in and scale things if it was done right in the beginning (instead of me having to revamp your infrastructure to move components to the right tools), but either way I get paid, so it doesn't matter much to me. I'm just trying to help make your life easier.


Most configuration management tools are used for managing docker on machines. I don't see them used much for setting up whats inside them.

Plus they don't match the convenience of just pulling down an image identical to production for developers(very important).

I also don't want to be using puppet on dev machines. Configuration management tools also take longer to run all scripts then just pulling down an image.


> I also don't want to be using puppet on dev machines. Configuration management tools also take longer to run all scripts then just pulling down an image.

Which is why you should be provisioning with Vagrant locally to properly replicate production.


Database servers, whether MySQL or PostgreSQL or NoSQL ones like CouchDB should never be dockerized. This is a use case where it is inappropriate to use Docker at all. This is a case where the db server should use the entire resources of a single server and for managing that server (and its replicas or other cluster members) you use a tool like Ansible or Chef or Puppet. And you need to learn that management tool well, because you will be pushing the limits of server automation with a database server.

If you think that a database server is where you put your "application's database" then I question why you have a db server at all. That is the use case SQLITE is for. Database servers are for the use case of storing the business's mission critical data so that all the applications of the business can access needed data.

I love Docker and use it a lot, but not for everything.


I wouldn't go so far as to say they should "never be containerized," but doing so doesn't solve any of your problems other than a packaging problem, and it arguably creates problems you might not have had before.

It's my experience that most of the impetus behind Docker is driven by developers who need it on their Mac/Windows workstations to run Linux processes, and who (for good reason) want to ensure that their development environment is as similar to the production environment as possible. In other words, the developers are driving the production environment, instead of the production engineers driving the development environment. This leads to friction when the production environment has its own unique set of constraints and needs that Docker and containers aren't quite yet a match for.


I also develop on OSX or Windows systems, targeting Linux as the deployment system. And I also want to test the code in an environment as close to production as possible. That is why I install Virtualbox and set up virtual Linux servers using the exact same server management scripts (in my case Ansible) as are used to configure production servers. It works great with no Docker. But I do use Docker as well for servers that are more cookie cutter than a db. For instance a JVM app server that has two Docker images, one with 3 layers culminating in the JVM app, and one with 2 layers culminating in an NGINX SSL proxy.

And I don't run Docker on OSX at all because it is just a wrapper for running Virtualbox with a Linux server inside. It is simplicity to just set up Virtualbox directly and just use Linux virtual servers directly.


> This is a case where the db server should use the entire resources of a single server

They have thousands of clusters. They didn't design/architect anything.

They're likely just trying to regroup databases because they are heavily underutilized and noone knows WTF they are running. And the organization will keep growing like that, adding new databases every day.


There are other benefits to docker than just being able to run multiple processes on the same machine.

Putting everything you need to build an image using a dockerfile(Then to version control that) is useful. Developers can also grab docker images for databases for their localmachines without any setup. Also consistency with the rest of the stack.


If you don't need containerization and still want those features you mention why not just use Vagrant?


I often use vagrant + docker. It isn't a replacement for docker. Vagrant isn't designed for servers. I want be using the exact same dockerfiles + images on dev machine, as i'm using on production.

I usually use vagrant + docker images running inside the vm. Sometimes however since I often have all my microservices on my machine, which can consume a lot of ram. I don't want be splitting my RAM between a VM.


Eh. For small scale, running a DB in a container and linking the DB storage to a directory on the host's filesystem works well enough.

I can't talk about large scale since I don't have first hand experience managing a setup like that.


Why? There are tons of advantages of using docker that's not just for running multiple processes on a single machine. You can still gain advantages from just installing a single database container on a big server + docker volumes.

Being able to version control image creation scripts, being able to pull a identical image for both dev/production is useful and quicker than trying to use puppet on dev machines, the fact you can now use your docker deployment infrastructure for all services, and not have exceptions for things.

Whats the disadvantage? No one has explained this to me.


I don't understand your puppet comment, but then I have never used it. I had some experience with Chef, lots of experience with shell scripts to config servers from scratch. But now I use Ansible instead. Technically I still run a shell script to call AWS CLI and launch an instance, but then I do everything else with an Ansible playbook that configures the server, installs a specific version of PostgreSQL, adds some extras like server extensions and a performance reporting tool. This is what set up our current production and dev servers, and from time to time I use it to set up yet another server to test something like adding a language extension, before running it against the dev server.

And as I mentioned, I also use Docker for other servers. My experience is that using Docker to set up a server is usually the same amount of work as Ansible. But Ansible is more amenable to refactoring into reusable components, i.e. components of the build/config process.


Could you expand on why "database servers... should never be dockerized"? You mention that database servers should have access to all of the resources of the server, but Docker doesn't prevent that.

What other reasons would you give someone when they are considering Docker vs Ansible/Chef/Puppet to deploy a database server or cluster?


I love these posts for several reasons.

1. People think of companies like Uber primarily as providers of a regular service to customers (similarly to how Airbnb or Netflix is perceived), but it's interesting to see the engineering chops needed to maintain this operation. 2. Given the relative youth of the company, the stack employed is quite modern and often uses cutting edge technologies in production and with real impact on custormers. Everyday users of these technologies usually only get 'textbook' explanations, but blog posts like these allow them to actually find out how these can be used in production, what the caveats are, how they interplay with other parts of the stack etc. 3. Some people complain that when a startup gets bought or implodes, their technology virtually disappears instead of benefitting the wider developer community for learning purposes (I'm not here to argue what the best practice should be, just stating an observation of a common complaint). By continually describing their engineering practices (similarly to Airbnb in their engineering blog) and open sourcing their non-business technology (e.g. go-torch), their experience lives on beyond the life of the company.

Last but not least, it build a positive ethos about this company in the developer community.


but it's interesting to see the engineering chops needed to maintain this operation

No. Uber has clearly hired too many engineers and has too little for them to do, leaving them free to come up with Rube Goldberg contraptions like this one. This is not anything any company that isn't swimming in money should even contemplate (and neither is slapping a layer on MySQL and claiming to have invented a "new datastore"). The kids are incharge of the nursery over at Uber and they are in desperate need of a good, adult CTO...


> The kids are incharge of the nursery over at Uber and they are in desperate need of a good, adult CTO...

That's a proper description of most startups I've had contact with recently, namely (well) funded startups.

It's a consequence of having inexperienced (and, sometimes, untalented) people running functions at a company.

It also applies to other areas like sales ("my inside sales script is killer, even though I haven't done professional sales anywhere before and just found out what inside sales is") or hiring ("I'll just hire my equally inexperienced buddies from college, they were great there").

But when it comes to kids building technology with "college playground" quality, it's too evident not to notice. The product ran fine when their buddies were testing, but it's unstable, unmaintainable and in need of a complete redesign when real world traffic comes along. All things that could have been prevented with a little bit of competence and experience.

It's a kind of mantra for people that take VC blogs as gospel and engineering blogs as the 10 commandments. Don't go to college, don't get any real world experience: start a company and, if it actually survives more than a couple of years, someone will inherit your technical debt.

This investor speak will quickly change when you get funded and have to attend board meetings to tell investors that your product is still not working, or you need twice the team that would be required for your goals, or you need months to produce basic business data...


It doesn't surprise me that I had to scroll so far down in the thread to find this comment. Kudos for saying what needed to be said.

EDIT: Make your system as simple as possible, but no more simple than that.


It seems there is more backlash for Uber's engineering blog posts though and the points seem to make me doubt Ubers engineering chops.


> Running containerized processes makes it easier to run multiple MySQL processes on the same host in different versions and configurations.

You're doing it wrong. One doesn't simply run multiple DB servers on the same iron.


I strongly disagree. At large scale, there are huge advantages to running multiple mysqld on one physical host. Facebook does this across their entire DB fleet! Most of their DB hosts run 2 mysqld but some run 8 or more -- it depends on which workload the host is part of.

The original motivation was to support multiple hardware generations. PCIe flash cards have become much larger over the years very quickly -- much faster than older cards become end-of-life. The result is that if you're running a large fleet of database hosts for many years, their storage capacity will differ greatly, both between datacenters (e.g. older DCs will have older hardware on average) and eventually within a datacenter that gets a partial refresh.

By defining automation configuration that is smart enough to know that some hosts get 1 mysqld and others get 2 mysqld, based on storage / hw generation, there's a much better flash utilization win.

This setup also enables faster replacement of failed hosts. Say each host has N mysqld, all part of different pools. If a host fails then you need to hot-copy the data set of each of these, from other replicas in each affected pool, to a new location. The trick is the replacements can copy from N different source hosts, and even go to N different destination hosts as well. This permits massively faster hot copying behavior vs having a single giant mysqld per host.

tl;dr it requires a lot of automation but there are very valid reasons for doing this.

That said, I would not advocate using Docker to achieve this. It provides little benefit for this scenario. If you're good at calculating mysqld memory usage, you can already just set the buffer pool and per-session buffers to a size that prevents multiple mysqlds from ever swapping. Meanwhile cpu and network rarely are points of saturation for a db host so that tends to work out fine without a quota system.

So that leaves i/o as the main resource that the processes will compete for. But Docker cannot provide i/o isolation.

From what I understand, Google/YouTube containerizes their data stores, but their systems around containerization are far more advanced than anyone else's. So I assume they've already solved this problem internally, but that doesn't mean the current state of the art in the open source world is up to the task yet.


YouTube already open sourced their containerized MySQL solution and it works pretty great. http://vitess.io


Sure you do. I have installations where the IO bandwidth available from the PCIe based SSDs is 10x what the Postgres databases for a typical customer actually need. There are plenty of circumstances where there's no reason not to run multiple databases on the same physical hardware.


Biggest reason not to run multiple database instances on a single bit of hardware - ram.

Say you're running on a VM with 16gb of ram, and one of your clients is particularly active with a 10gb of data and constant usage of that dataset. If you colocate that customer with five other customers using individual processes, at most your star customer will be able to house about 3gb of their dataset in memory.

On the flip side, if you have all five customers in one DB instance, then that one customer can house all 10gb of their data in working memory, making it much more performant.

You're hamstringing your DB processes by running one instance per set of databases. Run them all in one process, and you will get much more bang for your infrastructure buck.


Doesn't work that way with containers - they can all still have access to all the memory if you are confident you can safely give them access to it (and if you're not, then you certainly can't co-locate them in the same process.

But I have yet to deal with a server where adding more RAM was more than a rounding error compared to getting a fast IO subsystem.

And running them all on a single server means you need to take down all of them to upgrade any one of them, and are contingent on all of them being able to run on the same database version, and with the same extensions.


> Doesn't work that way with containers - they can all still have access to all the memory

It's not a container problem I'm summing up here, it's a DB problem. Well behaved DB software won't grow to boundless limits to manage the workload, it will grow to it's configured limits.

Running multiple instances of a DB server means that each DB is going to be hard-coded to a limit which is some fraction of the available memory.

> I have yet to deal with a server where adding more RAM was more than a rounding error

There are limits even to this, though. The most memory you can get for any single, broadly available instance in AWS is 244 GiB, and its going to cost you an arm and a leg to run. It's not hard to create a dataset which exceeds 244 GiB in size.

> take down all of them to upgrade any one of them

Only if you're upgrading the actual DB software. And most time a master switch is sufficient to keep everything online through a DB version upgrade. Not to say that master switches are easy; but even Uber has that pain with their system.

> all of them being able to run on the same database version, and with the same extensions.

This isn't something I frequently see being an actual problem, within the same corporation.


> Running multiple instances of a DB server means that each DB is going to be hard-coded to a limit which is some fraction of the available memory.

What you are suggesting then, is setting limits in a way that does not guarantee resources for any given database.

> There are limits even to this, though. The most memory you can get for any single, broadly available instance in AWS is 244 GiB, and its going to cost you an arm and a leg to run. It's not hard to create a dataset which exceeds 244 GiB in size.

Most people don't have databases that large. If you do, perhaps you shouldn't be puting multiple of them in a single instance if you're limited to machines that small.

For comparison, I have servers with dozens of databases that fit into memory on machines with 64GB and less. I also have machines with single databases that need far more. If we need to, we'll provision machines with 1TB-2TB of RAM.

> This isn't something I frequently see being an actual problem, within the same corporation.

It's something I see all the time. Consider that "within the same corporation", people often run databases for a large number of external customers, or need to be able to roll out new versions, or do development and testing on different versions.

I have at least 5 different versions of Postgress sitting on production servers right now due to customers with different requirements and different upgrade cycles. I also have several different versions of MySQL.


> does not guarantee resources for any given database

Correct. Instead it gives the most resources to the pages in all of the databases which are accessed and used the most. MySQL and InnoDB are remarkably well tuned to ensure that the most often used pages are in memory, where they can be accessed and updated with the greatest performance.

> It's something I see all the time.

We apparently work on very different usecases. Fair enough. Just be aware that your usecase is far from typical. Continuing to argue points across such diverse usecases isn't going to make for a productive discussion.


> Instead it gives the most resources to the pages in all of the databases which are accessed and used the most.

And that's not acceptable when dealing with setups where each database needs predictable performance, and some of them may see heavy traffic while others don't.

> Just be aware that your usecase is far from typical.

I don't think there is such a thing as a "typical" usecase in this area based on what I see as part of my consulting work, but this is common enough that I deal with it regularly across a wide variety of clients.


Isn't that one advantage of Docker over VM's is that you have access to all of the RAM in a container while you don't in a VM? (I may well be wrong about that).


It's not Docker that's the issue, it's that well behaved DB software limits its memory usage to a fixed (but configurable) value. Unless you hand-tune the size of the (in this case) InnoDB buffer pools for each DB instance, the memory usage will not be as efficient.

Of course, if you're hand-tuning different DB instances, you're already in trouble.


Yes, you can. If anything, one of the downsides of Docker is that vs. e.g. OpenVz or full VMs the memory accounting is still much less mature and requires additional work to enable on many distros.


The limitation of the approach you are advocating is that all databases would need to replicate from the same master, which isn't always practical.


Given the choice between performance and ease of replication, the businesses I worked for always chose performance at scale. Perhaps that skews my expectations, but given how much DB performance affects overall app performance, it trumps most other considerations.


Multi DB hosting isn't the issue, it's installing two MySql server instances that I think people are questioning.

I know with MSSQL there were scenarios where you would install multiple instances, I think they were when you had more RAM in the box than the server package was licenced for, or for management of a large number of databases.


Great, except when different customers requires different versions or different extensions.


Why is that? Each process is using a separate disk on the machine, and if that's the case, where would you get potential problems?


And what gives you Docker what cgroups would not?

I mean they write:

    Initially, all our clusters were managed by Puppet
Of course docker won't actually replace such systems. Uber has the worst enginnering practices i've ever seen.

They replaced Puppet with a handcrafted tool that uses docker and call that Dockerizing MySQL, just wow. I wonder when they are at zero money. (and the best thing is probably that they run at AWS and could just use a AMI, but since that isn't written in the article I would not create such assumptions).


> I would not create such assumptions

But you assume they have the worst engineering practices based on a few blog posts?


Memory. Why go to disk for reads in the first place, if your most used data is in memory? You can keep more of the hot data in memory if you use one process instead of two (or five, or ten).


A lot of what they are describing in the first half sounds like kubernetes. I wonder if they could do it again would they adopt it...?


agreed. The newer "Pet Set" construct in k8s might simplify this quite a bit:

http://kubernetes.io/docs/user-guide/petset/


Or as it's been renamed last month to: "StatefulSet" https://github.com/kubernetes/kubernetes/issues/35534

Mostly what I was commenting on was how they are building some type of declarative infrastructure where they define the topology they want and then system will build it. This was screaming k8s to me.


Most of the assumptions in the comments boil down to its a bad idea if you're resource constrained because you'll get much better performance when everythings on one big server with shared ram.

However, possibly, their engineering goal was detecting and isolating hotspots and minimizing debug/downtime effort. And possibly they have an infinite pile of cash or at least they're not resource constrained. In that case it might make sense to fragment massively.

For example, lets say database #235 is using 75% of a shared server. Then at least in the very short term migrate the numerous other databases at the container (or image) level to other servers so the overloaded database doesn't flood out every other business system you support.

Now the argument is if you ran mysql bare metal then you'd have more memory and maybe only use 25% of that shared server, but sooner or later you'll get a big enough flood of traffic that you'll want to segment off and your operations team understands docker at 2am but not mysql DBA stuff quite as much. Sometimes its nice to have everything use the same standard and everything lives in docker might work.

This also has fascinating forensic and QoS implications where you can do interesting snapshot and cloning tricks affecting precisely and exactly one DB at a time.


Exactly this.

You don't go to containerization because it's cheaper / more efficient, you go to it because it's more flexible and lets you simplify your tech ops into business ops faster. Containers require a LOT of infrastructure / overhead to build -- and don't be seduced by cloud providers that abstract that complexity, if you build a business on containers you WILL have to deal with it. But it does allow you to de-skill a lot of tech ops processes, which helps reliability (so long as your strategies are effective).

And you're right; depending on the way they handle their sharding (which may be part of a greater data availability strategy) this might not be so bad. But business goals are driving this, so you have to find a way to make the tech deliver what the business wants.


It's almost 2017. The scaling choices are not limited to Docker vs. bare metal.

With their budget, you should be able to clone and scale metal servers within minutes if you wanted to.

This is like Tumblr spending $50,000 grand a month on their developer AWS instances while not making any money. Cool but not very practical.


Bare metal is operationally very expensive because someone has to monitor it, back it up, patch it, test it, fix the software when it breaks at 2am, fix the hardware when it breaks at 2am, explain to security that its either full of confidential stuff or not, decommission it years later, and none of that is as standardized as virtualized or containerized stuff.

Also "in the olden days" we had prod servers but shared dev and test servers. Now a days you just spin up resources assuming your ops is flexible enough. I'll spin up a test image to eliminate one bug and then destroy it, no sweat. In theory you can do that with metal but the accounting must be weird. You could make a pool of bare metal test boxes for people to use as they please, I guess...

I also like spinning up new images for software upgrades. Oh, a new version of the database, here's a new image, test it out.

Some of it is organizational hacking. After a few legendary disasters procedures will be formulated where standing up new iron takes interdepartmental meetings and signing off with the network and security and ops teams and the data center guy has to sign off on the thermal and electrical loads and power points all over the place. In comparison, you wouldn't make someone changing a cell in a spreadsheet go thru all that, right? So you deploying a virtual image is just clicking a harmless little button, as long as you operate under a blanket agreement with ops, infosec, networking, etc... At least until enough legendary disasters inevitably happen that clicking "create" on a virtual image requires weeks of time and at least 4 signatures and 3 departmental meetings of micromanagement. Hopefully we'll invent something new by then.


Not sure if you were just using a made-up example, but Tumblr primarily isn't on EC2 and never has been. For compute they were bare-metal from the start and moved to their own self-managed datacenters in 2012.


> The MySQL data directory is mounted from the host file system, which means that Docker introduces no write overhead

If I understand correctly, Docker doesn't introduce any write overhead, mounted from host or not. This is the main difference from VMs - syscalls are just being "forwarded" to kernel. Or am I missing something?

Still, I am only arguing with the stated reason. Of course you don't want to save state inside an (ephemeral) container.


There's not even a forwarding process. Containers are simply processes running in a set of one or more different "namespaces." A mount point in a container is just an ordinary mount point from the kernel's perspective. In this case, the mount point is relative to the container's root filesystem set via chroot(2).


It does not say so in the article but I guess they have a database cluster for each city or something like that which makes sense since a user do not care about Uber cars in a different city. Do they use GPS to put the Uber car in the right cluster? The users move around more but they are more static so they are centralized somehow?


Disclaimer: I have never used Uber.

What if user orders Uber from one city to another? If driver is then in another city, wouldn't be nice to give him some rides in that city or when someone wants to go back to original city? What about cities that are near the border? If user would like to go shopping to another country? (Not uncommon in Europe. Long time ago I used to take local bus from Poland to Germany, do some shopping and go back)


Sounds fairly simple (which is not to say easy to implement); you keep the car tied to the database of origin for the duration of the ride (other users don't need to know about it while it's occupied), then you can migrate the car to the other DB when it's idle.


How the data is sharded is described in http://eng.uber.com/mezzanine-migration/


It's gotta be the case. Probably choose trip db based on cutting map into squares or hexagons. Dump all location data from drivers and riders into it real time. Sounds like they've got batch jobs running against these databases to discover surge areas and stuff. Probably designed to only hit only 1-2 databases at a time for calculations they do a lot.

This really looks like a crap design overall for a company started well in the age of nosql and stream processing.


I'm going to guess your design assessment is unpopular. You may find that "old-skool" store-and-forward networks are pretty effective for slow-moving dimensions.


It sounds like Uber has a database cluster for roughly every employee?!! Uber has a single product that largely centers around 1 app. How can this be necessary?

I have a feeling... That they're dumping realtime GPS data into a bunch of these when they should be using something like Cassandra...


As said in a previous blog post by Uber, they have thousands of microservices and over 8000 Git repos.

It's safe to assume that their infra is a giant clusterfuck :D

I would assume that they scaled by adding more and more engineers, that end up working each in their corners on different problems, without any basic shared tooling/ practices/design. Things get out of hands quickly.


"Single product" which consists of more than X thousands of micro-services.


>The entire trip store, which receives millions of trips every day, now runs on Dockerized MySQL databases together with other stores

It really does sound like this is part of the issue. Millions of trips is billions of db row per day. A document store is much more amenable to that kind of workload than MySQL


Actually it is sort of the opposite.... They needed better OLTP to deal with synchronizing billing ... and they wanted schema flexibility. Document/Column stores are traditionally not ideal for this because A+P (CAP theorm) are preferred over "C"onsistency (very broadly speaking).

They could have done it with a transactional message queue and a decent RDBMS (there are far bigger companies that use RDBMS for far more transactions than Uber does) but they clearly did not have the in house expertise for that (and they also wanted rapid schema changes).

Part of the problem is Postgres was a little behind on scaling in previous years but that has changed. IMO they could have stuck with Postgres by making an addon to that instead but they found extending MySQL easier.


Uber essentually built their own custom document store on top of MySQL. They explain their design and reasons (and why they didn't use Cassandra etc) in this post: https://eng.uber.com/schemaless-part-one/


Okay thanks, that explains the why but doesn't make it sound less terrible. They essentially built a nosql database on top of MySQL. Forgivable years ago but this was in 2014...


Which likely explains why they use their MySQL Schemaless engine[1].

1. https://eng.uber.com/schemaless-part-one/


I guess we should train the new generation better so they don't end up as Uber so... tip for any company who gets incredibly successfull:

The first thing one has to do is to drop SQL databases as the main data source.

The usual choice is to move to Cassandra. It does have build-in sharding AND backup AND multi-master replication AND multi datacenter support AND performances scale linearly with the number of servers.

The [only] other option is ElasticSearch (which has slightly different properties regarding data format and data consistence).

You don't need to craft complex custom sharded distributed WTF software to abstract hundreds of cluster of hundreds of databases. Use the right tool for the job, that has that build in.


Dropping the RDBM and replacing it with something like Cassandra is fashionable, but sounds like bad advice to me.

A quick google gives these stats for Uber[1]:

- 8 million users

- 160 thousand drivers

- 1 million daily rides

- 2 billion rides so far

The biggest data, as stated in [2], is the trip info. They are storing the trip info as a 20kB JSON blob. If you add a custom data type [3] for a trip to PostgreSQL and encode it efficiently (binary, using deltas), you should be able to do much better than the ~3kB they get by using messagepack and zlib, say 1kB.

All that data should easily fit into 4TB. At one million rides per day, one reasonably beefy RDBM server should be able to easily handle that. You can partition the trips table and move the trips that are older than say, 1 month, to a slower disk to save money on disks. That helps with schema migration too: new data uses the new schema, and you use a view on the old data to adapt to the new schema.

I very much doubt you need anything much fancier than that. It would help if people really learned their tools (in this case PostgreSQL) instead of jumping to a different technology at the first problem they run into.

[1] http://expandedramblings.com/index.php/uber-statistics/

[2] https://eng.uber.com/trip-data-squeeze/

[3] https://www.postgresql.org/docs/current/static/xtypes.html


As a rule of thumbs, putting 5TB of data -that are growing exponentially- in a single box is always a terrible idea.

You're gonna hit all kind of limits with the RDBMS software and the special hardware that will be required.

Cassandra will handle the sharding automatically, it will have multiple instances with always one available for your applications, it will handle replications across datacenters all around the world, it will have dependable performances that can scale linearly with the specs you give it, it will allow you to do maintenance while online, it will let you add remove and refresh nodes.

A bigger RDBMS has none of these qualities. It's a one trick pony that will die when either the hardware or the software will reach a limit, then your whole site will be down. Even if you know what needs to be done to avoid the disaster, you can't do it because the RDBMS is a SPOF and every maintenance you perform is called "downtime".

Short term = Postgre/MySQL because it's easier and it gets the job done. Long term = Cassandra because it's dependable.


> As a rule of thumbs, putting 5TB of data -that are growing exponentially- in a single box is always a terrible idea.

Storage capacity is also growing exponentially: Samsung is shipping a 15TB SSD (albeit at $10000), Seagate has previewed a 60TB SSD.

> You're gonna hit all kind of limits with the RDBMS software and the special hardware that will be required.

The limits of RDBMs are well understood. You don't need any fancy hardware for this use case. A couple of Xeons, as much RAM as you can afford and a RAID of SSDs (or possible spinning disks, it doesn't look like Uber is doing anything too fancy).

SPOF: you have slave replicas running that can take over if something goes wrong with the master.

You don't have to take down an RDBM to do maintenance. DDL statements are transactional in PostgreSQL.

The automatic sharding of Cassandra is nice, when you need it. Of course, Uber's use case seems like it lends itself to easy geographic sharding when you're using an RDBM, if needed.

In the end, I'd rather deal with a mature well-understood technology like an RDBM compared to a 5 year old technology like Cassandra (release 1.0 in 2011). You obviously prefer the opposite. To each its own.


If you can't get the hardware neither on AWS nor Google nor SoftLayer. I'd consider that exotic enough.

Don't get me wrong. I know vertical scaling and I've done it before. I'd take an old school DBA who understands Oracle over a random junior speaking only NoSQL to everything.

For the majority of use cases (including where I am now), it's easier to pick the right technology (Cassandra) even if we have to learn and later teach it around, than it is to find someone who can really do 10TB PostgreSQL and spreads the knowledge.

Of course, if you have extensive experience with PostgreSQL, that may skew the choice heavily to the other direction ;)


> Of course, if you have extensive experience with PostgreSQL, that may skew the choice heavily to the other direction ;)

I have more experience with PostgreSQL than Cassandra, for sure. And I've had negative experiences with people trying to push Cassandra where it was totally not appropriate (small problem and no need for high availability). They had no experience with Cassandra themselves beyond watching a video and doing some tutorials and couldn't answer basic questions about the underlying technology. That might be skewing my perception too.


Is this somewhat similar to the containerizerd MySQL used at YouTube that was open sourced by Google? Vitesse (http://vitess.io/overview/)


I think the back story here is that they are running MySQL as sharded key-value storage.

And at 2300 instances, it makes sense that there would be a few of them dying at any given time.

With this setup, using MySQL is pretty much arbitrary, any databases could fill in the role as well. E.g. It got me thinking that Redis+Sentinel could do the job too (if they don't need super durable transactions)


Screenshot from our management console.

This looks like an amazing tool.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: