1) MySQL requires one file to configure it: my.cnf. This is not exactly a huge amount of configuration which needs to occur. Installs via puppet, Chef or Ansible tend to consist of two commands - one to install the package, and one to write the my.cnf (templates are good so you can use the same command on any sized server). You can add one more command to set up the initial users, should you so desire.
2) Multiple MySQL processes on the same host wastes that host's resources. A single MySQL instance is perfectly capable of running multiple databases, and will respond faster because it will properly allocate the boxes memory according to each DB's usage. Multiple processes will each chomp up the configured bit of memory, not allowing individual databases to use the resources they need. It's always faster to serve data from memory than from disk (even if that disk is an SSD). Worse, under-utilized DB instances will be swapped off to disk, causing even more load and delay as they are swapped back in.
3) Transferring data from one host to another when you need to bring up a stateful process does not scale. Above a few gigs, the transfer process creates significant load on both the source and destination host, and will easily saturate the link between the two. Neither will respond to requests with any alacrity, meaning you typically want to take both hosts out of the active DB pool.
4) The DB restoration process from copying over the raw files can take 10+ minutes, depending on how many dirty pages existed on the source. The restoration process will go faster with logical dumps, but logical dumps will take longer to generate, transfer, and load.
To reiterate, this are problems for those companies running at scale, with terabytes of data and dozens (or more) DB servers. When you're running a few GB of data, a DB container is probably going to work fine. Just don't believe for a second that you can scale it the same way you do your web frontend services.
It pays to hire experts. How much time and money has Uber sunk into working around their DB, instead of with it?
I can't say much but there were multiple issues that could have been easily avoided with a basic understanding of GIS - I have no doubt that some of it was due to the IT department at my employer but am also incredibly surprised that any GIS staff at the ride share companies would have agreed to what was implemented.
Waste of existing talent is unfortunately a way of life at large corporations :( .
5a) ignores MySQL's ability to make many runtime config changes without a restart
6) hard-codes master/slave relationships into the boot process (at least if I'm understanding the config json)
7) adds additional risk (noted as docker crashes) and client issues (noted as "userland proxy" comment)
8) requires additional ops overhead (noted as special care needed for masters)
But if you already run everything in docker and really want to get rid of puppet, then why not?
this article links to ansible but not puppet and is the only blog post tagged with either one; I think this really boils down to puppet vs $x
Given my limited failed experience with playing with docker + postgres (and various volume stores including flocker) I have continued doubts about Uber's choices.
I suppose they are having success but I wonder at what cost.
Specifically the quote on :
Finally, our decision ultimately came down to operational trust in the system we’d use, as it contains mission-critical trip data. Alternative solutions may be able to run reliably in theory, but whether we would have the operational knowledge to immediately execute their fullest capabilities factored in greatly to our decision to develop our own solution to our Uber use case. This is not only dependent on the technology we use, but also the experience with it that we had on the team.
And you know, I wouldn't argue too much at the thought of running MySQL out of a container. It does avoid a lot of basic issues around upgrades, versioning, stack split, etc. It's the act of treating it like all other containerized software which gets my panties in a twist.
Stateful systems can't be treated like stateless systems if you want to maintain anything resembling reasonable uptime and performance.
Docker does not replace puppet and they clearly wrote that they developed a System just for that.
It depends. @falcolas, could you point out a few other solutions then that? (for resharding live, other then replication / copy / rsync / ...)
The lightest weight solution I've seen is restoring from a daily backup in something like S3, then setting up as a slave from a live master to catch up on the day's binlogs. Still a lot of data to move and load, but at least it's not the entire contents of the DB.
The best you can do is be in control of when data transfers happens so you're doing it when it makes sense and not in the middle of your highest traffic period (which is what frequently happens when attempting to automatically scale DBs in response to load).
Some databases have a configurable limit in MB/s for replication, or it's possible to assign disk/CPU quota on the slave to slow it down.
Combine that with good planning and monitoring, you'll be fine =)
You should always run mysql_upgrade to be safe, and I'd assume Uber's automation does this. But you probably won't encounter problems if you don't.
mysql_upgrade typically just makes some quick changes to the system schema ("mysql" database), not to other data. And it typically only needs to change things across major version upgrades (5.5->5.6) or pre-GA releases; it's rare to make a change in a point release of a major version much after the GA release.
That all said, I don't ever advocate using Docker for MySQL in production at this time; see my longer comment in another subthread below for reasoning.
My experience with this mirrors the parents. The more data you have, and the more obscure engine features you use, the more likely it is to happen. It only takes one major failure to make you want to test the daylights out of any upgrades or downgrades.
The more common classes of problem are performance degradation, config options being added or renamed, new features being buggy in rare edge cases, etc. Not things particular to InnoDB's storage format or things that would prevent a volume from being usable with a different version.
Really? The mysql System Database is filled with tables that hold important configuration information. You can say "add one more command to set up the initial users," but if you have a lot of grants, you've got six levels of database privileges -- global, database, table, host, stored procedure/functions and proxies -- that all need to be configured. You've got the event scheduler. If you have UDFs that you expect on every server, those need to be configured. None of this is in my.cnf. Maybe at the scale you've worked at, you've never had issues with any of those things in a three-command puppet/Chef/Ansible setup, but maybe Uber has problems at a scale that you don't have, rather than them not understanding how MySQL is configured?
Much better to store it as its own version controlled SQL file which is run against a newly started DB (that third command I referenced). You could break out every statement into its own provisioner command, but then you are beginning to muddle the line between machine-level configuration and database level configuration.
So yes, I still stand by my original assertion that you can stand up a fully functional DB with three Ansible/puppet/chef commands.
> maybe Uber has problems at a scale that you don't have
I will admit that it is entirely possible. But from what they present to the public (which is all any of us outside of Uber have to go on), their problems aren't really that unique: a heavy write workload which is easily sharded. Only their chosen method of resolving them is proving to be unique, and they appear to disregard any and all normal methods of resolving their issues.
Sometimes the boring way of doing things - accepting that critical stateful processes are indeed different from stateless processes and managing them accordingly - is still the best way. Getting away from this means getting away from MySQL and its ilk entirely. I hope we can get there one day; managing DB instances separately from your stateless instances is a pain in the ass.
MySQL only needs the my.cnf to start a server.
The permissions are stored along the database files (If i remember well), which is a separate problem. 1) It's only configured once on the master 2) Uber says that they don't reuse volumes [which implies many other consequences and special management around that].
By the way, Chef/Puppet/Ansible have build-in commands to manage that as well.
The unfortunate thing, in my opinion, is that so much of being a software/system administrator of any flavor is having been burned by that software. That experience makes you invaluable, but it also makes you expensive to hire. Expensive hires are really hard for startups to justify.
That said, if you reach the size where you're having to shard your DB to hit performance metrics, you're at the point where it will typically cost you less to absorb that expensive hire (or consultant) than to attempt to naively engineer around the problem.
Or... startups are hard for experienced workers to justify.
Who would want to work for less money, shit equity, bad hours, on calls and surrounded by constant waves of variable level developers that you don't/can't get enough time to train.
We have been running our databases using this same methodology for almost two years now, including automatic initialization for new clusters, auto-configuration of master-slave and federation setup, and automatic failover/recovery. It doesn't solve all the problems, and we still need to manually solve issues (mostly because we use MongoDB and ran into several bugs and 'features' over the years), but for standard operational work it saves a ton of work. Expanding our cluster by adding a new replicated shard is no more than 2 minutes of work in changing the master config, everything else is fully automated.
Docker is useless for what is not stateless.
You can run database in Docker... until the assumptions of Docker will come back to bite you.
> There are still many advantages to wrapping your database in a container, and this post by Uber explains really well how and when to use this technique.
Well. Run databases in Docker if you've got 3000 clusters and your only concern [that you created yourself] is to run multiple of them on the same hosts.
Docker is not configuration management. Configuration management is not containerization.
Disclaimer: DevOps by day. I care because it makes it easier for me (and cheaper for you) to come in and scale things if it was done right in the beginning (instead of me having to revamp your infrastructure to move components to the right tools), but either way I get paid, so it doesn't matter much to me. I'm just trying to help make your life easier.
Plus they don't match the convenience of just pulling down an image identical to production for developers(very important).
I also don't want to be using puppet on dev machines. Configuration management tools also take longer to run all scripts then just pulling down an image.
Which is why you should be provisioning with Vagrant locally to properly replicate production.
If you think that a database server is where you put your "application's database" then I question why you have a db server at all. That is the use case SQLITE is for. Database servers are for the use case of storing the business's mission critical data so that all the applications of the business can access needed data.
I love Docker and use it a lot, but not for everything.
It's my experience that most of the impetus behind Docker is driven by developers who need it on their Mac/Windows workstations to run Linux processes, and who (for good reason) want to ensure that their development environment is as similar to the production environment as possible. In other words, the developers are driving the production environment, instead of the production engineers driving the development environment. This leads to friction when the production environment has its own unique set of constraints and needs that Docker and containers aren't quite yet a match for.
And I don't run Docker on OSX at all because it is just a wrapper for running Virtualbox with a Linux server inside. It is simplicity to just set up Virtualbox directly and just use Linux virtual servers directly.
They have thousands of clusters. They didn't design/architect anything.
They're likely just trying to regroup databases because they are heavily underutilized and noone knows WTF they are running. And the organization will keep growing like that, adding new databases every day.
Putting everything you need to build an image using a dockerfile(Then to version control that) is useful. Developers can also grab docker images for databases for their localmachines without any setup. Also consistency with the rest of the stack.
I usually use vagrant + docker images running inside the vm.
Sometimes however since I often have all my microservices on my machine, which can consume a lot of ram. I don't want be splitting my RAM between a VM.
I can't talk about large scale since I don't have first hand experience managing a setup like that.
Being able to version control image creation scripts, being able to pull a identical image for both dev/production is useful and quicker than trying to use puppet on dev machines, the fact you can now use your docker deployment infrastructure for all services, and not have exceptions for things.
Whats the disadvantage? No one has explained this to me.
And as I mentioned, I also use Docker for other servers. My experience is that using Docker to set up a server is usually the same amount of work as Ansible. But Ansible is more amenable to refactoring into reusable components, i.e. components of the build/config process.
What other reasons would you give someone when they are considering Docker vs Ansible/Chef/Puppet to deploy a database server or cluster?
1. People think of companies like Uber primarily as providers of a regular service to customers (similarly to how Airbnb or Netflix is perceived), but it's interesting to see the engineering chops needed to maintain this operation.
2. Given the relative youth of the company, the stack employed is quite modern and often uses cutting edge technologies in production and with real impact on custormers. Everyday users of these technologies usually only get 'textbook' explanations, but blog posts like these allow them to actually find out how these can be used in production, what the caveats are, how they interplay with other parts of the stack etc.
3. Some people complain that when a startup gets bought or implodes, their technology virtually disappears instead of benefitting the wider developer community for learning purposes (I'm not here to argue what the best practice should be, just stating an observation of a common complaint). By continually describing their engineering practices (similarly to Airbnb in their engineering blog) and open sourcing their non-business technology (e.g. go-torch), their experience lives on beyond the life of the company.
Last but not least, it build a positive ethos about this company in the developer community.
No. Uber has clearly hired too many engineers and has too little for them to do, leaving them free to come up with Rube Goldberg contraptions like this one. This is not anything any company that isn't swimming in money should even contemplate (and neither is slapping a layer on MySQL and claiming to have invented a "new datastore"). The kids are incharge of the nursery over at Uber and they are in desperate need of a good, adult CTO...
That's a proper description of most startups I've had contact with recently, namely (well) funded startups.
It's a consequence of having inexperienced (and, sometimes, untalented) people running functions at a company.
It also applies to other areas like sales ("my inside sales script is killer, even though I haven't done professional sales anywhere before and just found out what inside sales is") or hiring ("I'll just hire my equally inexperienced buddies from college, they were great there").
But when it comes to kids building technology with "college playground" quality, it's too evident not to notice. The product ran fine when their buddies were testing, but it's unstable, unmaintainable and in need of a complete redesign when real world traffic comes along. All things that could have been prevented with a little bit of competence and experience.
It's a kind of mantra for people that take VC blogs as gospel and engineering blogs as the 10 commandments. Don't go to college, don't get any real world experience: start a company and, if it actually survives more than a couple of years, someone will inherit your technical debt.
This investor speak will quickly change when you get funded and have to attend board meetings to tell investors that your product is still not working, or you need twice the team that would be required for your goals, or you need months to produce basic business data...
EDIT: Make your system as simple as possible, but no more simple than that.
You're doing it wrong. One doesn't simply run multiple DB servers on the same iron.
The original motivation was to support multiple hardware generations. PCIe flash cards have become much larger over the years very quickly -- much faster than older cards become end-of-life. The result is that if you're running a large fleet of database hosts for many years, their storage capacity will differ greatly, both between datacenters (e.g. older DCs will have older hardware on average) and eventually within a datacenter that gets a partial refresh.
By defining automation configuration that is smart enough to know that some hosts get 1 mysqld and others get 2 mysqld, based on storage / hw generation, there's a much better flash utilization win.
This setup also enables faster replacement of failed hosts. Say each host has N mysqld, all part of different pools. If a host fails then you need to hot-copy the data set of each of these, from other replicas in each affected pool, to a new location. The trick is the replacements can copy from N different source hosts, and even go to N different destination hosts as well. This permits massively faster hot copying behavior vs having a single giant mysqld per host.
tl;dr it requires a lot of automation but there are very valid reasons for doing this.
That said, I would not advocate using Docker to achieve this. It provides little benefit for this scenario. If you're good at calculating mysqld memory usage, you can already just set the buffer pool and per-session buffers to a size that prevents multiple mysqlds from ever swapping. Meanwhile cpu and network rarely are points of saturation for a db host so that tends to work out fine without a quota system.
So that leaves i/o as the main resource that the processes will compete for. But Docker cannot provide i/o isolation.
From what I understand, Google/YouTube containerizes their data stores, but their systems around containerization are far more advanced than anyone else's. So I assume they've already solved this problem internally, but that doesn't mean the current state of the art in the open source world is up to the task yet.
Say you're running on a VM with 16gb of ram, and one of your clients is particularly active with a 10gb of data and constant usage of that dataset. If you colocate that customer with five other customers using individual processes, at most your star customer will be able to house about 3gb of their dataset in memory.
On the flip side, if you have all five customers in one DB instance, then that one customer can house all 10gb of their data in working memory, making it much more performant.
You're hamstringing your DB processes by running one instance per set of databases. Run them all in one process, and you will get much more bang for your infrastructure buck.
But I have yet to deal with a server where adding more RAM was more than a rounding error compared to getting a fast IO subsystem.
And running them all on a single server means you need to take down all of them to upgrade any one of them, and are contingent on all of them being able to run on the same database version, and with the same extensions.
It's not a container problem I'm summing up here, it's a DB problem. Well behaved DB software won't grow to boundless limits to manage the workload, it will grow to it's configured limits.
Running multiple instances of a DB server means that each DB is going to be hard-coded to a limit which is some fraction of the available memory.
> I have yet to deal with a server where adding more RAM was more than a rounding error
There are limits even to this, though. The most memory you can get for any single, broadly available instance in AWS is 244 GiB, and its going to cost you an arm and a leg to run. It's not hard to create a dataset which exceeds 244 GiB in size.
> take down all of them to upgrade any one of them
Only if you're upgrading the actual DB software. And most time a master switch is sufficient to keep everything online through a DB version upgrade. Not to say that master switches are easy; but even Uber has that pain with their system.
> all of them being able to run on the same database version, and with the same extensions.
This isn't something I frequently see being an actual problem, within the same corporation.
What you are suggesting then, is setting limits in a way that does not guarantee resources for any given database.
> There are limits even to this, though. The most memory you can get for any single, broadly available instance in AWS is 244 GiB, and its going to cost you an arm and a leg to run. It's not hard to create a dataset which exceeds 244 GiB in size.
Most people don't have databases that large. If you do, perhaps you shouldn't be puting multiple of them in a single instance if you're limited to machines that small.
For comparison, I have servers with dozens of databases that fit into memory on machines with 64GB and less. I also have machines with single databases that need far more. If we need to, we'll provision machines with 1TB-2TB of RAM.
> This isn't something I frequently see being an actual problem, within the same corporation.
It's something I see all the time. Consider that "within the same corporation", people often run databases for a large number of external customers, or need to be able to roll out new versions, or do development and testing on different versions.
I have at least 5 different versions of Postgress sitting on production servers right now due to customers with different requirements and different upgrade cycles.
I also have several different versions of MySQL.
Correct. Instead it gives the most resources to the pages in all of the databases which are accessed and used the most. MySQL and InnoDB are remarkably well tuned to ensure that the most often used pages are in memory, where they can be accessed and updated with the greatest performance.
> It's something I see all the time.
We apparently work on very different usecases. Fair enough. Just be aware that your usecase is far from typical. Continuing to argue points across such diverse usecases isn't going to make for a productive discussion.
And that's not acceptable when dealing with setups where each database needs predictable performance, and some of them may see heavy traffic while others don't.
> Just be aware that your usecase is far from typical.
I don't think there is such a thing as a "typical" usecase in this area based on what I see as part of my consulting work, but this is common enough that I deal with it regularly across a wide variety of clients.
Of course, if you're hand-tuning different DB instances, you're already in trouble.
I know with MSSQL there were scenarios where you would install multiple instances, I think they were when you had more RAM in the box than the server package was licenced for, or for management of a large number of databases.
I mean they write:
Initially, all our clusters were managed by Puppet
They replaced Puppet with a handcrafted tool that uses docker and call that Dockerizing MySQL, just wow. I wonder when they are at zero money. (and the best thing is probably that they run at AWS and could just use a AMI, but since that isn't written in the article I would not create such assumptions).
But you assume they have the worst engineering practices based on a few blog posts?
Mostly what I was commenting on was how they are building some type of declarative infrastructure where they define the topology they want and then system will build it. This was screaming k8s to me.
However, possibly, their engineering goal was detecting and isolating hotspots and minimizing debug/downtime effort. And possibly they have an infinite pile of cash or at least they're not resource constrained. In that case it might make sense to fragment massively.
For example, lets say database #235 is using 75% of a shared server. Then at least in the very short term migrate the numerous other databases at the container (or image) level to other servers so the overloaded database doesn't flood out every other business system you support.
Now the argument is if you ran mysql bare metal then you'd have more memory and maybe only use 25% of that shared server, but sooner or later you'll get a big enough flood of traffic that you'll want to segment off and your operations team understands docker at 2am but not mysql DBA stuff quite as much. Sometimes its nice to have everything use the same standard and everything lives in docker might work.
This also has fascinating forensic and QoS implications where you can do interesting snapshot and cloning tricks affecting precisely and exactly one DB at a time.
You don't go to containerization because it's cheaper / more efficient, you go to it because it's more flexible and lets you simplify your tech ops into business ops faster. Containers require a LOT of infrastructure / overhead to build -- and don't be seduced by cloud providers that abstract that complexity, if you build a business on containers you WILL have to deal with it. But it does allow you to de-skill a lot of tech ops processes, which helps reliability (so long as your strategies are effective).
And you're right; depending on the way they handle their sharding (which may be part of a greater data availability strategy) this might not be so bad. But business goals are driving this, so you have to find a way to make the tech deliver what the business wants.
With their budget, you should be able to clone and scale metal servers within minutes if you wanted to.
This is like Tumblr spending $50,000 grand a month on their developer AWS instances while not making any money. Cool but not very practical.
Also "in the olden days" we had prod servers but shared dev and test servers. Now a days you just spin up resources assuming your ops is flexible enough. I'll spin up a test image to eliminate one bug and then destroy it, no sweat. In theory you can do that with metal but the accounting must be weird. You could make a pool of bare metal test boxes for people to use as they please, I guess...
I also like spinning up new images for software upgrades. Oh, a new version of the database, here's a new image, test it out.
Some of it is organizational hacking. After a few legendary disasters procedures will be formulated where standing up new iron takes interdepartmental meetings and signing off with the network and security and ops teams and the data center guy has to sign off on the thermal and electrical loads and power points all over the place. In comparison, you wouldn't make someone changing a cell in a spreadsheet go thru all that, right? So you deploying a virtual image is just clicking a harmless little button, as long as you operate under a blanket agreement with ops, infosec, networking, etc... At least until enough legendary disasters inevitably happen that clicking "create" on a virtual image requires weeks of time and at least 4 signatures and 3 departmental meetings of micromanagement. Hopefully we'll invent something new by then.
If I understand correctly, Docker doesn't introduce any write overhead, mounted from host or not. This is the main difference from VMs - syscalls are just being "forwarded" to kernel. Or am I missing something?
Still, I am only arguing with the stated reason. Of course you don't want to save state inside an (ephemeral) container.
What if user orders Uber from one city to another? If driver is then in another city, wouldn't be nice to give him some rides in that city or when someone wants to go back to original city?
What about cities that are near the border? If user would like to go shopping to another country? (Not uncommon in Europe. Long time ago I used to take local bus from Poland to Germany, do some shopping and go back)
This really looks like a crap design overall for a company started well in the age of nosql and stream processing.
I have a feeling... That they're dumping realtime GPS data into a bunch of these when they should be using something like Cassandra...
It's safe to assume that their infra is a giant clusterfuck :D
I would assume that they scaled by adding more and more engineers, that end up working each in their corners on different problems, without any basic shared tooling/ practices/design. Things get out of hands quickly.
It really does sound like this is part of the issue. Millions of trips is billions of db row per day. A document store is much more amenable to that kind of workload than MySQL
They could have done it with a transactional message queue and a decent RDBMS (there are far bigger companies that use RDBMS for far more transactions than Uber does) but they clearly did not have the in house expertise for that (and they also wanted rapid schema changes).
Part of the problem is Postgres was a little behind on scaling in previous years but that has changed. IMO they could have stuck with Postgres by making an addon to that instead but they found extending MySQL easier.
The first thing one has to do is to drop SQL databases as the main data source.
The usual choice is to move to Cassandra. It does have build-in sharding AND backup AND multi-master replication AND multi datacenter support AND performances scale linearly with the number of servers.
The [only] other option is ElasticSearch (which has slightly different properties regarding data format and data consistence).
You don't need to craft complex custom sharded distributed WTF software to abstract hundreds of cluster of hundreds of databases. Use the right tool for the job, that has that build in.
A quick google gives these stats for Uber:
- 8 million users
- 160 thousand drivers
- 1 million daily rides
- 2 billion rides so far
The biggest data, as stated in , is the trip info. They are storing the trip info as a 20kB JSON blob. If you add a custom data type  for a trip to PostgreSQL and encode it efficiently (binary, using deltas), you should be able to do much better than the ~3kB they get by using messagepack and zlib, say 1kB.
All that data should easily fit into 4TB. At one million rides per day, one reasonably beefy RDBM server should be able to easily handle that. You can partition the trips table and move the trips that are older than say, 1 month, to a slower disk to save money on disks. That helps with schema migration too: new data uses the new schema, and you use a view on the old data to adapt to the new schema.
I very much doubt you need anything much fancier than that. It would help if people really learned their tools (in this case PostgreSQL) instead of jumping to a different technology at the first problem they run into.
You're gonna hit all kind of limits with the RDBMS software and the special hardware that will be required.
Cassandra will handle the sharding automatically, it will have multiple instances with always one available for your applications, it will handle replications across datacenters all around the world, it will have dependable performances that can scale linearly with the specs you give it, it will allow you to do maintenance while online, it will let you add remove and refresh nodes.
A bigger RDBMS has none of these qualities. It's a one trick pony that will die when either the hardware or the software will reach a limit, then your whole site will be down. Even if you know what needs to be done to avoid the disaster, you can't do it because the RDBMS is a SPOF and every maintenance you perform is called "downtime".
Short term = Postgre/MySQL because it's easier and it gets the job done. Long term = Cassandra because it's dependable.
Storage capacity is also growing exponentially: Samsung is shipping a 15TB SSD (albeit at $10000), Seagate has previewed a 60TB SSD.
> You're gonna hit all kind of limits with the RDBMS software and the special hardware that will be required.
The limits of RDBMs are well understood. You don't need any fancy hardware for this use case. A couple of Xeons, as much RAM as you can afford and a RAID of SSDs (or possible spinning disks, it doesn't look like Uber is doing anything too fancy).
SPOF: you have slave replicas running that can take over if something goes wrong with the master.
You don't have to take down an RDBM to do maintenance. DDL statements are transactional in PostgreSQL.
The automatic sharding of Cassandra is nice, when you need it. Of course, Uber's use case seems like it lends itself to easy geographic sharding when you're using an RDBM, if needed.
In the end, I'd rather deal with a mature well-understood technology like an RDBM compared to a 5 year old technology like Cassandra (release 1.0 in 2011). You obviously prefer the opposite. To each its own.
Don't get me wrong. I know vertical scaling and I've done it before. I'd take an old school DBA who understands Oracle over a random junior speaking only NoSQL to everything.
For the majority of use cases (including where I am now), it's easier to pick the right technology (Cassandra) even if we have to learn and later teach it around, than it is to find someone who can really do 10TB PostgreSQL and spreads the knowledge.
Of course, if you have extensive experience with PostgreSQL, that may skew the choice heavily to the other direction ;)
I have more experience with PostgreSQL than Cassandra, for sure. And I've had negative experiences with people trying to push Cassandra where it was totally not appropriate (small problem and no need for high availability). They had no experience with Cassandra themselves beyond watching a video and doing some tutorials and couldn't answer basic questions about the underlying technology. That might be skewing my perception too.
And at 2300 instances, it makes sense that there would be a few of them dying at any given time.
With this setup, using MySQL is pretty much arbitrary, any databases could fill in the role as well. E.g. It got me thinking that Redis+Sentinel could do the job too (if they don't need super durable transactions)
This looks like an amazing tool.