Your data is the single most important thing to your company. A couple of days of testing is insufficient to validate such a change there.
You should be paranoid, and conservative.
Don't make unnecessary changes. When you do have to make changes, test everything thoroughly.
Testing thoroughly means more than just functional testing of your code and benchmarking.
At the very least, it means standing up a second db cluster and having production queries going through to both for weeks so you can see precisely how things will behave, and also just as importantly so you can confirm that there has been no data loss.
It is always so tempting to use shiny shiny in tech. It's a always exciting, especially when it promises to make life better. It requires discipline, though, to quench that enthusiasm and approach production environments with appropriate care. You can not afford to mess up there.
Bluntly speaking, you got very lucky this time. You, and your customers, can't rely on you being lucky.
"There are only two really hard problems in computing..."
And before the Mongo haters come out, remember that WiredTiger was written by about as stellar a database team as you can have.
Most people design storage engines without a true I/O scheduler (WiredTiger appears to be such a case), mostly because it requires a huge amount of expert code work that doesn't payoff for narrow use cases. The caveat is that it is difficult-to-impossible to design a storage engine that has very high performance and is well behaved under diverse workloads without a proper I/O scheduler.
The side of "generalized good behavior" or "very high performance" tradeoff a storage engine falls on depends on the original goals of the developer and there are many such storage engines that explicitly optimize for excellent behavior under diverse loads over performance (PostgreSQL is such an example). In the case of WiredTiger, it was marketed as "very high performance" but is now being used under increasingly diverse workloads that exposes this tradeoff. Without making changes that saddle performance, behavioral edge cases are largely unavoidable; you can move them around but not completely eliminate them.
This remains, in my opinion, the most important difference between open and closed source storage engines in practice. In closed source, most of your high-performance storage engines implement true I/O schedulers, usually by the same few specialists that float around between companies.
One of the things I like about the original MongoDB approach is that very little configuration was required to tune the database; instead you just configure the OS so the FS cache and scheduling best matches your workload. At first glance this may seem lazy, but kernel engineers have been solving this problem pretty well for years. Also, misconfiguring one or both systems results in a lot of confusion of the end-user; they must understand internals of an OS and DB, and make sure they don't conclict.
Implementing proper scheduling (not just IO, but CPU) isn't so much difficult as it is ensuring your end-users have the right tools to understand how configuring an application impacts performance vs. a kernel configuration. For example, conflicting trade-offs between the OS FS cache size and the DB's buffer cache size could easily result in a system using less than half of the available ram for useful purposes.
Anyway, I think your assessment is mostly spot on, except the part about closer source engines. I'm also not sure I totally agree with Postgres being much better than other database systems by default, but I do think it's a solid piece of software that does a good job with configuration values.
I've worked on 3 database systems in my career. One was closed source, and it was my least favorite. But that had more to do with the design decision to run in the JVM, which adds a 3rd level of configuration complexity...
Anyway, just my $0.02.
The above scenario can easily happen if an end user's application doesn't take all resources into account (e.g. a web applicstion that accepts as many requests as possible and as fast as possible, and opens as many database connections to MongoDB as possible). In that situation, if MongoDB can't keep up, the application logic may keep accepting HTTP requests and generate even more DB requests. If this was indeed the case, MongoS would have been the bottleneck, thus SIGTERMs.
The crux of the actual problem they're having with wired tiger may have been due to a misunderstanding of performance expectations or configuration.
EDIT: it appears to have been a bug in this case; not the expectations or configuration I initially implied. My point about application design still stands; an HTTP 500 message is much better than submitting more requests to an already overloaded DB.
Comment from the author:
> We looked for an OOM in our logs but couldn't find it. Also, the log lines are from mongos nodes. Data nodes continues to run with severly reduced throughput.
MongoDB WT (and 3.2 especially) really moved the Mongo to the next level, and it's really weird to hear that it took so long to upgrade to WT. We moved a system to WT from MMAPv1 in the early 3.0 days (and to 3.2 later) and saw improvements in read/write 99.9th percentile latency, general ROI on instance utilization, QPS, and especially snappy compression bringing an X0 TB deployment to more reasonable scale as far as storage goes. This is not a crazy system, but peaks of 100k while jobs calculate and constant load of 50k QPS is very straightforward and WT responds beautifully to.
As mentioned before, this is a bug that got a number of fixes upstream (though 3.2.10 came out just only yesterday), but one of the only reasonable way to hit that load was to insert/update a lot of documents on a very slow underlying disk. Test your workloads before upgrading -- but there is something severely wrong going on with your deployment/query planning if 600 QPS per server is bringing cache issues, especially given r3.4xlarge underlying hardware.
Workload QPS is a meaningless number unless you know what the workload is. At Parse, we used RocksDB (a LSM tree write optimized storage engine) to get something stupid like 9x faster writes. I don't know where you're seeing 600 QPS ... (we got 100x that if we were optimizing for writes) ... but it still doesn't mean anything unless you know the size of the document, whether it's inserts, queries, updates, whether it's hot spotting on a row, whether the lock is at the collection, document, database, or global, etc.
If you're going to switch storage engines, you have to use some sort of capture replay technology or you can not have any confidence in the change. Use what we wrote at Parse: https://github.com/ParsePlatform/flashback. We spent ~6 months teasing out bugs in rocksdb, mongodb, the 2.6 query planner rewrite, etc before we even started to make the switch.
If you don't do that, you don't get to complain about the results. Sorry.
>To put things in context, we do approximately ~18.07K operations/minute on one primary server. We have two shards in the cluster so two primary server making it ~36.14K operations/minute.
36,140 / 60 is roughly 600.
Confusingly, the opcounter screenshot at the top of the blog shows something close to 3k updates / second with a peak of about 4.5k, though it is not explained. I left a more detailed comment on the clevertap blog regarding the kind of operations they are using.
When you talk about benchmarking and high level metrics like QPS, there is a level of thresholds where you can say "you're most likely doing something wrong as you should not be hitting cache limit issues".
The reason you will move to WT is because of snappy, generally higher 99.9th percentile performance, higher throughput, much better document based locks, etc.
Replaying real data is wonderful and great for preparing for a migration like that. MMAPv1 to WT was a much smoother transition for us than what I imagine MMAPv1 to Rocks would be, though the team at FB including Mark Callaghan has made incredible progress on it.
The problems we had moving from mmapv1 to rocksdb were almost entirely based on the rewrite of the query planner in 2.6, not the storage engine itself.
The WT storage engine is far less mature than RocksDB. Only a very tiny amount of code in the Mongo+Rocks integration is actually new code, it's just the thin layer of glue between mongo and the storage engine API.
It was a tricky migration because we made the decision to do it early while so much code was still new in mongo for that version. We wanted to help give feedback on the storage engine API because we had pushed so hard for that feature, and because we were pushing the boundaries of what the mmap engine could ever reasonably be expected to do.
(Side note: OMG, anyone who measures their request rate in anything other than per-second is putting up a giant red flag that they don't know what they're doing.)
RocksDB is great, WT is getting to be great, both are getting better and have solid teams working on them. LSMs have great use cases, btrees too. 600 QPS you do not blog about.
Firstly, the original article mentioned 18K operations/minute per node. This was a publishing mistake. We actually do 18K operations/second (the graph on the original post has the numbers as reported by MMS). Sorry about this :(
We've posted a couple of updates on our blog, in case you're interested in following this through.
And yes, we did file a bug report with Mongo.
In my experience, MongoDB and MySQL performance are usually on par. It's important to understand what the operations are before making any kind of comparison.
That said, I totally understand why this was your first thought. The article doesn't even state if the operations are queries, updates or inserts, let alone how many indexes or scans are performed...
To be fair, this is part of Mongo's pitch: come use a database, you don't need DBAs or people who know things about databases. Which is always gonna be kind of a lie. So they invite this on themselves.\
I hate when I get reduced to sounding like a mongo apologist, but it's their own fault for not following any best practices with data.
Hopefully, people will have figured out by now that MongoDB is not ready for production.
Hopefully people will stop trolling like this. And hopefully I'll stop reading and relying to them -- like it will make a difference.
Was mysql "ready for production" when google used it? facebook? Lol.
Storage engines are hard. I'm boggled that they recommend that you keep the old replica around for ten days. When we did the MMAPv1 migration to RocksDB at Parse, we kept at least one MMAP replica around for 6-9 months, as we worked through the integration of Mongo + rocksdb ... even though RocksDB the storage engine is powering fucking Facebook.
I get so fucking annoyed at the Hacker News armchair brigade that wants to talk shit about MongoDB, when they don't know a goddamn thing about databases -- or, from the looks of it, Facts About Stateful Services In General.
Actually, yes, yes it was. It had been used in production environments for quite a while by the time they came to use it. It was stable, reliable and had known performance characteristics and replication behaviour.
we moved to dynomo db and mysql that leads great sleep :)
This article is very insightful though and i will postpone switching to wiredtiger till stable version release.
I'm with @clevertap
I say this because most posts about issues we experience don't end up being on HN, and add a result the powers that be either don't hear us, or by the time we've logged issues and gotten the problems resolved; there's no longer enough time to spend another week migrating again to WT.
If the issue isn't resolved by 3.2.10 as mentioned elsewhere, I hope it gets resolved soon so you can get that 7x throughout improvement.
What does wiredtiger deliver for you that sqlite doesn't?