If you're going to be making such a drastic change, you ought to be doing a lot more testing than was done here.
Your data is the single most important thing to your company. A couple of days of testing is insufficient to validate such a change there.
You should be paranoid, and conservative.
Don't make unnecessary changes. When you do have to make changes, test everything thoroughly.
Testing thoroughly means more than just functional testing of your code and benchmarking.
At the very least, it means standing up a second db cluster and having production queries going through to both for weeks so you can see precisely how things will behave, and also just as importantly so you can confirm that there has been no data loss.
It is always so tempting to use shiny shiny in tech. It's a always exciting, especially when it promises to make life better. It requires discipline, though, to quench that enthusiasm and approach production environments with appropriate care. You can not afford to mess up there.
Bluntly speaking, you got very lucky this time. You, and your customers, can't rely on you being lucky.
So the root cause is that WiredTiger locks up and SIGTERMs when it fills the cache? If this is indeed the cause, I must say this does shake my faith in WiredTiger. That's a pretty basic scenario that a company like 10gen should be testing for regularly, certainly before releases.
And before the Mongo haters come out, remember that WiredTiger was written by about as stellar a database team as you can have.
This definitely should be caught by any reasonable testing regimen for a database but the underlying issue appears (upon casual inspection of the architecture) to be something different.
Most people design storage engines without a true I/O scheduler (WiredTiger appears to be such a case), mostly because it requires a huge amount of expert code work that doesn't payoff for narrow use cases. The caveat is that it is difficult-to-impossible to design a storage engine that has very high performance and is well behaved under diverse workloads without a proper I/O scheduler.
The side of "generalized good behavior" or "very high performance" tradeoff a storage engine falls on depends on the original goals of the developer and there are many such storage engines that explicitly optimize for excellent behavior under diverse loads over performance (PostgreSQL is such an example). In the case of WiredTiger, it was marketed as "very high performance" but is now being used under increasingly diverse workloads that exposes this tradeoff. Without making changes that saddle performance, behavioral edge cases are largely unavoidable; you can move them around but not completely eliminate them.
This remains, in my opinion, the most important difference between open and closed source storage engines in practice. In closed source, most of your high-performance storage engines implement true I/O schedulers, usually by the same few specialists that float around between companies.
I partially agree here, but want to throw in the configuration aspect. In order for any database to be optimized of all workloads, it either has to adapt with minimal impact (something I haven't seen successfully implemented yet, but may only be a matter of time), or it has to be told how to behave (in other words, it has to be configured properly).
One of the things I like about the original MongoDB approach is that very little configuration was required to tune the database; instead you just configure the OS so the FS cache and scheduling best matches your workload. At first glance this may seem lazy, but kernel engineers have been solving this problem pretty well for years. Also, misconfiguring one or both systems results in a lot of confusion of the end-user; they must understand internals of an OS and DB, and make sure they don't conclict.
Implementing proper scheduling (not just IO, but CPU) isn't so much difficult as it is ensuring your end-users have the right tools to understand how configuring an application impacts performance vs. a kernel configuration. For example, conflicting trade-offs between the OS FS cache size and the DB's buffer cache size could easily result in a system using less than half of the available ram for useful purposes.
Anyway, I think your assessment is mostly spot on, except the part about closer source engines. I'm also not sure I totally agree with Postgres being much better than other database systems by default, but I do think it's a solid piece of software that does a good job with configuration values.
I've worked on 3 database systems in my career. One was closed source, and it was my least favorite. But that had more to do with the design decision to run in the JVM, which adds a 3rd level of configuration complexity...
Not quite; the author states the MongoS nodes were SIGGERM'd. This is unrelated to WiredTiger (which runs in MongoD), and more likely caused by the Linux OOM killer or another resource limit being exceeded (like the number of threads or sockets MongoS tries to open). This can happen if the application is configured with larger connection pools than the limits imposed upon MongoS; e.g by calling ulimit).
The above scenario can easily happen if an end user's application doesn't take all resources into account (e.g. a web applicstion that accepts as many requests as possible and as fast as possible, and opens as many database connections to MongoDB as possible). In that situation, if MongoDB can't keep up, the application logic may keep accepting HTTP requests and generate even more DB requests. If this was indeed the case, MongoS would have been the bottleneck, thus SIGTERMs.
The crux of the actual problem they're having with wired tiger may have been due to a misunderstanding of performance expectations or configuration.
EDIT: it appears to have been a bug in this case; not the expectations or configuration I initially implied. My point about application design still stands; an HTTP 500 message is much better than submitting more requests to an already overloaded DB.
I don't know... look at the bug report and the fix [1]. This does look like a bug in WiredTiger. It preallocates all of its cache, so it doesn't look like raw capacity problems. It looks like the internal MongoD issues pushed the problems downstream to the individual MongoS instances.
Comment from the author:
> We looked for an OOM in our logs but couldn't find it. Also, the log lines are from mongos nodes. Data nodes continues to run with severly reduced throughput.
I'm not trying to critique the company's engineering, but the workloads and challenges they are facing are not really high in QPS. The company says they get about 600 QPS. These are database QPS, not even service QPS! Now no matter what kind of operations you're doing, correct query and index planning should serve you ok. Looking at the graph, there are a lot of updates going on, so I would say GridFS is out of the question.
MongoDB WT (and 3.2 especially) really moved the Mongo to the next level, and it's really weird to hear that it took so long to upgrade to WT. We moved a system to WT from MMAPv1 in the early 3.0 days (and to 3.2 later) and saw improvements in read/write 99.9th percentile latency, general ROI on instance utilization, QPS, and especially snappy compression bringing an X0 TB deployment to more reasonable scale as far as storage goes. This is not a crazy system, but peaks of 100k while jobs calculate and constant load of 50k QPS is very straightforward and WT responds beautifully to.
As mentioned before, this is a bug that got a number of fixes upstream (though 3.2.10 came out just only yesterday), but one of the only reasonable way to hit that load was to insert/update a lot of documents on a very slow underlying disk. Test your workloads before upgrading -- but there is something severely wrong going on with your deployment/query planning if 600 QPS per server is bringing cache issues, especially given r3.4xlarge underlying hardware.
I have plenty of rants about the company's engineering, priorities, etc. I just hate people saying dumb shit. :)
Workload QPS is a meaningless number unless you know what the workload is. At Parse, we used RocksDB (a LSM tree write optimized storage engine) to get something stupid like 9x faster writes. I don't know where you're seeing 600 QPS ... (we got 100x that if we were optimizing for writes) ... but it still doesn't mean anything unless you know the size of the document, whether it's inserts, queries, updates, whether it's hot spotting on a row, whether the lock is at the collection, document, database, or global, etc.
If you're going to switch storage engines, you have to use some sort of capture replay technology or you can not have any confidence in the change. Use what we wrote at Parse: https://github.com/ParsePlatform/flashback. We spent ~6 months teasing out bugs in rocksdb, mongodb, the 2.6 query planner rewrite, etc before we even started to make the switch.
If you don't do that, you don't get to complain about the results. Sorry.
Spimmy, I am not sure about the context of what you're saying though I agree with the context and feedback of your comments, but I am basing the 600 QPS based on this comment on the blog post linked in the article:
>To put things in context, we do approximately ~18.07K operations/minute on one primary server. We have two shards in the cluster so two primary server making it ~36.14K operations/minute.
36,140 / 60 is roughly 600.
Confusingly, the opcounter screenshot at the top of the blog shows something close to 3k updates / second with a peak of about 4.5k, though it is not explained. I left a more detailed comment on the clevertap blog regarding the kind of operations they are using.
When you talk about benchmarking and high level metrics like QPS, there is a level of thresholds where you can say "you're most likely doing something wrong as you should not be hitting cache limit issues".
The reason you will move to WT is because of snappy, generally higher 99.9th percentile performance, higher throughput, much better document based locks, etc.
Replaying real data is wonderful and great for preparing for a migration like that. MMAPv1 to WT was a much smoother transition for us than what I imagine MMAPv1 to Rocks would be, though the team at FB including Mark Callaghan has made incredible progress on it.
I'm agreeing with your comment, not arguing with it.
The problems we had moving from mmapv1 to rocksdb were almost entirely based on the rewrite of the query planner in 2.6, not the storage engine itself.
The WT storage engine is far less mature than RocksDB. Only a very tiny amount of code in the Mongo+Rocks integration is actually new code, it's just the thin layer of glue between mongo and the storage engine API.
It was a tricky migration because we made the decision to do it early while so much code was still new in mongo for that version. We wanted to help give feedback on the storage engine API because we had pushed so hard for that feature, and because we were pushing the boundaries of what the mmap engine could ever reasonably be expected to do.
(Side note: OMG, anyone who measures their request rate in anything other than per-second is putting up a giant red flag that they don't know what they're doing.)
Right, I knew there was a misunderstanding initially over what company I was referring to (CleverTap, not 10gen/MongoDB) and we were mostly on the same page :)
RocksDB is great, WT is getting to be great, both are getting better and have solid teams working on them. LSMs have great use cases, btrees too. 600 QPS you do not blog about.
Firstly, the original article mentioned 18K operations/minute per node. This was a publishing mistake. We actually do 18K operations/second (the graph on the original post has the numbers as reported by MMS). Sorry about this :(
We've posted a couple of updates on our blog, in case you're interested in following this through.
18k operations "per minute?"
You can do that on MS Access.
Why use MongoDB at all then?
We do thousands of transactions per second on mysql with innodb and SSDs.
Because an operation may be a query for a single document by an indexed field, or it could be an FTS across billions of records, each with thousands of terms.
In my experience, MongoDB and MySQL performance are usually on par. It's important to understand what the operations are before making any kind of comparison.
That said, I totally understand why this was your first thought. The article doesn't even state if the operations are queries, updates or inserts, let alone how many indexes or scans are performed...
The article was written by people who don't understand very basic things about databases.
To be fair, this is part of Mongo's pitch: come use a database, you don't need DBAs or people who know things about databases. Which is always gonna be kind of a lie. So they invite this on themselves.\
I hate when I get reduced to sounding like a mongo apologist, but it's their own fault for not following any best practices with data.
Another option when upgrading something fundamental is to run a dark service on a fork of real time load for a bit, to make sure it can take it.
Given they were in ECC and only had two storage servers, that seems like something I'd like to read about when they try again!
Was mysql "ready for production" when google used it? facebook? Lol.
Storage engines are hard. I'm boggled that they recommend that you keep the old replica around for ten days. When we did the MMAPv1 migration to RocksDB at Parse, we kept at least one MMAP replica around for 6-9 months, as we worked through the integration of Mongo + rocksdb ... even though RocksDB the storage engine is powering fucking Facebook.
I get so fucking annoyed at the Hacker News armchair brigade that wants to talk shit about MongoDB, when they don't know a goddamn thing about databases -- or, from the looks of it, Facts About Stateful Services In General.
> Was mysql "ready for production" when google used it? facebook? Lol.
Actually, yes, yes it was. It had been used in production environments for quite a while by the time they came to use it. It was stable, reliable and had known performance characteristics and replication behaviour.
In regards to the throughput degenerating it sounds like maybe they were hitting the limit of configured read/write tickets in the WiredTiger engine. This article talks about those settings as well as the cache settings:
Guys i have used clevertap in my 2 apps and was sad to see the lag in my traffic. But now because of this blog i know why was the lag. Anyways you guys are doing some really awesome work and very well i must say.
This article is very insightful though and i will postpone switching to wiredtiger till stable version release.
I thought you would have reported the issue before/at the same time as the blog post. My opinion's that it sometimes works out better if I get in touch with software developers first, in case there is a known issue and/or workaround.
I say this because most posts about issues we experience don't end up being on HN, and add a result the powers that be either don't hear us, or by the time we've logged issues and gotten the problems resolved; there's no longer enough time to spend another week migrating again to WT.
If the issue isn't resolved by 3.2.10 as mentioned elsewhere, I hope it gets resolved soon so you can get that 7x throughout improvement.
Your data is the single most important thing to your company. A couple of days of testing is insufficient to validate such a change there.
You should be paranoid, and conservative.
Don't make unnecessary changes. When you do have to make changes, test everything thoroughly.
Testing thoroughly means more than just functional testing of your code and benchmarking.
At the very least, it means standing up a second db cluster and having production queries going through to both for weeks so you can see precisely how things will behave, and also just as importantly so you can confirm that there has been no data loss.
It is always so tempting to use shiny shiny in tech. It's a always exciting, especially when it promises to make life better. It requires discipline, though, to quench that enthusiasm and approach production environments with appropriate care. You can not afford to mess up there.
Bluntly speaking, you got very lucky this time. You, and your customers, can't rely on you being lucky.