
Sleepless nights with MongoDB WiredTiger and our return to MMAPv1 - kiyanwang
https://blog.clevertap.com/sleepless-nights-with-mongodb-wiredtiger-and-our-return-to-mmapv1/
======
Twirrim
If you're going to be making such a drastic change, you ought to be doing a
_lot_ more testing than was done here.

Your data is the single most important thing to your company. A couple of days
of testing is insufficient to validate such a change there.

You should be paranoid, and conservative.

Don't make unnecessary changes. When you do have to make changes, test
everything thoroughly.

Testing thoroughly means more than just functional testing of your code and
benchmarking.

At the very least, it means standing up a second db cluster and having
production queries going through to both for weeks so you can see precisely
how things will behave, and also just as importantly so you can confirm that
there has been no data loss.

It is always so tempting to use shiny shiny in tech. It's a always exciting,
especially when it promises to make life better. It requires discipline,
though, to quench that enthusiasm and approach production environments with
appropriate care. You can not afford to mess up there.

Bluntly speaking, you got _very_ lucky this time. You, and your customers,
can't rely on you being lucky.

------
vtomasr5
I think they fixed it with mongodb 3.2.10 release
[https://jira.mongodb.org/browse/SERVER-25974](https://jira.mongodb.org/browse/SERVER-25974)

~~~
juanroy
With 3.2.10 release this will not be a problem any more

------
yid
So the root cause is that WiredTiger locks up and SIGTERMs when it fills the
cache? If this is indeed the cause, I must say this does shake my faith in
WiredTiger. That's a pretty basic scenario that a company like 10gen should be
testing for regularly, certainly before releases.

And before the Mongo haters come out, remember that WiredTiger was written by
about as stellar a database team as you can have.

~~~
jandrewrogers
This definitely should be caught by any reasonable testing regimen for a
database but the underlying issue appears (upon casual inspection of the
architecture) to be something different.

Most people design storage engines without a true I/O scheduler (WiredTiger
appears to be such a case), mostly because it requires a huge amount of expert
code work that doesn't payoff for narrow use cases. The caveat is that it is
difficult-to-impossible to design a storage engine that has very high
performance _and_ is well behaved under diverse workloads without a proper I/O
scheduler.

The side of "generalized good behavior" or "very high performance" tradeoff a
storage engine falls on depends on the original goals of the developer and
there are many such storage engines that explicitly optimize for excellent
behavior under diverse loads over performance (PostgreSQL is such an example).
In the case of WiredTiger, it was marketed as "very high performance" but is
now being used under increasingly diverse workloads that exposes this
tradeoff. Without making changes that saddle performance, behavioral edge
cases are largely unavoidable; you can move them around but not completely
eliminate them.

This remains, in my opinion, the most important difference between open and
closed source storage engines in practice. In closed source, most of your
high-performance storage engines implement true I/O schedulers, usually by the
same few specialists that float around between companies.

~~~
vrtx0
I partially agree here, but want to throw in the configuration aspect. In
order for any database to be optimized of all workloads, it either has to
adapt with minimal impact (something I haven't seen successfully implemented
yet, but may only be a matter of time), or it has to be told how to behave (in
other words, it has to be configured properly).

One of the things I like about the original MongoDB approach is that very
little configuration was required to tune the database; instead you just
configure the OS so the FS cache and scheduling best matches your workload. At
first glance this may seem lazy, but kernel engineers have been solving this
problem pretty well for years. Also, misconfiguring one or both systems
results in a lot of confusion of the end-user; they must understand internals
of an OS and DB, and make sure they don't conclict.

Implementing proper scheduling (not just IO, but CPU) isn't so much difficult
as it is ensuring your end-users have the right tools to understand how
configuring an application impacts performance vs. a kernel configuration. For
example, conflicting trade-offs between the OS FS cache size and the DB's
buffer cache size could easily result in a system using less than half of the
available ram for useful purposes.

Anyway, I think your assessment is mostly spot on, except the part about
closer source engines. I'm also not sure I totally agree with Postgres being
much better than other database systems by default, but I do think it's a
solid piece of software that does a good job with configuration values.

I've worked on 3 database systems in my career. One was closed source, and it
was my least favorite. But that had more to do with the design decision to run
in the JVM, which adds a 3rd level of configuration complexity...

Anyway, just my $0.02.

------
diziet
I'm not trying to critique the company's engineering, but the workloads and
challenges they are facing are not really high in QPS. The company says they
get about 600 QPS. These are database QPS, not even service QPS! Now no matter
what kind of operations you're doing, correct query and index planning should
serve you ok. Looking at the graph, there are a lot of updates going on, so I
would say GridFS is out of the question.

MongoDB WT (and 3.2 especially) really moved the Mongo to the next level, and
it's really weird to hear that it took so long to upgrade to WT. We moved a
system to WT from MMAPv1 in the early 3.0 days (and to 3.2 later) and saw
improvements in read/write 99.9th percentile latency, general ROI on instance
utilization, QPS, and especially snappy compression bringing an X0 TB
deployment to more reasonable scale as far as storage goes. This is not a
crazy system, but peaks of 100k while jobs calculate and constant load of 50k
QPS is very straightforward and WT responds beautifully to.

As mentioned before, this is a bug that got a number of fixes upstream (though
3.2.10 came out just only yesterday), but one of the only reasonable way to
hit that load was to insert/update a lot of documents on a very slow
underlying disk. Test your workloads before upgrading -- but there is
something severely wrong going on with your deployment/query planning if 600
QPS per server is bringing cache issues, especially given r3.4xlarge
underlying hardware.

~~~
spimmy
I have plenty of rants about the company's engineering, priorities, etc. I
just hate people saying dumb shit. :)

Workload QPS is a meaningless number unless you know what the workload is. At
Parse, we used RocksDB (a LSM tree write optimized storage engine) to get
something stupid like 9x faster writes. I don't know where you're seeing 600
QPS ... (we got 100x that if we were optimizing for writes) ... but it still
doesn't mean anything unless you know the size of the document, whether it's
inserts, queries, updates, whether it's hot spotting on a row, whether the
lock is at the collection, document, database, or global, etc.

If you're going to switch storage engines, you _have_ to use some sort of
capture replay technology or you can not have any confidence in the change.
Use what we wrote at Parse:
[https://github.com/ParsePlatform/flashback](https://github.com/ParsePlatform/flashback).
We spent ~6 months teasing out bugs in rocksdb, mongodb, the 2.6 query planner
rewrite, etc before we even _started_ to make the switch.

If you don't do that, you don't get to complain about the results. Sorry.

~~~
diziet
Spimmy, I am not sure about the context of what you're saying though I agree
with the context and feedback of your comments, but I am basing the 600 QPS
based on this comment on the blog post linked in the article:

>To put things in context, we do approximately ~18.07K operations/minute on
one primary server. We have two shards in the cluster so two primary server
making it ~36.14K operations/minute.

36,140 / 60 is roughly 600.

Confusingly, the opcounter screenshot at the top of the blog shows something
close to 3k updates / second with a peak of about 4.5k, though it is not
explained. I left a more detailed comment on the clevertap blog regarding the
kind of operations they are using.

When you talk about benchmarking and high level metrics like QPS, there is a
level of thresholds where you can say "you're most likely doing something
wrong as you should not be hitting cache limit issues".

The reason you will move to WT is because of snappy, generally higher 99.9th
percentile performance, higher throughput, much better document based locks,
etc.

Replaying real data is wonderful and great for preparing for a migration like
that. MMAPv1 to WT was a much smoother transition for us than what I imagine
MMAPv1 to Rocks would be, though the team at FB including Mark Callaghan has
made incredible progress on it.

~~~
spimmy
I'm agreeing with your comment, not arguing with it.

The problems we had moving from mmapv1 to rocksdb were almost entirely based
on the rewrite of the query planner in 2.6, not the storage engine itself.

The WT storage engine is far less mature than RocksDB. Only a very tiny amount
of code in the Mongo+Rocks integration is actually new code, it's just the
thin layer of glue between mongo and the storage engine API.

It was a tricky migration because we made the decision to do it _early_ while
so much code was still new in mongo for that version. We wanted to help give
feedback on the storage engine API because we had pushed so hard for that
feature, and because we were pushing the boundaries of what the mmap engine
could ever reasonably be expected to do.

(Side note: OMG, anyone who measures their request rate in anything other than
per-second is putting up a giant red flag that they don't know what they're
doing.)

~~~
diziet
Right, I knew there was a misunderstanding initially over what company I was
referring to (CleverTap, not 10gen/MongoDB) and we were mostly on the same
page :)

RocksDB is great, WT is getting to be great, both are getting better and have
solid teams working on them. LSMs have great use cases, btrees too. 600 QPS
you do not blog about.

------
francispereira
I am the author of the original blog post -
[https://blog.clevertap.com/sleepless-nights-with-mongodb-
wir...](https://blog.clevertap.com/sleepless-nights-with-mongodb-wiredtiger-
and-our-return-to-mmapv1/) and I handle the dev ops at CleverTap. I wanted to
make a few clarifications about the article:

Firstly, the original article mentioned 18K operations/minute per node. This
was a publishing mistake. We actually do 18K operations/second (the graph on
the original post has the numbers as reported by MMS). Sorry about this :(

We've posted a couple of updates on our blog, in case you're interested in
following this through.

And yes, we did file a bug report with Mongo.

------
jwatte
18k operations "per minute?" You can do that on MS Access. Why use MongoDB at
all then? We do thousands of transactions per second on mysql with innodb and
SSDs.

~~~
vrtx0
Because an operation may be a query for a single document by an indexed field,
or it could be an FTS across billions of records, each with thousands of
terms.

In my experience, MongoDB and MySQL performance are usually on par. It's
important to understand what the operations are before making any kind of
comparison.

That said, I totally understand why this was your first thought. The article
doesn't even state if the operations are queries, updates or inserts, let
alone how many indexes or scans are performed...

~~~
spimmy
The article was written by people who don't understand very basic things about
databases.

To be fair, this is part of Mongo's pitch: come use a database, you don't need
DBAs or people who know things about databases. Which is always gonna be kind
of a lie. So they invite this on themselves.\

I hate when I get reduced to sounding like a mongo apologist, but it's their
own fault for not following any best practices with data.

------
jwatte
Another option when upgrading something fundamental is to run a dark service
on a fork of real time load for a bit, to make sure it can take it. Given they
were in ECC and only had two storage servers, that seems like something I'd
like to read about when they try again!

------
user5994461
One more war story about the bad design decision in MongoDB resulting in
unresponsive nodes and data loss.

Hopefully, people will have figured out by now that MongoDB is not ready for
production.

~~~
spimmy
No, you're being dumb. :)

Was mysql "ready for production" when google used it? facebook? Lol.

Storage engines are hard. I'm boggled that they recommend that you keep the
old replica around for ten days. When we did the MMAPv1 migration to RocksDB
at Parse, we kept at least one MMAP replica around for 6-9 months, as we
worked through the integration of Mongo + rocksdb ... even though RocksDB the
storage engine is powering fucking _Facebook_.

I get so fucking annoyed at the Hacker News armchair brigade that wants to
talk shit about MongoDB, when they don't know a goddamn thing about databases
-- or, from the looks of it, Facts About Stateful Services In General.

Eyeroll.

~~~
Twirrim
> Was mysql "ready for production" when google used it? facebook? Lol.

Actually, yes, yes it was. It had been used in production environments for
quite a while by the time they came to use it. It was stable, reliable and had
known performance characteristics and replication behaviour.

------
bogomipz
In regards to the throughput degenerating it sounds like maybe they were
hitting the limit of configured read/write tickets in the WiredTiger engine.
This article talks about those settings as well as the cache settings:

[http://www.developer.com/db/tips-for-mongodb-wiredtiger-
perf...](http://www.developer.com/db/tips-for-mongodb-wiredtiger-performance-
tuning.html)

------
movielala
same here, sleepless nights.

we moved to dynomo db and mysql that leads great sleep :)

------
prashant10
Guys i have used clevertap in my 2 apps and was sad to see the lag in my
traffic. But now because of this blog i know why was the lag. Anyways you guys
are doing some really awesome work and very well i must say.

This article is very insightful though and i will postpone switching to
wiredtiger till stable version release.

Thanks

------
helloanand
We got heard by the powers to be at MongoDB and are now working with them to
resolve the issue.

I'm with @clevertap

~~~
nevi-me
I thought you would have reported the issue before/at the same time as the
blog post. My opinion's that it sometimes works out better if I get in touch
with software developers first, in case there is a known issue and/or
workaround.

I say this because most posts about issues we experience don't end up being on
HN, and add a result the powers that be either don't hear us, or by the time
we've logged issues and gotten the problems resolved; there's no longer enough
time to spend another week migrating again to WT.

If the issue isn't resolved by 3.2.10 as mentioned elsewhere, I hope it gets
resolved soon so you can get that 7x throughout improvement.

~~~
helloanand
Yes, we did report the issue to MongoDB. They've acknowledged that the issue
does exist, and are working with us to repro/resolve.

------
amirouche
FWIW I've been using wiredtiger as sqlite replacement and it works great in
this context.

~~~
beagle3
What about sqlite needs replacement?

What does wiredtiger deliver for you that sqlite doesn't?

~~~
amirouche
Honestly, I prefer wiredtiger API. Otherwise I can only tell that it's
pratical over a 50Go dataset that is all.

