
Broken by Design: MongoDB Fault Tolerance - brohee
http://hackingdistributed.com/2013/01/29/mongo-ft/ 
======
enigma1510
"Broken by bad logic"

The post contains the 'straw man' informal fallacy - ie. standing up an absurd
scenario in order to knock it down.

Mongo provides write concerns for several use cases. Those concerns are:

NONE NORMAL SAFE FSYNC_SAFE REPLICAS_SAFE

The author focuses on the write concern NONE - a write concern that is
provided for the 'fire and forget' use case. To choose NONE and expect
guarantees and fault tolerance is simply user error.

~~~
emin_gun_sirer
Something tells me that you did not read the article.

~~~
enigma1510
Something tells me you did not read the manual.

The post focuses on the least safe mode (NONE), and barely touches on
REPLICA_SAFE at the end of the article. Leaving the reader to wait for "a
future blog post".

Choose the tool and configuration for the job. For a Mongo user that needs
fault tolerance and consistency guarantees, I would recommend (based on
documentation) choosing a configuration that attempts to provide those.

The safest server configuration would be a multi-node replica set (3 or 5)
that spans multiple data centers (preferably 3 or more). The safest client
configuration would be a write concern REPLICA_SET (majority).

If you have examples of Mongo not providing fault tolerance in this setup, the
world would be a better place if you shared.

~~~
emin_gun_sirer
I not only read but _quoted_ the manual, in parts of the article that you
evidently skipped. I also described why SAFE and FSYNC_SAFE, aren't. I also
described why getLastError doesn't work when pipelined or multithreaded.

It's hard writing for the "tl;dr"-generation (as this exchange demonstrates --
there was a time when it used to be embarrassing to have someone say the same
thing twice to you in a discussion).

I'll have a separate blog entry on why REPLICA_SAFE is broken, and how the
setup you outlined can lose data.

------
fields
Ironically, whatever this is linking to is throwing a 500 error.

~~~
willlll
That isn't irony.

~~~
d23
A post criticizing MongoDB for being "broken" goes down? I believe it is.

~~~
brohee
Then it criticizes MongoDB for doing little if anything against data loss, was
any data lost during this outage I didn't see? ;-)

------
pbbakkum
This is the 3rd or 4th time I've seen this article in the past few days so
I've decided to post my take. I work with Mongo in a production scenario but
I'm hesitant to post because these things tend to turn into a pointless
argument. So let me stress: this is not an attack on the blog post, I'm hoping
to improve the discussion here. Here goes:

\- "It lies." Mongo used to have a default where a driver would fire off a
write and not check that it succeeded with the server. This was very obviously
a decision made to improve benchmark performance, though I imagine a benchmark
with only default settings would be rather naive. Regardless, yes, the default
was a stupid corporate decision but it is well known and should be apparent if
you're deploying a Mongo cluster. Additionally, as the author notes, this
default has changed and this entire point is no longer a concern.

\- "Its Slow." A real point he raises is that its a little wonky you need to
send a separate message for getLastError. I suspect this is an artifact of
Mongo's historical internal structure. [http://docs.mongodb.org/meta-
driver/latest/legacy/mongodb-wi...](http://docs.mongodb.org/meta-
driver/latest/legacy/mongodb-wire-protocol) . If you look at the wire
protocol, I think it is designed such that only the OP_QUERY and OP_GET_MORE
message types get an OP_REPLY back. getLastError is a command, which are run
through an OP_QUERY message. He notes that using this check affects
performance. It does, but lets dig into this:

"Using this call requires doubling the cost of every write operation." If the
author benchmarked this, I suspect he would find that it vastly more than
doubles the latency of a single write operation from the client's perspective.
My understanding is that when performing this kind of safe write, the driver
sends an OP_INSERT, for which it doesn't have to wait for a reply, then
immediately sends an OP_QUERY message (getLastError), on which it hangs
waiting for the OP_REPLY. In other words, this is now slower because we've
created a synchronous operation immediately after just firing off the insert
command. Again, its a little wonky that we send Mongo two messages, but one is
immediately after the other and that is vastly overshadowed by the fact that
we now have to wait for a reply. I believe 1 synchronous send and receive is
unavoidable to ensure a safe write in ANY system, and the argument about
sending 2 messages really boils down to sending about 200 bytes over a socket
vs 400 bytes, I personally don't worry about it.

\- "It doesn't work pipelined / it doesn't work multithreaded." He is also
missing the real complexity here, and this is where I have a problem with this
blog post because this is not a theoretical discussion, if its content is true
then it should just be proven rather than FUD launched into the world. As
noted in the docs (
[http://docs.mongodb.org/manual/reference/command/getLastErro...](http://docs.mongodb.org/manual/reference/command/getLastError/)
), getLastError applies for the socket over which the command was run, so its
up to the driver to execute the getLastError on the same socket as the write,
which is an implementation detail and a solved problem. The way the drivers do
this in practice is you set the write concern and the driver takes care of the
rest. If you run getLastError manually then it depends on the driver, but for
Java the correct procedure is addressed at
<http://docs.mongodb.org/ecosystem/drivers/java-concurrency/> . So for the
fastest possible safe performance, you multithread (which is effectively
pipelining from the server's perspective) and run operations with the driver's
thread-safe connection pool. Suffice to say people actually use these drivers
in a multithreaded context in the real world, and they work.

\- "WriteConcerns are broken." There are several relevant settings here,
including write concern acknowledgement and fsync, the author is confused
about how these map to the Java driver WriteConcern enum values. The
acknowledgement setting (elegantly stored in a variable named "w", thanks
10gen) is the number of replicas that must confirm the write before the driver
believes it has succeeded. I personally set this to 1, but you could
potentially wait for the entire replica set to acknowledge. The fsync setting
is whether or not this acknowledgement means that the server on these machines
has completed the write in memory, or actually synced the data to disk. I set
this to false for performance. There is an excellent StackOverflow answer on
Java Mongo driver configuration at
[http://stackoverflow.com/questions/6520439/how-to-
configure-...](http://stackoverflow.com/questions/6520439/how-to-configure-
mongodb-java-driver-mongooptions-for-production-use) .

The author also spends time noting that if you only ensure the write succeeds
on a single machine, and you irreparably lose that machine before replication,
then the data is lost. This is obviously true for every distributed system.

I've used a 4-shard Mongo in a production multithreaded environment that
handled several hundred million writes. For part of this period our cluster
was extremely unstable because of a serious data corrupting Mongo bug (more on
this in a second). I haven't done a full audit but based on our logs, in about
6 months I've seen exactly 1 write go missing, so I'm personally not concerned
about the things mentioned in the blog post. I've also been happy with
performance as long as the data size comes under the memory limit. If your
data exceeds memory, Mongo essentially falls on its face, though this is hard
to avoid when a query requires disk accesses.

Mongo is not without its problems, however. As I mentioned, QA is a real
concern, we hit a subtle bug when we upgraded to v2.2 that caused data
corruption when profiling was turned on and the cluster was under high load.
It was very difficult to debug, and basically should have been caught by 10gen
before their release.

Another serious problem is that sharding configuration is still somewhat
immature, and it seems like every new release is described as "well it used to
be bad, we finally fixed it". Here is an example: you pick a shard key that
can't be split into small enough chunks, and now shard balancing silently
fails. Ok, so you pick a better shard key, but you can't migrate shard keys,
so you have to drop the collection and start again. Except dropping a
collection distributed across a cluster is buggy, so you can't recreate the
collection with the same name and a different shard key. So you just pick a
new name for your collection, and you have this weird broken state for the
original one sitting around forever unless you completely blast your cluster
and start from scratch. This sort of thing is not fun!

Mongo has many pros and cons, personally I think its real advantage is
simplicity for developers which makes it worth putting up with the other
stuff. Sorry for being long winded, hopefully this has been useful.

