
Failing with MongoDB - lenn0x
http://blog.schmichael.com/2011/11/05/failing-with-mongodb/
======
nomongo
Why is a database that fails so easily and most of the time even loses data so
popular? Is it really all just a huge marketing budget?

~~~
viraptor
If I never expect the dataset to grow past 1GB and a single server, why would
I use anything else? It doesn't really fail - none of the issues described
were "failures" really. [edit: just to be clear, it didn't crash and burn, I
don't think performance issue == failure] The data loss was not confirmed
either: "There _appears_ to be some data loss occurring" and in small
deployments you can just use transaction log.

There's no other project I know of, which provides: schemaless json documents,
indexing on any part of them, server-side mapreduce, lots of connectors for
different languages, atomic updates on part of the document. If there is one
and it's better than mongo, I'd switch any moment.

~~~
cscotta
>> "It doesn't really fail - none of the issues described were "failures"
really."

These absolutely were failures.

The author listed several instances in which the database became unavailable,
the vendor-supplied client drivers refused to communicate with it, or both.
Some of these scenarios included the primary database daemon crashing,
secondaries failing to return from a "repairing" to an "online" state after a
failure (and unable to serve operations in the cluster), and configuration
servers failing to propagate shard config to the rest of the cluster -- which
required taking down the entire database cluster to repair.

Each of the issues described above would result in extended application
downtime (or at best highly degraded availability), the full attention of an
operations team, and potential lost revenue. The data loss concern is also
unnerving. In a rapidly-moving distributed system, it can be difficult to pin
down and identify the root cause of data loss. However, many techniques such
as implementing counters at the application level and periodically sanity-
checking them against the database can at minimum indicate that data is
missing or corrupted. The issues described do not appear to be related to a
journal or lack thereof.

Further, the fact that the database's throughput is limited to utilizing a
single core of a 16-way box due to a global write lock demonstrates that even
when ample IO throughput is available, writes will be stuck contending for the
global lock, while all reads are blocked. Being forced to run multiple
instances of the daemon behind a sharding service on the same box to achieve
any reasonable level of concurrency is embarrassing.

On the "1GB / small dataset" point, keep in mind that Mongo does not permit
compactions and read/write operations to occur concurrently. As documents are
inserted, updated, and deleted, what may be 1GB of data will grow without
bound in size, past 10GB, 16GB, 32GB, and so on until it is compacted in a
write-heavy scenario. Unfortunately, compaction also requires that nodes be
taken out of service. Even with small datasets, the fact that they will
continue to grow without bound in write/update/delete-heavy scenarios until
the node is taken out of service to be compacted further compromises the
availability of the system.

What's unfortunate is that many of these issues aren't simply "bugs" that can
be fixed with a JIRA ticket, a patch, and a couple rounds of code review --
instead, they reach to the core of the engine itself. Even with small
datasets, there are very good reasons to pause and carefully consider whether
or not your application and operations team can tolerate these tradeoffs.

~~~
rbranson
Just to be 100% clear -- so people don't misunderstand your explanation of
Mongo's compaction: Mongo does have a free space map that it uses to attempt
to fit new data or resized documents into "holes" left by deleted data.
However, compaction will still eventually have to be ran as the data will
continue to fragment and eventually things get bad.

------
gojomo
Maybe there's a niche for "PostgreNoSQL", a layer atop Postgres that you start
using like a NoSQL solution. (Perhaps, it's string keys and JSON blob values.)
It's not very efficient, except for simple keyed lookups, but it works enough
for a quick start.

Then, as you use it, the system optimizes itself (or makes suggestions) based
on actual access patterns. A subset of objects could be a formal, indexed
table? Have it happen automatically or offer the SQL as a suggestion.

~~~
i34159
Conversely, you could have a NoSQL layer below Postgres, where PG stores and
indexes metadata which tells it which, of many, small NoSQL dbs to find the
actual data in. These data dbs then can be sharded/replicated across physical
systems as you like. You loose some raw speed on reads, but avoid a global
write lock and the system scales quite well. I've started playing around with
such a system with <https://github.com/cloudflare/SortaSQL>

------
christkv
Seems to me they used the wrong setup they should have looked at a replicaset
setup with secondaries for read and sharding if they needed more write
performance and nonblocking reads. That said version 2 has less locking
problems and I understand they are working on finer grained locking.

~~~
schmichael
Sorry, this is a pretty poorly written blog post. We're definitely using
sharding+replica sets.

Replication of any kind won't help you with a high write load as secondaries
have to apply the same number of writes as primaries.

~~~
christkv
They seem to be very aware of the problem and focused on solving it as soon as
possible. I guess it's just a matter of time. Compared to how long it took
MySQL to mature into a stable platform I've been pretty impressed at their
responsiveness and quick improvements so far :).

~~~
christkv
seems from the comments in the post that 10gen went out of it's way to be
helpful in resolving the issues ???

~~~
schmichael
Yes. The only thing that would be better is if these issue didn't exist to
begin with.

------
StavrosK
All I need is a schemaless version of postgres (with ACID-compliance and
everything), does anyone know of one?

~~~
ericflo
<http://www.postgresql.org/docs/9.0/static/hstore.html>

~~~
StavrosK
That's very useful, thank you!

~~~
rkalla
Keep in mind this isnt meant to be redis-on-postgresql
[http://archives.postgresql.org/pgsql-
performance/2011-05/msg...](http://archives.postgresql.org/pgsql-
performance/2011-05/msg00238.php)

------
lucian1900
Sadly, MongoDB blows for actual usage. It locks, it's not crash-only, it has
mutable data.

CouchDB is much better (you're as likely to lose data as with Postgres), but
is potentially less efficient (no BSON).

------
bbulkow
Disclosure: I wrote a product called Citrusleaf, which also plays in the NoSQL
space.

My focus in starting Citruseaf wasn't features, it was operational
dependability. I had worked at companies who had to take their system offline
when they had the greatest exposure - like getting massive load from the Yahoo
front page (back in the day). Citrusleaf focuses on monitoring, integration
with monitoring software, operations. We call ourselves a real-time database
because we've focused on predictable performance (and very high performance).

We don't have as many features as mongo. You can't do a javascript/json long
running batch job. We'll get to features.

The global R/W lock does limit mongo. Absolutely. Our testing shows a nearly
10x difference in performance between Mongo and Citrusleaf on writes. Frankly,
if you're still doing 1,000 tps, you should probably stick with a decent MySQL
implementation.

Here's a performance analysis we did: <http://bit.ly/rRlq9V>

This theory that "mongo is designed to run on in-memory data sets" is,
frankly, terrible --- simply because mongo doesn't give you the _control_ to
keep you in memory. You don't know when you're going to spill out of memory.
There's no way to "timeout" a page cache IO. There's no asynchronous interface
for page IO. For all of these reasons - and our internal testing showing page
IO is 5x slower than aio; the reason all professional databases use aio and
raw devices - we coded Citrusleaf using normal multithreaded io strategies.

With Citrusleaf, we do it differently, and that difference is huge. We keep
our indexes in memory. Our indexes are the most efficient anywhere - more
objects, fea. You configure Citrusleaf with the amount of memory you want to
use, and apply policies when you start flowing out of memory. Like not taking
writes. Like expiring the least-recently-used data.

That's an example of our focus on operations. If your application use pattern
changes, you can't have your database go down, or go so slowly as to be nearly
unusable.

Again, take my comments with a grain of salt, but with Citrusleaf you'll have
better uptime, fewer servers, a far less complex installation. Sure, it's not
free, but talk to us and we'll find a way to make it work for your project.

------
t3mp3st
Disclosure: I hack on MongoDB.

I'm a little surprised to see all of the MongoDB hate in this thread.

There seems to be quite a bit of misinformation out there: lots of folks seem
focused on the global R/W lock and how it must lead to lousy performance.

In practice, the global R/W isn't optimal -- but it's really not a big deal.

First, MongoDB is designed to be run on a machine with sufficient primary
memory to hold the working set. In this case, writes finish extremely quickly
and therefore lock contention is quite low. Optimizing for this data pattern
is a fundamental design decision.

Second, long running operations (i.e., just before a pageout) cause the
MongoDB kernel to yield. This prevents slow operations from screwing the
pooch, so to speak. Not perfect, but smooths over many problematic cases.

Third, the MongoDB developer community is EXTREMELY passionate about the
project. Fine-grained locking and concurrency are areas of active development.
The allegation that features or patches are withheld from the broader
community is total bunk; the team at 10gen is dedicated, community-focused,
and honest. Take a look at the Google Group, JIRA, or disqus if you don't
believe me: "free" tickets and questions get resolved very, very quickly.

Other criticisms of MongoDB concerning in-place updates and durability are
worth looking at a bit more closely. MongoDB is designed to scale very well
for applications where a single master (and/or sharding) makes sense. Thus,
the "idiomatic" way of achieving durability in MongoDB is through replication
-- journaling comes at a cost that can, in a properly replicated environment,
be safely factored out. This is merely a design decision.

Next, in-place updates allow for extremely fast writes provided a correctly
designed schema and an aversion to document-growing updates (i.e., $push). If
you meet these requirements-- or select an appropriate padding factor-- you'll
enjoy high performance without having to garbage collect old versions of data
or store more data than you need. Again, this is a design decision.

Finally, it is worth stressing the convenience and flexibility of a schemaless
document-oriented datastore. Migrations are greatly simplified and generic
models (i.e., product or profile) no longer require a zillion joins. In many
regards, working with a schemaless store is a lot like working with an
interpreted language: you don't have to mess with "compilation" and you enjoy
a bit more flexibility (though you'll need to be more careful at runtime).
It's worth noting that MongoDB provides support for dynamic querying of this
schemaless data -- you're free to ask whatever you like, indices be damned.
Many other schemaless stores do not provide this functionality.

Regardless of the above, if you're looking to scale writes and can tolerate
data conflicts (due to outages or network partitions), you might be better
served by Cassandra, CouchDB, or another master-master/NoSQL/fill-in-the-blank
datastore. It's really up to the developer to select the right tool for the
job and to use that tool the way it's designed to be used.

I've written a bit more than I intended to but I hope that what I've said has
added to the discussion. MongoDB is a neat piece of software that's really
useful for a particular set of applications. Does it always work perfectly?
No. Is it the best for everything? Not at all. Do the developers care? You
better believe they do.

------
nomoremongo
I'd appreciate if someone would submit this story for me.

<http://pastebin.com/raw.php?i=FD3xe6Jt>

------
patrickod
Wow there's a lot of Mongo hate in this thread all from one article. Yesterday
MongoDB was the darling of HN and today it has to be defended from ridiculous
claims. Why the mob attitude? Have you all had these issues?

------
vegai
All the commercial DBs have similar issues. Just deal with them and go on.

~~~
dextorious
No, they do not. Some joke DBs had some issues back in the day (MySQL comes to
mind) but issues of such importance were solved looong ago.

~~~
vegai
No, all of them had, and most still do. People don't seem to realize how
freaking old and complicated those things are.

------
plasma
Ravendb (www.ravendb.net) is a solid competitor.

~~~
icey
"Raven is an Open Source (with a commercial option) document database for the
.NET/Windows platform."

I'm not sure it's a competitor at all. RavenDB is a CouchDB clone for .Net
that requires a commercial license for proprietary software.

~~~
latch
Which has a ton of magic baked into the driver making it unlikely you'll get
your data back out via anything but .NET.

------
amalag
If your data is easily modeled relationally, go for relation, if you are going
to change it constantly and is not a natural fit for a relational model,
Mongodb is worth a shot.

From this article, sounds like their data is pretty seriously relational.

Mongodb has been pushing the ops side of their product, but I can agree it has
failings there. To me the advantage is the querying and the json style
documents.

~~~
gmcquillan
I'm not sure you read the article fully, because relationships were never
described in the article. Instead, it was high read/update load which caused
problems.

Mongo, on paper, should be an ideal candidate for this job; but, due to
complications with the locking model and with its inability to do online
compactions, it's failing.

~~~
amalag
Relation was a bad word choice, I meant easily modeled by a relational
database system. Seems like your data can be modeled with fixed columns.

I had to model data with umpteen crazy relationships so we went with Mongodb.
We did not have the high update issue or any locking issues. If one has a few
large tables with fixed columns that can easily define the data, then
relational DBs probably make more sense. But to your point, 10gen will not
tell you that and the hype doesn't tell you that either.

