

MongoDB Gotchas and How To Avoid Them - ukd1
http://rsmith.co/2012/11/05/mongodb-gotchas-and-how-to-avoid-them/

======
Pewpewarrows
Very good summary of what to look out for. Here are a few others that I ran
into back when I was still entertaining the idea of using Mongo in production:

1\. The keys in Mongo documents get repeated over and over again for every
record in your collection (which makes sense when you remember that
collections don't have a db-enforced schema). When you have millions of
documents this really adds up. Consider adding an abstraction mapping layer of
short 1-2 character keys to real keys in your business logic.

2\. Mongo lies about being ready after an initial install. If you're trying to
automate bringing mongo boxes up and down, you're going to run into the case
where the mongo service says that it's ready, but in reality it's still
preparing its preallocated journal. During this time, which can take up to
5-10 minutes based on your system specs, all of your connections will just
hang and timeout. Either build the preallocated journal yourself and drop it
in place before installing mongo, or touch the file locations if you don't
mind the slight initial performance hit on that machine. (Note: not all
installs will create a preallocated journal. Mongo tries to do a mini
performance test on install to determine at runtime whether preallocating is
better for your hardware or not. There's no way to force it one way or the
other.)

~~~
ukd1
Thanks; #1 can be a problem for large collections...due to the extra space. I
think I've seen some pre-rolled abstraction layers for this, but never used a
open-source one myself.

#2 - usually the journal should be pretty quick to allocate - I've not
experienced this problem directly myself.

I'll add some extra bits to the bottom of the post with your notes.

~~~
rogerbinns
There is a ticket for compression of both keys and values
<https://jira.mongodb.org/browse/SERVER-164>

The Mongo team seem somewhat reluctant to implement it.

~~~
codewright
It's probably not a clear win in most use-cases.

~~~
rogerbinns
Reducing the working set is a big win. Mongo's behaviour when the working set
is larger than RAM is really bad- <https://jira.mongodb.org/browse/SERVER-574>

Compression could have drawbacks on documents that get updated frequently. But
it will be extremely useful on documents that get created and rarely/never
change, coincidentally what I mostly have.

It would also greatly help if keys are compressed or indexed in some way since
it could be done transparently.

You may recall the Mongo team being reluctant to make the database function
well in single server setups, but they did address that with journalling.

------
T-R
An excellent and practical article. I do want to emphasize one thing, though,
since I feel like the article almost seemed to downplay its significance:

 _MongoDB does not support joins; If you need to retrieve data from more than
one collection you must do more than one query ... you can generally redesign
your schema ... you can de-normalize your data easily._

This is a much larger issue than it seems - nested collections aren't first
class objects in MongoDB - the $ operator for querying into arrays only goes
one level deep, amongst its other issues, meaning that often-times you _must_
break things out into separate collections. This doesn't work either, though,
as there are no cross-collection transactions, so if you need to break things
into separate collections, you can't guarantee a write to each collection will
go through properly. (Though, I suppose if you're using the latest version,
you _could_ lock your whole database)

~~~
anuraj
Absolute showstopper for anything more than keeping track of simple stats or
storing comments. Denormalizing has limits. Invariably apps will come to the
point where you need to do joins and that is when you start cursing your
decision to go with NoSQL. My experience is that NoSQL (including Mongo) is
not a replacement to traditional RDBMS, but if you use NoSQL complementary to
an RDBMS, primarily for real time performance, works quite beautifully. That
said, there may be quite a few simple web app use cases that do not need RDBMS
at all.

~~~
taf2
it's true you almost always will find yourself needing to join cross
collection/table... I believe the recent support to integrate with hadoop
should help this: <http://www.mongodb.org/display/DOCS/Hadoop+Quick+Start>
when your reason for needing joins is for reporting (often the case for say
financial reporting)

Also, the postgres integration (linked/discussed) here on HN

------
23david
There are some good things here, but on a systems level there are huge
oversights that are absolute showstoppers on production systems. Maybe there
is a level of Mongo proficiency above MongoDB Master? I hope so.

1) Make sure to permanently increase the hard and soft limits for Linux open
files and user processes for the MongoDB/Mongo user. If not, MongoDB will
segfault under load and when that happens, the automatic recovery process
works incredibly slowly. It's a bit tricky to get this right, depending on
your level of sysadmin knowledge. 10gen doesn't emphasize or explain the issue
very well in their docs: "Set file descriptor limit and user process limit to
4k+ (see etc/limits and ulimit)" That probably makes sense to just about 0.1%
of the people setting up MongoDB:
[http://www.mongodb.org/display/DOCS/Production+Notes#Product...](http://www.mongodb.org/display/DOCS/Production+Notes#ProductionNotes-
GeneralUnixNotes)

2) Make sure to disable NUMA. This 10gen documentation note is a great example
of clear documentation: "Linux, NUMA and MongoDB tend not to work well
together ... Problems will manifest in strange ways, such as massive slow
downs for periods of time or high system cpu time." Massive slowdowns and
mysteriously pegged cpu usage on production database systems are definitely
'strange'. I would probably choose stronger and more precise language, but
10gen clearly knows what they're doing:
<http://www.mongodb.org/display/DOCS/NUMA>

tl;dr If you have problems with MongoDB, you aren't using it right. Read the
documentation more carefully, and then when that doesn't work, hire an expert.

~~~
ukd1
Yep, I managed to miss a few things; I'll add these shortly.

~~~
23david
Nice work. :-) This list is getting pretty comprehensive.

~~~
ukd1
Thanks; If you think of anything more, let me know!

------
codewright
I'm one of the people that like to make fun of MongoDB from time to time, but
that's mostly from proximity producing contempt.

Nevertheless, a rundown of the gotchas and how to avoid them _based on
experience beyond simply running apt-get install mongodb_ is one of the most
useful pieces on MongoDB I've seen of late.

The only new-news for me was that SSL support isn't compiled in by default.
That's pretty irritating. I wonder if that applies just to 10gen's packages or
also to distribution provided mongodb packages.

~~~
dpeck
Disturbingly 10gen uses SSL support as a reason to use their subscriber
packages. Sure you can build it in yourself if you compile the package, but
it's disappointing that SSL support is one of the carrots that they use for
their premium offerings.

Edit: Link to applicable docs on how to compile in/use.
<http://docs.mongodb.org/manual/administration/ssl/>

~~~
rit
The reason binaries aren't distributed by default have to do with US Export
Controls
-[http://en.m.wikipedia.org/wiki/Export_of_cryptography_in_the...](http://en.m.wikipedia.org/wiki/Export_of_cryptography_in_the_United_States#section_2)

~~~
dm_mongodb
SSL support is in the free distribution but you must built it yourself. One
reason would be export controls; another is that creates a dependency on the
SSL library for those who don't use SSL, which we found awkward (especially if
doing all platforms; the subscriber build just does the popular ones).

So it's available; albeit there is an intent to have a subscriber build with
some extra features that are heavily enterprise biased in their usefulness.

------
dschiptsov
Any one else noticed a striking similarity to PHP - every feature is broken
somehow?)

I thing this will be a good slogan - 'We are PHP of storage engines.'

~~~
mylittlepony
If by 'broken' you mean 'has taken over the internet'...

~~~
zalew
if by 'taken over' you mean 'pooped all over'

------
mason55
One of the more useful Mongo articles I've seen here. You might want to clear
up "You cannot shard a collection over 256G" however. The limitation is that
if you have an unsharded collection that grows over 256GB you cannot make it a
sharded collection. The way it's written now makes it sound like sharded
collections can't grow over 256GB (at least to me) which isn't true.

~~~
ukd1
Thanks - I can see how that reads wrong, updating now.

------
etrain
Good to see some constructive advice on how to configure mongo, instead of
just bashing it.

Even if it's not your favorite technology, sometimes you end up in a position
where the rest of the company is using something, and you need to work within
those constraints. It's important to understand the technologies you're
building on, their configuration options, and to understand the best practices
way of working with them.

This, by the way, is not restricted to mongo.

~~~
fredsters_s
totally agree!

------
sjtgraham
OP knows his stuff. I met him at a hackday and learnt an insane amount from
talking to him at dinner. I'm keeping this post bookmarked for reference.
Great stuff.

------
whitej
I see the "32-bit vs. 64-bit" issue appear in many rants about MongoDB. There
are two types of people that fall off the 2GB cliff. a) People who say "what
just happened... oh, I get it... 32-bits, memmapped files... I'll switch to
64-bit" b) People who say "WTF.. #MongoHate.. going to blog about how @#$#%! a
DB this is"

Some people understand the tools they work with. Some people know just barely
enough to throw things together and don't tolerate it when something doesn't
work out of the box. Worst of all, this second group tends to be very vocal on
the interwebs.

I'd almost like to see 10gen not publish the 32-bit package at all. Source is
still there. If you want 32-bit, cool, compile it. But forcing the user to
compile the 32-bit version assures at least a minimum bound of technical
proficiency (an "I understand what I'm doing, why it's not the default and
what the limitations are").

~~~
knightni
It's interesting, because from my observation a lot of the crowd who
popularised systems like mongo _were_ those people who weren't willing to
think* . Learning the relational data model + tooling was too complicated. Now
Mongo has a big ol' list of caveats you ought to understand before you can
start chucking data into it too.

I'm a total RDBMS nerd, and it's amazing to me how few people truly care about
their data storage. They just want it to work - and, I suppose, it's hard to
blame them for that.

*Not that I mean to say that this is the only reason to use a NoSQL DB - it doesn't seem an uncommon one, though.

~~~
vertis
I have to admit that a lot of the joy I get from using MongoDB is during dev.

While you still have to think about your schema, it does mean that you're not
constantly writing and removing migrations (rails), while an application is
still evolving.

~~~
mylittlepony
In Symfony2 (php) migrations are created by comparing the new schema with the
old one. Is that not the case for Rails? What do you mean 'constantly writing
migrations'?

~~~
grey-area
Rails does not specify the mapping of models to database schema, so it
requires specification of migrations instead to document changes to the
database that go along with any code changes. So migrations are explicit
commands (in a pretty simple dsl) to add columns etc. spread out over many
files as the application evolves. This means each schema change requires
adding a migration file with those changes in it, rather than modifying a
master schema or mapping. There is a schema.rb file but it is created/modified
automatically.

There are trade-offs to each approach but it is probably one of the areas that
Rails could still improve by looking at other ORMs - I'd prefer to see the
schema specified along with constraints etc for each field at the top of each
model to make it explicit and self-documenting, and perhaps doing away with
migrations altogether.

------
ianrose
"However due to the way voting works with MongoDB, you must use an odd number
of replica set members."

So what happens if I have 2 sequential failures? Suppose I have a replica set
of size 5 and the master fails? The remaining 4 would elect a new master from
amongst themselves, right? But then what if this next master also fails? The
remaining 3 nodes are still a quorum (3 > 5/2) and thus (theoretically) should
be able to elect a master. But am I to understand that they won't be able to
do so?

~~~
crcsmnky
As I understand it, those 4 still think there are 5 nodes in the set (just
that one is down) so you can still establish majority voting because the set
size is 5.

------
ukd1
If I've missed anything from the article, feel free to let me know! :-)

~~~
fbuilesv
This is one of the first articles I've seen where someone with a lot of
knowledge posts a lot of realities about Mongo, thank you for that!

I'd love to read something describing the "perfect use cases for Mongo" from
you :)

~~~
ukd1
I've got a post like that coming up soon; 'thought process and reasonings
behind choosing a datastore'

------
bitdiffusion
Here is one to add to the list - if you delete records and/or entire
collections, you won't reclaim the associated disk space automatically. Once
the space is allocated, it remains allocated and will be reused when more data
is added later. If you want to reclaim the "empty space", you need to run a
repairDatabase() which will lock the entire database while it's busy.

~~~
ukd1
or compact, which you can run on a secondary...good call - missed this one!
Will add shortly.

------
fredsters_s
It's interesting to get a rundown of Mongo's limitations from someone who
clearly knows what they're talking about. Thanks.

~~~
ukd1
Thanks dude :-)

------
rgarcia
_The solution is simple; use a tool to keep an eye on MongoDB, make a best
guess of your capacity (flush time, queue lengths, lock percentages and faults
are good gauges) and shard before you get to 80% of your estimated capacity._

Any recommendations for such a tool?

~~~
ukd1
MMS (<https://mms.10gen.com/user/login>) from 10gen is pretty good - it's very
mongo specific.

I've also used Munin (<http://munin-monitoring.org/>, there is a great plugin
- <https://github.com/erh/mongo-munin>), CloudWatch
(<http://aws.amazon.com/cloudwatch/>) and various in-house ones as well.

~~~
alexmic
I've been using MMS for about a year now, and I've found their agent to be
not-so-great. I keep getting random alerts about the agent being down and then
up again.

Other than that, it's descent and free!

------
CliffFarr
I have recently wrote a similar blog post (same idea but different set of
"gotchas" here: [http://blog.trackerbird.com/content/mongodb-performance-
pitf...](http://blog.trackerbird.com/content/mongodb-performance-pitfalls-
behind-the-scenes/)

------
fideloper
I have a suspicion that this seemingly popular sentiment about so many people
hating MongoDB is untrue.

Or people are careless about what systems they put into production?

Also, awesome article!

~~~
ukd1
Thanks! Well, it seemed true enough for me to write the article - there have
been a whole load of them on HN recently -
[http://www.hnsearch.com/search#request/submissions&q=mon...](http://www.hnsearch.com/search#request/submissions&q=mongodb&sortby=points+desc..).

------
cjc1083
On the same token, albiet a bit off of the trail. Does anyone have any
suggestions for effectively storing fields which can contain BIG5 (IE non
utf-8) chars in them, but usually do not? IE Email subject lines or senders.

JSON is picky in this regard, and I don't want to convert the whole string to
B64 etc encode/decode it going in and out, as I would like to retain regex
search capability for the 99% of email titles and names which are not Chinese
within mongo from my php application which lives on the front.

~~~
mattparlane
If you need to store non-UTF8 data, MongoDB has a binary data type:

<http://php.net/manual/en/class.mongobindata.php>

You can't do things like regex searches on binary data, but since MongoDB
supports different data types within the same "column", you can just store
some as UTF8 and some as binary, depending on whether the string has non-UTF8
characters in it.

~~~
cjc1083
Thank you for pointing me in this direction, I'll see if I can make this work
in the application I'm building. Thanks again for the reply.

------
aledalgrande
SSL support is not so easy to set up if you are on Suse Linux Enterprise.
There is basically no support for it. And for some reason it doesn't work for
me.

But the thing I don't understand is, if people use replicasets, how comes
they're not using encryption? It would be easy to sniff data off the
instances. But yet, when I search on stackoverflow/serverfault, there are
close to no people using SSL with Mongo.

~~~
ukd1
Laziness, assumption and the difficulty in supporting self compiled apps over
standard packages. Also, most projects I've worked on have either been
firewalled or on a seperate internal only network for non-web layers...so it's
less of an issue. Also, there is a performance overhead.

~~~
aledalgrande
By non-web you mean a network not accessible from Internet right? I wonder if
I should drop SSL as well... not much of a gain from it, as the DB layer is in
fact configured as you said.

------
vertis
Really great list of gotchas.

I have been using MongoDB for a long time, unfortunately mostly this has been
small applications, so you don't really get to test how MongoDB scales.

On that same note, I would love to see a list of gotchas for Riak (assuming
some exist). I keep hearing recommendations for Riak, it would be nice to know
how it fares in a large production environment.

------
alexmic
Here's two:

(1) There's no need to add a "created" field on your documents. You can
extract it from the _id field by just taking the first 4 bytes.

(2) If you are storing hashes (md5 for example), you might want to consider
storing them as BinData instead of strings. Mongo uses UTF-8 so every
character will be at least 8 bits whereas you can get away with 4 bits per
character.

~~~
ukd1
Great points but they are not really gotchas; I was trying to help people
avoid big / 'obvious' / documented things when using MongoDB :-)

~~~
alexmic
Sure :)

------
CliffFarr
I wrote an article comparing disk space usage between MySQL and MongoDB with
some notes about RAM requirements and data compression here:
[http://blog.trackerbird.com/content/mysql-vs-mongodb-disk-
sp...](http://blog.trackerbird.com/content/mysql-vs-mongodb-disk-space-usage/)

------
netvarun
Thanks for the great article. A gotcha I have come across: Document keys can't
contain the dot character. If you are storing a complex document (hash-of-
hash-of-hash-etc..), you would need to recursively clean up and ensure that
none of the keys contain any '.' char.

------
stbrody
"For setups that are sharded, you can use 32-bit builds for mongod" - I don't
think this is accurate. Whether or not you are sharded has no effect on the
limitations of a 32-bit mongod. Did you mean to say that you can use 32-bit
builds for the mongos?

~~~
ukd1
I did indeed, typo; they don't store any data so usually don't have the same
issue with 2G limits. I've updated the article.

------
gianpaj
I believe you mean 'sharding' not 'sharing': "Unique indexes and sharing"

And

"Process Limits in Linux" If you experience segfaults under load with MongoDB,
you may find it’s __beacuse __of low or default open files / process limits

------
ndepoel
Bottom line: MongoDB is not an RDBMS and you shouldn't try to use it as an
RDBMS. Something with trying to fit a square peg into a round hole. MongoDB
requires a different mindset and if you're unable to adapt, then you should
simply stay away.

~~~
ukd1
It's not really this at all; the point is - read the docs, research and
understand the tools you are going to use. Choose the ones which fit best.
Understand the trade-offs.

~~~
camus
no the point is , while its "api" is great , you cant replace your RDBMS with
a mongodb , while other solutions like redis or couchdb are "minimalist" they
are better suited for what nosql db are for , high availability and scalling.

~~~
ukd1
Unfortunately that's not true: it depends directly on what you are doing with
your RDBMS - there are many work loads which are better suited to MongoDB.
Obviously there are also many which are not.

Unless I'm out of date, Redis and High Availability don't go together in the
same sentence; awesome as it is, it's still a single point of failure.

~~~
bsg75
> Unless I'm out of date, Redis and High Availability don't go together in the
> same sentence; awesome as it is, it's still a single point of failure.

Clustering is a work in progress (<http://redis.io/topics/cluster-spec> ,
<http://redis.io/presentation/Redis_Cluster.pdf>), replication is available
(<http://redis.io/topics/replication>).

------
paulsutter
Also worth mentioning that performance is much more predictable when the data
fits into memory (or the working set, but that may be harder to convey).

------
pjd7
I stopped reading when you said up to 1tb of data like that was a large
number.

