Hacker News new | comments | show | ask | jobs | submit login
MongoDB Gotchas and How To Avoid Them (rsmith.co)
353 points by ukd1 1634 days ago | hide | past | web | 90 comments | favorite



Very good summary of what to look out for. Here are a few others that I ran into back when I was still entertaining the idea of using Mongo in production:

1. The keys in Mongo documents get repeated over and over again for every record in your collection (which makes sense when you remember that collections don't have a db-enforced schema). When you have millions of documents this really adds up. Consider adding an abstraction mapping layer of short 1-2 character keys to real keys in your business logic.

2. Mongo lies about being ready after an initial install. If you're trying to automate bringing mongo boxes up and down, you're going to run into the case where the mongo service says that it's ready, but in reality it's still preparing its preallocated journal. During this time, which can take up to 5-10 minutes based on your system specs, all of your connections will just hang and timeout. Either build the preallocated journal yourself and drop it in place before installing mongo, or touch the file locations if you don't mind the slight initial performance hit on that machine. (Note: not all installs will create a preallocated journal. Mongo tries to do a mini performance test on install to determine at runtime whether preallocating is better for your hardware or not. There's no way to force it one way or the other.)


I can suggest the sadly deliberately undocumented (viz. src/mongo/db/db.cpp:719) "--nopreallocj" option, as well.


Added this + listed you in the footer, thanks!


Thanks; #1 can be a problem for large collections...due to the extra space. I think I've seen some pre-rolled abstraction layers for this, but never used a open-source one myself.

#2 - usually the journal should be pretty quick to allocate - I've not experienced this problem directly myself.

I'll add some extra bits to the bottom of the post with your notes.


There is a ticket for compression of both keys and values https://jira.mongodb.org/browse/SERVER-164

The Mongo team seem somewhat reluctant to implement it.


I'm not sure if they are still doing it, but they used to work on the highest voted stuff as a priority; so if you want a feature, vote.

Also, I'm working on Snappy compression with Mongo (it's already used for the journal), however it's not currently stable and work is sporadic due to my startup.


It's probably not a clear win in most use-cases.


Reducing the working set is a big win. Mongo's behaviour when the working set is larger than RAM is really bad- https://jira.mongodb.org/browse/SERVER-574

Compression could have drawbacks on documents that get updated frequently. But it will be extremely useful on documents that get created and rarely/never change, coincidentally what I mostly have.

It would also greatly help if keys are compressed or indexed in some way since it could be done transparently.

You may recall the Mongo team being reluctant to make the database function well in single server setups, but they did address that with journalling.


Spring Data - MongoDB's ORM provides a nice way of doing this.


I prefer Morphia as ORM for MongoDB. Actively developed and maintained, less seems like enterprise http://code.google.com/p/morphia/


re #2: looking at the source code I don't see this behavior. Perhaps in a version from the past? Anyway, please advise or point me to a jira...

In src/mongo/db/db.cpp I see:

    void _initAndListen(int listenPort ) {
        ...
        dur::startup(); // <- this i believe preallocs journal files
        ...
        listen(listenPort); // after journal prealloc i think
p.s. this blog post is a very interesting article overall i think, without commenting on all the specifics...


can you create a Jira ticket for #2 if that is happening it should be easy to fix. jira.mongodb.org


An excellent and practical article. I do want to emphasize one thing, though, since I feel like the article almost seemed to downplay its significance:

MongoDB does not support joins; If you need to retrieve data from more than one collection you must do more than one query ... you can generally redesign your schema ... you can de-normalize your data easily.

This is a much larger issue than it seems - nested collections aren't first class objects in MongoDB - the $ operator for querying into arrays only goes one level deep, amongst its other issues, meaning that often-times you must break things out into separate collections. This doesn't work either, though, as there are no cross-collection transactions, so if you need to break things into separate collections, you can't guarantee a write to each collection will go through properly. (Though, I suppose if you're using the latest version, you could lock your whole database)


Absolute showstopper for anything more than keeping track of simple stats or storing comments. Denormalizing has limits. Invariably apps will come to the point where you need to do joins and that is when you start cursing your decision to go with NoSQL. My experience is that NoSQL (including Mongo) is not a replacement to traditional RDBMS, but if you use NoSQL complementary to an RDBMS, primarily for real time performance, works quite beautifully. That said, there may be quite a few simple web app use cases that do not need RDBMS at all.


it's true you almost always will find yourself needing to join cross collection/table... I believe the recent support to integrate with hadoop should help this: http://www.mongodb.org/display/DOCS/Hadoop+Quick+Start when your reason for needing joins is for reporting (often the case for say financial reporting)

Also, the postgres integration (linked/discussed) here on HN


Good point; I didn't mention the lack of multi-document commits. I'll add this shortly.


There are some good things here, but on a systems level there are huge oversights that are absolute showstoppers on production systems. Maybe there is a level of Mongo proficiency above MongoDB Master? I hope so.

1) Make sure to permanently increase the hard and soft limits for Linux open files and user processes for the MongoDB/Mongo user. If not, MongoDB will segfault under load and when that happens, the automatic recovery process works incredibly slowly. It's a bit tricky to get this right, depending on your level of sysadmin knowledge. 10gen doesn't emphasize or explain the issue very well in their docs: "Set file descriptor limit and user process limit to 4k+ (see etc/limits and ulimit)" That probably makes sense to just about 0.1% of the people setting up MongoDB: http://www.mongodb.org/display/DOCS/Production+Notes#Product...

2) Make sure to disable NUMA. This 10gen documentation note is a great example of clear documentation: "Linux, NUMA and MongoDB tend not to work well together ... Problems will manifest in strange ways, such as massive slow downs for periods of time or high system cpu time." Massive slowdowns and mysteriously pegged cpu usage on production database systems are definitely 'strange'. I would probably choose stronger and more precise language, but 10gen clearly knows what they're doing: http://www.mongodb.org/display/DOCS/NUMA

tl;dr If you have problems with MongoDB, you aren't using it right. Read the documentation more carefully, and then when that doesn't work, hire an expert.


> tl;dr If you have problems with MongoDB, you aren't using it right. Read the documentation more carefully, and then when that doesn't work, hire an expert.

I'm getting the idea that is rather challenging to use mongoDB right. While there's certainly a place for power tools that can only be used by highly trained experts or you risk disaster... that kind of goes against the idea that mongodb has anything to do with 'simplicity', don't it?


Yep, I managed to miss a few things; I'll add these shortly.


Nice work. :-) This list is getting pretty comprehensive.


Thanks; If you think of anything more, let me know!


I'm one of the people that like to make fun of MongoDB from time to time, but that's mostly from proximity producing contempt.

Nevertheless, a rundown of the gotchas and how to avoid them based on experience beyond simply running apt-get install mongodb is one of the most useful pieces on MongoDB I've seen of late.

The only new-news for me was that SSL support isn't compiled in by default. That's pretty irritating. I wonder if that applies just to 10gen's packages or also to distribution provided mongodb packages.


Disturbingly 10gen uses SSL support as a reason to use their subscriber packages. Sure you can build it in yourself if you compile the package, but it's disappointing that SSL support is one of the carrots that they use for their premium offerings.

Edit: Link to applicable docs on how to compile in/use. http://docs.mongodb.org/manual/administration/ssl/


The reason binaries aren't distributed by default have to do with US Export Controls -http://en.m.wikipedia.org/wiki/Export_of_cryptography_in_the...


SSL support is in the free distribution but you must built it yourself. One reason would be export controls; another is that creates a dependency on the SSL library for those who don't use SSL, which we found awkward (especially if doing all platforms; the subscriber build just does the popular ones).

So it's available; albeit there is an intent to have a subscriber build with some extra features that are heavily enterprise biased in their usefulness.


Thanks!

Most of the distribution provided packages (well, ubuntu / debian) were horribly out of date last time I used them. Not sure if they do, or don't have SSL support - but I doubt it as most of them were < 1.8 which I think is pre-SSL (not 100% and can't find the commit though).


>Most of the distribution provided packages (well, ubuntu / debian) were horribly out of date last time I used them.

Oh, they still are. I use the 10gen apt source.


Any one else noticed a striking similarity to PHP - every feature is broken somehow?)

I thing this will be a good slogan - 'We are PHP of storage engines.'


I have no doubt that 10gen would adore it if mongo were the PHP of storage engines, considering PHP's massive success


If by 'broken' you mean 'has taken over the internet'...


if by 'taken over' you mean 'pooped all over'


One of the more useful Mongo articles I've seen here. You might want to clear up "You cannot shard a collection over 256G" however. The limitation is that if you have an unsharded collection that grows over 256GB you cannot make it a sharded collection. The way it's written now makes it sound like sharded collections can't grow over 256GB (at least to me) which isn't true.


Thanks - I can see how that reads wrong, updating now.


Good to see some constructive advice on how to configure mongo, instead of just bashing it.

Even if it's not your favorite technology, sometimes you end up in a position where the rest of the company is using something, and you need to work within those constraints. It's important to understand the technologies you're building on, their configuration options, and to understand the best practices way of working with them.

This, by the way, is not restricted to mongo.


totally agree!


OP knows his stuff. I met him at a hackday and learnt an insane amount from talking to him at dinner. I'm keeping this post bookmarked for reference. Great stuff.


I see the "32-bit vs. 64-bit" issue appear in many rants about MongoDB. There are two types of people that fall off the 2GB cliff. a) People who say "what just happened... oh, I get it... 32-bits, memmapped files... I'll switch to 64-bit" b) People who say "WTF.. #MongoHate.. going to blog about how @#$#%! a DB this is"

Some people understand the tools they work with. Some people know just barely enough to throw things together and don't tolerate it when something doesn't work out of the box. Worst of all, this second group tends to be very vocal on the interwebs.

I'd almost like to see 10gen not publish the 32-bit package at all. Source is still there. If you want 32-bit, cool, compile it. But forcing the user to compile the 32-bit version assures at least a minimum bound of technical proficiency (an "I understand what I'm doing, why it's not the default and what the limitations are").


It's interesting, because from my observation a lot of the crowd who popularised systems like mongo were those people who weren't willing to think* . Learning the relational data model + tooling was too complicated. Now Mongo has a big ol' list of caveats you ought to understand before you can start chucking data into it too.

I'm a total RDBMS nerd, and it's amazing to me how few people truly care about their data storage. They just want it to work - and, I suppose, it's hard to blame them for that.

*Not that I mean to say that this is the only reason to use a NoSQL DB - it doesn't seem an uncommon one, though.


I have to admit that a lot of the joy I get from using MongoDB is during dev.

While you still have to think about your schema, it does mean that you're not constantly writing and removing migrations (rails), while an application is still evolving.


In Symfony2 (php) migrations are created by comparing the new schema with the old one. Is that not the case for Rails? What do you mean 'constantly writing migrations'?


Rails does not specify the mapping of models to database schema, so it requires specification of migrations instead to document changes to the database that go along with any code changes. So migrations are explicit commands (in a pretty simple dsl) to add columns etc. spread out over many files as the application evolves. This means each schema change requires adding a migration file with those changes in it, rather than modifying a master schema or mapping. There is a schema.rb file but it is created/modified automatically.

There are trade-offs to each approach but it is probably one of the areas that Rails could still improve by looking at other ORMs - I'd prefer to see the schema specified along with constraints etc for each field at the top of each model to make it explicit and self-documenting, and perhaps doing away with migrations altogether.


It seems to be a pretty common senario; people thinking X new technology that Y large site used will solve all of their problems magically. Comparing MongoDB to others; it's similar with Redis and from experience less so with Cassandra (probably the steeper barrier to entry) / Riak (lack of commonality with a standard DB is way more obvious) / Hbase, etc...


"However due to the way voting works with MongoDB, you must use an odd number of replica set members."

So what happens if I have 2 sequential failures? Suppose I have a replica set of size 5 and the master fails? The remaining 4 would elect a new master from amongst themselves, right? But then what if this next master also fails? The remaining 3 nodes are still a quorum (3 > 5/2) and thus (theoretically) should be able to elect a master. But am I to understand that they won't be able to do so?


As I understand it, those 4 still think there are 5 nodes in the set (just that one is down) so you can still establish majority voting because the set size is 5.


If I've missed anything from the article, feel free to let me know! :-)


This is one of the first articles I've seen where someone with a lot of knowledge posts a lot of realities about Mongo, thank you for that!

I'd love to read something describing the "perfect use cases for Mongo" from you :)


I've got a post like that coming up soon; 'thought process and reasonings behind choosing a datastore'


Growing documents, padding. Selecting subset of fields, takes the whole document in ram, creates a new bson and transfers it back. The same for updating. Maybe read before updating for lock %.


All valid - though I'm going to check the read-before update as I thought that was a non-issue in 2.2? I'll write them up but it might take a little time. Thanks!


I've used mongo off and on for a while (even writing a common lisp tutorial on using it), and this is the first really nice writeup for mongo I've seen. Thanks a lot!


:-) no problem!


Thanks for the article!


Here is one to add to the list - if you delete records and/or entire collections, you won't reclaim the associated disk space automatically. Once the space is allocated, it remains allocated and will be reused when more data is added later. If you want to reclaim the "empty space", you need to run a repairDatabase() which will lock the entire database while it's busy.


or compact, which you can run on a secondary...good call - missed this one! Will add shortly.


It's interesting to get a rundown of Mongo's limitations from someone who clearly knows what they're talking about. Thanks.


Thanks dude :-)


I have recently wrote a similar blog post (same idea but different set of "gotchas" here: http://blog.trackerbird.com/content/mongodb-performance-pitf...


The solution is simple; use a tool to keep an eye on MongoDB, make a best guess of your capacity (flush time, queue lengths, lock percentages and faults are good gauges) and shard before you get to 80% of your estimated capacity.

Any recommendations for such a tool?


The Zabbix mongodb plugin also looks like a good option:

http://code.google.com/p/mikoomi/source/browse/#svn%2Fplugin...


MMS (https://mms.10gen.com/user/login) from 10gen is pretty good - it's very mongo specific.

I've also used Munin (http://munin-monitoring.org/, there is a great plugin - https://github.com/erh/mongo-munin), CloudWatch (http://aws.amazon.com/cloudwatch/) and various in-house ones as well.


I've been using MMS for about a year now, and I've found their agent to be not-so-great. I keep getting random alerts about the agent being down and then up again.

Other than that, it's descent and free!


How would you use CloudWatch to monitor MongoDB? Or do you mean that you can get statistics from the Amazon EC2 mongo instance(s) with Munin and a cloudwatch plugin?


On the same token, albiet a bit off of the trail. Does anyone have any suggestions for effectively storing fields which can contain BIG5 (IE non utf-8) chars in them, but usually do not? IE Email subject lines or senders.

JSON is picky in this regard, and I don't want to convert the whole string to B64 etc encode/decode it going in and out, as I would like to retain regex search capability for the 99% of email titles and names which are not Chinese within mongo from my php application which lives on the front.


If you need to store non-UTF8 data, MongoDB has a binary data type:

http://php.net/manual/en/class.mongobindata.php

You can't do things like regex searches on binary data, but since MongoDB supports different data types within the same "column", you can just store some as UTF8 and some as binary, depending on whether the string has non-UTF8 characters in it.


Thank you for pointing me in this direction, I'll see if I can make this work in the application I'm building. Thanks again for the reply.


It may be better to encode everything to utf8 before you store it.


I have a suspicion that this seemingly popular sentiment about so many people hating MongoDB is untrue.

Or people are careless about what systems they put into production?

Also, awesome article!


Thanks! Well, it seemed true enough for me to write the article - there have been a whole load of them on HN recently - http://www.hnsearch.com/search#request/submissions&q=mon....


SSL support is not so easy to set up if you are on Suse Linux Enterprise. There is basically no support for it. And for some reason it doesn't work for me.

But the thing I don't understand is, if people use replicasets, how comes they're not using encryption? It would be easy to sniff data off the instances. But yet, when I search on stackoverflow/serverfault, there are close to no people using SSL with Mongo.


Laziness, assumption and the difficulty in supporting self compiled apps over standard packages. Also, most projects I've worked on have either been firewalled or on a seperate internal only network for non-web layers...so it's less of an issue. Also, there is a performance overhead.


By non-web you mean a network not accessible from Internet right? I wonder if I should drop SSL as well... not much of a gain from it, as the DB layer is in fact configured as you said.


Really great list of gotchas.

I have been using MongoDB for a long time, unfortunately mostly this has been small applications, so you don't really get to test how MongoDB scales.

On that same note, I would love to see a list of gotchas for Riak (assuming some exist). I keep hearing recommendations for Riak, it would be nice to know how it fares in a large production environment.


Here's two:

(1) There's no need to add a "created" field on your documents. You can extract it from the _id field by just taking the first 4 bytes.

(2) If you are storing hashes (md5 for example), you might want to consider storing them as BinData instead of strings. Mongo uses UTF-8 so every character will be at least 8 bits whereas you can get away with 4 bits per character.


Great points but they are not really gotchas; I was trying to help people avoid big / 'obvious' / documented things when using MongoDB :-)


Sure :)


Can you efficiently select and sort on the date part of the _id?


Sorting is easy, you just choose _id as the sort field.

Selecting is also reasonably easy, the first 4 bytes of the id are the timestamp (seconds since the epoch). You just create a hex string in that format -- 4 bytes of timestamp and then 8 bytes of zeroes and then create an object ID (using the classes provided by your driver) and do:

    coll.find({_id:{$gte:<id>}})
or whatever is the equivalent in your language of choice.


I wrote an article comparing disk space usage between MySQL and MongoDB with some notes about RAM requirements and data compression here: http://blog.trackerbird.com/content/mysql-vs-mongodb-disk-sp...


Thanks for the great article. A gotcha I have come across: Document keys can't contain the dot character. If you are storing a complex document (hash-of-hash-of-hash-etc..), you would need to recursively clean up and ensure that none of the keys contain any '.' char.


"For setups that are sharded, you can use 32-bit builds for mongod" - I don't think this is accurate. Whether or not you are sharded has no effect on the limitations of a 32-bit mongod. Did you mean to say that you can use 32-bit builds for the mongos?


I did indeed, typo; they don't store any data so usually don't have the same issue with 2G limits. I've updated the article.


Bottom line: MongoDB is not an RDBMS and you shouldn't try to use it as an RDBMS. Something with trying to fit a square peg into a round hole. MongoDB requires a different mindset and if you're unable to adapt, then you should simply stay away.


It's not really this at all; the point is - read the docs, research and understand the tools you are going to use. Choose the ones which fit best. Understand the trade-offs.


no the point is , while its "api" is great , you cant replace your RDBMS with a mongodb , while other solutions like redis or couchdb are "minimalist" they are better suited for what nosql db are for , high availability and scalling.


Unfortunately that's not true: it depends directly on what you are doing with your RDBMS - there are many work loads which are better suited to MongoDB. Obviously there are also many which are not.

Unless I'm out of date, Redis and High Availability don't go together in the same sentence; awesome as it is, it's still a single point of failure.


> Unless I'm out of date, Redis and High Availability don't go together in the same sentence; awesome as it is, it's still a single point of failure.

Clustering is a work in progress (http://redis.io/topics/cluster-spec , http://redis.io/presentation/Redis_Cluster.pdf), replication is available (http://redis.io/topics/replication).


> Obviously there are also many which are not. that's exactly what i am saying.


I believe you mean 'sharding' not 'sharing': "Unique indexes and sharing"

And

"Process Limits in Linux" If you experience segfaults under load with MongoDB, you may find it’s beacuse of low or default open files / process limits


Also worth mentioning that performance is much more predictable when the data fits into memory (or the working set, but that may be harder to convey).


I stopped reading when you said up to 1tb of data like that was a large number.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: