Hacker Newsnew | comments | show | ask | jobs | submit login
MongoDB gotchas for the unaware user (senko.net)
83 points by senko 1613 days ago | 31 comments



I think "always use getLastError" is a poor general guideline. The "gotcha" at the heart of it is to know that mongodb syncs inserts to disk at a configurable interval (1 minute by default) - which means data can be lost. Calling getLastError (or setting safe => true in a lot of drivers) will force the write to disk.

However, one of the great things about MongoDB is that, in some cases, you can easily afford to lose 1 minute worth of inserts in exchange for huge performance gains. Large chunks of data that is used for analytics is a good example since lost data won't [likely] impact final aggregates/percentages.

Some of our inserts, like user registration we run with safe=>true. Others, like audit logs (which, for us aren't as important as they might be for others) we don't.

-----


I actually didn't have the problem with data not being synced to disk immediately, and I wasn't using getLastError to sync on every operation (AFAIK it doesn't by default, unless you use fsync option).

My more mundane problem was that I didn't know whether database said the (insertion) operation was ok (or, for example, I tried to reuse an unique key value). Without using getLastError (or, indeed, safe=True in Python), I have no idea whether any possible errors (eg. a bug in my code) have occurred.

For data for which you can ignore occasional error (e.g. some logging, or click tracking, or similar) I agree getLastError may not be needed. I believe that it's not a very good default for most users with use-cases similar to mine - you have a VPS, you build a simple app on it, and use MongoDB in it.

-----


Perhaps your use case is better suited to a traditional RDMS then, like MySQL or Postresql. Mongo is specifically marketed as a high performance database, which is why this is the default.

-----


getLastError (or safe=true) does not force the write to the disk. Checking getLastError determines whether the server accepted the write, or whether there were failures (uniqueness checks, for instance, or whether any rows were updated).

-----


First error of this user is trying to run mongodb on a single node instance. One note about his paragraph regarding replication ... just make sure to not use the autoresync flag if that's your data recovery plan or you'll simply replicate bad data to the slaves. If your really serious about the data, cycle a couple slaves down every X minutes so that they completely write to disk so that you can perform proper backups then have them come back up and resync.

Here's a rewrite of the article: Don't use monogdb unless you know what your doing and have the hardware to do it right.

-----


(I'm the author of the article).

> Don't use monogdb unless you know what your doing

That can just as well be applied to anything, not just mongodb. Whenever you start using something, you're going to make mistakes.

> and have the hardware to do it right.

AFAIK, running it on two instances (master + slave, and then stop/cycle the slave for backups) should be just fine. So you don't need to have "web scale" hardware for mongodb.

MongoDB is an interesting database and can fit nicely into some use cases - by which I mean data organisation, not just scale. So I don't think it should be avoided by people running simple things with not-humongous data-sets. We just have to look out for a few things we might not have expected. That's why I didn't call them "bugs" or "problems" - just "gotchas".

-----


Excellent point about data organization. gridfs also helps with that. I'm loving gridfs so far.

-----


Cycling slaves isn't necessary. You can use the flush and lock method described here: http://www.mongodb.org/pages/viewpage.action?pageId=19562846

-----


I love the blazing fast throughput of MongoDB but the gotchas make me nervous.

I wish there was a MongoDB Guru site where I could contract out some MongoDB related maintenance activities such as validating my MongoDB installation and making sure that I have not made dumb errors, demystifying performance issues etc. So far I am making do with documentation and mailing lists but I would rather contract this out to a specialist.

Anyone know of any provider like this with affordable rates ?

-----


You can hire the developers of mongodb, 10gen, themselves, they provide commercial support / consulting for mongodb.

http://www.10gen.com/

-----


Agreed .I like 10gen but I don't think I could hire them to help me with my side projects which I am using to get acquainted with MongoDB.

I was looking more for guys like contract DBAs that are available for Oracle or even PostgresSQL. Maybe 10gen could create certification programs for admins such that we could have a pool of knowledgeable admins who could support MongoDB newbies such as myself.

-----


10gen offers training for DBAs: http://www.10gen.com/training

No certification right now but that's good feedback, thanks

-----


Is there anything that prevents you from reading the mailing list a bit and mailing some helpful and competent people? It's not "official", but you may end up with more competent and cheaper help that way.

-----


I use http://mongohq.com and love them, I don't think they actually do consulting stuff but they do take care of all the technical stuff I'm too busy/lazy/stupid to learn myself.

-----


Sharding gotchas:

You can't shard an existing collection that's surpassed 50GB.

All collections are currently created on the primary shard of the database. (Although this is slated to change: http://jira.mongodb.org/browse/SERVER-939 )

If a collection already has a unique key, that has to be your shard key.

You cannot update the value of a shard key.

-----


Another gotcha with sharding is that during a re-balance some documents will be on "both" shards simultaneously. Thus issuing a count command repeatedly on a 100k doc collection will see the count bounce up as it shards. 100k, 106k, 100k, 104k etc. This is not a problem for lookups of single documents but means any row scans will produce inconsistent values if the collection is re-balancing. There is a certain mode you can trigger for scan queries that may address this but I have not tested it yet.

You learn a lot about mongodb just by playing with it. It is super simple to set up and just hammer with different tests. I'm playing around with 4 extra large instances wired to 4 drives each in raid 0 on amazon right now and having a blast.

-----


Another small one I'd point out: You can't sort large sets on fields that aren't indexed. It's not just slow, mongo flat out refuses to do it.

-----


This is actually awesome feature! AppEngine's Datastore requires this as well.

I really wish that PostgreSQL (and other SQL databases) would allow one to enforce such policy, it's a lifesaver.

-----


Agreed. I wish normal SQL dbs would implement a mode where index misses result in errors unless they contain some sort of opt in text.

-----


Mongo servers can be run in a mode that rejects the query with an error if it would result in a row scan. Very useful for safely avoiding bad queries slamming your server.

-----


Some "normal" SQL dbs can, at the very least, be configured to log this. Which is pretty great when you consider some can't.

-----


I am having trouble reconciling the following two quotations:

"[MongoDB is] so simple and natural to use from dynamic languages"

and:

"In my test code, I had an 'async' remove() call (ie. I didn’t wait for it to finish) and was then inserting new entries, and previous remove() happiliy removed them (all of them, or some, or none, depending on the race). Those were very confusing few hours."

-----


This is a language driver issue. By default, some language drivers for mongo use connection pools. When operating in this mode, five inserts will get processed by five different connections in an arbitrary order. The advantages are speed and no need for managing connections (ie. try { getConnection() } finally { freeConnection() }. Thus in default usage, no leaked connections and speed is great but this behavior is very surprising when you learn about it as it is not at all obvious.

-----


These are all reasons why I'm leaning towards CouchDB.

-----


Here's an interesting presentation on the nosql dbs you might enjoy. I'm not the author but I'm working with CouchDB right now and this came across the mailing list yesterday. The mailing list is actually quite good.

http://www.slideshare.net/danglbl/schemaless-databases

-----


As a general rule, switching from technology (x) which has actual users sharing lessons learned on the internet to technology (y) because it doesn't have many actual users and they haven't shared any lessons, is good way to become a beta tester.

-----


But does Mongo have more users or just more gotchas?

-----


Sorry for the slow reply. I can't speak about mongo's user base, but it does appear larger than CouchDB's. Both of them are several orders of magnitude less mature / well understood than MySQL. I'm simply trying to say, that technologies with no problems must have no users.

-----


The good thing about MongoDB is that all the gotchas are well known in the community and documented. Thats actually all the more reason to stay with MongoDB.

-----


Need to explicitly specify case insensitive while searching for strings. Not really a gotcha, but can be overlooked if you are transitioning in from MySql which does case insensitive matching by default.

-----


"Use 64-bit version" is not much of a gotcha. It warns you when you start up the 32-bit version (or at least it did). Besides that, reinstalling Mongo is an easy and fast operation.

-----




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: