I've tried to post this as a comment on the blog, but it's not showing up (moderated?):
Full disclosure: I work for 10gen.
You strategically posted this when my air conditioning was broken, so here are a few thoughts before I go find somewhere cooler. Since CouchDB is "not a competitor" to MongoDB, it's nice of you to put all this time into a public service.
> MongoDB, <b>by default</b>, doesn’t actually have a response for writes.
Seriously, though, this "unchecked" type of write is just supposed to be for stuff like analytics or sensor data, when you're getting a zillion a second and don't really care some get lost if the server crashes. <b>You can do an insert that not only waits for a database response, but waits for N slaves (user configurable) to have replicated that insert.</b> Note that this is very similar to Cassandra's "write to multiple nodes" promise. You can also fsync after every write.
> MongoDB writes to a mem-mapped file and lets the kernel fsync it whenever
> the kernel feels like it.
fsyncs are configurable. You can fsync once a second, never, or after every single insert, remove, and update if you wish.
> When you look at MongoDB more critically I don’t see how you could actually
> justify using it for anything resembling the traditional role of a database.
This is because you assume you'll run it on single server. MongoDB's documentation clearly, repeatedly, and earnestly tells people to run MongoDB on multiple servers.
Also, as another commenter mentioned, full single-server durability is scheduled for the fall.
> Stories like this (http://www.korokithakis.net/node/119) are dubious not
> because they expose a few bugs in MongoDB but because they show inherent
> architectural problems you cannot overcome long term without something
Stories "like this" show that MongoDB doesn't work for everyone, particularly people who give no specifics about their architecture, setup, what happened, or anything else. Isn't it irritating how people will write, "MongoDB lost my data" or "CouchDB is really slow" and provide no specifics?
That's not to say that things never go wrong, MongoDB is definitely not perfect and has lots of room for improvement. I hope that users with questions and problems will contact us on the list, our wiki, the bug tracker, or IRC (or, heck, write a snarky blog post). Anything to contact the community and let us try to help. I wish every person who tried MongoDB had a great experience with it.
Lots of users, hopefully most, love MongoDB and are using it happily and successfully in production.
We deployed it to production recently, and we were well aware of this. I feel their documentation is quite clear on the point.
We're also using it single-server, but only for analytics where loss of a small amount of data isn't a big deal. I've made it very clear to all and sundry that if we're going to put any data that has more value into it, we first need at least one additional server instance.
We really, really want people to know they should run on multiple servers. Do you have any suggestions on making it clearer? Where did you look for information about running it in production (so I can add stuff about multiple servers to that page)?
> Having said that, I must ask. Do the "unchecked" writes,
> configurable fsyncs and multi-node writes exist in the
> currently stable version of MongoDB or are these features
> still in alpha or beta?
They're all available in stable! This might be a documentation fail. It is on the wiki, but suggestions are welcome if you looked at page X and didn't see it.
But couchdb is slow though. I, like many, switched to mongodb because couchdb was just too slow and when I asked in IRC how to make it faster I was told to run a cluster of couchdb, so, not too different than mongodb ;)
for some reason wordpress wanted me to moderate your post so sorry for the delay in it showing up.
>> Whoopsy, got your emphasis wrong there ….. Seriously, though, this “unchecked” type
>> of write is just supposed to be for stuff like analytics or sensor data, when you’re getting
>> a zillion a second and don’t really care some get lost if the server crashes.
did the default change? the last time i attempted to a concurrent performance test this was one of the barriers i hit. my issue isn’t that you include this feature, it’s that it’s the default, i certainly believe there is a use case for it i just think it’s harmful as a default.
>> Since CouchDB is “not a competitor” to MongoDB, it’s nice of you to put all this time
>> into a public service.
haha, that’s funny. i regularly use non-CouchDB databases and I get along great with all the people from other databases at conferences. even if i did feel like we were competing, i wouldn’t care. this post really is about reliability issues i don’t think your users are fully aware of and i honestly hope that you fix.
>> fsyncs are configurable. You can fsync once a second, never, or after every single insert,
>> remove, and update if you wish.
that’s really good to hear. have you optimized for a “group commit” yet?
>> This is because you assume you’ll run it on single server. MongoDB’s documentation
>> clearly, repeatedly, and earnestly tells people to run MongoDB on multiple servers.
I responded earlier to the complexity of actually keeping something available that depends on this. so i won’t cover it again.
>> That’s not to say that things never go wrong, MongoDB is definitely not perfect and has
>> lots of room for improvement. I hope that users with questions and problems will
>> contact us on the list, our wiki, the bug tracker, or IRC (or, heck, write a snarky blog
>> post). Anything to contact the community and let us try to help. I wish every person
>> who tried MongoDB had a great experience with it.
You make it sounds like this is all just a matter of bugs, it’s not, and i find blaming it on users who don’t use JIRA or get on IRC a little distasteful.
these issues are architectural and until you do something append-only they aren’t going to go away. someone mentioned earlier that you plan to do an append-only transaction log, if that’s accurate then it’s fantastic news.
Response (awaiting moderation, I think it's the links?):
> haha, that’s funny. i regularly use non-CouchDB databases and I get along great with
> all the people from other databases at conferences.
Oh, I tend to bite people when I find out they use another database . Maybe I should stop that?
> even if i did feel like we were competing, i wouldn’t care. this post really is about
> reliability issues i don’t think your users are fully aware of and i honestly hope that
> you fix.
You must be thrilled to learn that single server durability is coming. I look forward to a followup post extolling MongoDB’s virtues this fall.
> I responded earlier to the complexity of actually keeping something available that
> depends on [multiple servers]. so i won’t cover it again.
Yes, it is a difficult, but not unsolvable, problem. Mongo’s made a bunch of tradeoffs in the awesome vs. easy to program area. For instance, remember last year when CouchDB was saying MongoDB sucked because of its lack of concurrency? That it was too complicated to do concurrency in C++ and that Erlang was the way? Well, now Mongo has concurrency, so on to the next “must have” thing.
>You make it sounds like this is all just a matter of bugs, it’s not, and i find blaming
> it on users who don’t use JIRA or get on IRC a little distasteful.
People discuss everything from bugs to architecture to lunch on our various forums. I was trying to say, possibly badly, that we have a lot of ways for people with questions, problems, and suggestions to reach out.
Eliminating the methods I outlined, I’m not sure how people with suggestions could reach the developers, other than telepathy.
Ok, so the standard response seems to be "single server durability" isn't supported, but replication makes up for that.
How can this be? If no single server guarantees durability of the writes, how can a cluster of those types of machines suddenly cause those writes to be durable? Maybe it's a pedantic argument, but it seems to me that semantically speaking you are simply relying upon luck that your replicated nodes don't become corrupt or die for some systemtic reason.
The fact that there is not a replayable, append-only transaction log says to me that no matter what you build on top of it, it will never by definition be durable because the whole cannot be greater than the sum of its parts in this case.
Writing to disk and transaction logs are nice, but they aren't magic bullets. What if a data center catches fire? More mundanely, I've heard ~6% of hard drive fail/year. Only replication can help you there.
I'd argue that durability is a sliding scale. You have to figure out how much risk you're willing to take and you cannot have a perfectly durable system.
traditionally durability isn't considered a sliding scale, it's a goal/priority which requires you to implement multiple features and fallbacks to handle everything from invalid writes, crash during write, to the data center catching on fire.
thinking about durability this way may work great for MongoDB but it isn't how durability is framed in the rest of the database world.
A) Just because something is 'traditionally' done doesn't mean its mandatory. Databases 'traditionally' spoke SQL but I don't see you dinging anyone for breaking that tradition. You've used the Appeal to Tradition fallacy (look it up on wikipedia) many times, and it add nothing to your argument.
B) Durability is an important goal at a system-wide level, but that doesn't mean it needs to be handled at the database layer. In addition to the already mentioned replication and transaction log methods, it can also be handled at the block or fs layer using snapshots, or by admins using backup tools. It can even be handled by having a different Database of Record and using Mongo as a operational store. Mongo as software is agnostic; we provide the tools, but it is up to the user or admin to make the best decisions for their technical and business interests. If another layer of the stack provides sufficient protection against data loss, it is unnecessary to pay performance costs associated with doing it in the DB layer.
We've always clearly said don't build a bond trading system with it. Our philosophy is one-size-fits-all is over; use the right tool for the right problem.
Based on how I like to define the term, there is nothing in the NoSQL space that does full "ACID", including complex transactional semantics involving many objects, on many server clusters. That is ok : the perf + scale problem isn't really solvable if you don't give on something.
Clustered durability. You can request, per-write, "don't return until this data has been persisted to N other servers," with N chosen per-write.
So if you're running master/slave and choose N to be 1, you're durable (at the two server level). Run 5-replica sets and choose N to be 3 or 4, and you're basically guaranteed durability. One of your nodes goes down hard and bad things happen? You have four additional clones of it sitting and waiting to be copied.
This is actually more durable in cases of extreme hardware failure like, say, your RAID controller going out. However, it requires that you spend more on hardware. So do several other things Mongo does, like using a lot of RAM and disk to get faster writes, so that's in keeping with a lot of their other design decisions.
It's not perfect for every project, but it's a great choice for most of the same projects where you'd use Mongo in the first place.
Basically: (a) for real single server durability you need to turn off hardware buffering or have a battery-backed RAID controller to ensure your write really hit disk; (b) this won't help you if your disks fail, and this failure mode is as likely as any other; and (c) for some applications, the delay required to replay a transaction log in unacceptable--you need 100% uptime.
I am thinking though, that the fact that MongoDB writes to files at all is somewhat misleading, and that they may as well say that all your data is loaded into virtual memory. (Since they make no guarantees that the database files will be consistent except in the case of a controlled shutdown. Neither does MySQL, they point out, but I think in practice MySQL database files will be easier to recover from, since their structure is presumably more regular.)
We run MongoDB on development machines, and they're frequently shut down unexpectedly. But some our development databases take an hour or two to regenerate from scratch, so we prefer to run recovery.
In our experience, MongoDB single-node recovery is very robust. It makes no guarantees of transactional integrity _between_ objects, but in our experience the individual objects have always been recovered intact. According to the MongoDB documentation, recovery will occasionally fail to rebuild objects that span disk pages which were flushed in an inconvenient order, but this will not prevent it from recovering other objects.
So even though MongoDB doesn't have single node durability, and you _really_ ought to run it in replicated mode, it actually manages to have a robust recovery tool.
Just a thought here… there might be a difference between a development box that's mostly idle shutting down unexpectedly and a production server that's under heavy load, maybe starting to fall behind on some writes, freezing and crashing.
Just saying that you might want to test it in a ways that simulates an overwhelming workload, before you trust it.
It's more interesting to see how they recover in case of failure, rather than how write is completely on disk. Machine failure will happen. Writes will not be fully flushed to disk. It's how one recovers from failure that defines how robust the system is.
Gotta say, as a database developer, this is what interests me as well. Let's assume you've had a datacenter-wide power failure, and you're running the repair process on your master and slaves -- what does that look like? How long does it take? What are the factors that influence whether or not it will complete successfully?
Actually, I feel that these are some of the reasons that I find MongoDB to be an attractive solution for certain purposes. You can give up some facets of durability and make up for it in ease and performance. MongoDB actually fits quite well into various web apps that have a need for a secondary, non-authoritative datastore that needs to be highly available but not as reliable. Any kind of real-time stream feature would probably benefit from this. Just keep an authoritative copy of your data in your relational datastore of your choice, and asynchronously replicate data as needed to MongoDB. Since it's a non-authoritative datastore, you can denormalize it as much as you need. Works pretty well, IMHO.
I do agree durability is important, but as long as you are aware of this behavior and you consider it in your design, you may still find scenarios where the gained speed is a good trade off.
Another aspect that tends to be forgot when speaking about MongoDB is its fire-and-forget API. Combined with the behavior of automatic collection creation, innexistent data validation this may lead to "interesting" results.
I'm a big fan of MongoDB, and I think its replicated durability characteristics are good enough for many classes of applications. If you could lose a minute of data and have it just be "bad" instead of "customer enraging and business threatening", then it's a very nice database system.
However--I do think the decision to make writes "fire and forget" is just a mistake. If you use abstractions like a connection pool under heavy concurrency, you can get unpredictable behavior in terms of when the data is "actually there."
For example, all in one thread:
with connection pool: do_write operation A
do other things...
with connection pool: read something, assuming A has been applied
Specifically, the fact that you get an arbitrary connection out of the pool means you cannot be sure that the database has completely processed operation A before executing your new query.
MongoDB has a "safe=" flag in their Python bindings that implements the project's official (http://www.mongodb.org/display/DOCS/Last+Error+Commands#Last...) recommendation for "it is written" consistency. It's a bit of a hack, but it calls "getLastError()" on the connection that does the write before returning from the update()/save()/insert()/delete() call. I think it's astonishing this behavior isn't the default.
In fact, I recently updated diesel's bindings to be safe=True, so getLastError() is always executed on write operations to make sure they succeeded before the call returns:
Actually, that will be a recommended config for replica sets. Most of our slides already show 3 replicas per set.
Also, most people have backups so its not really "running naked". See http://www.mongodb.org/display/DOCS/Backups for a few ways to backup mongodb. With LVM/EBS/ZFS or any other snapshotable filesystem, backups can be done almost instantly. With EBS you can even get an insta-slave from the snapshot.
the suggested default step 1 for MongoDB is to acquire 3 servers?
i mean, no other database suggests such a huge default configuration. even knowing that their datacenter can get hit by lightening and all a lot of large production sites don't even run with this kind of redundancy.
this seems like a pretty taxing workaround for not keeping an append-only transaction log.
No, it is a suggested configuration, not the suggested one. And recommending three nodes is not that uncommon for distributed systems, because you cant have a quorum with only two nodes. Some users are more concerned with handling 10s or 100s of servers than having single-server durability.
That said, for most users two servers are fine. Other users don't need any replicas at all since they do a nightly dump from a stored source into mongodb, or they just take regular backups. There are many ways to achieve system-wide durability, not all of them require the database to be durable.
"the suggested default step 1 for MongoDB is to acquire 3 servers?"
At my previous job, we suffered a catastrophic RAID failure. The controller died while trying repair a degraded array. At this point, we valued the data enough that we weren't going to screw around with replacing the controller and restarting the recovery. The whole RAID array went out to DriveSavers, at the cost of several thousand dollars. Now, DriveSavers are really awesome folks, but my goal is to never do business with them again. :-)
After this incident, our policy was simple: If our data mattered, it had to be replicated.
A 3 server configuration is, admittedly, pretty high end: Even if one node is down, you still have redundancy. You can keep the remaining 2 nodes live while rebuilding the failed node.
I know lots of people who care _deeply_ about data integrity, but who keep their databases on a single server. I find this a bit mystifying: Even the most expensive hardware can die in ugly and unrecoverable ways. And RAID arrays are some of the biggest culprits: Their striping formats are usually undocumented and proprietary.
Sure. And again, they make that clear. It's quite easy to find out that MongoDB doesn't "have single-server durability". If, like me (and at least one commenter here) you say, "what does that mean?", you can quickly find an explanation on their web site, or in many other places.