
MongoDB, Data Durability and Improvements coming in 1.8 - mattyb
http://www.paperplanes.de/2011/1/10/mongodb_and_data_durability.html
======
agentultra
I agree largely with this post. I'm tired of hearing the same arguments the
author has. This technology we use is hardly infallible and as programmers and
system administrators of these systems the defects should be obvious! Yet
somehow people still think that the "clueless users" are to blame for these
"defects."

Sure, we all know the saying about assumptions. But why are we at a point
where it's cool to use a data persistance layer that assumes it doesn't have
to live up to any of those assumptions? It's why we have things like ACID, the
LSB, etc. Contracts about how these systems should work for the end user.

It's hard to get all that stuff right off the bat, sure. But losing the entire
database when a few bytes land in the wrong place? Yikes. Hardly something I'd
trust real data to.

This stuff can always get better. Of course the solution as pointed out is
tried and true. Glad to see that the developers aren't covering their ears...

------
SoftwareMaven
I read once that every developer ought to use 'kill -9' as their method of
stopping services they are writing, then ensure that doing so never causes a
problem. Even if you can keep people from executing 'kill -9' (hint: you
can't), there are always cases that you can't control that can stop a box in
its tracks. By developing explicitly for this, you save you and your customer
a lot of pain.

Honestly, I would never use a database that doesn't make safeguarding my data
as its number one priority. Even if I'm just dumping logs in it, the time I
really need to know what's going on is likely the time those safeguards are
put to the test.

~~~
po
It was briefly mentioned in the article, but this is exactly how couchdb shuts
down:

 _On-disk, CouchDB never overwrites committed data or associated structures,
ensuring the database file is always in a consistent state. This is a “crash-
only” design where the CouchDB server does not go through a shut down process,
it’s simply terminated._

<http://couchdb.apache.org/docs/overview.html>

(Please, no mongodb vs. couchdb wars here)

~~~
SoftwareMaven
I think that is actually where I read it. It must have resonated internally
and became the topic (in my mind).

------
prosa
This is exciting. MongoDB (and the suite of libraries building up around it)
has made prototyping web applications an order of magnitude easier than using
SQL. Lack of single-server durability, however, is a showstopper in
production, before you've grown enough to justify scaling the database beyond
one machine.

In my case, that meant going back to MySQL once our schema was finalized
(sadly). Looks like that won't be necessary in the not-so-near future, which
means MongoDB is usable in new projects without worrying about replication
before otherwise necessary.

~~~
thenduks
What's with the aversion to replicas?

I have an app I'm working on launching right now, it's in what you might call
'private beta': real people accessing the app is <10\. It's been in this state
for a little less than 2 months and there's been a replica-set and backups
happening since even before that.

What I'm saying is that you don't need to grow before adding a second box. My
project currently runs on ec2 micro instances and the second box will probably
cost you less than your GitHub code hosting costs! Just seems to me that (in
ec2 terms, as an example) going from $9/month burn to $18/month is hardly a
show stopper.

~~~
prosa
Aren't micro instances $15/month? (Plus EBS backing, unless you want to lose
your DB on reboot.)

Regardless, it's added complexity that you can avoid with MySQL early in your
application's lifecycle. Even if it's not much more expensive, you have more
hardware to administer. Compare that to a Rails+MySQL stack, which you can run
on a single box. (There are added benefits of replicas, sure, but they aren't
_easier_ to set up, per se, then just installing your stack on one machine
when you're getting going.)

~~~
thenduks
Yea sorry $15/month (EBS is practically free, as is Elastic IP), but anyway my
point still stands at $30/month. I think you'd be hard pressed to find a
developer who doesn't spend that much on coffee per month (not to mention
merchant accounts and support and code hosting and bug tracking and blah blah
blah) :)

As to the rest, sure, it's sys-admin-wise easier to setup. But then you have
to consider stuff like database dumps/backups. Where are you storing them?
Have you tested your recovery plan? Then there's hassle when you decide you
need a dedicated db box and move it there (friday fat-finger, anyone?). True
enough that none of these are deal-breaker level issues (evidenced by the
ubiquity of MySQL/etc).

With mongo you spend a tiny bit of extra money and a tiny bit of extra time
setting up a replica set which handles your backup for you. And then when you
need more db horsepower you just start sharding or just adding more replicas
and sending reads there. Need to upgrade the db box? Cool just add a huge
instance, add it to the replica set and promote it when it's done syncing.

I certainly agree that it's a trade-off... it's just one that I think is _not_
driven by data durability. Either way I need to handle my disaster recovery
with some kind of automated system and I'm just suggesting that whether this
is some kind of cron-job, restore recipe, etc backup of a db-dump or a replica
on a cheap extra box is a wash.

------
noelwelsh
I'm pleased Mongo is getting single server durability. I have never understood
why it got so popular without this feature. I'd love to know why people choose
Mongo over say, Riak, or CouchDB, as the majority of projects don't need more
than one server.

~~~
harryh
Because single server durability is a myth when hardware can fail at any time.

~~~
lwat
We're talking about data durability. When my SQL server crashes due to
hardware failure I know that when I eventually get it back up and running the
data will be consistent with at worst the last couple of transactions being
rolled back.

~~~
harryh
This is most certainly not true. Disks fail in ways where the whole volume
becomes unreadable all the time.

~~~
lwat
What I'm saying is that I can go up to my SQL server and disconnect the power
cord and my database will not be corrupt when I start it back up. Sure if your
HDD gets taken out by a meteor then nothing will save you but that's why you
have backups.

~~~
bsg75
Not guaranteed. I have had more than one customer experience hardware failure,
resulting in a corrupt or suspect SQL Server database, that was unrecoverable
via normal means.

In each of the cases where the customer had a true standby system, implemented
via replication, log shipping, or mirroring, they were able to failover with
little (log shipping) or no data loss.

In the cases where they had a single, standalone server, the option was to
restore the last known good backup, or sent the database files had to
Microsoft for analysis and repair.

ANY system (RDBMS, NoSQL, or otherwise), should have a standby replica to
prevent data loss. If you data is stored on a single machine, you are doing it
wrong.

~~~
lwat
I'm not disagreeing with you, all I'm saying is that SQL Server databases are
built from the ground up to resist data corruption and does exceedingly well
at it. Not so with NoSQL data stores.

~~~
ericflo
Not so with _MongoDB_. Let's please not blame all of NoSQL for MongoDB's bad
design decisions.

------
megaman821
I am never sure what niche MongoDB is supposed to fill.

If I want a large cluster to handle "big data" Riak or Cassandra seem to fit
the bill better.

If I want speed Redis is great.

If I want a schema-less SQL-like (but not SQL) database, MongoDB?

~~~
pashields
In my admittedly limited experience with mongo, it felt like they were
designing the ultimate database for the web. Not in a "web-scale" sort of way,
more in a swiss army database for the web dev way. This came to me when I
wanted to do location queries on a db. Most people using standard DBs do this
with PostGIS for postgres, which is incredibly powerful and accurate. But
mongoDB supports the "find shit near me with good enough accuracy" query that
90% of people want to run right now.

I was able to make a simple test page that told you the IP of the previous
visitor who was closest to you. It took me thirty minutes from "hey mongo has
some sort of geo support" to that (and I'm not a web dev and was using ATT 3G
to read doc). PostGIS took me longer to figure out how to set a lat,long pair
in the DB.

I'm not saying raw speed of development is the best measure of a database, but
if your #1 priority is the ability to rapidly prototype, I'd imagine mongo is
a good fit in your toolbelt.

~~~
lwat
I assure you that that anyone who has used any kind of SQL database can do the
same stuff you did on Mongo in the same time or less. Development speed is
much faster on the SQL database once you have some more data and you can start
using joins and searches and all the other nice RDBMS features.

~~~
steveklabnik
I've built a lot of stuff on RDBMSen, and often, those 'nice features' are
actually what I call 'a giant pain in the ass.'

Sometimes, data doesn't fit a relational model well. In those cases, something
like Mongo is a godsend. And yeah, you could de-normalize your data and get
halfway to Mongo, but I'd rather use each tool for what it's good at.

~~~
lwat
"Sometimes, data doesn't fit a relational model well."

I keep hearing this but I've yet to see any good examples.

~~~
steveklabnik
Here's what I posed last time someone asked me this, I got no answer:
<http://news.ycombinator.com/item?id=1637903>

~~~
lwat
We have exactly the same kind of data relationships in our database. Sure it's
complicated and in your example sure you'll end up with a bunch of tables if
you normalize it all the way to 3NF. But there's a lot of GOOD STUFF you get
when you do, for example if you decide to rename a FooType or a FooSize then
you only have to update one record rather than searching every document in
your document DB for instances of those names so you can rename them. Hell if
you have many documents and 24/7 clients this may become IMPOSSIBLE to do
while still guaranteeing consistency in your data.

Your example is not that unusual or even that complicated. What is about it
that doesn't work on a RDBMS? You can't just say the solution sucks because it
uses more tables that you feel it should.

~~~
steveklabnik
It's good to know that I'm not totally spewing bullshit, and that's the way it
should be done.

In this case, I would have less than 200 types, and probably under 1000 Foo
records total. Updates wouldn't have been a problem. The pain was much bigger
than the upside. Especially designing the forms that would end up creating
those rows...

It's not that it doesn't work. It's that it's not a 'this always fits'
solution.

~~~
megaman821
This is what views are for. If it is easier to work with the data
denormalized, create a view with a few joins. Some views can even have updates
directly applied to them.

~~~
steveklabnik
You're not exactly helping with the "Mongo is significantly less complicated
to model my data with" argument. ;)

But still, good to know. I'm bad with views. :[

------
cagenut
If I had a nickel for every time a coworker used "-9" for no other reason than
being in a rush and not grok'ing what a bad idea it is.

~~~
samstokes
I don't know if this is what happened to the reluctant hero of this story, but
if the database really had hung, and wasn't responding to "sensible" methods
of shutting down, what are you supposed to do besides kill -9? (By "sensible"
I mean something the DB could trap in order to finish saving a consistent
state to disk - shutdown command, -INT, -QUIT, etc.)

(I'm not excusing trying -9 as a first resort, but if a process is deadlocked
(or pegging the CPU in a tight loop) and won't respond to -QUIT, there's not
much else you can do.)

------
richcollins
You can also use fsync to guarantee that a transaction was written to disk at
a _severe_ hit to performance.

~~~
mathias_10gen
Actually that doesn't add any additional durability guarantees in the general
case. The only way to use MongoDB durably with a single server is to use the
new --dur flag coming in 1.8. I assure you there is no other way.

~~~
oomkiller
And how far out is 1.8?

