

Retrospective from Postmark on outages (MongoDB) - cemerick
http://blog.postmarkapp.com/post/20172172065/lessons-learned-from-our-recent-outages

======
megaman821
I know I shouldn't care but I am always weary of doing business with any
company using MongoDB.

10gen put out claiming MongoDB was so much faster than SQL solutions but it
seemed obvious to me than turning off fsync would make the SQL solutions run
at about the same speed. Plus, why would I want to run my database in mode
where it is easy to lose data? MongoDB may be a useful product but their
marketing is deceptive, which will lead to companies using it in inappropriate
situations.

~~~
rogerbinns
Have you noticed that the problem isn't MongoDB? Every case I've seen has been
the data traffic exceeding the capacity of the system. There is no database
that works well under those circumstances. Another way of putting it is that
these businesses have been sufficiently successful that they have outgrown
their own planning and deployment.

As for your second paragraph, MongoDB has had journalling for quite a while,
so you can make you writes durable and limited to the speed of your storage.

~~~
thibaut_barrere
> Every case I've seen has been the data traffic exceeding the capacity of the
> system. There is no database that works well under those circumstances.

Is there a benchmark somewhere comparing the memory/disk consumption of
MongoDB vs. other datastores?

If there's a significant overhead (and my early tests tend to show that there
was - but I didn't make a strict benchmark though), then it would become very
related to MongoDB then.

(honest and real question, I'm a MongoDB user btw, as well as Redis, MySQL,
Postgresql etc).

~~~
rogerbinns
The main overhead for MongoDB's storage is that the "column name" (keys) is
stored in every record rather than just once as with traditional SQL database
and some of the other NoSQL solutions. That is why you'll often see developers
using very short key names, and one use for an "ORM" to translate between
developer friendly names and short stored names.

Of course this can be solved fairly easily by the MongoDB developers by having
a table mapping between short tokens/numbers and the long names. This is the
ticket:

<https://jira.mongodb.org/browse/SERVER-863>

This is someone's measurements with different key names:

[http://christophermaier.name/blog/2011/05/22/MongoDB-key-
nam...](http://christophermaier.name/blog/2011/05/22/MongoDB-key-names)

~~~
thibaut_barrere
Thanks for the links.

My question goes further though, as someone who has worked with, and
implemented too, column-based stores: I'm curious to compare the respective
space/ram consumption for the data part, too.

I think I'll write such a benchmark one day :)

------
viraptor
I get the impression that one of the biggest issues was missed. They did not
test the standard load against the secondary server, they assigned a machine
of lower specs to the task and there's nothing in the future actions that
indicates they'll change it... Even if they go for the new and shiny, they can
end up in the same situation when their master fails.

I hope they just overlooked that in the blog post, rather than actually not
correcting this first.

~~~
steve8918
That's my takeaway as well.

I have no opinions on MongoDB, but it really seems like this particular
problem was because they skimped on disaster recovery, ie. their failover
hardware was less powerful than their production hardware. That was the root
cause of their downtime, which is inadequate planning.

That's spending money on car insurance, but realizing only after you get into
an accident that the car insurance covers almost nothing. It means you've
wasted your money paying for the insurance. They paid for the secondary
failover hardware, but it was effectively useless since they were down for 2
days. The only thing it mitigated, possibly, was how long they were down for,
but the primary objective of the hardware, ie. keep them up in case of a
disaster, was a complete failure.

I've worked at a company that was completely down for a day worldwide due to a
"disaster", even though we had spent millions on diesel fuel generators, etc.
I blame the "checkbox" mentality where people only look to satisfy
requirements, but no one actually has ownership over the process and the
details. Unfortunately, in my case, no one got fired over this complete
misstep, which is another problem... zero accountability.

~~~
viraptor
Seems like Netflix's chaos monkey is not a bad idea actually. I don't mean you
have to kill your services randomly while there are users on them... but
switching from your master to secondary (why are you even making a distinction
anyway?) should be a pretty standard operation.

Even normal upgrades (hardware fails - it's a question of when, not if) could
be handled transparently just by making the "secondary" server a first-class
citizen.

------
ninjastar99
Really tried to love Postmark for about 3 months now. Constant, almost daily
issues forced us to unfollow them on Twitter (it literally became an annoyance
seeing issues daily). Then we switched to Mailgun last week, and we are very
happy. +1 to Mailgun

------
brainless
Are they using MongoDB for the wrong purposes? The task they have seems
similar to logging and if so aren't there much better software to do that? A
logging server perhaps?

~~~
rogerbinns
MongoDB does have what it calls capped collections - the semantics are the
same as a circular log. No idea why they don't use that storage format.

In any event this problem is one of success. That is the kind of problem I
prefer having.

------
reustle
So the primary was much beefier than the secondary? If so, why let it fail
over? I figured you should always have the same resources for a mongo primary
and secondary.

------
bsg75
Two is one. One is none.

This includes equality in failover systems. The common issue in all of these
cases is not the DB engine, OS, stack, whatever, but the infrastructure.

