
Newsblur runs into MongoDB bug – can't replicate DB - sp332
https://jira.mongodb.org/browse/SERVER-9059
======
gecko
Here's what I do not understand:

Every last tool I have ever worked with has trade-offs. I don't have any
problem with that; I've sometimes even gone as far as phrasing it as, "If you
don't hate your tech stack, you aren't really using it." I can tell you all
kinds of things that C#, the .NET CLR, IIS, Apache, Mercurial, Git,
elasticsearch, Redis, Gunicorn, Python, Celery, and SQL Server do that make me
livid, because I've used them heavily. But that's _because_ I use them
heavily. I've never had _any_ of these tools bite me in the butt early in the
process, and _definitely_ not had any of them bite me in the butt in
_unexpected ways_ down the road. They bite me when I push them incredibly
hard, right to their limits, and, due to well-understood design constraints
that I'm frequently anticipating hitting ahead of time, they fall down. That's
normal and fine, and handling those situations is just good software
engineering. Your tools will have limits, and that stinks, but handling those
limits is part of what your job entails, and you need to deal with it.

I do not use MongoDB. But here's what I see: about once a month, I come across
an article where something incredibly fundamental to Mongo does not work
properly. Not only does it not work properly: the way it doesn't work properly
is _exceedingly bad_. In this case, Newsblur can't shard, which removes one of
Mongo's best benefits, and the _way_ it fails isn't to tell you early on that
you will not go to space today, but rather to segfault and die after six hours
of replication.

That's not predictable. That's not documented. And that's not something you
can anticipate. As a developer, that concerns me, and it should concern you,
too.

I understand that 10gen is an awesome, responsive company, and they have
always been there to help. I don't want to malign that. When the Trello team
had Mongo-related issues the other week, they were trying to help them out,
too. But I genuinely do not view as paranoia my belief that the frequency and
severity of stories like this mean that MongoDB is _still not a good
technology choice_.

~~~
nasalgoat
I have to agree - if I had the choice to make again, I wouldn't have chosen
MongoDB. It is a fundamentally broken product _by design_.

For example, for their sharding setup they require three separate servers to
host shard config data. The idea is to have high availability of that data
should one of those config servers go away.

However, _by design_ , losing a _single_ config server can cause multiple
shard masters to die, requiring a manual restart. How they die and when is
random, determined by where in a block migration a server is.

How is this high availability? All three config servers must be running at all
times without interruption for your cluster to be stable. Their roadmap to
deal with this issue is sometime next year.

One month not too long ago, my company found 90% of the bugs listed in their
bug tracker for a specific release of MongoDB - many of which would have been
found with a basic unit testing suite and some minor load testing. We were
effectively performing QA functions for 10gen in our production environment.

I've gone through _nine releases_ of their PHP driver to deal with broken Data
Center Awareness and none of them have worked - DCA still eludes them two
years later. We have to do an OS level hack to make this work, that breaks
other HA functions.

Finally, their mmap design means that memory use is extremely inefficient - on
a box with 256GB of memory, with a database that is only 100GB in size, it
_still hits disk on db reads_ because they offload memory management to the
OS. Any other enterprise-level DB would preload the entire dataset in memory
if there's room, but not MongoDB.

It really is terrible.

~~~
Qerub
> it still hits disk on db reads because they offload memory management to the
> OS

Or rather: It hits the OS's disk cache. I'm not saying that this isn't a
problem, but it's far from as bad you make it sound.

([http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serve...](http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server)
actually recommends limiting `shared_buffers` (PostgreSQL's in-memory cache)
to let the OS disk cache do its magic.)a

------
conesus
NewsBlur's developer here. Just want to point out that this bug has bitten me
back in 2.2, a few months ago. I need to be able to replicate my db in order
to shard. But the biggest beef I have with this bug is that it doesn't become
apparent until AFTER you spend 6 hours replicating (which results in poor load
times for the primary db).

I would love it if I could choose which server to sync from. This option used
to exist, but they removed it. But that would only solve the performance
penalty of replication.

To their credit, the MongoDB folks have been stellar. I had a hardware failure
a year ago and the CTO ssh'ed into my machines to figure out what was wrong.
This time I'm having a bit more difficulty getting the problem fixed, but it's
understandable as I have 100GB of highly variable data.

~~~
MichaelGG
100GB takes 6 hours? That's ~4.7MB/sec - is that right? That seems, sorta like
slow?

Not to sound rude, but what is so difficult about managing 100GB of data? You
can fit that in RAM without much work.

~~~
conesus
The transfer speed is between data centers. West coast to east coast.

As for the 100GB, I have 12 task servers with 6-8 processes each reading 25-50
stories each every 4 seconds. I also have 18 app servers with 6-8 processes
each reading 12 stories every second. Each story is on average 4KB. That's
several MB a sec.

~~~
ryangripp
I just signed up for newsblur and paid $24. Love it.

------
sp332
Funny that just two days ago he posted: "It's easy to be negative on Mongo.
It's far harder to scale the alternatives. I've been through many ups and
downs w/ Mongo. Sticking w/ it."
<https://twitter.com/NewsBlur/status/314120501356265472>

~~~
coolsunglasses
Greek tragedy tier hubris backfire.

(Speaking as a user of Postgres and MongoDB, I'm way more worried about the
latter screwing me than the former.)

------
tharshan09
I dont understand why you are posting this on here? Is it to point at him and
laugh? Its seems pretty childish to point out someone's hardship on here.

~~~
sp332
NewsBlur is one of the top contenders to replace Google Reader, which is
closing in 3 months. The guy behind it posted some of his scaling troubles
before <https://news.ycombinator.com/item?id=5391713> and his twitter
@NewsBlur has been keeping users up-to-date with performance issues and
planned downtime for migration etc. This story is interesting to watch.

~~~
leothekim
"This story is interesting to watch."

This is a JIRA. It became a story once you posted it and commented on it in
Hacker News.

~~~
sp332
I meant the whole story of NewsBlur being at the right place at the right time
and trying to scale really, really fast.

------
jerdavis
Just going to get on the hater train here. (not any poster in particular but
Mongo in general). The day Postgres releases an upgrade, I expect, and get
100% functionality. So the fact that this is "Just Released" is FAR from an
excuse, in fact it shows EXACTLY why you shouldn't use Mongo. Also a 100GB DB,
with 300 connections and several MB per second.. Seriously. WTF is wrong with
you people? Multiply by 10, and you can still do that on 1 machine.

~~~
nasalgoat
MongoDB performance is abysmal - I'm running 50 machines with massive memory,
CPU and disk resources to get, maybe, 20,000 queries/sec per master.

There appears to be _zero_ CPU or memory optimization in place. Mostly I blame
the mmapped files.

------
nemothekid
Hope this issue can be resolved in manner that isn't toxic. I actually ran
into this issue like this on a fairly large collection 2 months ago (almost
300GB, happened on a single replica set). It looked like an index got
irreversibly corrupted. Luckily for us this collection got rebuilt everyday,
so we could just delete the whole thing, but the pain was we weren't aware of
the issue until a user reported he couldn't access data. When ever the
corrupted data was accessed, it would lead to an assertion failure. Seemed
like one of those dreadful moments when mongo decides to corrupt your data.

------
tweaqslug
First, they are not attempting to shard, they are just adding redundancy.
Second, it is not surprising that someone encountered a bug in a piece of
server software released literally two days ago.

~~~
MichaelGG
I guess people tend to expect server _database_ software to not have critical
bugs like this. I understand, mistakes happen. But this doesn't help Mongo's
less-than-stellar reputation.

~~~
ajbetteridge
Or he could have, I don't know, tested it first? Nah, no one tests first
before making such a large change on a production server. And don't give me
crap about it's a one man operation either and we should feel sorry for him,
if he's hoping to take over a good portion of Google Reader's subscribers
(which he seems to hoping to do) then he should be planning better. If he
can't plan to cope with a big influx of new users, that he's asked for by
advertising his service, then perhaps migrating to his service isn't the right
thing to do. I'd rather have Google Reader notify me that they're shutting
down in 3 months than a one man operation shutdown without notice because they
can't handle the load and the hassle and costs are way over what they were
expecting.

------
diminoten
If you actually can't shard a 2.4 db, I feel like we'd have heard about it
earlier. There's (almost guaranteed) something else going on here, but nice
sensationalist title.

~~~
sp332
I didn't sensationalize the title. That's just what the bug report says. _I
can't make any new copies of the data! And I need to start sharding because my
service is exploding in growth. What am I supposed to do if I can't
replicate?_

------
leothekim
Looks remarkably similar to this bug, which they claim is fixed:
<https://jira.mongodb.org/browse/SERVER-6228>

Though the error in that bug was due to a regular replication event ("Fatal
Assertion 16360") and this one was happening during an initial sync ("Fatal
Assertion 16361"). Maybe the bug was fixed in one and not the other.

------
ryangripp
I love how Google kills reader and newsblur's Alexa graph goes off the charts

<http://www.alexa.com/siteinfo/newsblur.com#>

