
Scaling with MongoDB (or how Urban Airship abandoned it for PostgreSQL) - rbranson
http://schmichael.com/files/schmongodb/Scaling%20with%20MongoDB%20(with%20notes).pdf
======
AlexC04
Although I'm _very_ interested in the subject matter (I have a production site
that uses MongoDB and AWS), I often find presentations like this incredibly
annoying to read.

The problem is that presentations are short-form jot notes that are meant to
be supported by a delivery as well. The vast majority of the value comes from
what's not written down.

What's worse is that the "signal to noise" ratio detracts from my ability to
understand what you're really going on about.

Some examples:

You say "MongoStat is _Amazing_ " why? how? what does it do? (I've personally
played with and used it a little, but I'd like to know why you're amazed by
it. How can the things you've learned improve my practice?

You mention starting with 1.2 - does that mean you're not using 1.8? I've used
1.2 and was suffering 80% - 99.9% CPU load, as soon as I updated to 1.81 the
load dropped to closer to 0.03% for most of the same operations. Does that
mean your talk is invalid? Obviously there's a lot more to your story that I
can get from your slides.

All things considered though I'm no trying to come off as a jackass. Normally
when I see a HackerNews->slidedeck I just get frustrated and annoyed by things
like the "inside joke" and "cute kitten slides" that rely on the speaker to
deliver some insightful punchline. In this case I _did_ read one or two
interesting points.

Though I find I'm still suffering from the signal-to-noise problem that
accompanies all these decks, this was better than most. I'd love a bit more
narrative around the points that matter and a bit less around things that
don't (how to query in mongo isn't really necessary in a story of why you
abandoned it, is it?)

Anyways - was that a rant, review or stream of consciousness?

 __TD;DR __Thanks for the slide deck; I hate slide decks; Yours was less
terrible than most.

~~~
schmichael
(I'm the speaker)

Audio should be coming, but I'm not sure it will help. I definitely tried to
pack way too much into a 45 minute talk.

Not sure how to explain mongostat further... have you run it on a loaded
server? Just a great little near-real-time stats tool. Like vmstat for a db.
It could probably have a talk all its own just to explain how to diagnose
performance issues with it.

Most of my comments apply to MongoDB 1.4 although our little bit of data that
remains in MongoDB has been upgraded to 1.8 (via 1.6 first). I've never had
MongoDB use a significant amount of CPU load (from 1.2 - 1.8).

------
FooBarWidget
The presentation claims that MongoDB performance drops off a cliff once the
data or the indices no longer fit in RAM. This is not entirely
complete/correct.

First off, the same problem occurs no matter which database. Oracle,
PostgreSQL, MySQL, whatever, they all grind to a halt when stuff no longer
fits in RAM.

Second, it's not really the whole data set or the whole index that needs to
fit in RAM, it's the _working set_. This is a pretty big difference. I'm
currently running a production MongoDB database which is about 20 times larger
than RAM and at first I made the mistake of using random string primary keys
(_ids). This ensured good distribution between shards but also ensured good
distribution within the primary key index. This essentially made the working
set of the primary key index equal to the entire index. Once the primary key
index no longer fits in RAM, the database grinded to a halt. I now use a
primary key consisting of a timestamp followed by some random text. The
timestamp ensures that inserts only hit the end of the index, thereby keeping
the working set small. The database is now almost as fast as when it was
fresh.

~~~
rbranson
This harsh cliff does NOT occur on every database. The old guard RDBMSs have
extremely sophisticated and well-tuned I/O code that has been working around
slow disks for decades. While you can't escape the physics of rotational
drives, MongoDB does probably the worst job possible of dealing with their
poor performance. Yeah, great, it's fast as long as data is in memory, welcome
to the world of b-trees. With a high disk:RAM ratio for a data set (like
>30:1), it's slower than PostgreSQL. Without compaction, it's performance
degrades to become many of orders of magnitude slower than PostgreSQL.

The other major problem is that there is no tooling that can be used measure
how large your working set is in MongoDB, so by the time you figure out you're
near the cliff, it's WAY too late.

MongoDB encourages poor decisions in query design that lead to these problems.
The "rich query model" is enough rope to hang yourself with in a big data /
high scale situation. What makes MongoDB popular is that it's got much of the
convenience of SQL queries, but the problem is that it also falls into the
same set of traps. This is exactly what happened to Foursquare -- MongoDB's
marketing materials won't help escape the realities of data locality, the
scale-up hit, poor concurrency design, and rotational disk speed coupled with
mediocre caching infrastructure and an unpolished I/O subsystem.

As far as I'm concerned, MongoDB isn't for big data or scale -- yet. It's best
use is for small to medium sized data sets which benefit from it's schema-free
design.

~~~
FooBarWidget
While it's true that some other DBs have better code for working around slow
disks, does it really help all that much? In my case the random primary key
was causing every single operation to result in a seek to a random location on
disk. I really doubt other DBs could have helped me no matter how good their
code is. At best they postpone the core problem.

What tools do you use to measure the working set in PostgreSQL and other
databases?

~~~
rbranson
A random operation results in random I/O, another unavoidable problem. What I
speak of is the way in which modern, MVCC-based databases deal with
concurrency and optimally pattern writes in a way that makes reads lock-free.
In addition, they don't sacrifice any level of durability or ability to
quickly recover a corrupted database to get this. It's just a more advanced
design.

Usually each table is kept in it's own file (or set of files, if needed),
which helps sequential I/O for a single table stay sequential, and also
provides a point of logical partitioning for database operations.

In contrast, MongoDB stores ALL of the data for a database in a single set of
sequentially numbered files. This works great if you have a single collection
that can be ordered monotonically and queried in a roughly time-ordered way,
and you never delete anything, but it's pretty bad otherwise. MongoDB was
originally built for analytics, and it shows.

In this modern RDBMS, everything gets written first to a write-ahead log
(WAL), in which the database will group several transactions together to make
this a mostly sequential I/O operation. Many DBAs insist on a dedicated disk
for their WALs so that the drive head is dedicated to this sequential I/O.
There are also often multiple WALs, allowing multiple threads to write to disk
concurrently. Most MVCC databases default to a READ-COMMITTED isolation level
(this is actually relaxing some ACID semantics somewhat), and in this level,
performing reads and writes do not block each other.

When these WAL files grow to a certain point, the data in them is ordered to
be written to the table files as sequentially as possible in lower priority
threads. Multiple updates are resolved and collapsed. Parts of the database
files that contain deleted data are marked as such in some kind of heap
structure that keeps track of this free space. Whatever index updates need to
be made are done as well, and it's much more optimal to do these in bulk at
checkpoint time than it is to try to do them in-place. Doing bulk index
updates is FAR more preferable than doing them in-place at write-time, an
indisputable fact.

PostgreSQL in specific compresses large data values (ones that would push the
row size beyond the 8K page size) and transparently keeps them in a separate
TOAST table. The values are broken into 2,000 byte chunks. This prevents the
main data files from getting too sparse when row sizes have a high level of
variance.

Cassandra borrows WAL-writes-first-then-lazily-checkpoint technique, but takes
it a step further by using sorted, immutable table files. Cassandra doesn't
have MVCC, so making the table files immutable means that there's no need to
worry about locking as the contents never change. The sorted order means that
merging table files together at compaction time can be done in roughly O(N*2)
time.

All that work in making I/O as sequential as possible significantly softens
the blow of running out of RAM cache for your most active set. Only random
reads actually become slow in this case, instead of practically every kind of
I/O like it does in MongoDB when used in real-world scenarios.

To answer your second question, as far as I know, what I've done to to measure
how far away PostgreSQL is from the RAM cliff is to set the shared_buffers to
2/3 RAM (or 3/4 RAM if you're >8GB, 4/5 RAM if you're >32GB, etc), and measure
hit rates between the buffers and OS cache. Hit rates can be measured at the
database level and also down to the table level. The idea here is that if your
shared buffer hit rates begin to drop precipitously, you need more RAM, but
also gives you the another 1/3rd of the RAM on the box before you're truly
doomed. PG works well with the OS cache, so performance should remain steady.

MongoDB can't give you this indicator because it's dependent on the OS cache
entirely. I've spent a day looking for tools that would allow tracking how
large the hot pool of Linux disk cache was, with no success.

PostgreSQL also gives you per-table statistics on how many queries are ran,
and then how much and what types of I/O are being performed, so catching
problems and pinpointing them can be done before they become problems.

~~~
FooBarWidget
Thank you for this informative rely, this kind of rely is what I'm looking
for. There are a few things I should reply on:

> MongoDB stores ALL of the data for a database in a single set of
> sequentially numbered files.

Correct. But are you under the assumption that this means documents are spread
randomly inside the files? MongoDB database files work with extents. Each
collection consists of multiple extent. The extents may be scattered across
the database files but each extent is contiguous.

> write-ahead log (WAL), in which the database will group several transactions
> together to make this a mostly sequential I/O operation

Very informative, thanks. I thought the WAL was only a durability feature.
Indeed, all the resources that I've read on WAL so far only mention it in the
context of durability. I had to read your post a few times to understand that
it can be used for optimizing I/O.

That said, MongoDB relies on the operating system's I/O scheduler to perform
the reordering. It just performs the write without fsyncing, allowing the
writes to end up in the OS page cache. The OS is then free to reorder the
writes however it likes.

How useful is the WAL for reordering I/O compared to what the OS already does?
Do you think one is superior to the other? Other than bulk index updates of
course.

> Only random reads actually become slow in this case, instead of practically
> every kind of I/O like it does in MongoDB when used in real-world scenarios.

Actually after fixing the _id problem my MongoDB database became insanely fast
again. I/O reordering _does_ happen in with MongoDB, just not in MongoDB
itself.

> MongoDB can't give you this indicator because it's dependent on the OS cache
> entirely.

Your technique for measuring the working set sounds not too far off from what
I do with MongoDB although I use a few more tools. The 'mongostat' utility
shows how many page faults MongoDB gets; this is similar to your buffer miss
rates. You can use Linux cgroups to limit MongoDB's page cache size. I also
use mincore() and fincore() to see which places in the database files get
paged in.

All in all, I have to change my opinion and admit that PostgreSQL does do many
things better than MongoDB. However I think MongoDB doesn't do as bad as you
think it does, and the automatic sharding feature in MongoDB is still
something I really need.

~~~
rbranson
You are correct that it's broken into extents, but as the extents become
sparse over time without compaction, I/O gets scattered further and further
around the disk.

The WAL is as much a durability features as it is about fast recovery, fast
writes, and concurrency. MongoDB uses a journal collection for durability,
which is quite a different thing, and actually requires doubled writes. The
journal collection is just another MongoDB collection, instead of being an
optimized log file structure.

The OS I/O scheduler works fairly well, but it's a general purpose tool and
must work well for all types of I/O. The WAL writes themselves are ALWAYS
sequential, something in-place updates will never get. The OS still tries to
flush writes within something like up to 5 seconds (the Linux deadline
scheduler). The RDBMS WAL checkpoint process can gather hundreds of megabytes
of data over many minutes or hours, sort it, and write to the table files in a
single pass.

In PostgreSQL, you can tune the checkpoint process down to how much data to
write before a checkpoint, how long to wait at maximum, and how much I/O
capacity to consume while checkpointing.

The cgroups, mincore, and fincore techniques are very interesting, going to
look into them.

The key to the PostgreSQL strategy is that segmenting the cache into in-
process buffers and the OS disk cache. It's an early warning system once the
hits start to cross that threshold.

I'd be interested in learning about any strategies that could be used to
segment the OS disk cache into multiple pools, because I'd prefer to actually
reduce the shared buffer pool as it's somewhat less efficient than just using
the OS disk cache (duplicated caches). Perhaps some kind of automated process
could monitor the hit rate and actively expand the cgroups as needed to try to
meet a hit rate target, and give you some warning when a RAM % threshold was
exceeded.

There is nothing evil or wrong about MongoDB itself, it's just a new product
that needs more polish. Many years ago when I first started using PostgreSQL,
it needed much polish. It takes years and years to fine tune these types of
things and find out how the initial assumptions made work out in real life.

------
cloudhead
Well... no shit, people. For whatever reason, many developers think MongoDB is
an SQL replacement.

To be fair, MongoDB did a very good job at hiding information which made it
look bad, while posting misleading benchmarks. I think the majority of the
user-base doesn't know MongoDB isn't durable out of the box.. they think it's
just a "faster", better MySQL.

Doing solid research goes a long way to tone down the hype.

~~~
methodin
I actually tend to believe many developers realize that data in a relation DB
is not the only way to go about things. Having spent my entire adult life in a
SQL bubble it's refreshing to have all these alternative tools now that can do
more for me as a collective whole. But of course blindly doing anything
without a reason is a bad idea.

For me, Mongo blows any RDMS out of the water with regards to prototyping and
getting an idea into a working solution. That alone is why I use it
exclusively for new projects - under the guise that if any of the projects
ever become successful I'll have to split things out into their respective
buckets. So far, I haven't been fortunate enough to have these problems.

~~~
rbranson
Can you give a concrete example where MongoDB blew an RDBMS out of the water
in regards to prototyping? I find that database schemas don't really get in
the way until a project matures, and then it becomes a road block both in a
legacy manner and that making changes puts strain on the database
infrastructure.

~~~
methodin
I can't remember the last time a schema I made in week 1 was similar to the
schema in week 4. That alone is why I made that statement. Modifying schemes
requires modifying code all over the place that relies on it. The majority of
the time, with Mongo, I can ignore those changes and only code for what I
added.

That is my experience. Perhaps you have the patience and foresight to plan out
your schema well before coding.

------
gurgeous
Money quote based on trying to restore their MongoDB during the EBS downtime:

"Waiting for indexes to build takes approximately as long as it takes AWS
engineers to fix EBS issues."

~~~
dolinsky
As far as I'm aware this would be no different when reimporting a dump from
MySQL/Postgres. If you're not restoring from a snapshot then the indexes have
to rebuild themselves.

~~~
cheald
My experience has been that MySQL backup/restore ops (mysqldump; mysql <
dump.sql) are orders of magnitude slower than MongoDB backup/restore ops. We
use LVM snapshotting for MySQL backups because the "normal" method is just too
slow.

I just did a test dump of a 1.7GB MongoDB database:

    
    
        real    0m5.781s
        user    0m0.570s
        sys     0m4.817s
    

I'm not in a position to do a restore just now, but my experience has been
that restores are similarly quick. 120GB is obviously a far larger amount of
data, but unless you've gone insanely nuts with the indexing, I can't imagine
that it would take hours upon hours to do.

~~~
cdavid
5 seconds for a 1.7 GB working set sounds really low: it would mean you could
write > 200 Mb / sec assuming you could get data from mongodb infinitely fast.
Also, what happens to your mongo database while you export data (can you still
write to it ?).

In any case, mysqldump is rarely the bottleneck - restoring is. For the
latter, you _really_ want to use mysql from percona, because their version of
mysqldump has an option to void the data without index first, and create the
index later (--innodb-optimize-keys=True). The effect will depend on the
database schema of course, but I often see one order of magnitude difference
for simple, large tables with a few simple indexes.

~~~
cheald
5 seconds does seem very fast, but this is on a high-end Rackspace server with
what I'm sure are very fast hard drives and some 64GB of RAM. It's a bit
slower on "normal" boxes, but still quite quick.

You can still write to your database, yes. There's a flag, --oplog, which when
passed to mongodump, will record the oplog position at the time the backup
starts, then dump any ops since the backup started as a part of the backup, so
they get replayed on restore. This leaves your data consistent with the
database state at the end of the backup without the need to issue a write
lock.

------
imack
This reminds me of a video I saw about scaling with ruby on a niche social
network (<http://zenbach.com/scaling-fetlife-com-with-ruby>). It was an
interesting talk, but they also tried MongoDb and moved away from it, though,
they were also using an older version.

------
schmichael
A few clarifications:

* Slides w/o notes available here: <http://opensourcebridge.org/2011/wiki/Scaling_with_MongoDB>

* Most slides referred to MongoDB 1.4

* This was _our_ experience, YMMV. I just wanted to share our story in hopes it would help others avoid some pitfalls.

I'll try to give honest responses to questions asked. I'm not trying to be
unnecessarily cynical, overly negative, or anti-10gen in my slides, talk, or
comments here. I'm sure many people are using MongoDB happily, and I've heard
10gen's support is outstanding.

~~~
Joakal
A good part of your presentation should have been what your requirements are.

For example: MongoDB's schemaless feature has a downside that means more data
as a key must be defined individually. Great if you constantly need to
fix/change the schema. A structured schema solution obviously has a space
saving.

~~~
schmichael
Schemaless is _not_ great when you need to actually fix or change your schema.
It's very difficult to migrate every document's schema without downtime.

Schemaless _is_ great if you have a _dynamic_ schema. We did not for the main
dataset in question, and I probably should have mentioned that.

We do have 1 smaller dataset still in Mongo that currently has a dynamic
schema and works perfectly (very little data & traffic though).

~~~
jamwt
Kind of a side topic, but lazy, JIT migrations work well for this. You keep a
"migration chain" in your fetch layer that goes v1 -> v2, v2 -> v3, and every
persisted doc is tagged with the schema version at the time of persistence.

Then, your binding code just runs it through whatever subset of the chain
brings the document to v-latest. If the document is re-written, it will
(naturally) be re-written in updated form, and you'll only pay the upgrade
penalty once (or, not at all for cold data).

It sounds kinda creepy to old-school schema theorists, but in practice it
actually works pretty well (provided you always use this library to retrieve
your objects.)

Using this method, I've never done a stop-the-world schema migration on a
schema-free system, and I don't think I ever would... otherwise, you're
tossing out one of the major benefits of schema-free: no migrations! If you're
going to be rigorous about full-collection schema consistency, you maybe
should have just used a proper RDBMS to begin with...

~~~
jorangreef
I am trying to come up with a simple way of automatically handling data and
schema migrations across a database that spans client and server treating both
as masters. It just seems incredibly difficult to get right. Your idea of
handling migration at read time sounds promising.

~~~
jamwt
Yep--and just to reassure you, it's more than an idea on my end... I designed
and maintained a critical system a few years ago on a large K/V store that
used this method, and it worked without a hitch. We never even needed to think
about the fact that the documents within the store were comprised of various
schemas from several generation of the application (except when we wrote and
tested the next incremental link the upgrade chain).

------
i34159
Had some of the same issues with HBase. Ended up going back to Postgres too,
but with a twist: <http://www.slideshare.net/cloudflare/sortasql>

~~~
someone13
What filesystem did you end up using? Ever looked at GlusterFS
(<http://www.gluster.org/>)? From an (admittedly unscientific) comparison of
the various free/open-source distributed filesystems, it seems the best to me.

Ceph is still in beta, and the fact that Lustre only supports a single
Metadata Server is really scary to me.

------
dolinsky
One particular part of that slide deck stuck out, and b/c there's no
accompanying audio to go with it I thought I'd mention it. Towards the
beginning of the deck in one slide he says 'MongoDB is fast.... until your
data / indexes no longer fit in memory'. Later on there's the '120GB of data
in MongoDB became 70GB of data in Postgres'. The next slide talks about using
short key names + a key-mapping layer to preserve space.

While using longer key names in your objects will lead to larger objects /
larger storage requirements, indexes do not contain the names of the keys of
the documents they are created on, so if I have 10M documents in a collection
and an index on a field 'ts', that index has the same size as if the field
were named 'timestamp'. It's only the storage of the object that grows in
size, and by the time you are handling 120GB of data it's very reasonable to
imagine you would be using Memcached in front of MongoDB to obtain the actual
objects.

~~~
schmichael
We used verbose key names and paid for it. We even failed to override _id with
a useful primary key. So... not great.

That being said, our data is not completely normalized in PostgreSQL, so it's
not a totally unfair comparison.

It may be possible to simulate an apples to apples comparison (small keys in
mongo + perfectly normalized pgsql tables), but I'm not sure that would be a
useful indicator for real world use cases.

Suffice it to say: repeating your schema in every row/document is going to
take more space than not repeating it.

~~~
dolinsky
Oh no doubt, and I think naming conventions is something that takes practice
over time in this situation - too verbose leads to 'wasted' space, too little
leads to making it near impossible for someone to view the database. Since
indexes don't store the key names that does alleviate part of the problem as
well.

I tend to leave the default _id in place as it also acts as a 'created_on'
timestamp, saving the need to add an additional field.

Thanks for the deck. It's always good to hear about situations that didn't
work out well and what can be learned going forward (though I do think you'd
have a better experience using the 1.8 branch than you did with 1.4).

~~~
moe
The real shame is that this is still something the programmer has to worry
about in first place.

When you're a database in 2011 then you should have a damn good reason for
asking programmers to space-optimize their identifiers. I can't see such a
reason in Mongo really.

~~~
schmichael
Everyone should be upvoting this.

I vaguely remember 10gen brainstorming some ideas for fixing this, but it's a
pretty poor design decision for a database you're advertising for use in "Big
Data" scenarios. (Not that our measly 120 GB db constitutes big data... but
even at Medium Data something as silly as key sizes can have a significant
performance impact).

~~~
bdarfler
Some/M=most Mongodb client libraries can do this for you. I can see arguments
both ways to keep it in the server or keep it in the client library. 10gen
does a hell of a job supporting a wide variety of very good client libraries.

------
someone13
Thanks for this. I've heard lots of good things about MongoDB (and auto-
sharding makes me drool), but I've heard rumours of problems with it too.
Always cool to read something from the other side of the fence.

Too bad that there's no other product out there that makes automatically
sharding large amounts of data quite so easy...

~~~
megaman821
Doesn't Riak and Cassandra make it easy? There is just no fast ad-hoc querying
in those two.

~~~
rbranson
Arguably, with a data set of any serious magnitude, there's no such thing as
ad-hoc. As soon as an RDBMS is sharded, it loses most of it's ad-hoc powers.

That being said, Riak Search allows Riak to perform ad-hoc queries and
Cassandra now supports secondary indices along with a SQL-like query language
called CQL. Of course, you'll have to pre-plan your indexes with either of
those options.

~~~
schmichael
"As soon as an RDBMS is sharded, it loses most of it's ad-hoc powers."

That's an overly broad statement I would avoid. We only need to query across
our various PostgreSQL partitions for offline batch operations. Our
application code has yet to feel restricted by the manual sharding.

Secondary indexes - or rather: not restricting a query to a single partition
(using your primary key) - is going to be expensive in any distributed system.

~~~
rbranson
You lose aggregates and joins, arguably the most powerful parts of an ad-hoc
query system (business intelligence kind of stuff). Without those features,
you must move a great deal of these concerns into the application layer (or
whatever manage your shards), even if it's just rolling up aggregates across
shards. This works, but again, at the loss of the real advantages of the
database here as you've got to do it yourself.

I agree with your second assertion. Secondary indexes should be avoided, but
are a useful tool if absolutely needed. I've found that Cassandra's richer
data model really allows them to be avoided in most scenarios.

------
invisible
I think a huge issue that he somewhat glazed over is that he is refusing to
use auto-sharding. That is a pretty large benefit that MongoDB provides. Oh,
and he refuses to upgrade his 1 year old build (1.5, now at 1.8) - no
explanation of why but he was willing to switch the entire stack to
PostgreSQL.

~~~
robotadam
We made the switch before 1.8 was released, and made the decision during the
1.6 cycle, give or take. Auto-sharding never worked reliably in our tests. I
think schmichael does a good job explaining the new features that would have
helped -- spares indexes, in particular -- but the core issues still remain
(lock contention, no non-blocking online compaction).

~~~
kristina
For 2.0, there is some major work on locking being done (concurrency will be
an ongoing theme in 2011) and online compaction has been implemented (although
you have to kick it off, it's not yet automatic).

~~~
robotadam
According to the docs, this is a blocking operation -- so it's inplace, but I
wouldn't consider it to be online, as I would PostgreSQL's vacuum or
Cassandra's major compactions:
<http://www.mongodb.org/display/DOCS/compact+Command>

~~~
rbranson
Let's also consider that MongoDB's entire I/O architecture and data structure
layout is built around a global lock and in-place updates, which makes
implementing non-blocking compaction nearly impossible without either a MAJOR
overhaul of the codebase or if somehow they were able to rig up the
replication system to internalize the advocated compact-on-replica-set-slave-
and-then-rotate method.

PostgreSQL has an automatic cost-based vacuum and sophisticated adaptive free
space and visibility maps. With log-then-checkpoint writes and MVCC, there's
no interruption of the database at all to reclaim page space used by deleted
rows.

Cassandra also has a write-ahead log, and flushes it's in-memory
representation of data to immutable files on disk, which actually was a pretty
brilliant decision. Dead rows are removed during a minor compaction, which can
also merge data files together. The compaction is very fast because the files
are immutable and in key-sorted order, making it all lock-free, sequential
I/O.

In theoryland(tm), in-place updates seem better because you're only writing
data once. In realityworld(R), the need for compaction, cost of random I/O,
data corruption problems, and locking make it suck.

------
geophile
This is a really fascinating presentation, and an easy read. The major
conclusion is to think really carefully about abandoning a SQL database for
noSQL. A little more info on the postgres solution and their experience with
replication would have been good.

~~~
schmichael
Horizontal scaling is key. If we had more faith in auto-sharding in MongoDB
1.6, that may have worked (however horizontal scaling wasn't our only issue
with MongoDB).

With PostgreSQL we manually partition the data (luckily we have a really nice
key to partition on).

PostgreSQL 9's streaming replication has been adequate so far. We've only been
using it in production for a few months and recently moved off of AWS* which
drastically reduced our error rates overall.

* [http://urbanairship.com/blog/2011/06/01/hybrid-cloud-were-up...](http://urbanairship.com/blog/2011/06/01/hybrid-cloud-were-up/)

~~~
jacquesgt
Totally off-topic here, but have you noticed that the body text size of your
company's blog stays really small even when trying to zoom in? I've tried
zooming in using Safari and Chrome, and with both of them the text stays so
small that it's difficult to read.

------
thadeus_venture
It seems that the main criticisms are:

a) Lack of online compaction (causing data set to be much bigger than
necessary)

b) Broken auto sharding (complicating horizontal scaling)

c) "Scary" durability and scaling stories with auto sharding

Looks like auto-sharding has had a lot of work done, while online compaction
still isn't there.

But in particular I'm curious about (c). What exactly is so scary durability
wise with auto-sharding? Is it just the fact that it didn't work in 1.5/1.6?

~~~
schmichael
"But in particular I'm curious about (c). What exactly is so scary durability
wise with auto-sharding? Is it just the fact that it didn't work in 1.5/1.6?"

Yes. During the 1.5 dev cycle we ran out of time. Our main database server
couldn't vertically scale any further, so we had to do something. Despite
1.6.0 being released before our migration finishing, auto-sharding still
didn't seem to be anywhere near the stability we needed.

I've heard some horror stories regarding auto-sharding & replica sets (in 1.8)
since then that sadly I'm not at liberty to share.

I think you missed a number of finer points as well:

* We were not guided to make the best schema decisions initially, and migrating schemas is tricky and manual.

* Some schema decisions exposed us to short-comings in MongoDB (the double updating issue).

* The global write lock is severely limiting.

* It's unsafe by default (drivers default to not checking for errors when writing, among other things)

* The ops load is greater for auto-sharding+replica-sets than for manually sharded master/slave setups (granted the former has features the latter does not).

Perhaps the most frustrating part was the constant unfortunate surprises.
There are stories that didn't even make it into the slides. Many you could
blame on us. Some we could blame on EC2. However, strong benefits to using
MongoDB over PgSQL never materialized, so we opted to switch to a system with
far fewer unknowns for our fairly modest needs.

If you think fully utilizing MongoDB's architecture will give your business a
competitive advantage: go nuts. That was far from our experience.

~~~
thadeus_venture
Thank you, that's very interesting. "Going nuts" with MongoDB is exactly what
I'm contemplating right now at my start up, and this is something I haven't
quite heard before.

Is there more info on the double update issue anywhere? It sounds like a
really bad race condition.

------
rosser
If you want to use MongoDB in conjunction with PostgreSQL, the latest release
of Bucardo (5.0) includes support for replicating between them.

~~~
bsg75
In the PostgreSQL -> MongoDB direction only. While I understand the reasons
for implementing this first (nested document replication to an RDBMS is a
bigger task), I am hoping that it will see the reverse direction someday.

Using MongoDB as data collectors on the edge, replication to a central RDBMS
for use with established BI tools would be a win.

------
code_duck
Etsy.com unveiled a new gallery list system, the Treasury last year and
published a couple of articles about their use of Mongo. Apparently they had
some problems and rewrote the backend recently... haven't heard what yet.
Hopefully they'll publish their experiences in their dev blog after the dust
settles.

------
bdarfler
It seems a bit disingenuous to post slides now reviewing a year old version of
MongoDB that is almost 3 major revs behind.

These days auto-sharding works and works well. There is a learning curve to
optimize it (i.e. picking a good shard key, analogous to picking a good _id
field) but that is to be expected.

Replica sets work exactly as advertised in my experience. Automatic fail over
and no need to wake up at 3am is a huge win.

Finally single server durability directly addresses durability concerns that
many have.

The first two points are the reason why my company and so many others are
switching even if that means needing to understand how to pick good ids/shard
keys and having to shorten our field names by hand. I can see why, if they
were not available at the time, MongoDB might not be a good choice but the
fact of the matter is that they are available today.

------
ConceitedCode
Great read. Gives a great example of when NoSQL is not a good solution instead
of a relational database.

~~~
Nizumzen
I think the real lesson here is to choose technologies based on actual
experience rather than hype.

So that means extensive testing before nailing your colours to the mast.

~~~
mtkd
Or choose technology based on requirements.

If the port to SQL was so easy and used half the storage and didn't appear to
impact performance - it was possibly the better solution in the first place.

Mongodb is awesome - one of best technologies I've played with in a long while
- but don't be too quick to write SQL off - especially if you have a lot of
relationships.

Off topic - prototyping in Rails with Mongoid and the schemaless Mongodb
underneath is frictionless - it's a step change in speed from prototyping with
SQL.

~~~
tedkalaw
Would you happen to have any anecdotes of a project or dataset that MongoDB
worked particularly well for you?

I'm working on a project where documents seem to work particularly well,
though the core of the project is still in MySQL. I decided to give Mongo a
shot for what I'm working on, but posts like this are a reminder to constantly
be evaluating the technology you choose to use.

~~~
jjm
Take a look at this: <http://www.percona.com/software/percona-server/for-
nosql/>

There is one for mysql: <https://github.com/ahiguti/HandlerSocket-Plugin-for-
MySQL>

But i would urge to go with XtraDB.

I use mongo for everything but the money (where cash is not involved) and for
things not needing transactions. Have been sticking to MySQL for everything
else (now moving to XtraDB).

For big big big projects and pure nosql I go with Cassandra (as long as you
can feed it enough RAM). Otherwise, XtraDB can go the distance.

------
andrew311
Michael, interesting presentation.

I'm the guy that created the "Ops per second" slide included in the
presentation. It was originally from my presentation at MongoNYC:

[http://www.10gen.com/presentation/mongonyc-2011/optimizing-m...](http://www.10gen.com/presentation/mongonyc-2011/optimizing-
mongodb-lessons-learned-at-localytics)

I almost regret including that chart because a lot of people are citing it out
of context without explanation. That chart illustrates a very extreme case,
and it makes an exaggerated point so that people know to be conscious of
working set size vs RAM.

FooBarWidget made some very valid points about working set elsewhere in these
comments. It's important to note that any database will suffer when data
spills out of RAM. The degree to which it suffers varies in different
workloads, DBs, and hardware.

Databases like PostgreSQL do better in some cases because they have more
granular/robust locking and yielding, so disk access is handled better and
generally leads to less degradation in performance.

For example, in MongoDB, when you do a write, the whole DB locks.
Additionally, it does not yield the lock while reading a page from disk in the
case of an update (this is changing). So if you spill out of RAM, and you
update something, now you need to traverse a b-tree, pull in data, and write
to a data extent, possibly all against disk, while blocking other operations.

In other databases, they might yield while the data is being fetched from disk
for update (or have more granular locks), so a read on an entirely different
piece of data can go through. That other piece of data might reside on another
disk (imagine RAID setups), so it is able to finish while the write is also
going. Beyond that, MVCC DBs can do a read with no read lock at all.

Yes, MongoDB lacks this robustness, and it's why it suffers worse under disk
bound workloads than other databases. However, 10gen is very aware of this,
and they are making a strong effort to introduce better yielding and locking
over the next year.

Regarding indexing and working sets, FooBarWidget points out that there are
very naive ways to approach indexing that lead to a much larger working set
than is necessary. With very simple tweaks you can have a working set that
only includes a sub-tree of the index and smaller portions of data extents.
These tweaks apply to many DBs out there.

The classic example is using random hashes or GUIDs for primary ids. If you do
this, any new record will end up somewhere entirely arbitrary in the index.
This means the whole index is your working set, because placing the new record
could traverse any arbitrary path. Imagine instead that you prefix your GUID
with time (or use a GUID that has time as the first component). Now your tree
will arrange itself according to time, and you will only need the sub-trees
for the data that has time ranges you frequently look at. Most workloads only
reference recent data, so this helps a ton. You’ll still find yourself doing
stuff like this elsewhere in other databases to improve locality.

What’s really interesting is when you look at MVCC, WAL log systems and how
they compare. They get a huge win by not needing locks to read things in a
committed state. They also convert a lot of ops that involve random IO into
sequential IO, but you pay a price somewhere else, usually needing constant
compaction on live, query handling nodes. Rbranson elaborated really nicely on
this stuff:

<http://news.ycombinator.com/item?id=2688848>

Other DBs also win in some ways by managing their own row cache. Letting the
OS do it against disk extents means holes in RAM data. If you handle it
yourself, you focus on caching actual rows and not deleted data. Cassandra
does this.

Once MongoDB addresses compaction, locks, and yielding, it will start to be
very competitive.

Keep in mind that I think there's something to be said for having native
replica sets and sharding. I can say that it definitely does work in 1.8, we
use it all over the place. With the recent journaling improvements, durability
is also better, but even without it our experience has been more than
adequate. Secondary nodes keep up with our primaries just fine. In the rare
case where a primary falls over, we lose maybe only a handful of operations,
which for our workload is acceptable.

Overall, we're happy with our choice of MongoDB, it's already doing what we
need. With the improvements coming down the road, I think it'll be a major
force in a year from now, so I'm also excited for what the future holds.

