Michael, interesting presentation. I'm the guy that created the "Ops per second"...

Michael, interesting presentation.

I'm the guy that created the "Ops per second" slide included in the presentation. It was originally from my presentation at MongoNYC:

http://www.10gen.com/presentation/mongonyc-2011/optimizing-m...

I almost regret including that chart because a lot of people are citing it out of context without explanation. That chart illustrates a very extreme case, and it makes an exaggerated point so that people know to be conscious of working set size vs RAM.

FooBarWidget made some very valid points about working set elsewhere in these comments. It's important to note that any database will suffer when data spills out of RAM. The degree to which it suffers varies in different workloads, DBs, and hardware.

Databases like PostgreSQL do better in some cases because they have more granular/robust locking and yielding, so disk access is handled better and generally leads to less degradation in performance.

For example, in MongoDB, when you do a write, the whole DB locks. Additionally, it does not yield the lock while reading a page from disk in the case of an update (this is changing). So if you spill out of RAM, and you update something, now you need to traverse a b-tree, pull in data, and write to a data extent, possibly all against disk, while blocking other operations.

In other databases, they might yield while the data is being fetched from disk for update (or have more granular locks), so a read on an entirely different piece of data can go through. That other piece of data might reside on another disk (imagine RAID setups), so it is able to finish while the write is also going. Beyond that, MVCC DBs can do a read with no read lock at all.

Yes, MongoDB lacks this robustness, and it's why it suffers worse under disk bound workloads than other databases. However, 10gen is very aware of this, and they are making a strong effort to introduce better yielding and locking over the next year.

Regarding indexing and working sets, FooBarWidget points out that there are very naive ways to approach indexing that lead to a much larger working set than is necessary. With very simple tweaks you can have a working set that only includes a sub-tree of the index and smaller portions of data extents. These tweaks apply to many DBs out there.

The classic example is using random hashes or GUIDs for primary ids. If you do this, any new record will end up somewhere entirely arbitrary in the index. This means the whole index is your working set, because placing the new record could traverse any arbitrary path. Imagine instead that you prefix your GUID with time (or use a GUID that has time as the first component). Now your tree will arrange itself according to time, and you will only need the sub-trees for the data that has time ranges you frequently look at. Most workloads only reference recent data, so this helps a ton. You’ll still find yourself doing stuff like this elsewhere in other databases to improve locality.

What’s really interesting is when you look at MVCC, WAL log systems and how they compare. They get a huge win by not needing locks to read things in a committed state. They also convert a lot of ops that involve random IO into sequential IO, but you pay a price somewhere else, usually needing constant compaction on live, query handling nodes. Rbranson elaborated really nicely on this stuff:

http://news.ycombinator.com/item?id=2688848

Other DBs also win in some ways by managing their own row cache. Letting the OS do it against disk extents means holes in RAM data. If you handle it yourself, you focus on caching actual rows and not deleted data. Cassandra does this.

Once MongoDB addresses compaction, locks, and yielding, it will start to be very competitive.

Keep in mind that I think there's something to be said for having native replica sets and sharding. I can say that it definitely does work in 1.8, we use it all over the place. With the recent journaling improvements, durability is also better, but even without it our experience has been more than adequate. Secondary nodes keep up with our primaries just fine. In the rare case where a primary falls over, we lose maybe only a handful of operations, which for our workload is acceptable.

Overall, we're happy with our choice of MongoDB, it's already doing what we need. With the improvements coming down the road, I think it'll be a major force in a year from now, so I'm also excited for what the future holds.