A random operation results in random I/O, another unavoidable problem. What I sp...

FooBarWidget · on June 23, 2011

Thank you for this informative rely, this kind of rely is what I'm looking for. There are a few things I should reply on:

> MongoDB stores ALL of the data for a database in a single set of sequentially numbered files.

Correct. But are you under the assumption that this means documents are spread randomly inside the files? MongoDB database files work with extents. Each collection consists of multiple extent. The extents may be scattered across the database files but each extent is contiguous.

> write-ahead log (WAL), in which the database will group several transactions together to make this a mostly sequential I/O operation

Very informative, thanks. I thought the WAL was only a durability feature. Indeed, all the resources that I've read on WAL so far only mention it in the context of durability. I had to read your post a few times to understand that it can be used for optimizing I/O.

That said, MongoDB relies on the operating system's I/O scheduler to perform the reordering. It just performs the write without fsyncing, allowing the writes to end up in the OS page cache. The OS is then free to reorder the writes however it likes.

How useful is the WAL for reordering I/O compared to what the OS already does? Do you think one is superior to the other? Other than bulk index updates of course.

> Only random reads actually become slow in this case, instead of practically every kind of I/O like it does in MongoDB when used in real-world scenarios.

Actually after fixing the _id problem my MongoDB database became insanely fast again. I/O reordering does happen in with MongoDB, just not in MongoDB itself.

> MongoDB can't give you this indicator because it's dependent on the OS cache entirely.

Your technique for measuring the working set sounds not too far off from what I do with MongoDB although I use a few more tools. The 'mongostat' utility shows how many page faults MongoDB gets; this is similar to your buffer miss rates. You can use Linux cgroups to limit MongoDB's page cache size. I also use mincore() and fincore() to see which places in the database files get paged in.

All in all, I have to change my opinion and admit that PostgreSQL does do many things better than MongoDB. However I think MongoDB doesn't do as bad as you think it does, and the automatic sharding feature in MongoDB is still something I really need.

rbranson · on June 23, 2011

You are correct that it's broken into extents, but as the extents become sparse over time without compaction, I/O gets scattered further and further around the disk.

The WAL is as much a durability features as it is about fast recovery, fast writes, and concurrency. MongoDB uses a journal collection for durability, which is quite a different thing, and actually requires doubled writes. The journal collection is just another MongoDB collection, instead of being an optimized log file structure.

The OS I/O scheduler works fairly well, but it's a general purpose tool and must work well for all types of I/O. The WAL writes themselves are ALWAYS sequential, something in-place updates will never get. The OS still tries to flush writes within something like up to 5 seconds (the Linux deadline scheduler). The RDBMS WAL checkpoint process can gather hundreds of megabytes of data over many minutes or hours, sort it, and write to the table files in a single pass.

In PostgreSQL, you can tune the checkpoint process down to how much data to write before a checkpoint, how long to wait at maximum, and how much I/O capacity to consume while checkpointing.

The cgroups, mincore, and fincore techniques are very interesting, going to look into them.

The key to the PostgreSQL strategy is that segmenting the cache into in-process buffers and the OS disk cache. It's an early warning system once the hits start to cross that threshold.

I'd be interested in learning about any strategies that could be used to segment the OS disk cache into multiple pools, because I'd prefer to actually reduce the shared buffer pool as it's somewhat less efficient than just using the OS disk cache (duplicated caches). Perhaps some kind of automated process could monitor the hit rate and actively expand the cgroups as needed to try to meet a hit rate target, and give you some warning when a RAM % threshold was exceeded.

There is nothing evil or wrong about MongoDB itself, it's just a new product that needs more polish. Many years ago when I first started using PostgreSQL, it needed much polish. It takes years and years to fine tune these types of things and find out how the initial assumptions made work out in real life.