Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A random operation results in random I/O, another unavoidable problem. What I speak of is the way in which modern, MVCC-based databases deal with concurrency and optimally pattern writes in a way that makes reads lock-free. In addition, they don't sacrifice any level of durability or ability to quickly recover a corrupted database to get this. It's just a more advanced design.

Usually each table is kept in it's own file (or set of files, if needed), which helps sequential I/O for a single table stay sequential, and also provides a point of logical partitioning for database operations.

In contrast, MongoDB stores ALL of the data for a database in a single set of sequentially numbered files. This works great if you have a single collection that can be ordered monotonically and queried in a roughly time-ordered way, and you never delete anything, but it's pretty bad otherwise. MongoDB was originally built for analytics, and it shows.

In this modern RDBMS, everything gets written first to a write-ahead log (WAL), in which the database will group several transactions together to make this a mostly sequential I/O operation. Many DBAs insist on a dedicated disk for their WALs so that the drive head is dedicated to this sequential I/O. There are also often multiple WALs, allowing multiple threads to write to disk concurrently. Most MVCC databases default to a READ-COMMITTED isolation level (this is actually relaxing some ACID semantics somewhat), and in this level, performing reads and writes do not block each other.

When these WAL files grow to a certain point, the data in them is ordered to be written to the table files as sequentially as possible in lower priority threads. Multiple updates are resolved and collapsed. Parts of the database files that contain deleted data are marked as such in some kind of heap structure that keeps track of this free space. Whatever index updates need to be made are done as well, and it's much more optimal to do these in bulk at checkpoint time than it is to try to do them in-place. Doing bulk index updates is FAR more preferable than doing them in-place at write-time, an indisputable fact.

PostgreSQL in specific compresses large data values (ones that would push the row size beyond the 8K page size) and transparently keeps them in a separate TOAST table. The values are broken into 2,000 byte chunks. This prevents the main data files from getting too sparse when row sizes have a high level of variance.

Cassandra borrows WAL-writes-first-then-lazily-checkpoint technique, but takes it a step further by using sorted, immutable table files. Cassandra doesn't have MVCC, so making the table files immutable means that there's no need to worry about locking as the contents never change. The sorted order means that merging table files together at compaction time can be done in roughly O(N*2) time.

All that work in making I/O as sequential as possible significantly softens the blow of running out of RAM cache for your most active set. Only random reads actually become slow in this case, instead of practically every kind of I/O like it does in MongoDB when used in real-world scenarios.

To answer your second question, as far as I know, what I've done to to measure how far away PostgreSQL is from the RAM cliff is to set the shared_buffers to 2/3 RAM (or 3/4 RAM if you're >8GB, 4/5 RAM if you're >32GB, etc), and measure hit rates between the buffers and OS cache. Hit rates can be measured at the database level and also down to the table level. The idea here is that if your shared buffer hit rates begin to drop precipitously, you need more RAM, but also gives you the another 1/3rd of the RAM on the box before you're truly doomed. PG works well with the OS cache, so performance should remain steady.

MongoDB can't give you this indicator because it's dependent on the OS cache entirely. I've spent a day looking for tools that would allow tracking how large the hot pool of Linux disk cache was, with no success.

PostgreSQL also gives you per-table statistics on how many queries are ran, and then how much and what types of I/O are being performed, so catching problems and pinpointing them can be done before they become problems.



Thank you for this informative rely, this kind of rely is what I'm looking for. There are a few things I should reply on:

> MongoDB stores ALL of the data for a database in a single set of sequentially numbered files.

Correct. But are you under the assumption that this means documents are spread randomly inside the files? MongoDB database files work with extents. Each collection consists of multiple extent. The extents may be scattered across the database files but each extent is contiguous.

> write-ahead log (WAL), in which the database will group several transactions together to make this a mostly sequential I/O operation

Very informative, thanks. I thought the WAL was only a durability feature. Indeed, all the resources that I've read on WAL so far only mention it in the context of durability. I had to read your post a few times to understand that it can be used for optimizing I/O.

That said, MongoDB relies on the operating system's I/O scheduler to perform the reordering. It just performs the write without fsyncing, allowing the writes to end up in the OS page cache. The OS is then free to reorder the writes however it likes.

How useful is the WAL for reordering I/O compared to what the OS already does? Do you think one is superior to the other? Other than bulk index updates of course.

> Only random reads actually become slow in this case, instead of practically every kind of I/O like it does in MongoDB when used in real-world scenarios.

Actually after fixing the _id problem my MongoDB database became insanely fast again. I/O reordering does happen in with MongoDB, just not in MongoDB itself.

> MongoDB can't give you this indicator because it's dependent on the OS cache entirely.

Your technique for measuring the working set sounds not too far off from what I do with MongoDB although I use a few more tools. The 'mongostat' utility shows how many page faults MongoDB gets; this is similar to your buffer miss rates. You can use Linux cgroups to limit MongoDB's page cache size. I also use mincore() and fincore() to see which places in the database files get paged in.

All in all, I have to change my opinion and admit that PostgreSQL does do many things better than MongoDB. However I think MongoDB doesn't do as bad as you think it does, and the automatic sharding feature in MongoDB is still something I really need.


You are correct that it's broken into extents, but as the extents become sparse over time without compaction, I/O gets scattered further and further around the disk.

The WAL is as much a durability features as it is about fast recovery, fast writes, and concurrency. MongoDB uses a journal collection for durability, which is quite a different thing, and actually requires doubled writes. The journal collection is just another MongoDB collection, instead of being an optimized log file structure.

The OS I/O scheduler works fairly well, but it's a general purpose tool and must work well for all types of I/O. The WAL writes themselves are ALWAYS sequential, something in-place updates will never get. The OS still tries to flush writes within something like up to 5 seconds (the Linux deadline scheduler). The RDBMS WAL checkpoint process can gather hundreds of megabytes of data over many minutes or hours, sort it, and write to the table files in a single pass.

In PostgreSQL, you can tune the checkpoint process down to how much data to write before a checkpoint, how long to wait at maximum, and how much I/O capacity to consume while checkpointing.

The cgroups, mincore, and fincore techniques are very interesting, going to look into them.

The key to the PostgreSQL strategy is that segmenting the cache into in-process buffers and the OS disk cache. It's an early warning system once the hits start to cross that threshold.

I'd be interested in learning about any strategies that could be used to segment the OS disk cache into multiple pools, because I'd prefer to actually reduce the shared buffer pool as it's somewhat less efficient than just using the OS disk cache (duplicated caches). Perhaps some kind of automated process could monitor the hit rate and actively expand the cgroups as needed to try to meet a hit rate target, and give you some warning when a RAM % threshold was exceeded.

There is nothing evil or wrong about MongoDB itself, it's just a new product that needs more polish. Many years ago when I first started using PostgreSQL, it needed much polish. It takes years and years to fine tune these types of things and find out how the initial assumptions made work out in real life.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: