First off, the same problem occurs no matter which database. Oracle, PostgreSQL, MySQL, whatever, they all grind to a halt when stuff no longer fits in RAM.
Second, it's not really the whole data set or the whole index that needs to fit in RAM, it's the working set. This is a pretty big difference. I'm currently running a production MongoDB database which is about 20 times larger than RAM and at first I made the mistake of using random string primary keys (_ids). This ensured good distribution between shards but also ensured good distribution within the primary key index. This essentially made the working set of the primary key index equal to the entire index. Once the primary key index no longer fits in RAM, the database grinded to a halt. I now use a primary key consisting of a timestamp followed by some random text. The timestamp ensures that inserts only hit the end of the index, thereby keeping the working set small. The database is now almost as fast as when it was fresh.
The other major problem is that there is no tooling that can be used measure how large your working set is in MongoDB, so by the time you figure out you're near the cliff, it's WAY too late.
MongoDB encourages poor decisions in query design that lead to these problems. The "rich query model" is enough rope to hang yourself with in a big data / high scale situation. What makes MongoDB popular is that it's got much of the convenience of SQL queries, but the problem is that it also falls into the same set of traps. This is exactly what happened to Foursquare -- MongoDB's marketing materials won't help escape the realities of data locality, the scale-up hit, poor concurrency design, and rotational disk speed coupled with mediocre caching infrastructure and an unpolished I/O subsystem.
As far as I'm concerned, MongoDB isn't for big data or scale -- yet. It's best use is for small to medium sized data sets which benefit from it's schema-free design.
What tools do you use to measure the working set in PostgreSQL and other databases?
Usually each table is kept in it's own file (or set of files, if needed), which helps sequential I/O for a single table stay sequential, and also provides a point of logical partitioning for database operations.
In contrast, MongoDB stores ALL of the data for a database in a single set of sequentially numbered files. This works great if you have a single collection that can be ordered monotonically and queried in a roughly time-ordered way, and you never delete anything, but it's pretty bad otherwise. MongoDB was originally built for analytics, and it shows.
In this modern RDBMS, everything gets written first to a write-ahead log (WAL), in which the database will group several transactions together to make this a mostly sequential I/O operation. Many DBAs insist on a dedicated disk for their WALs so that the drive head is dedicated to this sequential I/O. There are also often multiple WALs, allowing multiple threads to write to disk concurrently. Most MVCC databases default to a READ-COMMITTED isolation level (this is actually relaxing some ACID semantics somewhat), and in this level, performing reads and writes do not block each other.
When these WAL files grow to a certain point, the data in them is ordered to be written to the table files as sequentially as possible in lower priority threads. Multiple updates are resolved and collapsed. Parts of the database files that contain deleted data are marked as such in some kind of heap structure that keeps track of this free space. Whatever index updates need to be made are done as well, and it's much more optimal to do these in bulk at checkpoint time than it is to try to do them in-place. Doing bulk index updates is FAR more preferable than doing them in-place at write-time, an indisputable fact.
PostgreSQL in specific compresses large data values (ones that would push the row size beyond the 8K page size) and transparently keeps them in a separate TOAST table. The values are broken into 2,000 byte chunks. This prevents the main data files from getting too sparse when row sizes have a high level of variance.
Cassandra borrows WAL-writes-first-then-lazily-checkpoint technique, but takes it a step further by using sorted, immutable table files. Cassandra doesn't have MVCC, so making the table files immutable means that there's no need to worry about locking as the contents never change. The sorted order means that merging table files together at compaction time can be done in roughly O(N*2) time.
All that work in making I/O as sequential as possible significantly softens the blow of running out of RAM cache for your most active set. Only random reads actually become slow in this case, instead of practically every kind of I/O like it does in MongoDB when used in real-world scenarios.
To answer your second question, as far as I know, what I've done to to measure how far away PostgreSQL is from the RAM cliff is to set the shared_buffers to 2/3 RAM (or 3/4 RAM if you're >8GB, 4/5 RAM if you're >32GB, etc), and measure hit rates between the buffers and OS cache. Hit rates can be measured at the database level and also down to the table level. The idea here is that if your shared buffer hit rates begin to drop precipitously, you need more RAM, but also gives you the another 1/3rd of the RAM on the box before you're truly doomed. PG works well with the OS cache, so performance should remain steady.
MongoDB can't give you this indicator because it's dependent on the OS cache entirely. I've spent a day looking for tools that would allow tracking how large the hot pool of Linux disk cache was, with no success.
PostgreSQL also gives you per-table statistics on how many queries are ran, and then how much and what types of I/O are being performed, so catching problems and pinpointing them can be done before they become problems.
> MongoDB stores ALL of the data for a database in a single set of sequentially numbered files.
Correct. But are you under the assumption that this means documents are spread randomly inside the files? MongoDB database files work with extents. Each collection consists of multiple extent. The extents may be scattered across the database files but each extent is contiguous.
> write-ahead log (WAL), in which the database will group several transactions together to make this a mostly sequential I/O operation
Very informative, thanks. I thought the WAL was only a durability feature. Indeed, all the resources that I've read on WAL so far only mention it in the context of durability. I had to read your post a few times to understand that it can be used for optimizing I/O.
That said, MongoDB relies on the operating system's I/O scheduler to perform the reordering. It just performs the write without fsyncing, allowing the writes to end up in the OS page cache. The OS is then free to reorder the writes however it likes.
How useful is the WAL for reordering I/O compared to what the OS already does? Do you think one is superior to the other? Other than bulk index updates of course.
> Only random reads actually become slow in this case, instead of practically every kind of I/O like it does in MongoDB when used in real-world scenarios.
Actually after fixing the _id problem my MongoDB database became insanely fast again. I/O reordering does happen in with MongoDB, just not in MongoDB itself.
> MongoDB can't give you this indicator because it's dependent on the OS cache entirely.
Your technique for measuring the working set sounds not too far off from what I do with MongoDB although I use a few more tools. The 'mongostat' utility shows how many page faults MongoDB gets; this is similar to your buffer miss rates. You can use Linux cgroups to limit MongoDB's page cache size. I also use mincore() and fincore() to see which places in the database files get paged in.
All in all, I have to change my opinion and admit that PostgreSQL does do many things better than MongoDB. However I think MongoDB doesn't do as bad as you think it does, and the automatic sharding feature in MongoDB is still something I really need.
The WAL is as much a durability features as it is about fast recovery, fast writes, and concurrency. MongoDB uses a journal collection for durability, which is quite a different thing, and actually requires doubled writes. The journal collection is just another MongoDB collection, instead of being an optimized log file structure.
The OS I/O scheduler works fairly well, but it's a general purpose tool and must work well for all types of I/O. The WAL writes themselves are ALWAYS sequential, something in-place updates will never get. The OS still tries to flush writes within something like up to 5 seconds (the Linux deadline scheduler). The RDBMS WAL checkpoint process can gather hundreds of megabytes of data over many minutes or hours, sort it, and write to the table files in a single pass.
In PostgreSQL, you can tune the checkpoint process down to how much data to write before a checkpoint, how long to wait at maximum, and how much I/O capacity to consume while checkpointing.
The cgroups, mincore, and fincore techniques are very interesting, going to look into them.
The key to the PostgreSQL strategy is that segmenting the cache into in-process buffers and the OS disk cache. It's an early warning system once the hits start to cross that threshold.
I'd be interested in learning about any strategies that could be used to segment the OS disk cache into multiple pools, because I'd prefer to actually reduce the shared buffer pool as it's somewhat less efficient than just using the OS disk cache (duplicated caches). Perhaps some kind of automated process could monitor the hit rate and actively expand the cgroups as needed to try to meet a hit rate target, and give you some warning when a RAM % threshold was exceeded.
There is nothing evil or wrong about MongoDB itself, it's just a new product that needs more polish. Many years ago when I first started using PostgreSQL, it needed much polish. It takes years and years to fine tune these types of things and find out how the initial assumptions made work out in real life.
Ever seen an Oracle database encounter too much contention on a lock? When the lock is operating at 98% capacity there is no trace of a problem. At 101% of capacity you fall over. And there are no tools to catch this.
You're absolutely right about the working set being the important bit to keep in memory (and all the indexes you use). I clarified that in person thanks to a great question from the audience.
My entire point wrt it slowing-down-outside-of-RAM is that it happens with any database. My point was that any database can be fast if it's in RAM and non-durable; yet Mongo trumpets performance under these unsafe scenarios while other databases use safe defaults.
To make matters worse, MongoDB data can suffer from poor data locality over time. Since caching is completely left up to the kernel, 1 hot document, even if it's only 200 bytes, will cause an entire page to remain in memory. The other bytes in the page may be cold documents or null data.
This is why flip/flopping is useful in current versions and in-place compaction so important for future versions. For prior research see defrag.exe.
I wish you luck in your attempts to tune performance by tweaking _id. I could only take doing hacks like that for so long before I looked for greener pastures.
"My stuff wouldn't work on PostgreSQL at all, I have so much data that I must use sharding."
We had to implement sharding as well and chose to do it manually in PostgreSQL. Luckily our schema made that relatively easy and natural. YMMV
It's setup for multi box (each schema is mapped to a hostname in code) but I simply haven't had a reason to move to more boxes yet. The schema feature is a nice, convenient way to pre-shard like this so that growing to more boxes doesn't require rehashing for a very long time if ever (depending on how much sharding you do up front). You just move schemas/shards as needed using the standard dump and restore tools and update the schema->hostname mapping in the code.
Thankfully most of the joins happen within a shard (hashing and sharding on something like a user_id) with the exception being various analysis and aggregation queries.
Using PostgreSQL's schemas is admittedly not too different from just using many DBs in MySQL or something else but in practice I've found that extra layer of organization helps keep things neater. I can backup, move, or delete a specific schema/shard or I can backup, move, etc all shards on a machine by operating on the containing database.
"A BSON ObjectID is a 12-byte value consisting of a 4-byte timestamp (seconds since epoch), a 3-byte machine id, a 2-byte process id, and a 3-byte counter. Note that the timestamp and counter fields must be stored big endian unlike the rest of BSON. This is because they are compared byte-by-byte and we want to ensure a mostly increasing order."
By sticking with the default _id, with my workload my _id index doesn't have to fit into memory. I can't actually use the default _id for various reasons but that's a whole different discussion.
I think a reread of the op suggests that only a subset of data and indexes are in ram - but my question still stands but slightly differently phrased - are you saying best practises is shard your dbase and place shards in ram in as many servers as needed?
The best practice totally depends on your work load and what kind of trade offs you're willing to make. I believe this is exactly the reason why there doesn't seem to be any good "best practice" documents for databases: it's different for everybody. In my case, there are a lot of writes and only occasional reads. My database is insert-mostly; once inserted, the data rarely changes. All data is timestamped. The reads usually pertain the most recently inserted data. In my case it makes sense to optimize the database in such a way that the most recently inserted data is in my working set.
This does means that querying old data is slower than querying recent data. This can be made faster by ensuring that all data fits in RAM, e.g. by adding more shards. But given the huge amount of data I have, placing all shards in RAM is way too expensive, and I'm not confident that my users are willing to pay 20 times the fees just to have faster queries for old data. The alternative is to store things on SSDs which should also make things faster, but this would still result in a 3-4 times fee increase for users.