Hacker News new | comments | show | ask | jobs | submit login

The presentation claims that MongoDB performance drops off a cliff once the data or the indices no longer fit in RAM. This is not entirely complete/correct.

First off, the same problem occurs no matter which database. Oracle, PostgreSQL, MySQL, whatever, they all grind to a halt when stuff no longer fits in RAM.

Second, it's not really the whole data set or the whole index that needs to fit in RAM, it's the working set. This is a pretty big difference. I'm currently running a production MongoDB database which is about 20 times larger than RAM and at first I made the mistake of using random string primary keys (_ids). This ensured good distribution between shards but also ensured good distribution within the primary key index. This essentially made the working set of the primary key index equal to the entire index. Once the primary key index no longer fits in RAM, the database grinded to a halt. I now use a primary key consisting of a timestamp followed by some random text. The timestamp ensures that inserts only hit the end of the index, thereby keeping the working set small. The database is now almost as fast as when it was fresh.

This harsh cliff does NOT occur on every database. The old guard RDBMSs have extremely sophisticated and well-tuned I/O code that has been working around slow disks for decades. While you can't escape the physics of rotational drives, MongoDB does probably the worst job possible of dealing with their poor performance. Yeah, great, it's fast as long as data is in memory, welcome to the world of b-trees. With a high disk:RAM ratio for a data set (like >30:1), it's slower than PostgreSQL. Without compaction, it's performance degrades to become many of orders of magnitude slower than PostgreSQL.

The other major problem is that there is no tooling that can be used measure how large your working set is in MongoDB, so by the time you figure out you're near the cliff, it's WAY too late.

MongoDB encourages poor decisions in query design that lead to these problems. The "rich query model" is enough rope to hang yourself with in a big data / high scale situation. What makes MongoDB popular is that it's got much of the convenience of SQL queries, but the problem is that it also falls into the same set of traps. This is exactly what happened to Foursquare -- MongoDB's marketing materials won't help escape the realities of data locality, the scale-up hit, poor concurrency design, and rotational disk speed coupled with mediocre caching infrastructure and an unpolished I/O subsystem.

As far as I'm concerned, MongoDB isn't for big data or scale -- yet. It's best use is for small to medium sized data sets which benefit from it's schema-free design.

While it's true that some other DBs have better code for working around slow disks, does it really help all that much? In my case the random primary key was causing every single operation to result in a seek to a random location on disk. I really doubt other DBs could have helped me no matter how good their code is. At best they postpone the core problem.

What tools do you use to measure the working set in PostgreSQL and other databases?

A random operation results in random I/O, another unavoidable problem. What I speak of is the way in which modern, MVCC-based databases deal with concurrency and optimally pattern writes in a way that makes reads lock-free. In addition, they don't sacrifice any level of durability or ability to quickly recover a corrupted database to get this. It's just a more advanced design.

Usually each table is kept in it's own file (or set of files, if needed), which helps sequential I/O for a single table stay sequential, and also provides a point of logical partitioning for database operations.

In contrast, MongoDB stores ALL of the data for a database in a single set of sequentially numbered files. This works great if you have a single collection that can be ordered monotonically and queried in a roughly time-ordered way, and you never delete anything, but it's pretty bad otherwise. MongoDB was originally built for analytics, and it shows.

In this modern RDBMS, everything gets written first to a write-ahead log (WAL), in which the database will group several transactions together to make this a mostly sequential I/O operation. Many DBAs insist on a dedicated disk for their WALs so that the drive head is dedicated to this sequential I/O. There are also often multiple WALs, allowing multiple threads to write to disk concurrently. Most MVCC databases default to a READ-COMMITTED isolation level (this is actually relaxing some ACID semantics somewhat), and in this level, performing reads and writes do not block each other.

When these WAL files grow to a certain point, the data in them is ordered to be written to the table files as sequentially as possible in lower priority threads. Multiple updates are resolved and collapsed. Parts of the database files that contain deleted data are marked as such in some kind of heap structure that keeps track of this free space. Whatever index updates need to be made are done as well, and it's much more optimal to do these in bulk at checkpoint time than it is to try to do them in-place. Doing bulk index updates is FAR more preferable than doing them in-place at write-time, an indisputable fact.

PostgreSQL in specific compresses large data values (ones that would push the row size beyond the 8K page size) and transparently keeps them in a separate TOAST table. The values are broken into 2,000 byte chunks. This prevents the main data files from getting too sparse when row sizes have a high level of variance.

Cassandra borrows WAL-writes-first-then-lazily-checkpoint technique, but takes it a step further by using sorted, immutable table files. Cassandra doesn't have MVCC, so making the table files immutable means that there's no need to worry about locking as the contents never change. The sorted order means that merging table files together at compaction time can be done in roughly O(N*2) time.

All that work in making I/O as sequential as possible significantly softens the blow of running out of RAM cache for your most active set. Only random reads actually become slow in this case, instead of practically every kind of I/O like it does in MongoDB when used in real-world scenarios.

To answer your second question, as far as I know, what I've done to to measure how far away PostgreSQL is from the RAM cliff is to set the shared_buffers to 2/3 RAM (or 3/4 RAM if you're >8GB, 4/5 RAM if you're >32GB, etc), and measure hit rates between the buffers and OS cache. Hit rates can be measured at the database level and also down to the table level. The idea here is that if your shared buffer hit rates begin to drop precipitously, you need more RAM, but also gives you the another 1/3rd of the RAM on the box before you're truly doomed. PG works well with the OS cache, so performance should remain steady.

MongoDB can't give you this indicator because it's dependent on the OS cache entirely. I've spent a day looking for tools that would allow tracking how large the hot pool of Linux disk cache was, with no success.

PostgreSQL also gives you per-table statistics on how many queries are ran, and then how much and what types of I/O are being performed, so catching problems and pinpointing them can be done before they become problems.

Thank you for this informative rely, this kind of rely is what I'm looking for. There are a few things I should reply on:

> MongoDB stores ALL of the data for a database in a single set of sequentially numbered files.

Correct. But are you under the assumption that this means documents are spread randomly inside the files? MongoDB database files work with extents. Each collection consists of multiple extent. The extents may be scattered across the database files but each extent is contiguous.

> write-ahead log (WAL), in which the database will group several transactions together to make this a mostly sequential I/O operation

Very informative, thanks. I thought the WAL was only a durability feature. Indeed, all the resources that I've read on WAL so far only mention it in the context of durability. I had to read your post a few times to understand that it can be used for optimizing I/O.

That said, MongoDB relies on the operating system's I/O scheduler to perform the reordering. It just performs the write without fsyncing, allowing the writes to end up in the OS page cache. The OS is then free to reorder the writes however it likes.

How useful is the WAL for reordering I/O compared to what the OS already does? Do you think one is superior to the other? Other than bulk index updates of course.

> Only random reads actually become slow in this case, instead of practically every kind of I/O like it does in MongoDB when used in real-world scenarios.

Actually after fixing the _id problem my MongoDB database became insanely fast again. I/O reordering does happen in with MongoDB, just not in MongoDB itself.

> MongoDB can't give you this indicator because it's dependent on the OS cache entirely.

Your technique for measuring the working set sounds not too far off from what I do with MongoDB although I use a few more tools. The 'mongostat' utility shows how many page faults MongoDB gets; this is similar to your buffer miss rates. You can use Linux cgroups to limit MongoDB's page cache size. I also use mincore() and fincore() to see which places in the database files get paged in.

All in all, I have to change my opinion and admit that PostgreSQL does do many things better than MongoDB. However I think MongoDB doesn't do as bad as you think it does, and the automatic sharding feature in MongoDB is still something I really need.

You are correct that it's broken into extents, but as the extents become sparse over time without compaction, I/O gets scattered further and further around the disk.

The WAL is as much a durability features as it is about fast recovery, fast writes, and concurrency. MongoDB uses a journal collection for durability, which is quite a different thing, and actually requires doubled writes. The journal collection is just another MongoDB collection, instead of being an optimized log file structure.

The OS I/O scheduler works fairly well, but it's a general purpose tool and must work well for all types of I/O. The WAL writes themselves are ALWAYS sequential, something in-place updates will never get. The OS still tries to flush writes within something like up to 5 seconds (the Linux deadline scheduler). The RDBMS WAL checkpoint process can gather hundreds of megabytes of data over many minutes or hours, sort it, and write to the table files in a single pass.

In PostgreSQL, you can tune the checkpoint process down to how much data to write before a checkpoint, how long to wait at maximum, and how much I/O capacity to consume while checkpointing.

The cgroups, mincore, and fincore techniques are very interesting, going to look into them.

The key to the PostgreSQL strategy is that segmenting the cache into in-process buffers and the OS disk cache. It's an early warning system once the hits start to cross that threshold.

I'd be interested in learning about any strategies that could be used to segment the OS disk cache into multiple pools, because I'd prefer to actually reduce the shared buffer pool as it's somewhat less efficient than just using the OS disk cache (duplicated caches). Perhaps some kind of automated process could monitor the hit rate and actively expand the cgroups as needed to try to meet a hit rate target, and give you some warning when a RAM % threshold was exceeded.

There is nothing evil or wrong about MongoDB itself, it's just a new product that needs more polish. Many years ago when I first started using PostgreSQL, it needed much polish. It takes years and years to fine tune these types of things and find out how the initial assumptions made work out in real life.

What's with all the downvotes? If you disagree then post a reply. Hitting that downvote button sure is convenient isn't it?

In my experience the old guard RDBMSs have a lot of performance cliffs that they love to fall off of with no warning when you try to scale them.

Ever seen an Oracle database encounter too much contention on a lock? When the lock is operating at 98% capacity there is no trace of a problem. At 101% of capacity you fall over. And there are no tools to catch this.

(speaker here)

You're absolutely right about the working set being the important bit to keep in memory (and all the indexes you use). I clarified that in person thanks to a great question from the audience.

My entire point wrt it slowing-down-outside-of-RAM is that it happens with any database. My point was that any database can be fast if it's in RAM and non-durable; yet Mongo trumpets performance under these unsafe scenarios while other databases use safe defaults.

To make matters worse, MongoDB data can suffer from poor data locality over time. Since caching is completely left up to the kernel, 1 hot document, even if it's only 200 bytes, will cause an entire page to remain in memory. The other bytes in the page may be cold documents or null data.

This is why flip/flopping is useful in current versions and in-place compaction so important for future versions. For prior research see defrag.exe.

I wish you luck in your attempts to tune performance by tweaking _id. I could only take doing hacks like that for so long before I looked for greener pastures.

Why do you consider tweaking _id a hack, and what is the greener pasture for you? If you use random string primary keys in any other database you'd run into the same problem. My app is much more write-intensive than read-intensive. My stuff wouldn't work on PostgreSQL at all, I have so much data that I must use sharding.

You're basically implementing an optimized version of ObjectId for your use case and having to take a couple stabs at it as well. No big deal, but there are just lots of little tricks like that to learn with MongoDB.

"My stuff wouldn't work on PostgreSQL at all, I have so much data that I must use sharding."

We had to implement sharding as well and chose to do it manually in PostgreSQL. Luckily our schema made that relatively easy and natural. YMMV

As mathias_10gen points out below this is the default behavior of ObjectId. You just have to be carful if you decide to override the _id field.

It isn't just the _id field, it's any indexed field.

I have a 1.5 TB Postgres database, sharded by schema, that runs wonderfully on a single box (12 core, 36GB RAM, raid 10 of 15k SAS drives). Why couldn't you shard with Postgres?

Am I reading that correctly that you're sharding a database on the same box?

Good point, sounds like I intend to keep it that way for this particular database in my comment.

It's setup for multi box (each schema is mapped to a hostname in code) but I simply haven't had a reason to move to more boxes yet. The schema feature is a nice, convenient way to pre-shard like this so that growing to more boxes doesn't require rehashing for a very long time if ever (depending on how much sharding you do up front). You just move schemas/shards as needed using the standard dump and restore tools and update the schema->hostname mapping in the code.

When sharding, do you do all joins in code, or just the ones that span several shards?

If I need to go cross-shard then I am doing it in code. If you knew both shards were on the same box you could do cross-shard joins if you used schemas like this but you would need some potentially tricky logic that determines if it is working with all shards on the same machine.

Thankfully most of the joins happen within a shard (hashing and sharding on something like a user_id) with the exception being various analysis and aggregation queries.

Using PostgreSQL's schemas is admittedly not too different from just using many DBs in MySQL or something else but in practice I've found that extra layer of organization helps keep things neater. I can backup, move, or delete a specific schema/shard or I can backup, move, etc all shards on a machine by operating on the containing database.

I would have to do that manually. Unless you know of an automatic solution that doesn't involve paying tons of money on commercial licenses?

FYI - Your _id trick is similar to the ObjectID type mongodb uses by default.

"A BSON ObjectID is a 12-byte value consisting of a 4-byte timestamp (seconds since epoch), a 3-byte machine id, a 2-byte process id, and a 3-byte counter. Note that the timestamp and counter fields must be stored big endian unlike the rest of BSON. This is because they are compared byte-by-byte and we want to ensure a mostly increasing order."


My _id is different from ObjectID. It does begin with a timestamp but one that has 2010 as epoch. It's also followed by a kind of user ID and a piece of random identifier that appears elsewhere in the document.

Interesting, so the _id trick mentioned by FooBarWidget is not the real reason for the speedup?

It is. I was using a totally random string key, not the default ObjectID.

So if you stick with the default _id, the claim of "Your indexes must fit in memory" is no longer valid?

That totally depends on your workload. In my case my working set happens to be mostly equal to the most recently inserted data. If you have to regularly access lots of random documents with no locality whatsoever then your working set is very large and should fit in memory.

By sticking with the default _id, with my workload my _id index doesn't have to fit into memory. I can't actually use the default _id for various reasons but that's a whole different discussion.

Useful information, thanks!

Forgive the dumb question but what do you mean by a working set? Is this the part of index that covers the shard that server is on? Or is your domain such that you only care about most recent records - for my part it's rare that a table has records I will not want to see and so can be in the non ram part of index?

I think a reread of the op suggests that only a subset of data and indexes are in ram - but my question still stands but slightly differently phrased - are you saying best practises is shard your dbase and place shards in ram in as many servers as needed?

http://en.wikipedia.org/wiki/Working_set Basically it means the portion of the total data set that's frequently accessed given a short time period.

The best practice totally depends on your work load and what kind of trade offs you're willing to make. I believe this is exactly the reason why there doesn't seem to be any good "best practice" documents for databases: it's different for everybody. In my case, there are a lot of writes and only occasional reads. My database is insert-mostly; once inserted, the data rarely changes. All data is timestamped. The reads usually pertain the most recently inserted data. In my case it makes sense to optimize the database in such a way that the most recently inserted data is in my working set.

This does means that querying old data is slower than querying recent data. This can be made faster by ensuring that all data fits in RAM, e.g. by adding more shards. But given the huge amount of data I have, placing all shards in RAM is way too expensive, and I'm not confident that my users are willing to pay 20 times the fees just to have faster queries for old data. The alternative is to store things on SSDs which should also make things faster, but this would still result in a 3-4 times fee increase for users.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact