Hacker News new | comments | show | ask | jobs | submit login

I wish the 10gen folks would do some work on various issues that have been in MongoDB for a long time. For me the most serious one is data size. Storage is pretty terrible, especially compared to CPU these days. I would far prefer the use of more CPU in order to reduce storage.

Records in a collection mostly use the same set of keys. Every record includes the full key which makes the storage that much larger. Strings can interned and replaced with a far smaller numerical token to reduce storage. (I'd be happy for it to be attempted for string values as well.)

The other is compression which is a straight forward cpu versus storage tradeoff.



Agreed. Of course, I also chimed in early on #164.

Compression (well, lack of) is probably the thing that'll cause me to migrate off MongoDB one of these days.

The lack of compression has a huge effect on working set size, especially as the BSON representation doesn't really save space just CPU.

Mongo's behaviour when the working set exceed memory size used to be pretty terrible. (I moved on from the relevant project so I don't know how much of an issue it still is.) https://jira.mongodb.org/browse/SERVER-574 - essentialy what happens is that Mongo keeps trying to execute queries rather than throttling to let existing ones finish which causes even more thrashing and longer query times.

Reducing the memory consumption would curtail the onset of that.

This is a daily issue for me - having to resync replicas that run out of disk space because Mongo has no compacting strategy.

Resyncing will drop disk usage from 200GB to 50GB(!). So much wasted space.

If you're resyncing that frequently, you should probably be used TTL indexes, the usePowerOf2Sizes column flag directly, or potentially capped collections.


The data in question is date referenced events - they can't be deleted because I need the data for analytics. It grows forever.

Why on earth is a resync necessary then?

Because the machines run out of disk space. A resync does a compress.

I'll just leave this here... http://www.tokutek.com/2013/02/mongodb-fractal-tree-indexes-...

Come talk to us if you're interested. We are stabilizing things and hope to open up evaluations more in a week or two.

I've seen your posting before, but don't see the relevance. Our indices are a trivial portion of our data size, and even if they were zero bytes in size it wouldn't make an appreciable difference to the data size.

(On phone, can't edit the other reply)

Though you are right, the architecture we tried earlier, when we were doing a proof of concept, would not have helped with compression. Back then we were trying to play nice with mongodb's data format and keep its legacy indexes around. It turned out that for many reasons (concurrency, recovery, and compression, and a few others), it was a lot better to replace all of the storage code, so that's what some of us have been doing the past couple months.

What we are doing now is replacing all of mongodb's storage. So all of the data and all the indexes are in fractal trees. This means we can compress everything (including the field names!) and keep everything nicely defragmented.

I would also love to see that happen... MongoDB is implementing really nice new features but I get the feeling that they are leaving behind some more important ones like what you just mentioned.

Until they do, you could always write a thin wrapper over the client driver to gzip/gunzip non-indexed fields. Something like Google's Snappy would be well suited to that.

That sounds great if you've got large fields with lots of redundancy. In fact, we do this.

But small fields won't compress so well on their own. Often there's a lot of redundancy across records for the same fields which is great for compression. This might be a great way to achieve the benefits of field name tokenization too (which is similar to part of how most compressors work). I'd like to see block compression, rather than field compression .

Hmm...interesting. I don't know if this will work, but you could try storing your MongoDB database on a compressed ZFS partition. Since MongoDB uses mmap, this would have the nice side-effect of your working set remaining uncompressed, and only being compressed when written to or read from disk.

you're not the first person to suggest that to me :) although I haven't thought about using ZFS for this. You're not the only one to suggest ZFS. Why that and not compression in btrfs (or something else entirely)?

_hopefully_ the mmap interface would provide the best of all worlds: mongodb continues to be simple with respect to how it handles getting data from disk and the kernel/fs can do it's magic behind the scenes of mmap. Of course, it could be that mmap + compressed filesystem leads to some unexpected (and bad) perf results. But then again, I've never tried :) have you?

No reason for ZFS in particular -- I was just unsure about how stable btrfs currently is. I haven't tried this out, but I think I might. Email's on the site in my profile if you beat me to it. :)

That then means all access has to be done via said wrapper. For example I couldn't continue using the Mongo shell as I occasionally do, or if someone comes up with a handy tool written using a different language than my usual (Python).

Another example, regarding the C# driver: don't block thread on network requests, i.e. use IOCP threading. Created 2011-04-25. https://jira.mongodb.org/browse/CSHARP-138

I don't disagree with your points but do keep in mind they are a large company and work towards new features (such as this query matcher) does not imply they don't also have people working on the features that are important to you.

Check the timestamps on those tickets, comment history, votes etc. They aren't some obscure corner!

You might try a compressed filesystem using ZFS.

I have btrfs on my systems, but gave up and got new SSDs with ext4. The copy on write approach is terrible for Mongo's database performance. btrfs doesn't have tools that tell you what the compression ratios etc are so you can't know unless you fill up storage. It is possible to turn off data copy on write, but that also turns off compression! Also MongoDB likes to use lots of 2GB files instead of a single file for the database which makes this sort of stuff even harder to apply.

have you (or anyone else) done this and have some experience to share?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact