Hacker News new | comments | show | ask | jobs | submit login
MongoDB's New Query Matcher (mongodb.org)
77 points by francesca 1330 days ago | hide | past | web | 37 comments | favorite

I wish the 10gen folks would do some work on various issues that have been in MongoDB for a long time. For me the most serious one is data size. Storage is pretty terrible, especially compared to CPU these days. I would far prefer the use of more CPU in order to reduce storage.

Records in a collection mostly use the same set of keys. Every record includes the full key which makes the storage that much larger. Strings can interned and replaced with a far smaller numerical token to reduce storage. (I'd be happy for it to be attempted for string values as well.)

The other is compression which is a straight forward cpu versus storage tradeoff.



Agreed. Of course, I also chimed in early on #164.

Compression (well, lack of) is probably the thing that'll cause me to migrate off MongoDB one of these days.

The lack of compression has a huge effect on working set size, especially as the BSON representation doesn't really save space just CPU.

Mongo's behaviour when the working set exceed memory size used to be pretty terrible. (I moved on from the relevant project so I don't know how much of an issue it still is.) https://jira.mongodb.org/browse/SERVER-574 - essentialy what happens is that Mongo keeps trying to execute queries rather than throttling to let existing ones finish which causes even more thrashing and longer query times.

Reducing the memory consumption would curtail the onset of that.

This is a daily issue for me - having to resync replicas that run out of disk space because Mongo has no compacting strategy.

Resyncing will drop disk usage from 200GB to 50GB(!). So much wasted space.

If you're resyncing that frequently, you should probably be used TTL indexes, the usePowerOf2Sizes column flag directly, or potentially capped collections.


The data in question is date referenced events - they can't be deleted because I need the data for analytics. It grows forever.

Why on earth is a resync necessary then?

Because the machines run out of disk space. A resync does a compress.

I'll just leave this here... http://www.tokutek.com/2013/02/mongodb-fractal-tree-indexes-...

Come talk to us if you're interested. We are stabilizing things and hope to open up evaluations more in a week or two.

I've seen your posting before, but don't see the relevance. Our indices are a trivial portion of our data size, and even if they were zero bytes in size it wouldn't make an appreciable difference to the data size.

(On phone, can't edit the other reply)

Though you are right, the architecture we tried earlier, when we were doing a proof of concept, would not have helped with compression. Back then we were trying to play nice with mongodb's data format and keep its legacy indexes around. It turned out that for many reasons (concurrency, recovery, and compression, and a few others), it was a lot better to replace all of the storage code, so that's what some of us have been doing the past couple months.

What we are doing now is replacing all of mongodb's storage. So all of the data and all the indexes are in fractal trees. This means we can compress everything (including the field names!) and keep everything nicely defragmented.

I would also love to see that happen... MongoDB is implementing really nice new features but I get the feeling that they are leaving behind some more important ones like what you just mentioned.

Until they do, you could always write a thin wrapper over the client driver to gzip/gunzip non-indexed fields. Something like Google's Snappy would be well suited to that.

That sounds great if you've got large fields with lots of redundancy. In fact, we do this.

But small fields won't compress so well on their own. Often there's a lot of redundancy across records for the same fields which is great for compression. This might be a great way to achieve the benefits of field name tokenization too (which is similar to part of how most compressors work). I'd like to see block compression, rather than field compression .

Hmm...interesting. I don't know if this will work, but you could try storing your MongoDB database on a compressed ZFS partition. Since MongoDB uses mmap, this would have the nice side-effect of your working set remaining uncompressed, and only being compressed when written to or read from disk.

you're not the first person to suggest that to me :) although I haven't thought about using ZFS for this. You're not the only one to suggest ZFS. Why that and not compression in btrfs (or something else entirely)?

_hopefully_ the mmap interface would provide the best of all worlds: mongodb continues to be simple with respect to how it handles getting data from disk and the kernel/fs can do it's magic behind the scenes of mmap. Of course, it could be that mmap + compressed filesystem leads to some unexpected (and bad) perf results. But then again, I've never tried :) have you?

No reason for ZFS in particular -- I was just unsure about how stable btrfs currently is. I haven't tried this out, but I think I might. Email's on the site in my profile if you beat me to it. :)

That then means all access has to be done via said wrapper. For example I couldn't continue using the Mongo shell as I occasionally do, or if someone comes up with a handy tool written using a different language than my usual (Python).

Another example, regarding the C# driver: don't block thread on network requests, i.e. use IOCP threading. Created 2011-04-25. https://jira.mongodb.org/browse/CSHARP-138

I don't disagree with your points but do keep in mind they are a large company and work towards new features (such as this query matcher) does not imply they don't also have people working on the features that are important to you.

Check the timestamps on those tickets, comment history, votes etc. They aren't some obscure corner!

You might try a compressed filesystem using ZFS.

I have btrfs on my systems, but gave up and got new SSDs with ext4. The copy on write approach is terrible for Mongo's database performance. btrfs doesn't have tools that tell you what the compression ratios etc are so you can't know unless you fill up storage. It is possible to turn off data copy on write, but that also turns off compression! Also MongoDB likes to use lots of 2GB files instead of a single file for the database which makes this sort of stuff even harder to apply.

have you (or anyone else) done this and have some experience to share?

As a large commercial user of MongoDB (almost 300 instances of it running in production), I've seen some big shifts in 10gen's focus lately. They've really ramped up the customer service, and they are making more regular releases that seem like improvements.

Unfortunately, the better customer service doesn't provide better answers or solutions to problems, and the improvements aren't targeting long standing, basic issues with the platform.

This feature announcement fits the pattern: an improvement, but not in a critical area like indexing, or document-level locking, sharding stability, etc. There are basic fundamentals that need addressing like overcoming the 20K maximum connections per server, or mongos CPU usage. These are the things I deal with in production and that are business critical, but those feature requests sit untouched in JIRA for months or years.

This feature seems interesting, but it solves a problem I don't have. I'd prefer them to solve real world problems.

Why do you need > 20k connections to your database?

Because I want to run more than 100 instances of a webserver per mongos process, but I can't because it causes the connections per server to go over 20K.

Right now my application stack is woefully under utilized due to this completely arbitrary decision on their part not to allow the end user to set the connection limit. They've even admitted that they just picked that number four years ago and haven't looked at it since.

Just like they limit replica sets to 12 members maximum. What if I have higher read requirements than that limitation allows? Well, too bad.

I think this points at a fundamental issue with MongoDB at the design level - they don't allow end users to make any decisions about how to configure the product, even if those decisions might turn out bad.

Every Enterprise-level DB allows end user tuning of parameters, so clearly MongoDB isn't Enterprise-level.

The 20,000 connection limit was removed in 2.5: https://jira.mongodb.org/browse/SERVER-8943

Most likely because the users runs a high number of workers or mongoses.

>This feature seems interesting, but it solves a problem I don't have. I'd prefer them to solve real world problems.

I think it is a little much to equate problems you have with being the only real world problem. Perhaps it's a real world problem that you don't have?

The problems I'm having with MongoDB at scale effect everyone at scale - server connection limits, data size, index inefficiencies, driver bugs and so on.

These aren't edge cases specific to my environment. They are very real, very visible issues that are discussed on the mailing lists all the time.

My name might be "Mud" a a previous client because I recommended MongoDB and implemented a first-rev persistence interface based on my presumption that this issue would be fixed in the near-term:


To be fair, I educated the client on the alternatives should 10Gen not improve the aggregation framework so that it could do arbitrary transforms without failure should the output size exceed 16MB. [Use other MongoDB mechanisms, custom compilation with higher per-document size cap, etc.] What annoys me is that 10Gen did not mention this incredibly important limitation when they were touting the planned features of 2.2. My client would not have minded had it simply been the case that queries with large result set sizes were somewhat slow. What this client could not tolerate were failures to deliver any results at all. In retrospect, I wish that I had pushed harder for a solution based on the Hadoop stack. While it seems to have its own demons, at least there is an ecosystem dedicated to fixing the most blatant of limitations.

Do I regret choosing MongoDB a couple years ago to augment Postgres for several data analysis applications? Not really. But I do wish Mongo has been more mature. I'm happy to see each step forward taken, but it does bother me that often serious issues get overlooked for press-friendly feature additions. Until Mongo adopts some form of compression I won't be using for new projects aside from personal ones where development ease trumps everything else.

How about compression on the replication links?

When I inquired about that, they suggested using an ssh tunnel. I can see why they like mmapped files for storing data.

I use MonogoDB. I hope they will be able to manage the technical debt and stay competitive with newer alternatives. Seems like this is a step forward.

Just out of curiosity. A real live application or a personal project. ?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact