Hacker News new | past | comments | ask | show | jobs | submit login
Why MongoDB is a bad choice for storing our scraped data (scrapinghub.com)
86 points by reinhardt on May 14, 2013 | hide | past | web | favorite | 115 comments

I really don't understand why people use MongoDB.

It seems like it's a elegant technological metaphor (lets use mmap, the OS is our cache and we can overwrite in place in RAM) that in practise turns out to be a terrible idea. Overwrite/mmap cannot be made reliable, requires blocking write-locks, wastes disk, and causes problems shuffling data around as it grows. Add other bad decisions (keys aren't interned, seriously?)and it's just a terrible limping monster.

Abandon it, walk away.

Because the koolaid is so sweet.

MongoDB is every developer's wet dream. With it's expressive query syntax and extreme ease of use, everyone wants to drink the koolaid. This is a huge problem, because mongodb as a database is dangerous


I have developers begging me to let them use it. This time to collect logs from our servers for analysis later. I cave in, and give my go ahead, with a warning saying that no critical data can enter that section. Mongo processes were crashing. Several times per day. About 20% of the crashes yielded a completely corrupted database. This programmers wet dream quickly shows itself to be a serious operations nightmare.


Because when you're not operating at significant scale, or have certain specific use cases, it's a fantastically elegant solution and one that's very quick and easy to set up.

I have used mongodb for a number of smaller projects, and I have had an excellent experience. It's not "a terrible idea in practice". It might be a terrible fit for what you want, but that doesn't mean it's bad technology.

"when you're not operating at significant scale, or have certain specific use cases, it's a fantastically elegant solution and one that's very quick and easy to set up."

When you're not operating at significant scale, you can use a relational database. They're easy and fast to set up, have nice write-safety guarantees, are more flexible than a key-value store, and will scale well beyond anything that mongo has ever achieved. You can even use them as a key-value store! The downside, of course, is that you have to a tiny bit of knowledge about set theory, and that's a deal breaker for most "developers" today.

The whole point of the GP was that Mongo isn't elegant or easy...it's just naive and short-sighted, and the architectural mistakes within it are fundamental and probably unfixable (at least, not without killing the speed advantages they claim). The real reason that people use mongo is that most webapp devs don't have a very good understanding of how computers work, and want everything to look like Javascript, because that's all they really know.

We use MongoDB in production for a couple of use cases: 1> e-Commerce Product Catalog 2> Home grown CMS for our news/editorial site

Both these applications have been in production for about a year without a single problem. Both are using the same MongoDB instance - data size is about 200 GB (RAM on the Mongo machine is 16 GB)

Both these applications were previously on Oracle and were a pain to maintain. The Mongo schema is simpler and far more maintainable then the RDBMS schema. Backups/Monitoring/Replication on Mongo has never given us any problems.

Now, since you claim that an RDBMS can do anything better than MongoDB, can you point me to a simple/elegant/maintainable RDBMS schema for an e-Commerce Product Catalog? I would love to see one.

The original post, like most of the 'Why we moved away from MongoDB' posts displays a shocking lack of due-diligence on the part of the development team / tech lead at these firms. All the points under the 'Data that should be good, ends up bad!' section are known facts about MongoDB. All of them are covered in the manual. If you are not fine with any of these points - please don't use MongoDB at all. Don't put it in production. It baffles me how these firms can put MongoDB into production and 'discover' these things later. Instead of ranting at MongoDB, the CTO's of all these firms deserve the sack for lack of due-diligence and putting data at risk.

One last point:

> ..more flexible than a key-value store

MongoDB is not a key-value store.

You lost me when you quoted the word "developers".

Edit: And this is downvoted for calling out the fact that people on HN can't discuss a freakin' database without hurling insults.

I quoted "developers", because we need a term to distinguish people who know basic computer science from people who know just enough to install software and piece together APIs. The latter group tends not to realize that things like overwriting your working set in memory and global write-locking lead inevitably to consistency and throughput issues.

The primary problem in software today is that we've confused the ability to build something with actually knowing anything of value.

we've confused the ability to build something with actually knowing anything of value

Am I reading this correctly? It seems to imply that the ability to build something is somehow orthogonal to knowledge of value.

I don't know about throwing mud into a heap and then calling it sculpture, but if we are talking about the subset of "things" that have value in and of themselves, the ability to build them does imply some knowledge of value.

Now, the relative value of knowing how to put together simple web sites using jQuery vs. the knowledge to discover Chaitin's Constant is very much worthy of discussion. But likewise, the knowledge of how to construct a true but unproveable statement in a toy system vs the knowledge of how to build VisiCalc and revolutionize programming is worthy of discussion as well.

"Am I reading this correctly? It seems to imply that the ability to build something is somehow orthogonal to knowledge of value."

Not only are you reading it correctly, that is in fact (part of) what I'm saying. Building something doesn't automatically create value. We've confused the two.

Orthogonality generally implies mutual exclusiveness (which I don't think is what you're trying to say).

No, it doesn't. Orthogonal means "at right angles", or "independent". The latter meaning applies.

In the real world though, both groups still need to use what works in practice. It's entirely possible for MongoDB to work sufficiently well for a certain group of people in a reasonably cost effective way. Exaggerating its problems (as bad as they are) doesn't add weight to your agrement. For example, global write-locking will not _inevitably_ lead to consistency or throughput unless the write frequencies are sufficient to cause and require that. Lots of data problems aren't "big data", they're barely a medium.

I'll agree MongoDB is not a terribly well engineered database however I don't agree that SQL is always the best alternative in these simpler scenarios. There are lots of things wrong with using SQL to solve every data problem. I also don't agree that knowing SQL is equivalent to understanding "set theory". I've known plenty of DBAs who don't know the first thing about set theory and really just know just enough to install the software and piece together the APIs. The fact that one has chosen SQL doesn't make them good at working with data any more than choosing MongoDB makes someone bad, or implies they don't understand "set theory".

What concerns me mostly though isn't the MongoDB issue but that we can't discus the issue in a professional way, without elitism and distain dripping through. You seem to have confused "computer science" with "anything of value". I happen to believe there are more things worth knowing that are of value to practical software development than just the computer science (not to minimize that of course).

SQL isn't the alternative, it's the standard and noSQL databases are supposed to offer extra value to cause you to migrate. According to these articles, MongoDB doesn't offer any real additional value, thus you shouldn't use it.

It's not about elitism, it's about making good decisions. That said, given all the hype with companies that hire, protesting loudly might not be the best short term personal decision. Meh.

These articles are just one side of the picture which gets heavily upvoted on HN.

And of course it is about elitism. Listen to yourself. "It's about making good decisions".

I mean who are you to judge from the outside what technology a company should use for a specific use cases ?

I wasn't specifically talking about the articles, just in general point that'll I'm happy to stand behind. I'll happily repeat it: unless you have a specfic use case, RDBMS is the default and it should be so. Now if you have a specific use case, fair enough, but the aforementioned examples don't honestly seem to be valid reason to throw our the advantages of RDBMS.

"In the real world though, both groups still need use what works in practice."

In the real world, people have been using relational databases to solve problems for years. They work, they're understood, they scale.

"global write-locking will not _inevitably_ lead to consistency or throughput unless the write frequencies are sufficient to cause and require that"

In which case, you can just as easily use a relational database and avoid the chance of problems altogether.

> In the real world, people have been using relational databases to solve problems for years. They work, they're understood, they scale.

And they're a pain the ass and don't mix well with the kinds of programs many want to write. Mongo clearly fills a niche that relational databases don't serve well; if it didn't, no one would use it.

I think it's not a fight. RDBMS has been there for many years and they have proved to work in many areas. NoSQL databases born for new needs people was asking to have in their new projects. Both would probably works perfect for many cases, but nosql databases are very suitable for scenarios when you do not require an strict schema, and also they are simple to setup.

I still think MongoDB is great for many applications as many companies are using it for their data needs (like Foresquare), and the same with RDBMS like MySQL, that lot of big fishes use them for different parts of their architecture (facebook, twitter, etc). In the end, each option has pros/cons, but one will be better for your use case.

Could you actually go into some detail about these mythical problems with actual databases? Faux database apologists seem to really love claiming databases are so unusable, but I've never gotten an actual explanation as to what problems they are having. As both a developer and a sysadmin, postgresql is much less of a pain in the ass than mongodb. And I have no idea what "don't mix well with the kinds of programs.." is supposed to mean.

The idea that "it must be good for something or people wouldn't use it" is absurd. People do the wrong thing all the time. People make technical decisions based on fads constantly. Mongodb is one of the prime examples of fad driven development choices, where people choose it because "it is web scale" while having no idea what they are even supposed to be comparing it to.

> Faux database apologists seem to really love claiming databases are so unusable

You really think tossing out insults like that is a way to have a reasoned conversation? I think not, come back when you can converse like an adult.

> And I have no idea what "don't mix well with the kinds of programs.." is supposed to mean.

Then you need more experience as a programmer perhaps. I'm a programer and a SQL guy, and I'm fully aware of what a pain SQL can be in an application and if you don't see the ease of programming things like NoSQL databases or Mongo bring to programming, you aren't paying attention or you're lying to yourself about how well SQL fits with code.

It depends upon the application, He is right and you also are right for the feilds of programming and applications you do, gets down to pro's and cons of types of interface to the database depending upon the application at hand. We all know the variations of those. Though perhaps could of been toned right. You are just going to argue over differing types of application without saying which type.

In short you are both right, applications and also use of said applications and requirments make the difference as to what interface is best. Don't need to argue over that without even stating it. You both win, this is the internet - now laugh :).

You made baseless assertions, and provided nothing to back them up. Your comment got precisely the response it deserves. Acting indignant does not support your assertions.

The fact that you have some unspecified problem does not mean anyone else who does not have that problem is inexperienced. Given the complete lack of information available, it is just as reasonable to conclude that you are in fact lacking in experience which allows others to solve the problem you continue to refuse to define.

+1 though your both right depending on the applicatiton and with that you will go from your experience and both be right without specifying in detail an example and neither of you want to go down to doing specs on a forum to win a argument that you both will win and end up arguing about the applicaiton and specs.

I say payroll databases at dawn, 50 paces each, turn and shoot. Go :-)

And which assertion was that; the one that Mongo fits a niche which it obviously does, or the one that many people find relational databases a pain, which they obviously do. Neither of those require me to provide evidence, they are self evident facts to anyone with even moderate experience in the field. I don't have an unspecified problem, not once did I even mention a problem, so take your childish argumentative b.s. somewhere else.

>And which assertion was that

Both of them. You only wrote two sentences, it shouldn't be hard to find them.

>the one that many people find

You didn't say anything about "many people find". You said they are a pain in the ass, and don't mix well with the applications many people are writing. Those are both assertions, and you supported neither of them. Even after replying twice, you still haven't even given a hint as to what you might be referring to. That really makes it seem like you are just saying things out of ignorance.

> You didn't say anything about "many people find". You said they are a pain in the ass

I said they were a pain in the ass for the kinds of programs many people want to write. I'm sorry you're too ignorant to grok my meaning without it being explained to you like a five year old child.

> Those are both assertions, and you supported neither of them.

They are self evident facts and don't require supporting evidence; the very fact that a community exists around these products should make that clear to you.

In any case, it's absolutely clear there's no value in conversing with you, good day.

>I said they were a pain in the ass for the kinds of programs many people want to write

And refuse to specify what those kinds of programs might be. You are inventing a "many" and giving them a problem to create a false impression of consensus, when it is actually just you making a singular, baseless assertion.

>the very fact that a community exists around these products should make that clear to you.

I address that in the first post. You keep responding purely to act like a petulant child, but provide absolutely nothing to support your claims. Do you really think that makes you appear to be the rational, logical party?

people choose it because "it is web scale" while having no idea what they are even supposed to be comparing it to

Some compare it to relational databases... http://www.mongodb-is-web-scale.com/

So the guys at Foursquare are driven by "fads" and don't have a clue about databases or scaling ?

You are reading something I didn't write.

OTOH, I've experienced a lot of people who know basic computer science but don't grasp any software engineering. Nor do they even realize that it's a thing.

Those sorts of people tend to be very focussed on clever algorithms and data structures, and frequently miss the larger picture and coding best practices. I've seen far too much code that had extensive CS cleverness at the root but was spaghettified, untested, undocumented, poorly performant, and not even tracked in a version control system. Such people often don't value writing code that other developers can read and maintain. On a team, clarity matters more than cleverness.

Good developers need both sets of skills.

This issue is well covered in Joel Spolsky's article The Perils of JavaSchools http://www.joelonsoftware.com/articles/ThePerilsofJavaSchool...

Saying that I think it is a bit elitist to think you need a new term, who gets to say who can use that term?

Sorry but I don't think you know what you're talking about here.

Write locking definitely leads to throughput issues but it results in better consistency not less.

> Edit: And this is downvoted for calling out the fact that people on HN can't discuss a freakin' database without hurling insults.

The original comment you're replying to aside, it was likely because stating that you dismissed the entire comment without giving an actual objection to it added nothing at all to the discussion.

Tone. Which is an actual objection.

When you're not operating at significant scale, you can use a relational database.

Yes, if you know how to do it. But posts like https://news.ycombinator.com/item?id=5675902 and questions 'Should I learn SQL?' here and there make me think that's not required knowledge these days.

Are you dead certain those projects aren't ever eating data, and nor are they about to crash tomorrow with an unrecoverably hosed DB?

I still say it's bad technology. Use plain old SQL instead.

"plain old SQL" can require a lot of mangling one's data to fit its constraints. I refuse to believe that there isn't a better key-value store for the case where the values are json documents, even if mongodb isn't it.

Except it isn't "SQL" it's first-order predicate calculus - a provably sound way to store and query your data.

But if one is going to use a key-value store anyway, where is the "mangling"? Building a key-value store in an SQL database is trivial (this below is for PostgreSQL):

    create table keyvalue (key text, value text);
    create index keyvalue_idx on keyvalue(key);
And use is trivial as well:

    insert into keyvalue (key, value) values ('key', 'value');

    select value from keyvalue where key = 'key';
[edit] formatting fix.

It is not nice to have to manipulate JSON as a plain string (you miss out on validation, have to manually construct like expressions for queries, and I don't even want to think about what you'd have to do to update part of a document), at least using "plain old SQL". (PostgreSQL's native JSON support would make it quite easy, but that supports my point)

If you need to query JSON content, why to store it like plain string, to begin with? Parse it at application level (or even in stored procedure) and store like ordinary fields. Or use hstore.

Postgres additionally has a json field type.

That is for now, until PostgreSQL 9.3 released, have only one useful feature comparing to plain string - validation. That's not much help, and that original lmm's point was about.

Redis. Riak. Rethinkdb. Cassandra (with a mapping layer). SQL (with a mapping layer). SQL, in a JSON-typed column (PostgreSQL supports it). SQL, in a blob field, plus indexes. I could go on.

I was required to use MongoDB in a recent large-scale analytics project. It is a disaster. I am in the process of replacing the most important metrics with Redis; I can't ditch Mongo fast enough.


They are also on my blacklist, forever.

They used to ship with unacknowledged writes as the default option. Think about it for a little, a database that just throws your _data_ over the fence and prays for the best, proceeding without an write acknowledgement.

There was not flashing warning on their front page about, no bold disclaimers, but there were sure plenty of "Oh look super fast benchmarks beating SQL and other NoSQL database, albeit created by fanboys".

That decision told me the story of who they are and what kind of principles they use to build their product. It wasn't a mistake it was a deliberate shady tactic employed.

(Yes, I know I have written about it at least 3 times before and will mention it every time I see MongoDB mentioned )

So can you tell us what's your easieast way to implement something like:

    books: [
        {id: 1, tags: ['a', 'b'], author: ['c', 'd'], count_read: 123, count_bought: 456},
where count_read and count_bought is atomic increment, tags are arbitrary string array and could be indexed for searching.

Yeah I knew Postgre could do that. But MongoDB is the most simple and direct way on the market. Stuff like tags makes MySQL m2m joins very inefficient.

> I really don't understand why people use MongoDB.

Really? When I don't want to think in databases and only on persisting my native types in my favorite programming language MongoDB is my choice. I use it only for experimenting, so I don't care about all those scalability issues.

Try Redis instead? Or rethinkdb (if you don't mind it being a bit new)?

No. The Redis interfaces doesn't abstract data types. For example Python types are quasi transparently interfaced with MongoDB.

Was your comment a little bit ironic?

I think the author summed it up well: "There is a niche where MongoDB can work well."

s/MongoDB/Any technology/g

The size of the niche varies. The lesson is to be sure the choices you make are appropriate for your situation, and be aware that things may change if new requirements emerge or scale needs to go beyond what you projected. These concerns are not specific to MongoDB. It is a rare project that goes from prototype to small scale to large scale on its original implementation technology choices.

MongoDB is the Pinto of databases. It is not safe at any scale.

Because is easy: * Easy to learn * Easy to use * Easy to deploy

All easy things get us in troubles when we want high performance. Cheers;)

I'm interested in hearing what the author's new storage system is. What would be compelling is to hear if the same hardware and storage with the new storage system performed better than mongo with some semblance of concrete metrics. There are a lot of complaints here about mongo -- all of them not new -- but no hard numbers.

Whenever I see the "You don't need Mongo DB, use an SQL database" and then in the flaming back and forth, I never see my key problem mentioned:

MongoDB makes it easy to scale out (replica sets and sharding), where is the "easy to setup replicated and sharded open source SQL database?"

I mean, I know that Postgres has replication (via Slony? honestly, it's been awhile since I looked at their solutions) but I don't recall it being as dead simple to set up.

For me, setting up replication needs to be easy because we redistribute the store as part of our product and we need scalability (both replication for redundancy and sharding for scaling).

So I'm honestly asking here, where is the easy to use sharded and replicated open source SQL store that I've been missing?

That "easy to scale out" is a misnomer. Replica sets and sharding work in the technical sense, but the implementation isn't anywhere near what I would qualify as production ready.

For example, today my entire production MongoDB database was running 3x slower because a single replica in one shard was down, and their buggy PHP driver kept trying to talk to it despite it being marked down. I really enjoyed waking up at 2am to deal with that.

It relates back to the "easy to use" nature of their marketing. It really is super easy to use and develop on, but the minute you need to do anything important or serious, it breaks down.

You aren't doing yourself any favors going with it except as a proof-of-concept.

But Mongo DB being buggy isn't a reason to need to use an SQL database vs. a NoSQL store. An SQL database could be buggy as well (I still use Postgres and comparing anything to that quality-wise is just going to bring sorrow for the thing you compare it to ;) ).

FWIW, it's been spotless for us so far. Our needs aren't web scale, but they're big enough to need scaling features.

The main issue with MongoDB is that it's so easy to use and seems like it scales, but soon you're invested in it to the point of refactoring being a serious engineering effort, and you're stuck with something that doesn't actually offer real scaling features.

So, it's less "nosql vs. sql" and more just "don't use mongodb".

Postgres has hot standby built in nowadays, and it works well.

Sharding certainly isn't as easy - the technical compromises that mongo makes make it pretty trivial to implement, whereas it's relatively hard to make it work in an RDBMS while maintaining all the expected capabilities. It generally requires some application-level work on open source dbs.

With that said, I really think many people grossly underestimate the effectiveness of scale-up. It's worth remembering that Stack Overflow (for example) is still running on a single pair of master/hot standby database machines.

Standby is a pretty poor solution compared to replica set let alone what Cassandra has to offer. Sharding is trivial on MongoDB/Cassandra and it is open source. So let's be accurate here. It is a problem inherent with the SQL databases.

And I think you underestimate the benefits of scaling out. If I want to ensure close to 100% uptime or have a server closer to my users than Cassandra or even MongoDB would be infinitely easier to setup and manage than Postgres. These are "very nice to haves" for even the tiniest startup.

Sharding is, of course, trivial on those systems - after all, they're extremely feature-poor, and have given up those features specifically to support trivial sharding.

You can replicate much of this behaviour using open source RDBMSs, but yeah, it's not what they're designed for and it's harder. If you want quality replication/clustering you're currently looking at paid-for DBs.

Being able to scale out is absolutely a nice-to-have. I'm not sure it's a nice-to-have on the scale of giving up all of the features an RDBMS provides for most people's use-cases. Further, you might find that the relative lack of data headaches you get with an RDBMS more than makes up for a little extra time setting up hot standby.

Finally, if 100% uptime is that important you're probably not relying on a relatively niche NoSQL database. If uptime on the level of Stack Overflow is good enough (which for most people it probably is, let's face it), then you'll probably find replicated postgres good enough.

Thanks, that's very much what I'm talking about. Cassandra would be my ideal store, I absolutely love it except for the ability to index across nodes. My understanding, when last I looked at it, was that indexes were only local and didn't span Cassandra nodes. Does Cassandra now have properly distributed indexes?

When I was looking at implementing Cassandra instead of Mongo DB, it seemed like we had to create reverse column family (IIRC, been away from Cassandra for a bit now). Is that still the case?

Just curious: did you get any real problem with local indexes? For me it works just fine.

How do you use a local index when your data is distributed across numerous nodes? Maybe I'm missing something fundamental, so I'd definitely like to understand.

It's hidden from me behind client library API(astyanax in my case), I shouldn't know anything about index locality. I just send request(give me records for this index value), and get response and don't care if it's local index or distributed index. Astyanax takes care about everything, it queries all nodes.

Just refreshing my memory here, but this blog post [1] is what kept me away from secondary indexes and explains why I need something more from Cassandra. Especially the section on "The Good: Secondary Indexes", I actually have some data that is stored by timestamp, that was, as I recall, the biggest turn-off on them.

Has the state of affairs advanced since a year ago? Would love to hear it has!

[1] http://brianoneill.blogspot.com/2012/03/cassandra-indexing-g...

I will check that out, thanks! I was looking at Hector and Pelops at the time and I don't believe they provided anything like that. But now that you mention it, I bet Hive could do what you describe as well. Need to dig a bit deeper now!

ask yourself how facebook does it

It's like people started complaining about MongoDB just for the sake of it. I guess it's the new trend?

- Ordered data and skip / limit: These would run just fine on any database system. Given that you have appropriate indexes. It does not matter if there are a trillion items total, as long as you are seeking over an index and the result set is in reasonable size.

- Restrictions: A lot of software has restrictions. Filesystems has file name limitations. RDBMSs have table / column name limitations. It's a fact of life. Why is this a con for MongoDB?

- Impossible to keep working set in memory: It is a fair argument that MongoDB has shitty memory management because it just delegates the responsibility to OS. However, this is a concern with any DBMS. Also, given that there are appropriate indexes, you don't need to keep the entire database on memory. This comes back to indexing problem.

- No transactions / lack of schema / no joins...: I don't remember mongoDB claiming to have such features. My car can't fly. I'm not complaining. (Well, sometimes)

- Locking: Fair point. Better I/O performance might come handy (like an SSD) or eventually sharding.

- Poor space efficiency: Fair point about fragmentation and field names. Compression can be achieved on the filesystem level. There was an article about that a couple of days ago. I'm not sure about pefroamnce though.

- Too many databases: This should not be a big issue. Mongo does not go ahead and allocate a couple gigagbytes for each db, it uses incremental file sizes.

- Silent failures: Yep.. There it fails miserably. Recent versions are better though.

Why are you taking this personally? They're just listing reasons why it's not a good fit for them. Useful information to others who are trying to pick a database for similar applications.

I'm sorry if it looks like I'm attacking the criticism. Nope, I would not use MongoDB ever again, after a year and a half with it. I have my reasons for this decision.

I just don't like people bashing something without valid reasons. It might just be a perfect solution for similar applications, this is not a good way to evaluate.

>I just don't like people bashing something without valid reasons

That really doesn't seem to be the case here. Like you, the article's author(and several others here) have had issues with it for their particular use-case, and the reasons are clearly listed in a well organized paragraph by paragraph summary explanation in the article. Others here who've had a similar experience at least stated they had issues with it as well, even if they didn't go into much detail about it.

And speaking of the lack of valid reasons, to be fair, many relatively new technologies like these often get significant praise/hype without many valid reasons as well, other than [X]startup/company is using it, so it should be able to work for me, or it must be an awesome technology to use.

Exactly. It just mention why MongoDB hasn't worked for us.

We are not plainly complaining about MongoDB, nor saying it's useless. We are just explaining why it's a poor choice for a specific use case: storing scraped data.

FWIW, we still use Mongo in other internal applications, it's just not the right choice for our crawl data storage backend.

One issue is that many of these points are design characteristics of MongoDB and should have been known before hand. I am not criticising but it's almost like you did zero research before hand.

Transactions for example have never existed in MongoDB and joins doesn't really make much sense.

Perhaps they did their research on MongoDB and knew all the limitations, but thought to themselves "meh, I can solve all that in the application code", and eventually found out it wasn't so easy to handle transactions and joins in the code?

After all, developers are rather susceptible to the "don't tell me I can't do that" behavior.

How was the evaluation process that led to using MongoDB in the first place?

At some point you must have compared it to, say, Postgres – which is what the section before the summary hints to.

It isn't new, i remember people complaining about MongoDB since its first releases.

> It's like people started complaining about MongoDB just for the sake of it. I guess it's the new trend?

All the HN crowd went insane at once or, who knows, maybe they found out one product that is relatively heavily marketed is mostly blowing smoke up everyone's asses.

The complaining is vis-a-vis the marketing and perceived fan-boyism. It might also not be completely bad news as it means people are still using it.

you lost me here

""" Ordered data

Some data (e.g. crawl logs) needs to be returned in the order it was written. Retrieving data in order requires sorting which is impractical when the number of records gets large. ""

it requires _indexing_ and is quite feasable as I do it every day with stock ticker logs ( also required to be retrieved incrementially )

There are a few other flags that make me wonder about the exact limitations you found, but I will be anticipating your follow up post to see what your fix was since some of those issues are very common.

No kidding. Without details, it really sounds like the author is a bit clueless.

He mentions the lack of joins, but doesn't say a word about Mapreduce.

"MongoDB needs to walk the index from the beginning to the offset..." You don't "walk an index". It's an index.

"Too many databases" sounds a little suspicious. Why not add an indexed field to partition records?

Complaining about a lack of schema, transactions and triggers? Really? Did you read the docs at all before starting?

MongoDB is not without its problems, but friend, I think you wanted either Postgres or Hadoop.

You don't "walk an index". It's an index

If you have an address book, you don't have to walk through the city to find an address, but you do have to look through your address book in some way or other. Of course, you can have an index of the index ("C starts at page 7"), but then you have to look through the index of the index.

The problems with pagination are explained better on this SO post: http://stackoverflow.com/questions/7228169/slow-pagination-o... Mongo docs used the "walk" terminology

Not sure what you're getting at. "Walk" usually means a sequential scan. An index is sorted, so you can binary search.

See other reply. If the docs say MongoDB "walks", it's hilarious to complain that someone obviously didn't read the docs for saying "walk", too.

It looks like they went with HBase for the replacement which means they can scan an index range, keys are in lexicographical order, so it makes it pretty easy to scan over a series of data with a single RPC call.

With a database like HBase, it's already ordered lexicographically which makes it easy to grab a range of data in the order it was written in. You could have a key design like <reverse_domain>-<epoch> which would allow quick scans over large amounts of data. ie..Scan from <object_id>-1368536860 to <object_id>-1368540450

HBase is multidimensional though, which allows you to keep N numbers of versions of a cell. By default you will get the latest version of the cell back, but you could also opt to receive N versions back, which is useful for time series use cases.

The last time I looked there weren't many resources explaining how to design NoSQL databases (how to compose your keys, when to avoid normalization, etc). Has this improved?

Well, NoSQL databases is a pretty broad term. Not all NoSQL databases are created alike, for example, MongoDB is a "document orientated database" where as HBase is a "column-oriented store" based on the Google BigTable whitepaper.

As far as I know, key design is not an important aspect with MongoDB but I could be mistaken. HBase has a pretty awesome book (http://www.hbasebook.com/), which has an entire chapter dedicated to key design. Lars (the author) also has a pretty in depth 1 hour video on key design (http://www.youtube.com/watch?v=_HLoH_PgrLk).

HBase is pretty widely used, I've seen 1200+ node clusters running production tables.

Take the example of crawl logs. Each log entry has a log level, timestamp and message. Typical use would be to view all ERROR (or higher) log levels, show all entries with a specific text in the message, or download the entire log. All of these should be in timestamp order. It's a shame that natural order is not insert order for non-capped collections.

It's a good point that some of this can be achieved with indexing, I should have given more details in the blog post.

Right for that example ,not knowing the specifics, I would index on timestamp and then I could sort by timestamp which once indexed should be a relativly fast operation. I could even go one step better and make the _id a construct of { <timestamp>,<loglevel>,<fuzz> } Then Inserting would be done in order and I would get a magic index for free (depending on if I often query by loglevel I might leave it out) This gives locality of timestamps and helps keep "Hot" sections in memory.

Also, mongo has natural ordering which would do what the author wants without sorting.

No mongo natural order is just order on the disk. It is not always in reverse insertion order

Now the interesting thing about this post is this, I can see why they wanted to use mongodb, and I can see why it bit them in the arse.

What interests me is why they would want to keep everything in the database? I'd assume that they need to aggregate and curate the scraped data. After the initial scrape the majority of actions surely are going to be on the metadata of the scraped content? (where is said data, when was it scraped, how big, relationship to other data, etc) This data is much smaller and can be stored in relational database, as its proper structured data with relationships.

This allows the nasty unstructured data to be kept on a plain boring filesystem. After all filesystems are exceptionally mature, universal, multilevel key-value stores.

Now people will say that filesystems don't scale, well that's not really true. ext4/ntfs on a single system won't scale, but something like lustre/gluster(although not as neat)/gpfs scales linearly with the amount of nodes you apply to it.

This approach (scraped data on FS + metadata in DB) works well for storing scraped data. It was the first thing I prototyped when we started the project to move away from MongoDB. We've worked on similar designs in the past where the data is in S3, it's a common pattern.

We'd need to code the searching, filtering, paginating, (distributed?) job management ourselves while being careful to keep the DB & metadata consistent. It works best if each file is a reasonable 'chunk' of data (not too big, not tiny). None of this is a problem, and it scales very well as you said.

In the end, we went with HBase for crawl data in the new system. Of course, you can look at this as files on a filesystem (HDFS or others) :) It does a lot of what we would otherwise have to code ourselves and it's a good fit for applications we want to build on that data in future (e.g. storing other crawl datastructures, processing with hadoop). I'll provide more details on that in the next post.

Quote from the original author in the comments. TLDR: it was human error that got us into this situation:

"The lack of joins & transactions of course did factor into the original decision. My point (which perhaps could be clearer) was that MongoDB ended up being used outside of the area in which we originally intended to use it. There was some reluctance to add another technology when we could get by with what we had for what was (initially) only a small use. Additionally, some limitations were not always well understood by web developers (who were new to mongo and enthusiastic to try it). I see this as our mistake. With hindsight, it’s clear we should have introduced an RDBMS immediately and kept MongoDB for managing the crawl data."

Databases almost always grow outside of their original scope, much more so than applications. And there is a good reason: data is more valuable when it's combined with other data.

So I think it's reasonable to be cautious about using a system that can't effectively grow outside of its initial special purpose.

Proof that the OP was just looking for more Hacker News cred so they wrote about why MongoDB sucks but in reality wanted to discuss their "human error" :p Nice marketing guys


I use Mongo for storing a fairly large amount of scraped data and it works great. Some of the data I store is results from bike races that I want to display on my website in a better way than it is displayed elsewhere. The columns change which makes Mongo a great fit, but the data is pretty static.

The real issue here is that it feels like the author has just 'discovered' these problems as if Mongo was hiding them all along and after a long time using the system he just found them. The reality is that all of the things he brings up are well documented. It is fascinating to me how people pick a buzzword database and don't bother to think about how their application might run poorly on it over time.

Every time there is a mongodb retrospective or experience report posted to HN, the top comment is one along the lines of "Well, these issues are all well documented."

Firstly, the fact that some drawback is well documented does not excuse the fact that it is a drawback.

Second, while some drawbacks are documented some implications of these drawbacks are nuanced and only become obvious with experience. A good example of this is the implications of "schemaless" databases (more accurately: databases that do not check data against a schema). Not having to migrate tables is a boon for lots of development. It's also a giant pain if it turns out that bugs cause data integrity issues.

Third, this experience report is really useful since poorly structured scrape data is one of the areas that I would have considered to be ideal for mongodb.

Most people don't have perfect foresight. I don't fault the author on his lack of omniscience with respect to how mongodb would turn out for them. His original reasoning (given in paragraph 1, sentence 1) does not seem stupid.

Got a link? I can't get enough bike.

People aren't complaining just to complain. When you can't even ctrl-c out of the shell there is a huge issue. It's a known bug, around since v1.8, 'minor' priority. Yeah, thanks for trapping me in the shell.

Apparently this is the ticket being referred to:


Ahh yes, the fun "Control-Z-kill-pid" trick. I do enjoy it so.

I read the whole post waiting to see what they ended up using as we are having similar issues, only to find that it's another post I have to wait for..

We went with HBase. Cassandra would have been suitable too, but we already use Hadoop for data processing so it was a natural choice within the infrastructure ecosystem. We will write a followup about that.

Clouderan here! Glad to hear you guys went with HBase, I'm looking forward to your follow up post. Will you detail your key design / architectural setup?

Did you guys roll your own HBase environment or did you go with the CDH? If you're using the CDH version and have any questions, feel free to shoot an email to cdh-user.

We are using CDH4.2 and have had a very positive experience so far.

Cloudera has in fact been an inspiration for us to follow, you guys have really struck the right balance between open source and commercial support. We follow the same philosophy with Scrapy (an open source web crawling framework), as you do with Hadoop and its ecosystem.

That's really awesome to hear, thanks for your kind words. I'm looking forward to the follow up blog, depending on your key design you may be able to take advantage of Impala for ad-hoc queries using SQL.

Based on their use case, I'd expect they went with either HBase or Cassandra. I'm quite partial to HBase, it's insanely scalable and has a lot of pretty amazing features, but at the cost of knowing exactly what you want to do with your application beforehand.

I'm not too familiar with Cassandra, but the scalability of an HBase table is almost entirely dependent on your key design. Judging from their use case and requirements, they would likely use a incremental key design which would allow for super fast range scans, of course, this leads to region server hotspotting, which may or not may not be a big deal to them.

Look into Cassandra. Very fast writes and scales linearly (although there is some elbow grease involved in keyspace distribution). Our analytics platform guys are very happy with it, after flirting with several other options (including big, beefy RDBMS). Great for large, flat, denormalized tables.

Just a plain old RDMS?

Me too, except I skipped straight to the end :)

Why you people hate MongoDB? It is great to start with NoSQL, and for new little projects.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact