Hacker Newsnew | comments | show | ask | jobs | submit login
Failing with MongoDB (schmichael.com)
154 points by lenn0x 1268 days ago | 122 comments



Disclosure: I hack on MongoDB.

I'm a little surprised to see all of the MongoDB hate in this thread.

There seems to be quite a bit of misinformation out there: lots of folks seem focused on the global R/W lock and how it must lead to lousy performance.

In practice, the global R/W isn't optimal -- but it's really not a big deal.

First, MongoDB is designed to be run on a machine with sufficient primary memory to hold the working set. In this case, writes finish extremely quickly and therefore lock contention is quite low. Optimizing for this data pattern is a fundamental design decision.

Second, long running operations (i.e., just before a pageout) cause the MongoDB kernel to yield. This prevents slow operations from screwing the pooch, so to speak. Not perfect, but smooths over many problematic cases.

Third, the MongoDB developer community is EXTREMELY passionate about the project. Fine-grained locking and concurrency are areas of active development. The allegation that features or patches are withheld from the broader community is total bunk; the team at 10gen is dedicated, community-focused, and honest. Take a look at the Google Group, JIRA, or disqus if you don't believe me: "free" tickets and questions get resolved very, very quickly.

Other criticisms of MongoDB concerning in-place updates and durability are worth looking at a bit more closely. MongoDB is designed to scale very well for applications where a single master (and/or sharding) makes sense. Thus, the "idiomatic" way of achieving durability in MongoDB is through replication -- journaling comes at a cost that can, in a properly replicated environment, be safely factored out. This is merely a design decision.

Next, in-place updates allow for extremely fast writes provided a correctly designed schema and an aversion to document-growing updates (i.e., $push). If you meet these requirements-- or select an appropriate padding factor-- you'll enjoy high performance without having to garbage collect old versions of data or store more data than you need. Again, this is a design decision.

Finally, it is worth stressing the convenience and flexibility of a schemaless document-oriented datastore. Migrations are greatly simplified and generic models (i.e., product or profile) no longer require a zillion joins. In many regards, working with a schemaless store is a lot like working with an interpreted language: you don't have to mess with "compilation" and you enjoy a bit more flexibility (though you'll need to be more careful at runtime). It's worth noting that MongoDB provides support for dynamic querying of this schemaless data -- you're free to ask whatever you like, indices be damned. Many other schemaless stores do not provide this functionality.

Regardless of the above, if you're looking to scale writes and can tolerate data conflicts (due to outages or network partitions), you might be better served by Cassandra, CouchDB, or another master-master/NoSQL/fill-in-the-blank datastore. It's really up to the developer to select the right tool for the job and to use that tool the way it's designed to be used.

I've written a bit more than I intended to but I hope that what I've said has added to the discussion. MongoDB is a neat piece of software that's really useful for a particular set of applications. Does it always work perfectly? No. Is it the best for everything? Not at all. Do the developers care? You better believe they do.

-----


Why is a database that fails so easily and most of the time even loses data so popular? Is it really all just a huge marketing budget?

-----


There are tons of reasons for that. Let me pull some of them from my butt:

Reason #1: Devs aren't Ops.

Reason #2: Devs need something new on their resume.

Reason #3: Certain type of Devs would read blogs and get excited and skipping scientific mumbo-jumbo and directly take the blogs as _the_ source of truth.

Reason #4: It's easy to bootstrap (schemaless, etc) your weekend project. Dealing with DB apparently is tedious for devs.

I'm sure others can add more...

Let me feel your love HN-ers ;)

-----


Why do I get the feeling that you're an op and look down on development people? If that's really true, try to start developing some project and see how you like frequent schema changes, trying to synchronise schemas with peers, resolving relation issues when merging features, etc. On the other hand if you abstract your interaction with data enough, you can change the whole backend later once it's stable and not care about it up-front.

What I hear you saying is unfortunately - it's worse for ops, so noone should use it.

-----


> On the other hand if you abstract your interaction with data enough, you can change the whole backend later once it's stable and not care about it up-front.

have to disagree with this. By forcing yourself to work with a data layer so abstracted that you can't even reference whether you're dealing with a JSON document or a set of twelve joinable tables, you're going to write the most tortured and inefficient application. Non trivial applications require leaky abstractions.

-----


And you lose the ability to do data constraints well. Put clearly: Declarative data constraints (including referential integrity, but also check constraints and the like) are the single most important features of RDBMS's for most applications.......

-----


On the other hand, if you abstract your interaction with data enough, you can change the whole backend later once it's stable and not care about it up-front.

This is a conception of data that is more true in theory than it is in practice. In practice, if you want to query your data and efficiently, you'll need to worry about how it's stored. You'll have to worry about the failure cases.

Of course, it is definitely application dependent. If you're just writing a Wordpress-replacement, you can probably choose whatever data store you want and just write an abstraction layer on top of it (especially if you don't care about performance). On the other hand, if you're looking at querying and indexing terabytes (or more) of data, you'll have to work very closely with your data store to extract maximum performance.

-----


"If that's really true, try to start developing some project and see how you like frequent schema changes, trying to synchronise schemas with peers, resolving relation issues when merging features, etc. On the other hand if you abstract your interaction with data enough, you can change the whole backend later once it's stable and not care about it up-front."

I can sort of sympathize with this a little. I used to use MySQL for schema prototyping and then move stable stuff to PostgreSQL back when PostgreSQL lacked an alter table drop column capability.

However today, this is less of a factor. Good database engineering is engineering. It's a math intensive discipline. Today I work often with intelligent database design approaches, while trying to allow for agility in higher levels of the app.

Don't get me wrong, NoSQL is great for some things. However it is NEVER a replacement for a good RDBMS where this is needed.

-----


> If that's really true, try to start developing some project and see how you like frequent schema changes, trying to synchronise schemas with peers, resolving relation issues when merging features,

How does something like MongoDB actually help with this, though? Certainly a lack of a schema lets you be more nimble in changing it, but you still have to write code to handle whatever schema you decide rather than letting a battle tested RDBMS handle it. I think NoSQLs have their uses but not forcing correctness on your data as a feature is not one of them. But I also believe in static typing.

-----


I've been dev since I graduated. I had been everything else during my intern (dev, qa, test-automation developer, tools dev, build engineer, integration engineer, etc).

Are you saying that Rails schema migration can only solve 10% of your migration needs? That kinda suck bro.

-----


Are you saying that Rails schema migration can only solve 10% of your migration needs?

Rails is a bad example here as their primitive migration-system still forces the developer to write those stupid migrations by hand.

Django/South just auto-generates them which removes a huge chore from the daily development workflow.

-----


That's excellent! I think it's time for me to learn some Django.

-----


Migration tools are great once you stop changing the schema very often like it happens when a project starts. This of course depends on the project... if you can have a full design from the start, it's probably going to work too. If you don't know the exact requirements or way to get there - not so much.

-----


If you change your schema once every 2-3 days, something wrong with whomever leading the software project. That's like writing a software with zero planning or lack of knowledge for the problem domain.

I don't care if it is a startup or not. Come up with a very simple idea, draw the models in ER diagram, implement that stuff.

It's very hard to imagine that tomorrow suddenly all relationships need to be changed. Even if that is the case, scrap your Repository/Entity model and start from the beginning.

Nothing can help you much if the fundamentals are wrong.

-----


Everything else in a software project changes frequently, especially during the early days. "Fundamentals" don't help you know more ahead of time and it's really nice to be able to quickly adjust when you come across something you hadn't anticipated.

I tried building a small side project with Postgres about 8 months ago (after not really doing rdbms stuff for 18 months before) and was amazed at how inflexible it felt, and how much frustration used to seem normal.

-----


That's what prototyping and peer review are for,

I write up a schema, send to everyone else on the team, get feedback. If users are invovled get it from them too.... take in all the feedback, write up a new draft, wash, rinse, repeat until running out of shampoo (i.e. feedback).....

Once things are pretty stable, do a prototype, address any oversights, do the real thing.

It's not rocket science.

-----


Could you elaborate please? What were you doing and what made it feel frustrating and inflexible exactly?

-----


As a startup, change of minds is pretty normal to the point that it's far more better to have the tools to quickly implement it rather than plan/document it well. The only documentation is the general gist behind the database.

If it prototypes well, then further refine it with ER diagrams for future maintenance.

Why is planning everything without validation better than above?

-----


Disagree here. Figure an hour of planning saves 10 hrs dev time and 100 hrs bugfixing.

That doesn't mean spending months planning. It does mean doing your best to plan over a few days, then prototype, review, and start implementing. If things change, you now have a clearer idea of the issues and can better address them.

The worst thing you can do is go into development both blind and without important tools you need to make sure that requirements are met--- tools like check constraints, referential integrity, and the like.

-----


small to medium changes? find a better migration tools.

large changes? throw away and start over; tools like Rails can help you get up and running really quickly.

I rarely see people go back and fix the mess.

-----


To be frank, you're talking out of your ass.

Maintenance takes up about 50% of all IT budget [1]. Most individual pieces of software will spend 2-6 times (considering the average life cycle of an in production software product to be 2-4 years) more money on being maintained than being developed.

Data migration is a massive problem for any organization with data sets at any scale. RDBMS, in general, has gotten in the way of those migrations. People aren't looking at NoSQL just because they cannot sit still but instead are looking to find a better experience with handling data.

I'm not sure if NoSQL is the right answer to that but let's give it a chance and see what happens when people are migrating MongoDB data in 3-4 years.

[1] http://www.zdnet.com/blog/btl/technology-budgets-2010-mainte...

-----


Integration issues are best handled with good API's. Migration issues are a bigger issue but one thing that good use of RDBMS's give you is the ability to ensure your migrated data is meaningful. Not sure you can do that with KVM-type stores.

-----


If you don't know the requirements to the point that you can't even design a data structure for your project... A schemaless database is NOT going to save you.

-----


Ever hear the term "exploratory programming"?

-----


Right there with you. The excuses I hear for not wanting to use a good 'ole RDBMS just does not make sense to me sometimes. CREATE TABLE too hard? Time consuming? Difficult?

Those who do not study the history of databases are doomed to repeat it. Soon we'll add back row-level write locks, transaction logging, schemas, multiple indexes and one day they wake up with MongoSQL.

-----


You say that like it's a failure on the devs' part, but that's kind of like blaming regular users for not switching to Linux because they don't like editing network configuration files.* It's masking the problem that there are real developer-friendliness issues with the existing databases. And taunting users will not get them to switch back.

* but then Linux distros get network autoconfiguration and suddenly it's obvious that it was the right solution all along.

-----


This is exactly why I do not believe that what we do is engineering. Engineers do not refuse to use proven techniques and technologies because they are "too hard" and instead use easier methods that have serious shortcomings.

-----


How about the freedom/need to experiment?

If MongoDB is popular, it's because it has some compelling features/arguments (I'm not sure of which ones exactly). But classical RDBMS seem to be trying to close the gap.

-----


If you're taking your time to design some important large scale system, using proven technologies is easier overall with higher up front costs.

Prototypes are another story. You'd optimize against mental overhead.

-----


Every major RDBMS has both command-line and GUI interfaces to create and change schemas. What exactly is missing in the dev-friendlyness department?

-----


The effort required to get something like Postgres running in the first place. With Redis and CouchDB, it's juts `sudo pacman -S redis|couchdb`, maybe edit the config file, `sudo rc.d start redis|couchdb`. With Postgres you have to create databases, users, etc. etc. While all this stuff is critical in production, it's really not something that you should have to do to just code some stuff.

-----


Well, this is completely bogus. Setting up a new user on PostgreSQL takes me roughly 30 seconds. Now if you use the stock system-db user mapping for dev environment, it's something you do only when you install your system. Mine has been installed for 3 years now.

-----


The parent is referring to a new user trying to set up postgres.

And in the above Linux networking example, this response is analogous to the person who says "But I wrote my own scripts to fix my ethernet configuration, I don't see what the big deal is!"

Edit: That came off a bit harsh. I just don't think the fact that it's not a hassle for us means it's not a hassle for other people. And I don't think that mongo's easy configuration means that it should be used over postgres in all cases, just that postgres should take notice that people like mongo's easy configuration and step up their game in that department.

-----


yum install postgresql-server

service postgresql initdb

service postgresql start

sudo -u postgres psql .....

postgres=#.....

If all you are doing is writing code, that's sufficient on an rpm based platform. On Debian, use apt-get instead. On Windows use the 1 click installer.

PostgreSQL comes with a default user and a default database, so the criticism on this thread is a bit..... incorrect.

Now it is true you have to set up a system user if you are compiling PostgreSQL from source. However, that's really optional in most cases unless the code you are writing is, well, a patch to PostgreSQL.......

-----


Exactly what I meant. In debian, there is no "initdb" step and the daemon is started when it's installed.

Anyway, people who call themselves ops or devs ought to be able to do that sort of 3 lines operations.

As I said before, it happens once in your product's dev/op lifetime.

-----


It didn't came harsh.

I use no script whatsoever and the complication . For a dev environment, I go like this:

  $ sudo su - postgres
  $ createuser -s nakkiel
And again, no further operation is required until you install a new OS on your machine.

I really think we're missing the point anyway. The complicated things, and Mongo's advantage, only really come later when it's time to create a database and tables and columns and indexes..

-----


Do you have any specific suggestions?

Postgres installs with a default superuser ("postgres" on ubuntu) and a default database (also "postgres"), so that's not the real problem.

Installing software via a package system is trivial and required for any system, so that can't be the issue.

The package distribution invariably chooses a default location for your data and initializes it, so that requires no additional effort at all.

You have to start and stop the service, but the package distribution should make that trivial, as well ("service postgresql start|stop" on ubuntu). And again, I don't see any difference here.

So the only possible area I see for problems is connecting your first time. This is somewhat of an issue for any network service, because you need to prevent anyone with your IP from connecting as superuser. The way ubuntu solves this is by allowing local connections to postgres if the system username matches the database username. So, you have to "su" to the user "postgres", and then do "psql postgres". Now you're in as superuser.

The default "postgres" superuser doesn't have a password (default passwords are bad) and only users with a password can connect over the network. But, you can add a password (which then allows that user to connect over a network), or create new users. If the new username matches a system username, that user can connect locally. If you gave the new user a password, they can connect over a network.

Do you see any fat in the above process that can be streamlined without some horrible side-effect (like allowing anyone with your IP to connect as superuser)? I'm serious here -- if you do see room for improvement, I really, really, really, want to hear what the sticking point is so that it can be fixed.

-----


Well, mostly the migration part that Devs don't want to deal with.

Most modern languages have migration utilities (Flyway for Java, Rails migration for Ruby, Python should have their de-facto migration for Django by now or else they fail hard, and JS... well.. let's wait until Node.js users decided to use RDBMS).

-----


    Python should have their de-facto migration for Django by now or else they fail hard
Yep: http://south.aeracode.org/

-----


First of all, RDBMS's are not perfect for all jobs. They have serious shortcomings in some areas (semi-structured and unstructured data for example)

A lot of other stuff can be stored in an RDBMS but isn't really optimal for it. Ideally hierarchical directory servers LDAP don't run directly off a relational db.

So there are places for other forms of stores, from BDB to XML, but these cannot and should not replace RDBMS's for most critical tasks.

(Also there is room for real improvement in certain areas of relational constraints in RDBMS's today, but NoSQL moves the wrong direction there.)

-----


Well here's a fuck you back from a dev: my time is finite and everyone wants a piece of it; If I can save an hour a day by never having to think about my database? If I can shave a week or two of labor off a project?

It's really easy to work with. This is why people keep using it.

-----


Back at you boy. I'm a dev. I hate dealing with other dev that wasted my time just because he ain't lover with RDBMS and decided to write more code and add more infrastructure components (that includes message queue unless you absolutely have no choice).

My time is finite. Ops time is finite. Obviously you decided to dick around with mine and Ops. How bout I send you to the QA department to write automation and software tools so you don't dick around with production code?

You can write with any language and any storage systems you'd like there.

-----


Let me guess, you're the guy who makes sure there's a "Senior" in his title and you use django because the docs are so great.

I'm sorry you work with incompetent people. Sounds like you're in a cubicle farm somewhere. While you're in a meeting swinging your seniority around, I'll be over here shipping products faster than your team.

-----


First, I'm no senior. My title is simply "developer". I work with people that share the same opinions. Most of us are in the same page and that's how we build our culture. The ones that aren't don't last long.

Second, I'm not using Django.... and what's wrong if I do?

Third. I respect people around me. In return, they respect each other so we don't throw away the word "incompetent" and to think that we're better than anybody else.

No pirates. No ninjas. No rockstars. No racers (dhh?) as well. Just grown-ups doing their job with a bit of love, passion, and respect. All balanced.

Fourth. I have no cubicle. I work in an open space and I love it. I don't need my special office (I had one a few years ago and it sucks).

Fifth. My project manager attends meetings and deliver mostly good news to us. He's the best PM I've been with (so far). If we have meeting, that's usually when shit hit the fan and we need to have an honest conversation. Other than that, e-mails are sufficient.

Sixth. Keep throwing the love...we need more

-----


What about all the time you'll waste debugging your app because you can't make good assumptions about the structure of your data?

-----


Isn't that the same argument some use against dynamic typing in programming languages?

-----


One reason Moose rocks for Perl is it gives you the ability to get some stronger type checking in the dynamically typed language.

So yes, I think dynamic typing can lead to debugging nightmares. It just so happens that often the fact that other factors make up for this in many cases.....

-----


10gen has focused strongly on ease of adoption, which seems like the highest priority of MongoDB at this point. From what I can tell, the idea is to get everyone using it, and then "scale" it once you've got people willing to pay out $ for fixes, but sometimes bad decisions made early on (like the global locks and in-place updates) are harder to change than originally thought.

-----


yes, we are using mongo from 1.6.3. Reliability, Locking and Data security (not losing data) are never first priorities on their to-do list, they just push new features and busy doing marketing propaganda about how web-scale mongodb is(which is fake). I submitted a jira issue about losing data when sync a slave, it's already 3+ months, all they did is let me try the new releases to see if it fix the problem. I tried the latest 2.0.1 release, and it's still cause data loss. Every time I sync a new slave, I pray to god, hoping not lose data.

How come a DB lose data so frequently and it sill call itself web-scale? It just breaks when you need scale!

For auto-sharding it's also super unreliable, tried once and it failed, and now we are using a lib that do application level sharding. We are also considering move to other databases that at least know not losing data is the first and most important thing of a DB.

Some one summarized the issues of mongodb, http://pastebin.com/raw.php?i=FD3xe6Jt , we experienced most problems in the article. So just a remind for someone who want to create serious product using mongodb, read the article, it's not FUD, it's just so true that I hope I read it 1 year ago, so we don't have to try moving so much legacy data to a new database solution.

-----


Not a bad business strategy. Kinda like MySQL back then right?

-----


Sadly, it seems that "give them free crap and then charge for fixes" is a very common business model in the open source world.

-----


I am really happy that this does not exist in the PostgreSQL community despite many of the core developers being consultants who live off solving their clients' problems with PostgreSQL. Maybe this is because it is a community project with no single company in control of it.

-----


Maybe this is because it is a community project with no single company in control of it.

Yes. In an organization like Postgresql, someone's contribution is measured by how much they contribute to the code. In an organization like Mysql, someone's contribution is measured by how much they contribute to the bottom line.

-----


It also started out as a research project, so for the first 10 years or so of its life, the main incentives for the developers were whether it was producing interesting research papers, rather than number of users.

-----


Unfortunately solid software projects do not get much love these days.

SQLite. FreeBSD. OpenBSD. PostgreSQL. Python (there are some, but Ruby took the thunder).

-----


Better than asking for big pile of money and give back crap and charge way more for fixes isn't it? :)

-----


Yes. But I think asking happy users to donate ultimately produces better code than asking unhappy users to pay for support.

-----


So true. So true. I wish happy users would donate more as well.

Unfortunately when happy users get used to the culture of free and good quality software, they started to have a sense of entitlement (instead of donating). That if the software didn't provide exactly what they wanted, they starting to swear and whine instead of being... calm and helpful.

-----


I think a big part of the entitlement problem is not being clear about the business model, or positioning oneself to take advantage of it.

One key thing is, I think, to advertise services and ways of making money. IOW, giving people the option to get new features, etc. is an important thing.

There is a lot of solid FOSS out there: PostgreSQL, BSD, Linux, CUPS, and more all come to mind. These often are less sexy than heavily marketed, inferior counterparts. But these all also have solid business models attached.

I highly recommend that folks who start open source software look around at business models surrounding the better open source products and see what they can do to capitalize on that.

-----


Long run, one wants happy users to donate sales/marketing effort with recommendations, and pay for development of new features.

I'd MUCH rather be paid to produce new features than fix bugs.

-----


I have been asking that too and I concluded that it is due to dishonest marketing. Up until a couple of months ago they basically shipped a database product with disabled singe server durability. That fact should have been written in bright flashing red letter warning on their front page, it wasn't. So it made for very fast benchmarks, because everyone benchmarks for speed, not many benchmark for failure.

-----


The the data model contributes to it's popularity. A document store with indexes on document fields is very convenient for several types of applications.

-----


It's interesting that couchdb gets little love (as evidenced by google trends), but it has document storage by index, easy enough to install, copy on write so has no global lock, sharding with bigcouch, and all client access is entirely REST... it may be couch is a little hard to grok, I dunno.

-----


It isnt packaged and marketed as cleanly. It takes hunting and learning on your own to get good with couch, and map reduce is no where as easy to get started with as mongo queries (although the addition of unql may improve things next year).

Also there is no single, central steward and authority on couch. All of this stymies traction and confidence even though the tech is great.

-----


This is true. I think couchdb tries to be all things to all people rather than just focusing on being a great data store. It's a database like mongo, it's a mobile database like sqlite, you can use it to host apps with couchapp, you can use it as a map-reduce cluster like hadoop, etc. I would rather have just a solid no-sql database that solves that problem. I like couch better than mongo for the reasons I mentioned, but I know couch also has problems and still haven't found a good third option.

-----


Which all sounds rather like Postgres. It's harder to use because you have to know what you're doing, and it's not as popular even though it's better on some axes that are significant if you're building something solid.

-----


Couch doesn't have great documentation, and doesn't have official native client drivers. Oh, and it's slow (though you can tune it, and it doesn't crumble under load).

-----


You can do exactly the same document store with indexes on any RDBMS.

-----


Yup. See http://bret.appspot.com/entry/how-friendfeed-uses-mysql for an example. MongoDB is more convenient to program because it's all built-in.

-----


If I never expect the dataset to grow past 1GB and a single server, why would I use anything else? It doesn't really fail - none of the issues described were "failures" really. [edit: just to be clear, it didn't crash and burn, I don't think performance issue == failure] The data loss was not confirmed either: "There appears to be some data loss occurring" and in small deployments you can just use transaction log.

There's no other project I know of, which provides: schemaless json documents, indexing on any part of them, server-side mapreduce, lots of connectors for different languages, atomic updates on part of the document. If there is one and it's better than mongo, I'd switch any moment.

-----


>> "It doesn't really fail - none of the issues described were "failures" really."

These absolutely were failures.

The author listed several instances in which the database became unavailable, the vendor-supplied client drivers refused to communicate with it, or both. Some of these scenarios included the primary database daemon crashing, secondaries failing to return from a "repairing" to an "online" state after a failure (and unable to serve operations in the cluster), and configuration servers failing to propagate shard config to the rest of the cluster -- which required taking down the entire database cluster to repair.

Each of the issues described above would result in extended application downtime (or at best highly degraded availability), the full attention of an operations team, and potential lost revenue. The data loss concern is also unnerving. In a rapidly-moving distributed system, it can be difficult to pin down and identify the root cause of data loss. However, many techniques such as implementing counters at the application level and periodically sanity-checking them against the database can at minimum indicate that data is missing or corrupted. The issues described do not appear to be related to a journal or lack thereof.

Further, the fact that the database's throughput is limited to utilizing a single core of a 16-way box due to a global write lock demonstrates that even when ample IO throughput is available, writes will be stuck contending for the global lock, while all reads are blocked. Being forced to run multiple instances of the daemon behind a sharding service on the same box to achieve any reasonable level of concurrency is embarrassing.

On the "1GB / small dataset" point, keep in mind that Mongo does not permit compactions and read/write operations to occur concurrently. As documents are inserted, updated, and deleted, what may be 1GB of data will grow without bound in size, past 10GB, 16GB, 32GB, and so on until it is compacted in a write-heavy scenario. Unfortunately, compaction also requires that nodes be taken out of service. Even with small datasets, the fact that they will continue to grow without bound in write/update/delete-heavy scenarios until the node is taken out of service to be compacted further compromises the availability of the system.

What's unfortunate is that many of these issues aren't simply "bugs" that can be fixed with a JIRA ticket, a patch, and a couple rounds of code review -- instead, they reach to the core of the engine itself. Even with small datasets, there are very good reasons to pause and carefully consider whether or not your application and operations team can tolerate these tradeoffs.

-----


Just to be 100% clear -- so people don't misunderstand your explanation of Mongo's compaction: Mongo does have a free space map that it uses to attempt to fit new data or resized documents into "holes" left by deleted data. However, compaction will still eventually have to be ran as the data will continue to fragment and eventually things get bad.

-----


The data loss was not confirmed either: "There appears to be some data loss occurring"

Oh, this mystery is a failure all right, and even the most charitable interpretation would call it a misfeature.

-----


what do you think of redis? I feel the same way about Mongo for the most part, but have been considering switching.

-----


If you can model your data in redis data structures it is excellent. Keep in mind that there is no preferred mechanism for operating redis when data is larger than ram. There is vm and diskstore, both deprecated by antirez, and a focus on data sets that fit in ram.

If you can do both of those things, it is awesome.

-----


It has a pretty good user experience, except for all the details. But the model isn't bad; it should be learned from. On the other hand, there is no trade-off made by Mongo that I'm aware of that is not fundamentally unavailable to more mature projects in a tractable amount of engineering time, so the question comes down to "does Mongo shed its reputation for lulz soon enough" vs "do other projects witness and adapt".

Yet we've also seen in the past that shedding such a reputation is not strictly required to be popular. And marketing budgets do matter.

-----


Why is a database that fails so easily and most of the time even loses data so popular?

Perhaps because both of your premises are wrong? I've used Mongo for over a year now with ~1000 writes/sec and haven't seen any of these problems. I'm not saying they don't exist (some are confirmed bugs that have been fixed), but they're not nearly as prevalent as your 'Do you still beat your wife?'-style question implies.

-----


Not all data is important enough that small losses are unacceptable. Analytics data that can be inferred from other sources, for example. Furthermore MongoDB supports autosharding while most (all?) SQL databases do not.

-----


So what's the preferred alternative noSQL wise?

MongoDB is flaky. CouchDB is a maintainability nightmare, so I hear.

Riak? Cassandra? Or does everything else have some other equally huge down-side?

-----


They all have their warts. For every story like this, there are petabyte deployments of your favorite datastore that work fine.

For every X sucks article, ther is Y is awesome.

In the nosql world the only way to choose is around the problems they solve... They are each specializing and optimizing for certain nitches. mongo is the most mysql-esque, but dosnt do things that redis, couch or cassandra do that you may need.

There is no clear winner (fortunately or unfortunately dependng on what you were hoping for)

-----


Have there been any new entrants in the last few years? Seems like innovation has stalled a bit and stabilization / improvement hasn't caught up yet.

-----


Good question -- as far as new, mature NoSQL solutions on the scene, I just became aware of OrientDB which sort of baffles me with it's functionality. It looks like this amazingly functional blend of MySQL and NoSQL: http://www.orientechnologies.com/orient-db.htm

Other than that, I actually think these solutions have been stabilizing exactly because of what you say: innovation is slowing down/stalling.

1-3 years ago the cool thing to do was store data different ways, now that we have all these solutions that people are ready to use in production, they are demanding more and more secure/safe functionality from them.

In the last year Redis added the append log and flushing to disk. CouchDB rewrote the replication code in the last release and has always had a wonderfully redundant and safe file mutation model (can can copy the DB file while in use and still get a safe snapshot) and MongoDB has been responding aggressively to crashes and corruption since 1.7 after all the single-server durability fiasco around 1.5/6 that had everyone up in arms.

These data stores are really brilliant pieces of code with some wonderful deployments to prove their worth.

There is still work to be done, sure, but I am not aware of glaring deficiencies in these systems like I used to be a year or more ago where you could point at "Oh, the XYZ bug might get you" -- that just doesn't seem to be happening anymore.

I don't know a whole hell of a lot about Cassandra (I am one of the few humans that still doesn't grok the data model easily) but I remember data recovery bugs from a year ago in the issue tracker that all got knocked out to the point that 1.x is looking like a really awesome release for them.

At this point, I think it just depends on what you need.

-----


DBMS can't lose data. This is supposed to be the first commandment of any storage system leaving performance, CAP, and all other considerations out of this.

-----


Yuuuup. If data gets into your DBMS then it can't get lost. Period. That level of fuckup is simply unacceptable.

-----


Orient looks cool. Thanks for the tip. It doesn't look like it will support any other languages though (like Python), which stinks.

-----


Bulbs (http://bulbflow.com) is a Python persistence framework that supports OrientDB, Neo4j, Dex, and any Blueprints-enabled graph DB.

-----


Cool, thanks.

-----


The problem with Cassandra is you never believe the responses to your queries..... until it is too late.....

-----


Uhh, I didn't realize that was a "thing" with Cassandra, is there a bug # to track for any issues specifically or are you saying that it has a history of reporting "Okey Dokey!" to a PUT when in reality it just exploded and died?

I've not messed with Cassandra in production so I haven't seen such a thing yet.

-----


I should have said "until it is too late.... and you have gone down in flames due to a trojan horse...."

Alternatively "I understand Cassandra is a Trojan. Can anyone confirm this?"

-----


Nobody reads Homer anymore.... Or even Virgil.....

-----


Many are moving to Neo4j, including some of the major social networks.

-----


Ahh good catch, I forgot about Neo4j. I'd like to see some non-graph deployments on it and see how it performs. From what I've seen with social/networked data models it looks very compelling.

Excited to see that DB get more and more traction.

-----


Probably because the quality of CS graduates has been so low at recent years. MongoDB = oh, shiny, fast.

-----


Disclosure: I wrote a product called Citrusleaf, which also plays in the NoSQL space.

My focus in starting Citruseaf wasn't features, it was operational dependability. I had worked at companies who had to take their system offline when they had the greatest exposure - like getting massive load from the Yahoo front page (back in the day). Citrusleaf focuses on monitoring, integration with monitoring software, operations. We call ourselves a real-time database because we've focused on predictable performance (and very high performance).

We don't have as many features as mongo. You can't do a javascript/json long running batch job. We'll get to features.

The global R/W lock does limit mongo. Absolutely. Our testing shows a nearly 10x difference in performance between Mongo and Citrusleaf on writes. Frankly, if you're still doing 1,000 tps, you should probably stick with a decent MySQL implementation.

Here's a performance analysis we did: http://bit.ly/rRlq9V

This theory that "mongo is designed to run on in-memory data sets" is, frankly, terrible --- simply because mongo doesn't give you the control to keep you in memory. You don't know when you're going to spill out of memory. There's no way to "timeout" a page cache IO. There's no asynchronous interface for page IO. For all of these reasons - and our internal testing showing page IO is 5x slower than aio; the reason all professional databases use aio and raw devices - we coded Citrusleaf using normal multithreaded io strategies.

With Citrusleaf, we do it differently, and that difference is huge. We keep our indexes in memory. Our indexes are the most efficient anywhere - more objects, fea. You configure Citrusleaf with the amount of memory you want to use, and apply policies when you start flowing out of memory. Like not taking writes. Like expiring the least-recently-used data.

That's an example of our focus on operations. If your application use pattern changes, you can't have your database go down, or go so slowly as to be nearly unusable.

Again, take my comments with a grain of salt, but with Citrusleaf you'll have better uptime, fewer servers, a far less complex installation. Sure, it's not free, but talk to us and we'll find a way to make it work for your project.

-----


Seems to me they used the wrong setup they should have looked at a replicaset setup with secondaries for read and sharding if they needed more write performance and nonblocking reads. That said version 2 has less locking problems and I understand they are working on finer grained locking.

-----


Sorry, this is a pretty poorly written blog post. We're definitely using sharding+replica sets.

Replication of any kind won't help you with a high write load as secondaries have to apply the same number of writes as primaries.

-----


They seem to be very aware of the problem and focused on solving it as soon as possible. I guess it's just a matter of time. Compared to how long it took MySQL to mature into a stable platform I've been pretty impressed at their responsiveness and quick improvements so far :).

-----


seems from the comments in the post that 10gen went out of it's way to be helpful in resolving the issues ???

-----


Yes. The only thing that would be better is if these issue didn't exist to begin with.

-----


Sadly, MongoDB blows for actual usage. It locks, it's not crash-only, it has mutable data.

CouchDB is much better (you're as likely to lose data as with Postgres), but is potentially less efficient (no BSON).

-----


All I need is a schemaless version of postgres (with ACID-compliance and everything), does anyone know of one?

-----


http://www.postgresql.org/docs/9.0/static/hstore.html

-----


That's very useful, thank you!

-----


Keep in mind this isnt meant to be redis-on-postgresql http://archives.postgresql.org/pgsql-performance/2011-05/msg...

-----


I'd appreciate if someone would submit this story for me.

http://pastebin.com/raw.php?i=FD3xe6Jt

-----


Wow there's a lot of Mongo hate in this thread all from one article. Yesterday MongoDB was the darling of HN and today it has to be defended from ridiculous claims. Why the mob attitude? Have you all had these issues?

-----


Maybe there's a niche for "PostgreNoSQL", a layer atop Postgres that you start using like a NoSQL solution. (Perhaps, it's string keys and JSON blob values.) It's not very efficient, except for simple keyed lookups, but it works enough for a quick start.

Then, as you use it, the system optimizes itself (or makes suggestions) based on actual access patterns. A subset of objects could be a formal, indexed table? Have it happen automatically or offer the SQL as a suggestion.

-----


Conversely, you could have a NoSQL layer below Postgres, where PG stores and indexes metadata which tells it which, of many, small NoSQL dbs to find the actual data in. These data dbs then can be sharded/replicated across physical systems as you like. You loose some raw speed on reads, but avoid a global write lock and the system scales quite well. I've started playing around with such a system with https://github.com/cloudflare/SortaSQL

-----


IIRC, there are people talking directly to InnoDB (MySQL backend) using it as a NoSQL style DB. You don't however get SQL analysis, you're bypassing the SQL side of things.

-----


hstore?

-----


All the commercial DBs have similar issues. Just deal with them and go on.

-----


No, they do not. Some joke DBs had some issues back in the day (MySQL comes to mind) but issues of such importance were solved looong ago.

-----


No, all of them had, and most still do. People don't seem to realize how freaking old and complicated those things are.

-----


Ravendb (www.ravendb.net) is a solid competitor.

-----


"Raven is an Open Source (with a commercial option) document database for the .NET/Windows platform."

I'm not sure it's a competitor at all. RavenDB is a CouchDB clone for .Net that requires a commercial license for proprietary software.

-----


Which has a ton of magic baked into the driver making it unlikely you'll get your data back out via anything but .NET.

-----


Well put.

-----


Why the downvotes? Why would I bother mentioning alternatives next time, sheesh.

-----


If your data is easily modeled relationally, go for relation, if you are going to change it constantly and is not a natural fit for a relational model, Mongodb is worth a shot.

From this article, sounds like their data is pretty seriously relational.

Mongodb has been pushing the ops side of their product, but I can agree it has failings there. To me the advantage is the querying and the json style documents.

-----


I'm not sure you read the article fully, because relationships were never described in the article. Instead, it was high read/update load which caused problems.

Mongo, on paper, should be an ideal candidate for this job; but, due to complications with the locking model and with its inability to do online compactions, it's failing.

-----


Relation was a bad word choice, I meant easily modeled by a relational database system. Seems like your data can be modeled with fixed columns.

I had to model data with umpteen crazy relationships so we went with Mongodb. We did not have the high update issue or any locking issues. If one has a few large tables with fixed columns that can easily define the data, then relational DBs probably make more sense. But to your point, 10gen will not tell you that and the hype doesn't tell you that either.

-----




Guidelines | FAQ | Support | API | Lists | Bookmarklet | DMCA | Y Combinator | Apply | Contact

Search: