In my opinion, there is but one feature that a database really must have: whatever data I write into it, if I don't delete it, I want to read it back - unaltered (preferably without needing at least three machines, but I'm willing to compromise)
Software that cannot provide this single feature just isn't something I would call database.
If it's unsafe default configurations or just bugs. I don't care.
Between these two articles over the weekend and some from earlier, personally, I don't trust MongoDB to still have that feature and as such, it needs much more than one article with a strongly worded title to convince me otherwise.
I see where you are headed but that is bad logic. It is like saying your car may catch on fire anytime or a meteorite can strike so why bother also wearing a seat belt?
Sorry you cannot possibly defend un-acknowledged writes as a default setting on a product that calls itself a database. Saying you need 3 machines anyway.
By coming up with more arguments and excuses as a defense for this design you are actually making the product you are defending look worse.
i didn't mean to imply that. i do agree that your db client should check for write errors if you care about your data, and that mongodb should make their official drivers have safer defaults.
but the poster to which i was replying seemed to be saying that machine redundancy was somehow optional. it is not. hardware fails, and the architecture of whatever db server you are using is irrelevant when it does.
The author would have us believe that it's unfair to pick on any piece of software because it "all sucks." They'd also have us believe that complaining about your data disappearing in MongoDB is an unfair criticism, and then takes the logical leap that judging the destruction of data and buggy software somehow has something to do with your own ability to create backups. Generally speaking the people who have been burned by MongoDB have survived by the fact that they had backups. This has nothing to do with the fact that their database nuked their data and that this is unacceptable if it happens due to careless engineering or poor defaults.
Edit: To be fair, if MongoDB was advertised as a "fault-intolerate, ephemeral database that is fast as heck but subject to failure and data loss, so do not put mission critical information in it" then all bets would be off. But we know that's never going to happen.
> Generally speaking the people who have been burned by MongoDB have survived by the fact that they had backups.
I'm not one of those people, MongoDB had silently corrupted half my data a week back without me noticing, so the backups were naturally missing half the data as well.
Yes, but i've been using oracle databases for almost a decade and have never known it to drop data on the floor through bugs (only through user error). Not saying it doesn't happen, just that it's not a common event. It seems with mongo you should expect dataloss.
This is the difference between a database product that has been developed for more then 40 years and a developing software product that has been public for 2.5 years. If mongo has the longevity that Oracle has had with their db product, my guess is that in 40 years we will not be talking about mongo data loss. (However, my guess is that in 40 years we will not be talking about mongo)
Mongo people may say that's a good thing - if you aren't planning on dataloss, you are just begging for a disaster. And Mongo will force you to deal with recovery early on.
That's no excuse for the DB being buggy, but some of Mongo's problems are due to hard design constraints - it's not so easy to make a DB that is fast and reliable, and easy to configure. Other's are due to it being immature. Some of it is concerning - it seems it can crumble under heavy write load - not so great for a DB who's selling point is "fast at scale".
Part of Mongo's charm is how it works on a stock system. For traditional DB's, they cache stuff in RAM, then the OS caches the stuff they cached in RAM and swaps their cache to disk. Then you modify something, and the OS swaps the DB cache from disk to RAM, then the DB tells the OS to write the change to disk invalidating your OS disk cache, which then ... you get the picture. Mongo (and Couch) use the OS's cache, which is suboptimal on a tuned machine, but optimal on something you just threw together.
No, just that there's an upside to their risky design philosophy.
I like Mongo because of its documentation. It's really really great. And good documentation = widespread adoption, and a team who actually cares about user's needs. What they really need is a lengthy tutorial on backups (which they already have written), linked from every page in their documentation. Because their reliability not something they should be hiding.
Sure there is an upside, nothing against, but the trade-off they made should have been advertised on their front page (before they fixed the defaults) in large bold flashing letters -- "you might lose your data if you use this product with default options". That is all.
Why? Because they are making a database not an rrd logger or in memory caching server.
> What they really need is a lengthy tutorial on backups.
As I put it the grandparent post, as a general rule, avoid products whose mission is by design to teach you backup discipline. That is all.
> a team who actually cares about user's needs.
You know what is a better way to care about users' needs? Not losing their data because of a bad design. We are not talking about generating a wrong color for a webpage or even exceptions that are thrown and server needing restart, we are talking about data being corrupted silently without users noticing. Guess what, even backups become useless. You have no idea your data is corrupted, so you keep backing up corrupted data.
If you're suggesting that "traditional" databases operate without thinking about OS cache, unbuffered IO when called-for, memmap, etc., I strongly believe you're way off.
The difference between MongoDB and many of the other popular persistent data stores (relational or not) is one of degree, not of kind.
MongoDB isn't a fundamentally flawed system. It's just that the distance between what 10gen (and many of its defenders) claim and what it delivers is much greater than most other data storage systems. This is a subtle thing.
Many people have attempted to use MongoDB for serious, production applications. The first few times they encounter problems, they assume it's their fault and go RTFM, ask for help, and exercise their support contract if they're lucky enough to have one. Eventually it dawns on them that they shouldn't have to be jumping through these hoops, and that somewhere along the way they have been misled.
So it's not like anyone is misinterpreting the purpose and/or problem domain of MongoDB. It's more that they are exploring the available options, reading what's out there about MongoDB, and thinking, "Gosh, that sounds awfully cool. It fits what I'm trying to build, and it doesn't seem to have many obvious drawbacks. I think I'll give that a try." And then they get burned miles further down the road.
If MongoDB were presented as more of an experimental direction in rearranging the priorities for a persistent data store, then that would be fine. That's what it is, and that's great! We should have more of those. But when it's marketed by 10gen (and others) as a one-size-fits-all, this-should-be-the-new-default-for-everything drop-in replacement for relational databases, then it's going to fall short. Far short.
I hate to break it to the poster (and I would if they hadn’t chickened out and actually put their name on their post) but software has bugs.
This is not a valid excuse. This is like running a red light, smashing into someone, and then telling them "hey, you should have looked before entering the intersection... you should know that people sometimes run red lights".
Yes, you should have backups. No, that doesn't make data-loss bugs any more excusable.
The anon poster claimed to have deployed an early version of Mongo, at a "high profile" company with tens of millions of users, and yet seemed surprised by basic RTFM facts like 'must use getLastError after calls if you need to ensure your write was taken', even well into a production deploy. That should raise huge alarm bells for anyone who is considering taking the guy seriously.
It's just not clear that there were bona-fide 'data-loss bugs' in play here. Seems at least as likely that misuse and misunderstanding of Mongo led to data-loss that could have been avoided.
So, I'd revise your simile. This is more like ignoring a lot of perfectly safe roads which lead to where you're trying to go, instead choosing to chance a more exciting looking shortcut filled with lava pits and dinosaurs. And putting on a blindfold before driving on to it.
Look, NoSQL is wild and wooly and full of tradeoffs, that's a truism by now. If you use such tech without thoroughly understanding it, and consequently run your company's data off a cliff, absolutely it's on you. Mongo does not have a responsibility to put training wheels on and save naive users from themselves, because there should not be naive users. These are data stores, the center of gravity for an application or a business. People involved in choosing and deploying them should not be whinging about default settings being dangerous, about not getting write confirmations when they didn't ask for write confirmations, etc. There's just no excuse for relying blindly upon default settings. Reading the manual on such tech is not optional. Those who don't and run into problems, well, they'd be well-advised to chalk it up as a learning experience and do better next time. Posting "ZOMG X SUCKS BECAUSE I BURNED MYSELF WITH IT" is just silly, reactionary stuff, and it depresses me that HN falls for it and upvotes it like it's worth a damn, every freaking time.
Mongo is fine until it's not. It's been fine for us for many months, but once you hit its limitations, it's pretty horrible. We're in this situation right now and we're seriously considering moving back to MySQL or Postgres.
Basically, "it doesn't scale" unless you throw tons of machines/shards at it.
Once they fix a few of their main issues such as the global write lock and fix many of the bugs, it could become an outstanding piece of software. Until then, I consider it as not ready for production use in a write-intensive application. Knowing what I know now, I certainly would not have switched our data to MongoDB.
Have you considered Riak? (I ask mostly because I've been looking at both, having a little MongoDB experience but having heard great things about Riak.)
Riak is definitely an option I'll consider, but experience has taught me one thing: most of the time it's better to stay with the most mainstream tools, such as MySQL, even if that means not using the absolute best tool. As long as it gets the job done and is "good enough".
There are better options than MySQL out there (such as Postgres and Riak probably), but when shit hits the fan in production and you need to quickly bring your servers back up, you'll be happy you chose a tool that has a lot of outstanding consultants and immense amounts of documentation. Finding help for MySQL is very easy. Help for Riak or even Postgres is much more scarse. Also, MySQL is very likely to still be around in a few years. We can't say the same for most of the new NoSQL stuff.
I tried to stay true to this as much as possible in the past and it served me well. I made this mistake with MongoDB, however.
That's fair. Riak seems interesting to me more from an almost academic perspective; as I said, I haven't used it. My go-to are MySQL and Postgres, too (though I use Redis a decent bit as a communication pipeline).
My experience is that this is true of every database system (relational or non-). The thing is that they all break in different ways at different points, and so the smart thing to do is make choices based on that information.
The stupid thing to do is write blog posts about how Software Package X sucks and nobody should use it for anything.
I work with databases in extremely high OLTP workload environments since 20 or so years.
We're talking enterprise products, mostly Sybase, some Postgresql and very little Oracle.
Have I encountered bugs?
Sure, tons of them. Some of them grave enough to render the specific version of the database software unusable in the context of the project I worked on.
However, in all this time I probably dealt with no more then 3 - 5 corrupt databases, none of them went corrupt due to a database bug. Usually it was related to hardware failure,
Arguing that database corruption is inherent in the design of the product is, from a database perspective, beyond the pale.
A database "breaking" is absolutely not the same as a database blasting your data into corrupt confetti.
If you actually go through the various stuff posted, you find a recurring theme: people who lose data fall into a pattern of "well, they told me not to do this, but I did it anyway, so now it must be their fault".
Which, I think you'll find, is a far cry from "database corruption is inherent in the design".
But hey, learning that sort of thing would require reading; much easier to jump on a bandwagon, badmouth a product and downvote anyone who disagrees, amirite?
And everything is wonderful! Complains and criticism should be removed from the world.
Nice job putting words in my mouth.
Look, I know it's fashionable right now to hate on Mongo for whatever reason, but the simple fact is that everything has a breaking point. Saying "I ran into this product's breaking point, therefore nobody should ever use it" (the gist of many of the recent posts) is frankly stupid; instead, we should be asking when and how and why something breaks when evaluating it, since that'll give us an idea of what fits specific use cases.
I'm not sure how hosting a video for Gawker builds a lot of credibility for having field tested a database; perhaps there are more details he can provide about how that is a particularly interesting trial for a database. Among other things, that seems like "lots and lots of reads, very few writes" and a "very consistent access pattern regardless" kind of situation.
Ha! Fair point. I thought it was an interesting trial in that all updates to our user data wound up being published into MongoDB. All the other tools we'd tried for this purpose, CouchDB, MySQL with both MyISAM and InnoDB and even "thousands of .js files in a hashed directory structure" didn't perform as well. It allowed us to shift the load from our MySQL database to "something else" as during our spikes we were getting killed. It was a read-heavy workload in that case.
The thing that struck me about the original post was how it seemed some of the complaints were just normal things that people learn when dealing with clusters under load. "Adding a shard under heavy load is a nightmare." Well, I mean, duh. If you add a shard and the cluster has to rebalance, you're adding load. It's like how you're more likely to get a disk failure during a RAID rebuild. The correct time to add a shard is during the off hours.
Unless you only need single key indexes, your custom made indexer/querier is going to require a lot of configuration. If you only need single key indexes, you wouldn't choose MongoDB anyway.
Can somebody recommend a database with an API like Mongo's, but performance and durability more like Postgresql or Oracle's?
What I want to do is throw semi-structured JSON data into a database, and define indexes on a few columns that I'd like to do equality and ranged queries on. Mongo seems ideal for this, but I don't needs its performance, and want durability and the ability to run the odd query which covers more data than fits into RAM, without completely falling over.
Right now, the alternative is to do something like the following in Postgres, and have the application code extract a few things from the JSON and duplicate them into database columns when I insert data.
CREATE TABLE collected_data(
source_node_id TEXT NOT NULL,
timestamp INTEGER NOT NULL,
json_data TEXT);
CREATE INDEX collected_data_idx ON collected_data(source_node_id, timestamp);
CouchDB fits the bill. It's all about documents, persistence and defining indexes for range-queries.
Keep in mind two things:
1. It's all on the disk. So, while its throughput is excellent (thousands or more requests per second), each individual request has a latency of ~10ms
2. You define your indexes beforehand (called 'views' in couch terminology), and then you can only make simple queries on them - like by a specific key or by a range of keys. It takes some learning.
If you and your app are ok with both, go for Couch.
First, I would take some of the discussion about MongoDB loosing data with a grain of salt - especially since the really harsh critique is coming from an unknown source, as far as we know, it could be a sinister competitor spreading BS. MongoDB makes the up front choice of performance and scalability over consistency, they have never pretended that was not the case. That does not mean that MongoDB looses data left and right, it means that in the choice between bringing a production app to a halt and loosing data, MongoDB will opt to keep your app running.
Second, and this is a shameless plug, I really believe that the DB I work for (Neo4j) is a good answer to your question. Neo4j makes the same consistency vs. uptime decision that classic RDBMes do. In the choice between bringing a production app to a halt and loosing data, Neo4j will opt for saving the data.
So, to answer your question: Neo4j lets you store semi-structured documents in a manner somewhat similar to MongoDB, with the added comfort of full ACID compliance. It also lets you specify indexes to do exactly what you describe.
A PostgreSQL-specific alternative might be to write triggers in one of the provided procedural languages to turn your JSON into something indexed or materialized elsewhere.
Do either of those work for you?
Also, purely out of curiosity, do you have a design reason for only wanting to store schema-less JSON, or have you just been burned by slow database migrations in the past?
There seems to be a big community of people who really want to reject schema and use JSON for everything, and I'm really curious if they (a) don't understand relational databases, (b) are getting some surprising productivity gains somehow, (c) have been burned by slow database migrations in the past, or (d) some other reason.
All of the above would work, but feels less than ideal. I'm pretty comfortable using a well-schema'd relational database to manage data, but I don't think it fits something I'm working on atm.
I'm collecting and parsing data from a few different types of sources (think: some web page scrapers, Twitter, RSS feeds) for later analysis. I want an intermediate data store where I can throw all of the data together for querying in the short term (within days).
Some of the features I extract from it will probably be stored for longer-term use in a regular database. The JSON itself I expect won't ever be referred to in the long term.
If I can think of a new piece of data I might want to look at, it's very appealing to be able to just print it out in one of the data-gathering programs, without having to touch the entire stack top-to-bottom, deploy a new schema, etc.
Anecdotes like the one from the article that ended with "The one thing that didn’t flinch was MongoDB" don't convince me one bit. When something else between the end user is a bottleneck it would be silly to assume that is the only problem in the entire system. Who is to say that if the load balancers hadn't been configured differently, or higher spec'd that their MongoDB wouldn't have become a smoldering crater?
While anecdotal evidence is always suspect, remember that the case in the article is MongoDB's optimal use case. That being of being extremely read heavy (There is no indication that they did more than one write that day).
"First of all, with any piece of technology, you should, y’know, RTFM. EVERY company out there selling software solutions is going to exaggerate how awesome it is."
Ah, but it isn't the company that's exaggerating the wonders of MongoDB...
remind me on the arguments of which programming language is the best. There is no best technology. Different technologies are designed for different engineering problems. You can't really blame the technology when you are choosing the wrong tools for your problem.
If you're going to choose one of these high performance NoSQL DBs you are trading ACID for that performance. How hard is this to understand guys? If that doesn't suit your purposes, don't use it.
1) No one said we are trading ALL of ACID. D should never be traded, period, except for transient data or cache.
2) We don't even get the performance guarantee. See the pastebin post about how the write lock affects performance and how synchronization with a slave can go awry.
Software that cannot provide this single feature just isn't something I would call database.
If it's unsafe default configurations or just bugs. I don't care.
Between these two articles over the weekend and some from earlier, personally, I don't trust MongoDB to still have that feature and as such, it needs much more than one article with a strongly worded title to convince me otherwise.