I run engineering for foursquare. About a year and a half ago my colleagues and I and made the decision to migrate to MongoDB for our primary data store. Currently we have dozens of MongoDB instances across several different data clusters storing over a TB of data and handling 10s of thousands of requests per second (mostly reads but the write load is reasonably high as well).
Have we run into problems with MongoDB along the way? Yes, of course we have. It is a new technology and problems happen.
Have they been problematic enough to seriously threaten our data? No they have not.
Has Eliot and the rest of his staff @ 10Gen been extremely responsive and helpful whenever we run into problems? Yes, absolutely. Their level of support is amazing.
MongoDB is a complicated beast (as are most datastores). It makes tradeoffs that you need to understand when thinking about using it. It's not necessarily for everyone. But it most certainly can be used by serious companies building serious products. Foursquare is proof of that.
I'm happy to answer any questions about our experience that the HN community might have.
Would you be able to sum up the things you consider Mongo to be extremely good at? Particularly in comparison to things like Riak (which I believe supports a similar data model), or indeed compared to an RDBMS.
All databases perform poorly if you try to use them for use cases they don't fit, but I find with NoSQL databases it can be hard to find concise, objective statements of which use cases each is ideal for.
have users of foursquare run into problems? were they serious? did someone lose money? let's ask. it would answer whether to use an eventually consistent db.
Of course we've run into problems from time to time. No one goes from nothing to foursquare's level of success without running into some bumps along the way.
> were they serious? did someone lose money?
No.
> it would answer whether to use an eventually consistent db
MongoDB actually isn't really an eventually consistent datastore. It doesn't (for example) allow writes to multiple nodes across a network partition and then have a mechanism for resolving conflicts.
Kudos for not blaming the tool when that would have been the easiest route. It's worth mentioning that 10gen has MongoDB Monitoring Service out now. It makes monitoring MongoDB instances a lot more accessible and convenient.
That's not how everyone remembered it at the time, nor the picture the blog painted. And the mongo monitoring thing is much newer, right? Its like saying that a fire wasn't a big deal because next time there'll be a fire station.
If you don't have paying customers (Foursquare), you're not going to lose much in hard dollars when your service falls over. Reputation points? Sure. Dollars, not so much.
Drawing a line between the difference in losing money for a paying service and a free service, while technically correct, is not the best business practice. Any online business looses money by being down, whether they can easily quantify it or not.
I appreciate the "public service" intend of this blog post, however:
1) It is wrong to evaluate a system for bugs now fixed (but you can evaluate a software development process this way, however it is not the same as MongoDB itself, since the latter got fixed).
2) A few of the problems claimed are hard to verify, like subsystems crashing, but users can verify or deny this just looking at the mailing list if MongoDB has a mailing list like the Redis one that is ran by an external company (google) and people outside 10 gen have the ability to moderate messages. (For instance in Redis two guys from Citrusbytes can look/moderate messages, so even if I and Pieter would like to remove a message that is bad advertising we can't in a deterministic way).
3) New systems fails, especially if they are developed in the current NoSQL arena that is of course also full of interests about winning users ASAP (in other words to push new features fast is so important that perhaps sometimes stability will suffer). I can see this myself as even if my group at VMware is very focused on telling me to ship Redis as stable as possible as first rule, sometimes I get pressures about releasing new stuff ASAP from the user base itself.
IMHO it is a good idea if programmers learn to test very well the systems they are going to use with simulations for the intended use case. Never listen to the Hype, nor to detractors.
On the other side all this stories keep me motivated in being conservative in the development of Redis and try avoiding bloats and things I think will ultimately suck in the context of Redis (like VM and diskstore, two projects I abandoned).
1) It is wrong to evaluate a system for bugs now fixed
I disagree. A project's errata is a very good indicator for the overall quality of the code and the team. If a database-systems history is littered with deadlock, data-corruption and data-loss bugs up to the present day then that's telling a story.
2) A few of the problems claimed are hard to verify
The particular bugs mentioned in an anonymous pastie may be hard to verify. However, the number of elaborate horror-stories from independent sources adds up.
3) New systems fails, especially if they are developed in the current NoSQL arena
Bullshit. You, personally, are demonstrating the opposite with redis which is about the same age as MongoDB (~2 years).
I agree with your responses to 1 and 2. I take issue with the example for 3 though because Redis is nowhere near the complexity or feature set of MongoDB.
When you strip MongoDB down to the parts that actually have a chance of working under load then you end up pretty close to a slow and unreliable version of redis.
Namely, Mongo demonstrably slows to a crawl when your working-set exceeds your available RAM. Thus both redis and mongo are to be considered in-memory databases whereas one of them is honest about it and the other not so much.
Likewise Mongo's advanced data structures demonstrably break down under load unless you craft your access pattern very carefully; i.e. growing records is a nono, atomic updates (transactions) are a huge headache, writes starve reads by design, the map-reduce impl halts the world, indexing halts the world, etc. etc.
My argument is that the feature disparity between mongo and redis stems mostly from the fact that Antirez has better judgement over what can be made work reliably and what can not. This is why redis clearly states its scope and limits on the tin and performs like a swiss watch within those bounds.
Mongo on the other hand promises the world and then degrades into a pile of rubble once you cross one of the various undocumented and poorly understood thresholds.
You know, I didn't think about how similar Redis and Mongo are at the core when I first read your comment. The first thing that jumped out at me was the large set of disparities.
Thanks for that explanation. I agree that Mongo seems to have over-promised and under-delivered and that you do have to really craft your access pattern. I'm not a heavy MongoDB user, but from reading the docs and playing around, I was already under the impression that the performance of MongoDB is entirely up to me and that I would need a lot of understanding to get the beast working well at scale.
So, it's a tough call for me to say whether they over-promised or not, but like I said...I'm not a heavy user. I just read a lot. I do think it is easy to be deceived by Mongo's apparent simplicity (ie - usage of JSON, Javascript, schema-lessness, etc).
EDIT: zzzeek made a good point below about spending time in a low-key mode before really selling the huge feature-set, which convinced me, so I think you're right. I do like the idea of Mongo though, so hopefully they can get through it.
there's something to be said for promoting an application proportionally to the maturity of its implementation. An application with a larger and more sprawling featureset would need to spend several years in "low key" mode, proving itself in production usage by a relatively low number of shops who treat it with caution. I think the issue here is one of premature overselling.
At the end of the post the author notes his concern isn't with the technical bugs per se, but with the deep rooted cultural problems and misplaced priorities the existence of those problems reveal.
That's a fair problem, but I think It is true for other products as well and was true for things that we feel very solid today like MySQL. In other words there is a tention between stability and speed of development, a very "hard" tention indeed. It is up to the developers culture and sensibility to balance the two ingredients in the best way.
One of the reasons I don't want to create a company around Redis, but want to stay with VMware forever as an employee developing Redis, is that I don't want development pressures that are not drive by: users, technical arguments. So that I can balance speed of development and stability as I (and the other developers) feel right.
Without direct reference to 10gen I guess this is harder when there is a product-focused company around the product (but I don't know how true this is for 10gen as I don't follow very closely the development and behavior of other NoSQL products).
MySQL is a poor analogy because the history of MySQL is very similar to 10gen: a 'hacker' solution originally patched together by people who didn't take their responsibility as database engineers very seriously. It's only after years (decades) of work that MySQL has managed to catch up with database technology of the 80s in terms of reliability and stability (and it still has plenty of issues, as the most recent debacles with 5.5 show.)
On the other hand, commercial vendors like Oracle and open source projects like PostgreSQL recognize their role as database engineers is to first and foremost "do no harm." Ie, the database should never destroy data, period. Bugs that get released that do cause such things can be traced back to issues that are not related to a reckless pursuit of other priorities like performance. Watching the PostgreSQL engineers agonize over data integrity and correctness with any and all features that go out that are meant to improve performance is a re-assuring sight to behold.
This priority list goes without saying for professional database engineers. That there is such a 'tension' between stability and speed says less about a real phenomenon being debated by database engineers and more about the fact that many people who call themselves database engineers have about as much business doing so as so-called doctors who have not gone to medical school or taken the Hippocratic oath.
I agree with you but my comments are more about telling what is going on in my opinion, instead of telling what I think should be the right priority list. Even if I agree I still recognize that MySQL had a much bigger effect to the database world compared to PostgreSQL, so the success of a database can sometimes take strange paths.
But I think a major difference between MySQL and Redis, MongoDB, Cassandra, and all the other NoSQL solutions out there is that MySQL had an impressive test bed: all the GPL LAMP applications, from forums to blogs, shipped and users by a shitload of users. We miss this "database gym" so these new databases are evolving in small companies or other more serious production environments, and this creates all the sort of problems if they are not stable enough in the first place.
So what you say can be more important for the new databases than it was for MySQL indeed.
> MySQL had a much bigger effect to the database world compared to PostgreSQL
And if MySQL never existed, what would have happened ? Would we have all used PostgreSQL in the first place and avoided years of painful instability ?
I read here all the time that fashion and ease of use are more attractive than reliability. And we introduce plenty of new software in complex architecture just because they are easy to use. We even introduce things like "eventual consistency", as if being eventually consistent was even an option for any business.
The problem is to not use random datastores. Use a database that has a proven record of stability. And if someone builds a database, he/she must prove that ACID rules are taken seriously, and not work around the CAP theorem with timestamps...
10 years ago, MySQL was not stable. PostgreSQL was. Today, most key-value databases are not stable, PostgreSQL is.
Interesting to note is that early versions of Postgres, we're talking the pre-6 versions around 1995 here, were awful. Not like I was a very sophisticated user at that time myself but it definitely ate my data back then - we switched to MSQL at that time which at least didn't do that.
Wasn't it still basically a university project for researching MVCC at that point? I love universities of course but we must admit they produce interestingly-architected abandonware sometimes.
My sense was that it got a pretty thorough review and revision/rewrite in the transition from Postgres to PostgreSQL.
PostgreSQL has evolved a LOT in the last decade even. I thought the university project was looking at OO paradigms in relational databases (inheritance between relations and the like).
The change from Postgres to PostgreSQL was largely a UI/API change and the move from QUEL to SQL. However, over time virtually all of the software has been reviewed and rewritten. It's an excellent project, and I have been using it since 6.5.......
That was 16 years ago. Since then, PostgreSQL engineers spent a LOT of time proving the reliability of their engine. And today, 16 years later, we can consider it reliable.
Most key-value databases didn't prove (as in: show me actual resistance tests, not supercompany123 uses it) that they are reliable. The day they do, I'll be the first one to use them. Until then, it's just a toy for devs who don't want to deal with ER models.
you misunderstand me. I LOVE postgresql. It is the best database ever and I try to use it as much as possible. My only point was, they started out as unstable and untrustworthy just like anything else would.
I agree. There was no WAL logging, for instance. Most people consider 7.4 the first actually-possibly-not-a-terrible-idea release.
Then again, Postgres -- the project -- did not try to position itself (was there even such a thing as "positioning" for Postgres 16 years ago?) as a mature, stable project that one would credibly bet one's business on.
Lots of early database releases are going to be like Mongo, the question is how much the parties at play own up to the fact that their implementation is still immature and present that starkly real truth to their customers. So far, it seems commercial vendors are less likely to do that.
Well, 8.0 is really the first really good release.
However, actually-not-a-terrible-idea is pretty relative, when you look at how the industry has evolved in the mean time. I mean, compared to MySQL at the time, PostgreSQL 6.5 was really not a terrible idea. 7.3 was the first release I didn't have to use MySQL as a prototyping system though.
> And if MySQL never existed, what would have happened ? Would we have all used PostgreSQL in the first place and avoided years of painful instability ?
I think you're missing the point a little. Yes, MySQL is a heap, and having to work with it in a Postgres world sucks. But, the point antirez is making in that comment (at least how I read into it) is that an active user community in ANY project is hugely important in that project's formation and "maturity" (sarcastically, of course, because Postgres is clearly more mature than MySQL). There's no extrapolation here to the top-level Mongo discussion going on in this thread -- I was just clarifying antirez's point.
I still think that solid engineering on any project begins with the engineering and leadership of a few, and the feedback of many. So yes, community is important, but less important than the core of that community which is necessarily small.
> IMHO it is a good idea if programmers learn to test very well the systems they are going to use ...
Great point. It would also help if the company that makes a DB would put flashing banner on their page to explain the trade-offs in their product. Such as "we don't have single server durability built in as a default".
I understand if they are selling dietary supplements and are touting how users will acquire magic properties for trying the product for 5 easy payments of $29.99. In other words I expect shady bogus claims there. But these people are marketing software, not to end users, but to other developers. A little honesty, won't hurt. It is not bad that they had durability turned off. It is just a choice, and it is fine. What is not fine is not making that clear on the front page.
It's good to see a voice of reason. I think we all win if NoSQL is allowed to survive. Having multiple paths to modeling and designing our applications is an enrichment of our ability to create interesting and valuable applications in our industry. The last 10 years have been about living under the modeling constraints of RDBMS's and the industry is slowly waking up to the realization that it does not need to be like this. Now we got choices. Graph db's, column db's, document db's etc.
I would like to thank you for the great job you have and are doing on Redis. It's an awesome piece of technology and warms my heart as an European :). Are you based in Palermo ?
"Allowed to survive" is the wrong approach. "Finds a niche" is better.
The fact that software engineers need to understand is that NoSQL is in no way a replacement for SQL in areas of data with inherent structure. In such areas, the relational model wins hands-down, and NoSQL is a big, heavy foot-gun. The caliber of the foot gun goes up significantly when multiple applications need to access the same data.
On the other hand, the relational model breaks down in some ways in many areas. Some things that you'd think are inherently structured (like world-wide street addresses) turn out to only be semi-structured. Document management, highly performing hierarchical directory stores, and a few other areas also are bad matches for the relational model. Other stores work well in many of these areas, from the filesystem to things like NoSQL databases.
The big problem occurs when semi-structured data (say files which contain printed invoice data in PDF format) have to be linked to inherently structured data (say, vendor invoices). In these cases, tradeoffs have to be made......
I have no doubt that NoSQL is able to find a niche. I doubt it will be one which at best involves inherently structured data.
> I think we all win if NoSQL is allowed to survive.
What does that even mean? Is it some sort of cultural practice or religion we are afraid of losing. So we should look over lost data and bad designs just because something falls under the "NoSQL" category?
I think anyone married to a technology like it is a religion is poised for failure. Technology should be evaluated as a tool. "Is this tool useful to me for this job?" Yes/No? Not "it has NoSQL in its title, it must be good, I'll use that".
Anyone with half a brain can go look at the MongoDB codebase and deduce that it's amateur hour.
It's start up quality code but it's supposed to keep your data safe. That's pretty much the issue here -- "cultural problems" is just another way of saying the same thing.
Compare the code base of something like PostgreSQL to Mongo, and you'll see how a real database should be coded. Even MySQL looks like it's written by the world's best programmers compared to Mongo.
I'm not trying to hate on Mongo or their programmers here, but you've basically paid the price for falling for HN hype.
Most RDBMSes have been around for 10+ years, so it's going to take a long, long time for Mongo to catch up in quality. But it won't, because once you start removing the write lock and all the other easy wins, you're going to hit the same problems that people solved 30 years ago, and your request rates are going to fall to memory/spindle speed.
I think the discussion here also misses an important aspect of the conversation which is about application data modeling. Mongo will sooner or later reach a "stable" level as it matures just as mysql, postgres and all other datastores have done. I picked mongo due to the good fit it had to the problems I needed solved not only from the server perspective but from the modeling perspective. The ease of ad-hoc queries and the schemaless nature of the db lent itself well to the kind of problems I wanted to solve.
So even if in 30 years it's got the same characteristics as our current dominant data storage models I consider it a net win that I will be able to use a document oriented database for development over a more traditional RDBMS for some off my applications.
The richer our toolset is the better we are off as not every problems is a nail to be hammered in with an RDBMS.
So a high five to all the people who dare go against convention and take a chance on a new approach to data modeling being it Mongo, Riak, CouchDb, Redis, Neo4j, Cassandra, HBase or any other awesome opensource project out there.
Document databases, network databases and hierarchical databases (IMS, CODASYL etc) predate relational databases by decades.
Relational is the universal default for a simple reason. When first introduced it proved to be far better, in every conceivable way, than the technologies it replaced.
It's as simple as that. Relational is a slam-dunk, no-brainer for 99.99% of use cases.
Still, if you really want a fast, proven system for one of the older models, you can get IBM to host stuff for you on a z/OS or z/TPF instance, running IMS. It'll have more predictable performance than AWS to boot.
I agree entirely - I think when people rebel against "relational databases" they're actually just realizing that the normalization fetish can be harmful in many application cases.
You're better off with MySQL or PostgreSQL managing a key-value table where the value is a blob of JSON (or XML, which I've done in the past), then defining a custom index, which is pretty damn easy in PostgreSQL. Then you have hundreds of genius-years of effort keeping everything stable, and you still get NoSQL's benefits. Everybody wins.
Normalization is a tricky thing. On one hand, highly normalized databases have better flexibility in reporting, IMHO. On the other, you lose some expressiveness regarding data constraints. High degrees of normalization would be ideal if cross-relation constraints were possible. As they are not, typically one has to normalize in part based on constraint dependencies just as much as data dependencies.
First, the more I have looked, the more I have found that non-relational database systems are remarkably common and have been for a long time.
The relational model is ideal in many circumstances. However, it breaks down in semi-structured content, content where---parentheses for grouping---(hierarchical structure is important, data is seldom written and frequently read, and where read performance navigating the hierarchy is most important) and so forth.
So I'd generally agree, but not every problem is in fact a nail.
> However, it breaks down in semi-structured content, content where---parentheses for grouping---(hierarchical structure is important, data is seldom written and frequently read, and where read performance navigating the hierarchy is most important) and so forth.
Again, this problem is not new. Database greybeards call this OLAP and it's been around since the 80s.
No. I am talking about something like LDAP, not OLAP. LDAP may suck badly in many many ways but it is almost exactly not like OLAP.
OLAP is typically used to refer to environments which provide complex reports quickly across huge datasets, so a lot of materialized views, summary tables, and the like may be used (as well as CUBEs and the like). Hierarchical directories are different. In a relational model you have to transfers the hierarchy to get the single record you want and you are not aggregating like you typically do in an OnLine Analytical Processing environment.
This is why OpenLDAP with a PostgreSQL backend sucks, while OpenLDAP with a non-relational backend (say BDB) does ok.
I am not saying anything new is under the sun, just that some of the old structures haven't gone away.
I was referring to the read/write preponderance. Normalisation optimises write performance, storage space and also provides strong confidence of integrity. But it means lots of joins, which can slow things down on the read side.
That's why OLAP came along. Structured denormalisation, usually into star schemata, that provide fast ad-hoc querying. I think part of the enthusiasm for NoSQL arises because most university courses and introductory database books will go into normalisation in great detail, but OLAP might only get name checked. So folk can get an incomplete impression of what relational systems can do.
If I had a purely K/V data problem -- a cache, for example -- I would turn to a pure K/V toolset. Memcache, for example.
Hierarchical datasets have long been the blindside for relational systems. Representable, but usually requiring fiddly schemes. But in the last decade SQL has gotten recursive queries, so it's not as big a problem as it used to be.
Normalization is formally defined based on data value dependencies. However, because there is no way to set constraints across joins, in practice, the dependencies of data constraints are as important as the dependencies of data values.
As far as recursive queries, I am not 100% sure this is ideal either from a read performance perspective. There are times when recursive queries are helpful from a performance perspective, but I don't see a good way to index, for example, path to a node. Certainly most databases don't do this well enough to be ideal for hierarchical directories. For example indexing the path to a node might be problematic, and I am not even sure you could do this reliably in PostgreSQL because the function involved is not immutable.
Your replies so far are excellent. You're pointing out things I've overlooked, thanks.
> However, because there is no way to set constraints across joins, in practice, the dependencies of data constraints are as important as the dependencies of data values.
I don't follow your argument here. Could you restate it?
> As far as recursive queries, I am not 100% sure this is ideal either from a read performance perspective. There are times when recursive queries are helpful from a performance perspective, but I don't see a good way to index, for example, path to a node.
Poking around the Oracle documentation and Ask Tom articles, it seems to be more art than science; mostly based on creating compound indices over the relevant fields. Oracle is smart enough to use an index if it's there for a recursive field, but will struggle unless there's a compound index for other fields. I don't see an obvious way to create what you might call 'recursive indices', short of having an MV.
> Certainly most databases don't do this well enough to be ideal for hierarchical directories.
It'll never perform as well as a specialised system. But relational never will. An RDBMS won't outperform a K/V store on K/V problems, won't outperform a file system for blob handling and so on. This is just another example of the No Free Lunch theorem in action.
My contention is that we, as a profession of people who Like Cool Things, tend to discount the value of ACID early and then painfully rediscover its value later on. The business value of ACID is not revealable in a benchmark, so nobody writes breathless blog posts where DrongoDB is 10,000x more atomic than MetaspasmCache.
> I don't follow your argument here. Could you restate it?
Sure.
Quick note, will use PostgreSQL SQL for this post.
Ok, take a simple example regarding US street addresses.
A street address contains the following important portions:
1) Street address designation (may or may not start with a digit). We will call this 'address' for relational purposes.
2) City
3) State
4) Zipcode
As for data value dependencies:
zipcode is functionally dependent on (city, state), and so for normalization purposes we might create two relations, assuming this is all the data we ever intend to store (which of course is always a bad assumption):
create table zipcode(zipcode varchar(10) not null primary key, city text not null, state text not null, id serial not null unique);
create table street( id serial not null, address text, zipcode_id int references zipcode(id), primary key(address, zipcode_id));
So far this works fine. However, suppose I need to place an additional constraint on (address) for some subset of (zipcodes), let's say all those in New York City. I can't do it declaratively, because all data constraints must be internal to a relation.
So at that point I have two options:
1) You can write a function which determines whether a zipcode_id matches the constraint and check on that, or
2) You can denormalize your schema and add the constraint declaratively.
I did some searching and determined strangely that although subqueries in check constraints are part of SQL92, the only "database" that seems to support them is MS Access. But while there are obvious issues regarding performance, I don't see why these couldn't be solved using indexes the same way foreign keys are typically addressed.
> Poking around the Oracle documentation and Ask Tom articles, it seems to be more art than science; mostly based on creating compound indices over the relevant fields. Oracle is smart enough to use an index if it's there for a recursive field, but will struggle unless there's a compound index for other fields. I don't see an obvious way to create what you might call 'recursive indices', short of having an MV.
No, there is an inherent problem here. Your index depends on other data in the database to be accurate. You can create an index over parent, etc. but you still end up having to check the hierarchy all the way down to find the path. You can't just index the path.
Consider this:
CREATE TABLE treetest (id int, parent int references treetest(id));
The path to 7 is: 1,2,6,7. To find this, I have to hit 4 records in a recursive query. That means 4 scans.
So suppose we index this value, reducing this to one scan.
Then suppose we:
update treetest set parent = 3 where id = 6;
and now our index doesn't match the actual path anymore.
With specialized hierachical databases, you could keep such paths indexed and make sure they are updated when any node in the path changes. There isn't a good way to do this in relational systems though because it is outside the concept of a relational index.
> My contention is that we, as a profession of people who Like Cool Things, tend to discount the value of ACID early and then painfully rediscover its value later on. The business value of ACID is not revealable in a benchmark, so nobody writes breathless blog posts where DrongoDB is 10,000x more atomic than MetaspasmCache.
No doubt about that. I think we are 100% in agreement there!
I'd also add that while RDBMS's aren't really optimal as backings for something like LDAP for a big directory, and while RDBMS's are horribly abused by dev's who don't understand them (ORM's and the like), they really are amazing, valuable tools, which are rarely valued enough or used to their fullest.
Later this week, I expect to write a bit of a blog post on http://ledgersmbdev.blogspot.com on why the intelligent database model (for RDBMS's) is usually the right model for the development of many business applications.
In response to PostgreSQL's custom index types, taking a quick look at the API, I don't see a way of telling GiST indexes which entries need to be updated when a row's parent id is changed.
Consequently I don't believe there is a reasonable way to index this because there is no way to ensure the indexes are current and so you don't have a good way of testing that a row is in a path on the tree other than building the tree with recursive subqueries.
The thing is, unless you have a system which is aware of hierarchical relationships between the rows (which by definition is outside the relational model), you have no way of handling this gracefully. So here you have lots of reads, I really think dedicated hierarchical systems will win for hierarchical data.
Of course this wouldn't necessarily mean you couldn't store everything in the RDBMS and periodically export it to the hierarchical store.....
> When first introduced it proved to be far better, in every conceivable way, than the technologies it replaced.
That's not exactly true; what they did was offer a generic query and constraint model that worked well in all cases while offering reasonable performance. They were not generally faster in optimal cases, but they were much easier to query especially given new requirements after the fact because the queries weren't baked into the data model itself. That generic query ability and general data model always come at the cost of speed; always. Document databases have always been faster in the optimal use case.
You're absolutely right -- RDBMSes were designed to solve problems with the nosql-type approaches that preceded them. The nosql bandwagon is blindly rolling into the past, where it will crash into the old problems of concurrency and consistency under load.
BTW if you want nosql-style schema flexibility within an RDBMS, then a simple solution is to store XML or JSON in in a character blob. Keep the fields you need to search over in separate indexed fields. If you make incompatible version changes, then add a new json/xml field.
very true but it's a resurgence of modeling alternatives which can only help to enrich our ability to write interesting applications. yes you can model a social network in a RDBMS but it's not as efficient or as flexible as using neo4j. or yes you can model a key value document in a RDBMS but again it's not a good fit. The right tool for the right problem. You don't build a house with only a hammer so why should we build applications only on one storage concept ?
I looked at using BSON in a project a while back, and ended up scrapping it mainly due to perceived poor code quality. Plenty of potential errors ignored, unclear error messages, unsafe practices.
I was also turned off by the sloppy use of memory. Heap allocated objects returned from functions with poor checks to see if anyone manages that memory on the other side. Lots of instances of strcmp, strcpy and similar unsafe string/buffer manipulation functions.
It's been a while since I looked at it so I don't have any particular examples at hand, but that was my impression.
Take a look at for example: bool BtreeBucket<V>::find
Without even thinking about what it is doing, it's quite clear that it is not readable code, and it's not immediately obvious what the high level structure of the logic is. The function does not even fit into two screens so it's hard to reason about; your short-time memory is overused.
Clearly you didn't actually read the source file. I graduated in CS. I know B+ trees.
I also know that an 85-line, 7-argument method in a 1988-line file shouldn't depend on a global variable ("guessIncreasing") modified from several other, unrelated functions. I know that in bt_insert, which (apparently) assigns to "guessIncreasing" and then resets it to false just prior to exit, should be using an RAII class to do so instead of trying to catch every exit path, especially in a codebase that uses exceptions.
Thanks for attacking me personally. But I have no interest to pursue it more. I made claims that clearly hold true, and they have nothing to do with what you said (I did not say anything about bugs, for example)
That is characteristic of mathematical code, like btree. (ranty aside: being able to recognize this and find out information regarding btree for maintenance is(should be) one of the key reasons to get a CS degree)
I found the btree file relatively readable. Some macro stuff is not familiar to me, but I am sure I could figure it out in a few hours if I felt like. And I haven't yet rolled around to implementing a full-on btree, ever.
The bug was only triggered when the delayed_commits option was on (holds off on fsyncing when lots of write operations are coming in) and there was both a write conflict and a period of inactivity - when the database was shut down, any writes that happened afterwards would not be saved.
They immediately worked to develop a process that would prevent any data from being lost if you didn't shut down the server, then a week later had released an emergency bugfix version without the bug. Then later they released a tool that could recover any data lost from the bug if the database hadn't been compacted.
That's the kind of attitude database developers need to have towards data integrity.
One of the things that I love about Couch is that the standard way to shutdown the process is simply doing a kill -9 on the server process. No data loss. No Worries. Want to back up your data? rsync it and be done with it.
Couch may have its warts, but it is damn reliable.
I feel that Couch has too much server side programming. It can be off puting sometimes. If anyone wants to make some money, I'd suggest them putting a server on top of a couch cluster that receives mongo queries.
I mean, how hard can it be to
1) Manage some indexes,
2) Keep some metadata around and
3) Build some half-assed single index query planner?
Couch is already a solid piece of technology. It just needs a better API to "sit" on top of it, kinda like what Membase is doing now.
edit: or on top of Riak, Cassandra, PostgreSQL or etc ... on the API side, Mongo has clearly won.
Look, I'm not the best person to do this..but...good points?
1 - Default writes are unsafe by default:
MongoDB supports a number of "write concerns":
* fire-and-forget or "unsafe"
* safe mode (only written to memory, but the data is checked for "correctness", like unique constraint violations)
* journal commit
* data-file commit
* replicate to N nodes
The last 4 can be mixed and matched. Most (all?) drivers allow this to be specified on a per-write basis. It's an incredible amount of flexibility. I don't know of any other store that lets you do that.
When a user registers, we do a journal commit ({j:true}), 'cuz you don't want to mess that up. When a user submits a score, we do a fire-and-forget, because, if we lose a few scores during the 100ms period between journal commit, it isn't the end of the world (for us, if it is for you, always use j:true)
The complaint is the default-behavior (which I think you can globally configure in most drivers) of the driver? Issue a pull request. Is the default table created in MySQL still MyISAM ?
2 and 6 - Lost Data
This is the most damning point. But what can I say? "No?" My word versus his? I haven't seen those issues in production, I hang out in their google groups and I don't recall seeing anyone bring that up - though I do tend to avoid anything complicated/serious and let the 10gens guys handle that. Maybe they did something wrong? Maybe they were running a development release? Maybe they did hit a really nasty MongoDB bug.
3 - Global Lock
MongoDB works best if your working set fits in memory. That should simply be an operation goal. Beyond that, three points. First, the global lock will yield, i believe (someone more informed can verify this). Second, the story gets better with every version and it's clearly high on 10gen's list.
Most importantly though, it's a constraint of the system. All systems have constraints. You need to test it out for your use-case. For a lot of people, the global lock isn't an issue, and MongoDB's performance tends to be higher than a lot of other systems. Yes it's a fact, but with respect to "don't use MongoDB", its FUD. It's an implementation detail, that you should be aware of, but it's the impact of that implementation details, if any, that we should be talking about.
3 and 4 - Sharding
Sharding is easy, rebalancing shards is hard. Sharding is something else which got better in 1.8 and 2.0, which the author thinks we ought to simply dismiss. I don't have enough experience with MongoDB shard management to comment more. I think the foursquare outage is somewhat relevant though (again, keeping in mind that things have improved a lot since then).
7 - "Things were shipped that should have never been shipped"
This is a good verifiable point? I remember using MySQL cluster when it first shipped. That was a disaster. I also remember using MySQL from a .NET project and opened up a good 3-4 separate bugs about concurrency issues where you could easily deadlock a thread trying to pull a connection from the connection pool.
I once had to use use clearcase. Talk about something that shouldn't have shipped.
This is essentially an attack on 10gen, that ISN'T verifiable. Again, it's his anonymous word versus no ones. Just talking about it is giving it unjust attention.
8 - Replication
It's unclear if this is replica sets or the older master-slave replication. Either way, again, I don't think this is verifiable. In fact, I can say that, relatively speaking, I see very few replica set questions in the groups. It works for me, but I have a very small data set, my data pieces themselves are small. Obviously some people are managing just fine (I'm not going to go through their who's who, I think we all know some of the big MongoDB installations).
9 - The "real" problem
We've all seen some pretty horrible things. I was using MySQL in 5.0 and there was some amazing bugs. There's a bug, which I think still exists, where SQL Server can return you the incorrect inserted id (no, not using @@identify, using scope_identity) when you use a multi-core system. MS spent years trying to fix it.
I guess I can say what 10gen never could...If you were using MongoDB prior to 1.8 on a single server, it's your own fault if you lost data. To me, replication as a means to provide durability never seemed crazy. It just means that you have to understand what's going on.
Look, I don't doubt that this guy really ran into problems. I just think they have a large data set with a heavy workload, they thought MongoDB was a silver bullet, and rather than being accountable for not doing proper testing, they want to try and burn 10gen.
They didn't act responsibly, and now they aren't being accountable.
If you were using MongoDB prior to 1.8 on a single server, it's your own fault if you lost data. To me, replication as a means to provide durability never seemed crazy. It just means that you have to understand what's going on.
Well, except for that thing where the replication decided that the empty set was the most recent and blew everything else away. And those cases where keys went away.
Losing data, particularly when the server goes down, is fine. Even not writing data isn't terrible, though his points about not knowing whether it has been written in case of failure are really good ones. But corrupting data and then replicating that corrupted data is really, really bad. Often unfixably bad.
They didn't act responsibly, and now they aren't being accountable.
For the complaints about the default write stuff, sure. For everything else... Dunno. He brought up a lot of real, actual issues which were not documented MongoDB behavior. Yes, there's also a fair bit of complaining about the documented bits, and sure, boo-hoo, whatever. But the idea that 10gen is shipping stuff with serious data integrity bugs, and doing so knowing, doesn't seem out of line here.
And while MySQL also has some bad stuff, sure, it has nothing like as many data integrity bugs as MongoDB.
And I say all of this as a serious fan of MongoDB.
"This is a good verifiable point? I remember using MySQL cluster when it first shipped. That was a disaster. I also remember using MySQL from a .NET project and opened up a good 3-4 separate bugs about concurrency issues where you could easily deadlock a thread trying to pull a connection from the connection pool."
You can STILL deadlock a transaction against itself in MySQL w/Innodb. How do they let this happen? I do not know. I just know I have been bitten by deadlocks in multi-row inserts quite often there enough to get really really frustrated when I use that db. This is in fact documented in the MySQL manual.
For better or worse, projects which start out without a goal to offer highly reliable software from the start never seem to be able to offer it later.
I've also seen a lot of SQL Server developers write large stored procedures that manage to easily deadlock. It's been years since I dealt with it...had something to do with lock escalation, from a read lock to an update lock to an insert lock.
You could say "don't use SQL Server"..or you could say "it's important that you understand SQL Server's locking behavior"
It's one thing for two transactions to deadlock against eachother. It takes special talent to allow a transaction to deadlock against itself, which InnoDB apparently allows.
I have NEVER had issues with PostgreSQL transactions deadlocking against themselves, even with monstrous stored procedures.
I spent the time to write all that, and all you got from it is "MySQL is just as bad"...I obviously did a bad job.
edit:
I brought up MySQL because I think we all know that companies, you, me knowingly ship products with bug. In fact, you can look at public bug tracking for a bunch of major software and see bug fixes scheduled for future releases.
However, if you are going to accuse a database vendor of knowingly shipping data-corruption bugs, I think you absolutely have to back that up. It's slanderous. Obviously, if you think that, you also shouldn't use their product. But you either know something the rest of us don't, or you're a complete ass, if you make those kinds of statements without evidence.
No, of course that's not all I got from it. I was making a point specifically about the comparison you seemed to be making: that because MySQL did something (shipping with stupid defaults, dataloss bugs, whatever), it doesn't count as a black mark against MongoDB if they do the same.
I didn't comment on the rest because I don't care, not because I don't get it.
Refutations are going to fall into two categories, it seems:
1. Questioning my honesty
2. Questioning my competence
Re #1, I'm not sure what you imagine my incentive to lie might be. I honestly just intended this to benefit the community, nothing more. I'm genuinely troubled that it might cause some problems for 10gen, b/c, again, Eliot & co are nice people.
Re #2, all I can do is attempt to reassure you we're generally smart and capable fellows. For example, these same systems exhibit none of these problems, and we're sleeping quite well through the night, on the new database system they've moved to. I'll omit the name of the database system just so there is no conflict that might undermine my integrity and motives (see #1).
edit:
(also, there are a few comments about "someone unknown/new around here"... trust me, I'm not new or unknown. I'm a regular.)
So, you've got direct engagement from the CTO above, and plenty of other commentary to consider here, but you dropped back in only to announce that "hey, all dissent falls into 2 convenient buckets, and here are my quick rebuttals to those strawmen"? Really?
If intellectual laziness like that is any indication, I doubt anyone is going to be reassured on your point #2. You've dropped a bomb on 10gen here, and done it anonymously to boot. You've got their people sifting through past issues on a Sunday, and for what? Because you fucked up a project by making poor choices, and probably took the well-deserved heat for it? Nevermind your categories. Man up and respond to these people directly, or don't respond at all.
Your HN acct was created 14 hours ago, and it's name is extremely specific to this particular post -- "nomoremongo". Nothing wrong with it, just a little peculiar considering the subject mater.
- Where did you experience these problems?
I guess I could see some issue around revealing where you work, but honestly, that just sucks. It really doesn't have anything to do with questioning your honesty and integrity; it's more about just being open about things. If you're going to be open about your experiences, why not be open about all of it?
Not to be rude, but the anonymous-nature of this post comes off as a bit over-dramatic.
I don't understand why anyone is surprised or bothered that this would be an anonymous post.
Has anyone commenting ever used a piece of software that caused them major problems, while watching others with less experience talk about how great it is? For me, it is beyond my capabilities to refrain from speaking up about it.
His identity does not matter, and it would start a war between people or companies. He is not interested in doing that, and he is not speaking on behalf of a company. There is not really any other way to do it.
Sometimes people need to put information out there but don't want to be personally associated with the information. This is fairly logical, because they are not associated with the information. They just discovered what was already true.
Some are also (fairly) questioning "why the anonymity?", and "where is the evidence?"
Those two things are connected: I can't provide the evidence without revealing identity. And the reason for the anonymity is we still have some small databases with 10gen and a current support contract. I had intended to go public with all this after we had transitioned off the system entirely, but more and more reports have continued to pop up of people having trouble with MongoDB, and it seemed as though delaying would be imprudent. An anonymous warning would be more valuable than saying nothing.
So--if you choose to ignore or dismiss our claims, you're entitled. :-) I still feel satisfied that I did what I needed to do.
Reading this overall thread, I think you didn't have as much of an impact on people's thinking as you'd probably like to. I hope after your company finishes transitioning you do go public on this, with all the specific evidence.
Yep. I do regret not GPG signing it or something so we could later claim it without more conspiracy theories. But I'll blog about it on an official blog as soon as we're clear of any interest in MongoDB.
For me, the post is very helpful---when MongoDB markets itself
MongoDB is a scalable, high-performance, open source, document-oriented database. Written in C++... http://www.mongodb.org/
I presume that anyone would post such content anonymously.
I've used MongoDB in production since the 1.4 days. It should be noted that my apps are NOT write heavy. But, many of the author's points can be refuted by using version 2.0.
Regarding the point of using getLastError(), the author is completely correct. But the problem is not so much that MongoDB isn't good, it's that developers start using it and expect it to behave like a relational DB. Start thinking in an asynchronous programming paradigm, and you'll have less problems.
I got bit my MongoDB early on. When my server crashed, I learned real quickly what fsync, journaling, and friends can do. The best thing a dev can do before using MongoDB is to RTFM and understand its implications.
The #1 reason that I used MongoDB, was because of the schema-less models. That's it. Early on in an applications life-cycle, the data model changes so frequently that I find migrations painful and unnecessary.
Schema-less is imho a overrated feature. ORMs like DataMapper (Ruby) and NHibernate (.NET) can generate the schema on the fly for RMDBS, so no need for migrations pre-production. But when your application is in production you need migrations even with a "schema-less" db! See, rename a field and "all your data" is lost, unless you migrate the data from the old field to the new one..
"Schema-less" has the potential (if you use it properly) advantage of allowing gradual migration.
As long as your code can handle all versions of objects in current use, you can deploy new code, then either migrate objects as they're updated/rewritten, and/or slowly migrate objects in the background.
For certain types of schema changes in large enough data stores, this can be a killer feature. I remember one RDBMS setup I had to deal with where we were "stuck" having to do a lot of suboptimal schema changes because the changes we actually wanted to do resulted (based on tests in our dev environment) the system to slow to a crawl where it was unusable for 8+ hours and we just couldn't afford that kind of downtime. We spent a lot of engineering time working our way around something that'd simply be a non-issue in a schema-less system.
"Schemaless" most of the time means "code based schema". Dealing with multiple schema versions at the same time is always possible, relational or not, but it causes significant bloat and complexity. When I hear gradual migration I think code decay, but I can see why it could be useful sometimes.
In my view, schemaless models are only desirable if the schema is not known until runtime, e.g. user specified fields or message structures, external file formats that you don't control but might need to query, etc.
Well, also there is the issue of highly unstructured data. In LedgerSMB, we put it in PostgreSQL along with highly structured data, and just use key-value modelling. These include things like configuration settings for the database in question and the specifics about what a menu item does. I might migrate some of this to hstore in the future (particular the menus).
There are many shortcomings of this approach but when dealing with highly unstructured data (or basically where the inherent structure is that of key/value pairs) it strikes me as the correct approach, and not different really from using NoSQL, XML, or any other non-relational store.
right but was that MySQL ? schema migrations are not a problem on quality systems like Oracle and Postgresql. Altering tables and such doesn't stop the database from running at all.
Not indexably. But you can do a hideous many-tables-per-real-table thing where each field gets a tall thin table in PostGRES or MySQL, do a lot of joins to get your data, and index the fields in that.
It's not as awful as it sounds, performance-wise. It is as awful as it sounds in terms of maintainability, of course.
Even more common is when you have a mature application with a lot of users and you need to add new fields to f.ex the user table and you can't because alter table across a sharded db setup will take days or weeks so you end up creating a table that's a hashtable
key, value
and then proceed to pay the cost of joins against it. Most of my excitement around NoSql comes from hard earned pain not from "oh new shiny thing, I got to use it".
I'll take well-understood pain that I can patiently work around, one time, over the course of days or weeks, if the alternative is random bugs that bite you in the night for years at a time.
Joins are no fun, yes, but as you gritted your teeth and implemented those cute little table-based key-value stores, did you find yourself mentally calculating the time required to restore the whole system from backup while muttering tiny prayers? Probably not. Did your code wake up the ops team an average of once per month for several years? Did you lose data? Did you have to put up an apologetic blog post? Did anyone have to get on the phone and rescue customer accounts, one at a time, with profuse apologies and gifts? (Now that is a non-scalable process...)
But at least this argument about maintenance is a real argument. The one about wanting to save time during initial development by skipping the declaration of schemas reads like the punchline of a Dilbert cartoon that you'd find taped to the wall in the devops lunchroom.
@mechanical_fish yes and it was a mysql installation. Weird things happen with all systems once you push them up to the edge of performance both of the hardware and interconnections between servers.
Slow interconnect between servers caused me headaches in the past with mysql for replication. Shared switched did the same. Problems with locks under high contention did the same. Problems with the client libraries the same. In fact all storage systems have similar problems and pain. Some are just more battle tested than others.
There's a difference between flexibility of schema definition and flexibility of schema change[1].
Flexibility of schema change, which NoSQL does not solve, is increasingly more important. Not just for large data stores but also for the data development process and release process. To avoid playing the suboptimal schema-change game both the code and the data need to be updated together. Or at least be given the illusion that they have[2].
A probably obvious question most developers must have asked by now is: if we've built great tools to version source changes, how come we haven't built great tools to version data changes?
See, rename a field and "all your data" is lost, unless you migrate the data from the old field to the new one
This is not true.
I wrote Objectify, a popular third-party java API to App Engine's datastore. The data migration primitives worked out building Objectify are what ScottH built into Morphia, the Java "ORM" system for MongoDB. With a small number of primitives (mostly @AlsoLoad and lifecycle callbacks) it's possible to make significant structure changes on-the-fly with zero downtime.
This is, IMHO, the best thing about schemaless datastores. There's no longer any compelling reason (at least, in the datastore) to take down a system for "scheduled maintenance".
For more information, here is the relevant section of Objectify's documentation:
ORMs are a pain to use. In addition to know the domain you need to map from and the domain you map to, you now also have to understand the mapping process.
...in exchange for dramatically pared-down and simplified code, consistent data access practices, and hundreds of hours of developer time saved. Driving a car is tough too - how to steer, drivers license, gas, insurance, what a PITA. Yet somehow it remains preferable to walking in many cases, despite the latter being mastered by most two year olds.
The same should be said for ODMs as well. A document might be a little more straightforward to map to an object but there is still plenty of miss-match.
I'll agree with this. Document stores don't solve the object-relational impedance mismatch, but they do help (and personally, I find they help more than "a little").
-> Function Scan on generate_series (cost=0.00..12.50 rows=1000 width=4) (a
ctual time=87400.737..512954.539 rows=200000000 loops=1)
Total runtime: 1086336.466 ms
(3 rows)
postgres=# alter table alter_benchmark add test text;
ALTER TABLE
takes insignificant time (less than a second).
I feel so spoiled using PostgreSQL :-D
As I understand it PostgreSQL doesn't rewrite the table to change the column. It might to change the data type of a column. EXPLAIN ANALYZE doesn't work with ALTER TABLE because there is no query plan generated, so I have no idea how quickly the statement actually executed. All I know is it completed in under a second.
You could try `time psql < alter-statement.sql`. I know, it'd not really be useful as it measures lots of overhead. But if it's fast on that, it's fast during an active session.
Schemaless is awesome. Are you dba or a developer? If you're a developer like me schemaless is awesome because of it's flexibility. I focus less time on the how to do stuff and more time on the what stuff should we do.
I've been using Hibernate for 9 years and I finally came to the conclusion that it's just not worth the pain. When working on RDBMS I'm using straight SQL from now on.
Schemaless also dispenses with the ability to declare what correct data is in the schema. For critical apps that's a high caliber footgun. For critical apps that have to integrate with eachother, it's a nice piece of artillery aimed squarely at your foot.
> But when your application is in production you need migrations even with a "schema-less" db!
I disagree. The most frequent use-case I come across is adding columns / fields to a table / collection, and not needing to ALTER TABLE and run a database migration as part of the deployment process to add said fields is extremely awesome.
We extensively tested this inside Viralheat with a write heavy load of over 30,000 writes per second and basically it failed our test. It is not robust for the analytics world is the conclusion we came to. Though, I hope it gets better one day...it has potential.
Our company is a big data company. So our amazing engineers are responsible for storing hundreds of millions of pieces of data per week AND also crunching and analyzing that data. So basically we need a system where we can have incredible write and read performance but also a system that is elastic in nature. Most importantly, it has to be available.
Before I go into more details, MongoDB is great for most people who don't have a high transaction volume. It is easy to setup and easy to use. So if you are in this camp, MongoDB is probably a good fit for you.
We did about two months worth of extensive tests in our lab. Basically two things didn't bode well for us. One, the locking killed reading...we just had a hard time keeping the flow of writes and the flow of data to our statistics cluster alive. Yea, you could use replication but that too didn't work too well performance wise. Two, the sharding didn't seem that robust. As the cluster got bigger and bigger, we started noticing the overhead of keeping it up was getting to be too great. Rather than write in detail, I think this article covers some of the scaling issues we experienced:
We finally used a hybrid system. We went with Membase, now CouchBase, to handle immediate storage and we are now implementing Hadoop for our long term storage needs.
Just reading about your transactional volume, it seems like at it's face MongoDB wouldn't be a good fit for this project. 30k per second is not anywhere MongoDB pretends to live, I think by their own admission. And Sharding in MongoDB, while being called a core feature, was bolted on after core development, probably intended to give Mongo some credibility with those who want it to be more scalable. IMHO if you need that kind of scalability, you're already straying from the Mongo Niche, 2.0.0 notwithstanding.
So agreeing with a point earlier, if you don't like a write lock implementation, and have concerns about scaling, and have a huge transactional volume, just really not something that fits well with MongoDB.
I've been using Mongo now (currently using 1.8) for three (is it almost three now?) years, 2 million hits/day, with a replicated set, and while I've needed maintenance, reindexing, and (gasp) restarts on occasion, never had any of the problems identified by the author of this post.
Bottom line, sounds to me like someone was in over someone's head from an architectural standpoint, made a bad choice of MongoDB, and then blamed 10gen for his own lack of foresight. So while I empathize with the struggle, I fault him for not knowing his options in advance, TESTING first, then betting the farm on a fairly new opensource codebase.
LOTS of other database solutions that would scale better. Analyzing lots and lots of transactional stateless data with MongoDB map-reduce? Well, just kinda like killing yourself by trying to sprint up from the bottom of the Grand Canyon. "You really tried to do that?"
We easily support 10s of millions of writes and reads against Mongo per hour on a very small (single digit) number of shards in the cloud (i.e. crappy disk I/O). While that is around an order of magnitude less than 30k a second I would be surprised if we couldn't scale mostly linearly by adding shards.
P.S. If your stack is KV then you should use a KV store.
It should be noted that this was not really a problem with MongoDB. Foursquare used a poorly-chosen shard key that caused a disproportionate load on one of its shards, and on top of that did not have proper system monitoring in place to alert them that a server was running out of RAM. It should also be noted that no data was lost in the process of resolving the problem.
I thought both Foursquare and 10gen handled the situation then very well, especially considering how much traction the story got (it had all the elements - a popular service, a popular new database, etc.)
I was sort of suggesting that this anonymous post may have come from somebody at Foursquare, since what is described kinda matches what happen there. The 'politics' element could also match because of the common investor - but I see that both 10gen and 4sq have responded here saying that they do not know who wrote this - which I believe.
First, I tried to find any client of ours with a track record like this and have been unsuccessful. I personally have looked at every single customer case that’s every come in (there are about 1600 of them) and cannot match this story to any of them. I am confused as to the origin here, so answers cannot be complete in some cases.
Some comments below, but the most important thing I wanted to say is if you have an issue with MongoDB please reach out so that we can help. https://groups.google.com/group/mongodb-user is the support forum, or try the IRC channel.
> 1. MongoDB issues writes in unsafe ways by default in order to win benchmarks
The reason for this has absolutely nothing to do with benchmarks, and everything to do with the original API design and what we were trying to do with it. To be fair, the uses of MongoDB have shifted a great deal since then, so perhaps the defaults could change.
The philosophy is to give the driver and the user fine grained control over acknowledgement of write completions. Not all writes are created equal, and it makes sense to be able to check on writes in different ways. For example with replica sets, you can do things like “don’t acknowledge this write until its on nodes in at least 2 data centers.”
> 2. MongoDB can lose data in many startling ways
> 1. They just disappeared sometimes. Cause unknown.
There has never been a case of a record disappearing that we either have not been able to trace to a bug that was fixed immediately, or other environmental issues. If you can link to a case number, we can at least try to understand or explain what happened. Clearly a case like this would be incredibly serious, and if this did happen to you I hope you told us and if you did, we were able to understand and fix immediately.
> 2. Recovery on corrupt database was not successful, pre transaction log.
This is expected, repairing was generally meant for single servers, which itself is not recommended without journaling. If a secondary crashes without journaling, you should resync it from the primary. As an FYI, journaling is the default and almost always used in v2.0.
> 3. Replication between master and slave had gaps in the oplogs, causing slaves to be missing records the master had. Yes, there is no checksum, and yes, the replication status had the slaves current
Do you have the case number? I do not see a case where this happened, but if true would obviously be a critical bug.
> 4. Replication just stops sometimes, without error. Monitor
> your replication status!
If you mean that an error condition can occur without issuing errors to a client, then yes, this is possible. If you want verification that replication is working at write time, you can do it with w=2 getLastError parameter.
> 3. MongoDB requires a global write lock to issue any write
> Under a write-heavy load, this will kill you. If you run a blog, you maybe don't care b/c your R:W ratio is so high.
The read/write lock is definitely an issue, but a lot of progress made and more to come. 2.0 introduced better yielding, reducing the scenarios where locks are held through slow IO operations. 2.2 will continue the yielding improvements and introduce finer grained concurrency.
> 4. MongoDB's sharding doesn't work that well under load
> Adding a shard under heavy load is a nightmare. Mongo either moves chunks between shards so quickly it DOSes the production traffic, or refuses to more chunks altogether.
Once a system is at or exceeding its capacity, moving data off is of course going to be hard. I talk about this in every single presentation I’ve ever given about sharding[0]: do no wait too long to add capacity. If you try to add capacity to a system at 100% utilization, it is not going to work.
> 5. mongos is unreliable
> The mongod/config server/mongos architecture is actually pretty reasonable and clever. Unfortunately, mongos is complete garbage. Under load, it crashed anywhere from every few hours to every few days. Restart supervision didn't always help b/c sometimes it would throw some assertion that would bail out a critical thread, but the process would stay running. Double fail.
I know of no such critical thread, can you send more details?
> 6. MongoDB actually once deleted the entire dataset
> MongoDB, 1.6, in replica set configuration, would sometimes determine the wrong node (often an empty node) was the freshest copy of the data available. It would then DELETE ALL THE DATA ON THE REPLICA (which may have been the 700GB of good data)
> They fixed this in 1.8, thank god.
Cannot find any relevant client issue, case nor commit. Can you please send something that we can look at?
> 7. Things were shipped that should have never been shipped
> Things with known, embarrassing bugs that could cause data problems were in "stable" releases--and often we weren't told about these issues until after they bit us, and then only b/c we had a super duper crazy platinum support contract with 10gen.
There is no crazy platinum contract and every issue we every find is put into the public jira. Every fix we make is public. Fixes have cases which are public. Without specifics, this is incredibly hard to discuss. When we do fix bugs we will try to get to users as fast as possible.
> 8. Replication was lackluster on busy servers
This simply sounds like a case of an overloaded server. I mentioned before, but if you want guaranteed replication, use w=2 form of getLastError.
> But, the real problem:
> 1. Don't lose data, be very deterministic with data
> 2. Employ practices to stay available
> 3. Multi-node scalability
> 4. Minimize latency at 99% and 95%
> 5. Raw req/s per resource
> 10gen's order seems to be, #5, then everything else in some order. #1 ain't in the top 3.
This is simply not true. Look at commits, look at what fixes we have made when. We have never shipped a release with a secret bug or anything remotely close to that and then secretly told certain clients. To be honest, if we were focused on raw req/s we would fix some of the code paths that waste a ton of cpu cycles. If we really cared about benchmark performance over anything else we would have dealt with the locking issues earlier so multi-threaded benchmarks would be better. (Even the most naive user benchmarks are usually multi-threaded.)
MongoDB is still a new product, there are definitely rough edges, and a seemingly infinite list of things to do.[1]
If you want to come talk to the MongoDB team, both our offices hold open office hours[2] where you can come and talk to the actual development teams. We try to be incredibly open, so please come and get to know us.
One addendum to Eliot's "both our offices hold open office hours"; we (10gen) also recently opened an office in London.
Although we don't yet have a fixed office hours schedule, we typically hold them every 2 weeks. The exact dates are announced via the local MongoDB Meetup Group°; we always hold the hours at "Look Mum No Hands" on Old Street.
At least one (and often several) of our Engineers make themselves available during this time to answer any questions and assist with MongoDB problems.
We've been using Mongo for almost a year now, and we've not seen any of the major issues such as data loss referred to. We've seen some of the growing pains of a quickly moving, dynamic platform, but nothing outside of the realm of what is reasonable for such a powerful solution. It's true that implementing sharding is no simple task, but with enough planning up front, you'll find yourself able to scale horizontally very quickly. After a couple of weeks of planning, we wound up making a few small changes in our codebase to migrate from master/slave to a sharded environment. Not a huge undertaking by any stretch, provided the current flexibility of our platform. Also, due to the fact that 10gen does make all bug information publicly available, we've managed to get it done with zero surprises.
> If you want to come talk to the MongoDB team, both our offices hold open office hours[2] where you can come and talk to the actual development teams. We try to be incredibly open, so please come and get to know us.
I envy how all your (potential) customers are from California.
Besides office hours in California NY and London we also have user groups in many cities http://www.10gen.com/user-groups and have (one day, very inexpensive) developer conferences frequently (next two in Dallas and Seattle).
Most of the best practices/gotchas can be found by reading the online documentation. Of all the replies Eliot gives they were either plainly obvious (oh, you have a system under heavy load and you're surprised that it gets worse when you give it another task to do?) or mentioned in the documentation. If you're planning on using something - especially for a production system - I sure hope you at least read all the available documentation.
I don't think a short doc is of any help for evaluators. You shouldn't be basing your decision on 400 words and some bullet points. If you're serious about your datastore then you should treat it seriously.
When I was doing my research and came across a bunch of "Why not to use MongoDB" articles, I looked at alternatives solution to see if there was anything "better." Granted NoSQL is the new kid on the block but I wanted to see what my options were. Guess what I'm using, MongoDB. Why? Their documentation is fan-f'n-tastic. Their newsgroup support is just as good, lots of folks who help troubleshoot issue, including the developers themselves.
The original story was submitted by nomoremongo, not nmongo. The original story was very detailed and identified known problems with MongoDB. This post is a one-liner.
So prove you were the one who wrote the account in pastebin.
More evidence: nomoremongo and nmongo have some differences in their writing style. nomoremongo uses semicolons properly, nmongo does not. An even bigger difference: nmongo doesn't capitalize his Is.
Yes, i am a troll, and things have gotten a little out of hand.
Just because a story was very successful at fishing for up-votes, it doesn't have to be true, people around here need to be a lot more sceptical.
And i think everyone who truly pays attention will know by now that MongoDB is the next MySQL.
Whether you are the original poster or not, you're not a troll, you're a sociopath emboldened by anonymity.
Cloak yourself in some idealistic mission if it makes you feel good- but your mission isn't to make the point that "people around here need to be a lot more skeptical"- You're a sociopath that enjoys kicking a hornet's nest just to watch the reaction.
My intention was to troll as many hipsters as possible and make them a little more aware of how easy to manipulate they are, without even providing the slightest bit of evidence.
It cracks me up that there are startups out there right now, making foolish architecture decisions based on the FUD i'm spreading.
Start thinking for yourself!
And in the process of discrediting, you might have turned many people away from MongoDB. You actions seem irresponsible to me. Unbeknownst to you at the time of posting, I'm sure, but your blog has gone somewhat viral, and it could take 10gen a while to recover from the negative press. Did you consider this when posting?
Kudos to Eliot for coming on and answering your phony accusations. I feel sorry for him though as he has obviously spent a great deal of time in responding, when he could have been doing other important things, like fixing urgent bugs. As others have pointed out, this is the mark of a company who take very good care of their customers. Customer service is what differentiates chiefs from cowboys.
HN is an important community resource, especially for people with little startup / dev experience. I would urge you to think next time before being so irresponsible.
There are only a few comments from credible sources in this thread, and none of those had anything negative to say about MongoDB, don't believe blindly.
Interesting you characterise mongodb users as hipsters - why is that? (at the risk of engaging a troll)
We use mongodb extensively, but I get the hipster feeling also, mostly because they hold office hours at Look Mum No Hands in Old Street, which is ultra proto-hipster.
If true, you do realize that you falsely tarnished a real company and product. If this was supposed to be some lesson in verifying sources and information, I think you went about it in the wrong way. What if someone started spreading misinformation about nmongo to prove a point (even an insignificant and unrelated one), would you like that?
What exactly was a hoax? The document pasted was rather detailed and, while somewhat overblown, was obviously written by someone who knew what they were talking about. It contains a lot of criticism of design decisions by MongoDB; these are pretty common and being opinion, can't really be called a hoax.
There's also a couple of anecdotes of MongoDB supposedly failing in various ways in the author's experience. Are you saying those were fake?
Just because you submitted the document here does not mean you wrote it. Pastebin logs the document as being submitted on 5th Nov. http://pastebin.com/FD3xe6Jt
I don't buy it. I don't think nmongo wrote the doc on pastebin. Maybe I'm overrating my character-detection abilities, but it didn't smell like it was written by some immature time-wasting kid.
edit: I use mongo in prod; very much a student of the "right tool for the job" school. Not trying to add or subtract weight from the original text; ambivalence reigns supreme regarding internet nosql battles. Just saying that my possibly unreliable circuits detect quite a gulf between the original document and the OP's hysterical, caps-lock-engaged cry for attention here.
This admission has my "spider-sense" tingling also. The communication style between this guy and the author of the pastebin log seems so different.
It is plausible that someone guessed the password of nmongo's throwaway account, quickly changed that password, and then started posting the whole thing was a hoax.
This rant is completely outdated and it shows: "pre transaction log" "fixed this in 1.8". You realize MongoDB is at 2.0 now and the transaction log was introduced in 1.8, right? Yes, MongoDB had problems but since the transaction log it's pretty good. I have used MongoDB since early 1.3 and I knew what I was doing and we never lost a bit of data. There is a tradeoff -- while MongoDB handled write load easily that a MySQL box with 2-3 times the RAM , I/O capability couldn't at all we understood the bleeding edge of using MongoDB back then. We have, for example, kept a snapshot slave which shot itself down often, took an LVM snapshot then continued replicating. Never needed those.
We have meticulously kept a QA server pair around and the only time when I have ran into a data loss problem was when I have hosed one of those -- but only one and even the QA department could continue (and hosing that server was me not knowing that Redhat 5 had separate e4fsprogs and e2fsprogs, only partially MongoDB fault but now it works without O_DIRECT so even this would not be a problem any more) . Never understood for example how could foursquare get where they got to -- didnt they have a QA copy similarly?
""This rant is completely outdated and it shows: "pre transaction log" "fixed this in 1.8". You realize MongoDB is at 2.0 now and the transaction log was introduced in 1.8, right?""
You do realize that 1.8 vs 2.0 is not eons ago, but just a few months, right? And you do realize that the cavalier-throw-all-caution-to-the-wind development attitude that cause all this problems can and does continue to exist? You don't eliminate that just because you added a transaction log (as late as in 1.6, IIRC).
Well, I worked in Vodafone (and Nokia) in very large (laaarge) projects, serving ~50 milions users. Years ago, no hope for NoSQL, we used MySQL. We hit at least 10/20 bugs, solved by 'hotpatch' from Sun. So? I think as developers we should get used to bugs and patches. Should I write a post "don't use MySQL?". We also hit several bugs in the generational garbage collector. Stop using Java?
I don't feel the drama here.
Tongue-in-cheek aside, the author's point is that regardless of its current status, MongoDB has been pushed on a lot of people hungry for performance/simplicity; in that singular pursuit they may be setting themselves up for disaster later on. Most developers have a (perhaps unspoken) assumption that a successful write to a database means that data Will Not Disappear. If Mongo violates this assumption, then either developers' attitudes have to change or they should look at other software to avoid being bitten.
Take something like sockets: by using TCP, I am telling my development environment that I would like an unbroken, sequential stream of traffic to another endpoint. Just as importantly, I would like to be notified if this ever is not the case. If I discovered errors in my TCP stack, I want those fixed pronto because any kind of workaround would be reimplementing the very task TCP is meant to cover -- I might as well write my own sequencing and retransmission logic on top of UDP!
Then I think it is way easier to write a post "Do not use technology, go back to the cave". Any technology has chances to fail, can be SQL, Cloud, yadda yadda. And if you want to work on the 'edge' (innovating to disrupt your competitors), that's a risk you should accept. Blaming the tools you use to achieve that point is childish.
I assure you that, back when MySQL was the same age as Mongo is today, "don't use MySQL" was conventional wisdom... among those who could find and afford Oracle DBAs. ;)
(Though there weren't a lot of blog posts about it, because the word blog had not been invented yet; blogs developed along with... MySQL.)
It will be interesting to watch Mongo as it matures over the next ten years. Unlike MySQL, it is competing against ubiquitously-deployed, well-known, well-worn open-source RDBMS packages, so its history is unlikely to unfold in the same way that MySQL's did.
"Don't Use MySQL" still should be conventional wisdom.
Indeed it's the only database system I have ever used where a system with a single transaction running only multi-row inserts into a table can (and frequently does) deadlock against itself. Don't get me wrong, time was when it was easier to use than PostgreSQL but that time is long since passed.
One area I have continued to recommend MySQL has been in areas of content management but to be honest in many of these areas, NoSQL is actually a better fit.
Given the size and success of MySQL deployments, it's getting awfully hard to evangelize that particular religion. I prefer Postgres, but life is too short to argue about it.
MySQL has a niche too. It's somewhere between that of a NoSQL database and that of a real RDBMS. MySQL does well for single app databases (as NoSQL does), but where the relational data then needs to be fed through other database systems for multi-app access.
If Facebook are so happy with MySQL, why did they develop Cassandra?
Derek Harris puts the larger point about Facebook's trouble with MySQL: "By and large, [MySQL] does [for Facebook] what it’s designed to do, which is to keep up with the myriad status updates and other data that populate users’ profiles. Rather, [the problem is] that Facebook had to expend so much money and so many man-hours to get there."
It's not about the size of the database or deployment. It's about the number of applications interacting across the same relational interface. The fact that applications can turn off strict mode is a big blow in this area. You can't be sure your data is "obviously correct" to paraphrase a different HN post.
One of my customers logs certain web data into a MySQL database and loads/processes it in a PostgreSQL database every day. The data is then accessed in Pg by at least three different applications.
I couldn't agree more with this analysis, with the added addition that the single threaded nature of the JS interpreter can also cause really bad & unexpected performance things to happen.
Most of the people who are excited about mongo, have never used it in a high volume environment, or with a large dataset. We used it for a medium sized app at my last employer, with paid support from 10gen, and everyone on the project walked away wishing we had stayed with a more mature data store.
Of course things work well when traffic is low, everything fits in memory, and there are no shards.
I would love to see a thorough approach in which such claims are actually shown and can be reproduced. This helps everyone immensely...from 10gen to people looking to adopt.
Disclosure: I wrote a product called Citrusleaf, which also plays in the NoSQL space.
My focus in starting Citruseaf wasn't features, it was operational dependability. I had worked at companies who had to take their system offline when they had the greatest exposure - like getting massive load from the Yahoo front page (back in the day). Citrusleaf focuses on monitoring, integration with monitoring software, operations. We call ourselves a real-time database because we've focused on predictable performance (and very high performance).
We don't have as many features as mongo. You can't do a javascript/json long running batch job. We'll get around to features - right now we're focused on uptime and operational efficiency. Our customers are in digital advertising, where they have 50,000 transactions per second on terabyte datasets (see us at ad:tech in NYC this coming week).
This theory that "mongo is designed to run on in-memory data sets" is, frankly, terrible --- simply because mongo doesn't give you the control to keep you in memory. You don't know when you're going to spill out of memory. There's no way to "timeout" a page cache IO. There's no asynchronous interface for page IO. For all of these reasons - and our internal testing showing page IO is 5x slower than aio; the reason all professional databases use aio and raw devices - we coded Citrusleaf using normal multithreaded io strategies.
With Citrusleaf, we do it differently, and that difference is huge. We keep our indexes in memory. Our indexes are the most efficient anywhere - more objects, fea. You configure Citrusleaf with the amount of memory you want to use, and apply policies when you start flowing out of memory. Like not taking writes. Like expiring the least-recently-used data.
That's an example of our focus on operations. If your application use pattern changes, you can't have your database go down, or go so slowly as to be nearly unusable.
Again, take my comments with a grain of salt, but with Citrusleaf you'll have great uptime, fewer servers, a far less complex installation. Sure, it's not free, but talk to us and we'll find a way to make it work for your project.
Looks interesting. May I suggest you provide a hosted service? With mongo, I tried it online and got a feel for it before we signed up, and there are multiple hosted services so I didn't have to worry about setting it up in the cloud. Looking at citrusleaf.com, though the blurb sounds like I might like it, nothing else really helps me. It's NoSQL, but that doesn't say anything. I know that memcache has a use case, and I know mongo's use case, and redis', but I don't see yours.
(PS I know you're enterprise software, but still).
We use Citrusleaf at my job, definitely one of the fastest nosql stores I've seen. However it doesn't nearly have the kind of flexiblity that mongodb has, we tend to use it more as a persistent cache like redis then a real database, its not quite as easily to write queries in it.
Except that those are not the words on a libelous, frustrated competitor. I've seen these claims validated over and over again both by posts on HN but also people I trust that have worked with MongoDB under load.
Performance benchmarks stop being meaningful when you realize that you can't fix the problem you're having without committing to a system-wide shutdown of unknown duration.
The main point that the author makes is that the creators of MongoDB do not follow rigourous practices. If this doesn't bother you, please go right ahead and use anything you wish.
In most case, I think 10gen will be able to dispute false claims.
With regard to nomongo's post, 10gen can check their record and say whether they did or didn't have a customer with premium support account with similar use case and issues. 10gen can also counter such complaints with testimonials from customers with similar use cases.
But note that nomongo's post is not about individual issues but about his concern that 10gen's priorities are misplaced which he should have wrote first instead of last. Rest was just about how his concern came about. Current status of technical issues he experienced are irrelevant to his concern.
People seem to be jumping on a lot of the NoSQL stuff for no good reason. You can get a lot of mileage out of something like Postgres or Mysql, and they work pretty well for a lot of things. Ok, if you get huge, you might have to figure out something else, but that's a good problem to have. On the other hand, if you've lost all your data, you're not going to get huge.
I had to use MongoDB recently, and I wasn't very pleased with it. It wasn't really appropriate for the project, which had data that would have fit better in a relational DB.
A story from a newly created account by a person nobody can verify is real and asking other people to submit his rant (to gain what? credibility to his story?)
Attacking the messenger is shallow. How about you look at the points - whether valid or not - he or she raises instead and try to refute them? It matters little if that person is well known or someone entirely new. I don't see how the relative anonymity of a person is in any way related to his or her credibility.
Besides, calling a position you don't agree with "trolling" with no further argumentation is 4chan level of discourse, and I know what I'm talking about when I say this. I will not take a side in this discussion because I'm not qualified to voice an opinion over things I do not understand well enough (databases), but I had to point this out.
it's still a valid point as there are no references to back up any off the claims in the post. he should at least have included links to issues in their jira or some way of replicating the problem he is experiencing.
as it stands now it's not fact based and could as much be opinion as there is no way to weight the merit of the claims against anything substantial :(
I just dislike calling anyone who prefers to stay in relative anonymity (for whatever reason) or is simply new to a community "not credible", at least if it's only because of those attributes. It's a thinly veiled ad hominem.
(Sorry, possibly excessive snark. That said, I think that blog post is a good example of one of this pastebin author's points: at least historically, benchmark numbers have been a big focus for Mongo developers.)
Anyone using Mongo currently has to be aware there are likely to be some teething issues as it is very new technology.
I haven't used it in production (yet), but I would have no fear of using it today. I would run regular consistency monitoring and validation around critical data just like I do with our SQL databases.
I'm willing to take my part of the pain and inconvenience in making technology like this stable.
You could have written this about any adolescent SQL server BITD. All the tools you use today had to go through this process.
For me Mongo is awesome and getting more awesome. Mongo and technology like it is the reason I still get excited about writing new apps.
This is textbook projecting. The team deployed an immature database and tried to push its limits, and now they're saying: "it sucks!". Sure, a 2 year-old database is the problem, not your ability to make architectural decisions. Sounds like someone is looking for a scapegoat. They took a risk and failed and this is just a poor way of coping with it.
It's OK to publish your experiences on your blog (which they did a few days ago). It's NOT OK to go around the Internets publishing "anonymous" articles about how MongoDB sucked for you, as if no one will see what you did there. That's just defamation, folks.
On a side note, we also looked at MongoDB and, after running a few tests, we concluded that it is a glorified key-value pair storage. That said, we did use it in a few small-scale projects and it works great.
The bottom line: choose the right tool for the job and don't bitch about the tools when you fail.
I would say however that a significant subset of NoSQL deployments (perhaps even a large majority) are by definition lacking in sound architectural decisions. I'd argue the same goes for ORM-based database access too.....
The failure exists because many developers don't ask a few key questions up front:
1) What exactly can the database do for us?
2) Which of these do we need? For example, is the database going to be a point of integration?
3) What failsafe or security measures do we want to count on in the database?>
These don't always have objectively right/wrong answers but failure to ask the questions leads to poor use of databases regardless of what technologies are chosen.
Is there someone here in HN who has used MongoDB with large data sets, high concurrency application? Can someone else share some light? And maybe a more recent version of MongoDB...
There is a team in the company I work for who has deployed Mongo to production, with, I suppose, a heavy load. I can check with them. I heard no complains, but the company is large enough for me not to hear everything.
Very interesting. I recently worked on a little side project using MongoDB and I noticed during testing that some records would disappear at random. Glad to see this has happened to others. I suppose it's time to check out Redis.
I feel like a dick, but I have got to ask. Is it Disney? Disney is on both the couchbase and 10gen sites. Both sites mention that they are using their NoSQL solutions to power their social and online games. Couchbase powers Zynga and can arguably be considered the leader on this specific market. Am I close?
Losing data is one of the most serious bugs. When I am using a DBMS in production, I have to rely on it 100%.
I believe the complains made could be real because MongoDB is highly optimized for speed. But, as long as there is no documented and maybe reproducible case, this post can't be taken for real.
I'm very skeptical of the lost data claims. People using MongoDB are writing new code. New code has bugs. When data is lost, it's certainly more convenient to claim 'the datastore ate it' than to admit you have a critical bug in your own code.
I agree. And this is why I like CouchDB's versioning. In similar cases we could track down unwanted deletes using previous versions of the document in question. Without those, it could easily be interpreted as "data loss".
10gen might become a victim of it's own popularity. I have heard:
* Yes, playing with Mongo is playing with fire. Know what you are doing. We don't claim that you should use us as your only database.
* We're going to fix these issues soon. The beginning days of MySQL etc were also frightening, with Oracle and MS SQL Server admins warning of all the dangerous things that can happen.
If they confront their issues, I think it's just a matter of time before Mongo wins the NoSQL race. They have what matters most - good people, a brand, and great expectations from customers.
Go to 10gens site. Watch some of the videos and see the huge volume of data (in TPS or in TBs) that people are working with. Go on the google group and see what problems people have. Don't take an anonymous post on pastebin as the gospel.
A lot of trolling here, I've never had any issues of missing data. When claiming a db (as big and popular as mongodb) doesn't work, you should include references, company, examples on how to reproduce, etc. Enough said: http://www.mongodb.org/display/DOCS/Production+Deployments
Just as you claim that people who have had problems are trolling, so could be said about your own claim. These people are basing their opinion on their own (negative) experience, while you are doing the same based on your own (positive) experience. How is that any different?
This thread includes numerous examples of people who did indeed have grave issues with Mongo. They're not any less valid that your own example (or the ones you link to). In these topics there's always going to be positive and negative takes, but calling people trolls because - again, like you -they voice their opinion is harsh.
Are the two really equal? Given two independent sources that you don't really know: a) I've never lost data with MongoDB, b) It wiped my database
Doesn't (b) have a certain burden of proof? Maybe he had a bug in his code? Maybe he did something weird with his server? Maybe he didn't follow upgrade directions properly? Maybe he got hacked? Is it really too much to ask for something verifiable? Steps to reproduce? Log files? Assuming that the person isn't just malicious, even a before and after of db.xx.count()?
These posts are exceptionally well-timed for me. I'm currently wrangling with one of those problems that is just not solved well with relational databases, or even the flat document store that my company already uses. I've been looking hard at Redis and Mongo, and of late I'm leaning towards Mongo. You know what? Having read these posts and the threads - and having extracted what little in the way of factual datapoints I could from them - I'm pretty sure I'll still be riding into production with Mongo.
Some of you guys who were all aboard the NOSQL UBER ALLES hype train a year or two ago now seem to be swinging back - with scrapes and bruises from some truly harebrained misdeployments, no doubt - to a reactionary 'All NoSQL are doomed to reimplement everything relational' nihilism. Back to shitty OR tools and ugly-ass joins for everyone, damnit! Harumph. I could write a novel just quoting and responding to some of the stupid pronouncements and prescriptions for correctness on these Mongo threads' comments.
Anyways. With regards to this specific post:
Let's rewind a couple of years. I work for a significantly smaller company than our anon raconteur, from the sound of it. At roughly the same time as he adopted Mongo, I was also looking hard at it, to solve some problems where the relational options available to us weren't going to cut the mustard. Damn, did Mongo look cool, fun even. The flexibility of having arbitrary object graphs in it and querying down into subdocument properties with real indexing on them, well, it sets nearly any developer's heart a-flutter, particularly those of us who work on dynamic web stuff a fair bit.
Sadly, I have to be an engineer and pragmatist first, I have to think about much more than what is sexy and comfortable for devs. I've been through my share of 3AM wake-up world-enders, I've learned the hard lessons. I considered variables like basic maintainability by ops people, credibility of the vendor, track record, robust redundancy and availability solutions, how far up shit creek we'd be in a disaster recovery scenario, etc. And after thorough research I decided that, for my much smaller company which can afford to be judiciously bleeding-edge where it makes sense to, Mongo was just not clearing the bar. I sucked it up and used unsexy properly normalized relational database tables, then utilized memory caching and async updates to try and paper over the performance issues inherent in that scheme.
What was anon doing? Charging full steam ahead into the wild unknown with Mongo, on an effort that was apparently important to a userbase of millions at a "high profile" company. That's some mighty responsible stewardship of the company, or even just the IT department's, broader concerns right there. Now, I understand that it totally makes sense to have used Mongo 1.x as a scrappy startup on a greenfield project, no problem. But this guy was in a different situation. At that scale in a BFC, conservatism rules, and it rules for a reason.
I think I am starting to understand why anon is anon.
In any case, we're likely going to roll with Mongo soon. It is indeed maturing, and I'm a lot more comfortable with it on all of my criteria these days. I have possibly read more of the JIRA issues than some of the devs, and they are prioritizing the Right Things - at least for my tastes. By my estimation it is on the right track.
Even having not used it in production yet, I can identify some things people are complaining about here as complete and utter RTFM-fail, misunderstanding of what it is they're deploying and whether what they expect out of it is realistic before they begin. I understand the tradeoffs of Mongo, and in my particular situation they make good sense.
There is/was a LOT of hype in NoSQL. Hype and very little understanding what NoSQL is about and specifically why/when choosing a NoSQL database makes sense and when it does not.
It is not about SQL vs. not. It is about consistency, availability, and partition tolerance, and which of these you are willing to give up. Surprisingly few people know about the CAP theorem and what it implies.
Generally there two main reasons why you switch to NoSQL (Not Only SQL) databases.
1. You need to scale out (add more storage and query capacity by adding more machines).
2. You do not want to be locked into a relational schema.
There is no magic in NoSQL! To scale out these stores give up exactly those features that would impede scaling out (for example global transactions).
What one has to realize that you give up a lot by letting go of relational databases: Fast ad hoc queries, transactions, consistency, and the entire theory and research behind it.
I don't see why relational databases are "unsexy". A good query planner is almost a work of art and it is amazing what they can do. In fact we use them alongside HBase.
Instead of ad hoc queries you either get slow map/reduce type "queries" or you need to plan your queries ahead of time and denormalize the data accordingly at insert time.
You better have very good reasons for the switch.
When we evaluated NoSQL stores a while back (for #1 type problems) I was quite the skeptic. We looked at Riak, Redis, MongoDB, CouchDB, Cassandra, and HBase).
Eventually we settled on HBase because needed consistency over availability and we needed more than just a key value store, and we already some Hadoop projects... and I started to drink the cool-aid :)
Personally, I am not a big fan of eventually consistent (but highly available) stores, because it is extremely difficult to reason about the state of the store; and the application layers bears a lot of extra complexity. But your mileage may vary.
HBase of course is new as well, and I needed to start fixing bugs and adding new features that we needed.
As with "Java is better than C++" type discussions, here too, what store to use depends on the use case. As parent points out any hype about anything is a bad thing, because it typically replaces reasons as an instrument of decision making.
(not sure what I was getting at, so I'll just stop here).
I think one of the reasons that NoSQL databases have been oversold is that a lot of projects don't have people on them who are good at engineering databases. The result is that folks use ORM's badly.
If you are going to use the database just to store data structures form your program, you might as well use NoSQL db's. However, in most cases, you get integration and migration wins by:
1) Placing your engineering effort on the database. Looking at the sort of real world data you are collecting, modelling it well in the database and then presenting an API to the application. The API will either be a relational one (i.e. views) or a procedural one (stored procedures). After a couple of iterations, the schema shouldn't be being fundamentally changed too much though there could be some minor tweaking.
2) Now, with a good API you can build an application on the database using a methodology of your choice. This could be done in an agile way.
Now if integration is not a goal, then sure you can do all the data validation in your application and you can use NoSQL databases. But relational databases are also powerful integration tools in their own right. I can't imagine LedgerSMB, for example, doing well on anything else for this reason alone.
This post is unparalleled FUD. We use MongoDB in production and all the issues we've encountered have been either environment or configuration related.
There are plenty of things about MongoDB I don't like but this OP is a total coward. If you've got something to say, put your name on it and come out in the open.
This type of post is the worst of its "hiding behind Internet anonymity" kind.
And for the record, I don't think Oracle is behind this. They're confident in their Exadata offering and have little to gain by posting this kind of crap around MongoDB. Besides, Larry Ellison has never been afraid to openly taunt his competitors.
I wouldn't use my True Name for serious criticism of a tool that may become popular, because I expect that to find that a career-limiting move. E.g., I think MySQL has reckless contempt for data integrity, but that doesn't mean I'd rather starve than ever be considered by a hiring manager at any MySQL shop.
This is simply anonymous vile FUD. It does not even read like an honest story. There is also not one reference to a real case, or anything that can be sourced at all. There seems to be a FUD offensive going against MongoDB for some reason.
The thing that worries me most is that this article got so many upvotes.
Somewhat off topic, but mongo recently came up in a design discussion. Some of the points here are intereting to consider/evaluate against the most recent version.
My question is, given Mongo and the other NoSQL solutions, has anyone come up with a comparison of strong and weak points across different application types? Feature lists really aren't always useful - as noted about things like code maturity, etc.
I've had similar performance in my use case (big joins and very large tables) using PostgreSQL (in my case) and disabling sync() to disk, and tuning the buffers, as with the various NoSQL I tried.
It seems to me that NoSQL does not really bring speed. Just scalability and a different model. Hopefully most of them don't lose data at random. PG certainly doesn't, even with sync() off.
I think MongoDB's biggest problem is people expect mongo to take care of all their scalability issues for them. In reality once you start hitting a certain scale you need to start rearchitecting your system, no datastore can automatically handle this for you, but perhaps mongoDB let you get a little bit bigger before this became a big issue.
I presume he's referring to the silent truncation stunt Mongo has been accused of by others before.
"Oh THAT", say Mongo boosters. "You should have read the IRC logs of June 22nd", they continue, "there was a 3 line patch posted in the channel. It totes fixes that problem".
I wonder why nobody mentions that MongoDB supports x86 CPU architecture ONLY.
It keeps unaligned data in its memory structures, and all operations are explicitly little-endian.
So, no chance to get it running on any ARM, MIPS, PowerPC or SPARC
Yes, if people are sing MongoDB for applications requiring ACID, it is probably not the best fit. However, there are many great use cases that MongoDB is a great fit for, sometimes characterized by needing a lot of read slaves for complex analytics, where data loss is not a lose-the-company proposition, rapid prototyping, etc. I just ported a Java GWT + Objectify appengine application to run on an EC2 with MongoDB and it was shockingly easy to do. Also, you an give up some write performance for increased data safety.
MongoDB (along with PostgreSQL, RDF data sotres, and sometimes Neo4J) is solidly in my preferred tool set.
I agree with the author that MongoDB is green, maybe not quite ready for prime time yet. All things considered, I realize we took a risk by switching and while I am quite happy with MongoDB, I do worry that at some point we will experience a failure condition we might not be able to recover from.
I have never used MongoDB in production but have thought about it. To me though it is just another architectural decision that you need to base around risk and reward.
MongoDB is awesome at certain things. But it is still not at a tried and true level as say PostgreSQL or MySQL.
I am skeptical of the article but only because it is all too easy to fault new projects. However I would be curious to know 10gen's development practices as compared to say Postgres or SQLite (I have heard awesome things about SQLite's development testing).
Am I the only one who's ever forgotten the where clause in an update or delete statement? For all the proof presented, for all we know that's the cause of their lost data.
Some guy posted some unproven random claims on pastebin, and people take it that serious? "It must be true, I've seen it on the Internet!" 500+ upvotes? c'mon.
We are all engineers and MongoDB is open source. Maybe the easiest way to evaluate the project is to review the source code. This will give at least some idea about quality of MongoDB - of course MongoDB can still be a great product even if code is not written well but it is important indicator.
What is your comment about code written? Is it maintainable? Is it modular? Doe s it seem well written?
As the author made no specific claims or didn't show any failing test cases which can be discussed and reasoned about in a sensible way, it's going to be very hard to confirm or refute anything.
Responding to anonymous flames in the internet is a waste of time.
After reading most of the comments now , I believe it is. I cant understand why would people just try to bring down something which is popular by all means by doing such rash publicity . HN should be a little careful while posting such links without verifying the source.
No, of course not. But the point is that it would be easy to generate a post like this just by going back over the critical bug list for previous versions, and throwing in an unsubstantiated claim of 'mysterious data loss.' And Oracle does have the incentive and the means to engage in an old fashioned Microsoft-style FUD campaign. They've already launched their 'embrace-and-extend' strategy: http://www.oracle.com/us/corporate/press/519708 .
DISCLAIMER: I submitted this story and it is in fact a hoax that has gone too far, you got trolled, truly frightening how gullible most of you are. DO NOT BELIEVE EVERYTHING YOU READ ON THE INTERNET!
That's an unsubstantiated, and probably false, assumption. There are a lot of people in this thread debating the merits of the complaints and most doubt those merits. Even more people are not participating in this thread at all and you don't have a clue what they think. If they're anything like me, they thought "another day, another complaint about product X." and didn't draw any conclusions from this post.
You seem too convinced of your own greatness to seriously consider the possibility that maybe most people here are actually sensible and don't rush to judgments. To me, this post portrays you as an 18-year old that was the smartest in his highschool, but has yet to realize and accept just how awfully many people out there are smarter and wiser. God help the people around you if you're older than 28.
I run engineering for foursquare. About a year and a half ago my colleagues and I and made the decision to migrate to MongoDB for our primary data store. Currently we have dozens of MongoDB instances across several different data clusters storing over a TB of data and handling 10s of thousands of requests per second (mostly reads but the write load is reasonably high as well).
Have we run into problems with MongoDB along the way? Yes, of course we have. It is a new technology and problems happen.
Have they been problematic enough to seriously threaten our data? No they have not.
Has Eliot and the rest of his staff @ 10Gen been extremely responsive and helpful whenever we run into problems? Yes, absolutely. Their level of support is amazing.
MongoDB is a complicated beast (as are most datastores). It makes tradeoffs that you need to understand when thinking about using it. It's not necessarily for everyone. But it most certainly can be used by serious companies building serious products. Foursquare is proof of that.
I'm happy to answer any questions about our experience that the HN community might have.
-harryh