Hacker News new | past | comments | ask | show | jobs | submit login
Why I Migrated Away From MongoDB (svs.io)
203 points by svs on Sept 17, 2012 | hide | past | favorite | 200 comments



You were fortunate to recognize that MongoDB was the wrong tool for your job, and lucky to be able to move to Postgres instead of continuing to throw your time and effort away. I see the ad hominem "you're an ignorant idiot" attacks already started, along with advice like using regexes to do case-insensitive searches. Watching the NoSQL "movement" encounter the problems RDBMSs fixed 20 years ago and then hand-wave and kludge them away is frustrating. I wrote about some of this in http://typicalprogrammer.com/?p=14.

Look at the bright side: programmers who are writing NoSQL-backed apps are creating the fossil fuel that will keep programmers who know RDBMs working into our retirement years. I already have more work than I can do fixing web apps that were built around crap data management tools that failed to scale beyond a few thousand users. Your Postgres expertise will still be a money-making skill long after MongoDB is forgotten.


The thing with the NoSQL guys is that many of them seem not to be in a position to make an educated comparison. For example, an, uhh, enthusiastic MongoDB advocate recently informed me that MongoDB was superior to Oracle because in Oracle you had to poll a table to see if it changed. Except, no, that isn't actually true: http://docs.oracle.com/cd/B19306_01/appdev.102/b14251/adfns_... - and that document is from 2005. And you could do a trigger and an AQ message/callback 5 years before that (at least). You haven't needed to poll an Oracle database for changes in a loooong time.

Basically, every evangelism point, you have to double-check and cross-reference, because as you say, the NoSQL guys are encountering issues the RDBMS community addressed years ago (7 in my example, but the sharding stuff, 20+ years) - except they think they are discovering it for the first time!


Programming is more like fashion than science in this regard. Every decade or so something truly new happens in the software world. All the rest is mostly sound and fury, signifying nothing. If you're young or new to programming it's easy to mistake the buzz around things like NoSQL for innovation when they are usually re-discoveries of old (and often discarded or obsolete) ideas dressed up in new clothes.

There's also the tendency to favor new shiny things and reject old crufty (but proven) things, to want to be part of what seems like the leading edge, to be that guy in the cube farm who is playing with the cool new stuff.

I have been programming longer than RDBMSs have been available, so I know from experience what it's like to manage large databases in application code, and how hard it can be to maintain consistency or do accurate queries and aggregation with half-baked tools. It's frustrating to see a new generation of programmers go through this, but it's human nature to ignore the past.

My fourteen year old son wears his pants pulled down below his waist, Vans shoes, hoodies, lots of hair. He looks pretty much like I did when I was fourteen back in the 1970s. The underlying technologies are the same: pants, shirt, shoes, sweater, hair. The only differences are superficial. To him that style is edgy and contemporary and something his parents don't get. NoSQL is the gangster fashion of programming right now.


I cannot remember a more aggravating discussion on Hacker News.

A lot of people who are working with / building and using 'NoSQL' databases are the very same people building the RDBMS tools that you are so eager to defend.

You have stated elsewhere that you do not know anything about 'NoSQL' stores, so instead of insulting a huge number of people far more experienced than yourself, why dont you attempt to learn? (or at the very least, avoid disregarding the thoughts of a huge sector of the industry)


I cannot remember a more aggravating discussion on Hacker News.

I suggest taking exchanges of opinion less personally.

You have stated elsewhere that you do not know anything about 'NoSQL' stores

No, I didn't. I have read about them, gone to conferences, and gone through tutorials for a few NoSQL products, but haven't had a use for them in my own work. Since I have 35+ years of programming and database experience I am not viewing these things through the eyes of a newb. And I have lots of experience with database management pre-RDBMS. I did say I haven't found any use for the current batch of NoSQL tools in my own projects, but obviously lots of other people have. And I have clearly indicated that everything I write is my opinion based on my experience, not word from on high. Again, get a grip.

instead of insulting a huge number of people far more experienced than yourself

No insult intended. My opinions, your mileage may vary. Ad hominem attacks are, on the other hand, deliberate insults.

why dont you attempt to learn? (or at the very least, avoid disregarding the thoughts of a huge sector of the industry)

Thanks for the career advice. My advice to you and some of the other commenters frothing at the mouth is to disconnect your ego from your preferred tools.


In fact I was watching a presentation somewhere (I forget where), where folks were using NoSQL databases as a sort of pre-processing layer for the RDBMS. In this regard it's very useful on the high-end.

Where I am less convinced is outside the high-end, and outside the idea of a transition tier between input and RDBMS (or between RDBMS and output even). As a persistence layer, NoSQL is applicable largely to the subset of cases where Object Oriented Databases were. Here the GP is right that the industry is keen on relearning the same lessons every couple of decades.... However, as pre-processing and post-processing, there seems to me to be a much larger set of use cases out there but again only where the RDBMS is no longer really able to handle everything you want it to do.


MongoDB is disproportionately liked by the inexperienced. There are things to like about it, and being able to have sparse secondary b-tree indexes on arbitrary data is, overall, pretty rad. For prototyping, being able to just toss some shit in there, especially when your data comes from an external service and you're not in control of your incoming data (very common these days), it works great for simple use cases and CRUD apps (which is a lion's share of new projects in the industry of the Internet). Being able to predict all the incoming data ahead of time isn't always doable, because a lot of services have underdocumented APIs, and every now and then you'll get data that's just ever-so-slightly different than what you expected. With Mongo, you can at least guarantee you're holding onto that, so that you can use this existing data in the future. That's way less contrived than it sounds. E.g., a third-party changes one of the data formats, adds a field or changes a datatype on one, for instance. If you have a strict schema, you might fail to write some of that data until you update everything. With Mongo, you can at least capture that data and get it working later.

I'm using it on one of my projects, and as I said, there are things to like about it, but graph traversals remain problematic. Overall, I'd say I'm pretty unhappy with MongoDB, and I wish I hadn't chosen it; it doesn't work well for my project. A lot of things I'm doing boil down to performing some kind of graph traversal, which is painful to do at runtime in Mongo (the potential solution space is too large to be precomputed). From what I've seen, MongoDB hasn't been working for me very well with a highly-connected data set.


(Not a response, but related as an FYI: you can easily do sparse secondary b-tree indexes over secondary data using PostgreSQL using predicated functional indexes.)


What I like about Mongo is that it's very free-form; I can splat some data in there and not be too concerned about it until later on. I don't really like dealing with file systems for holding data.

Of course Mongo ain't the best at relational data... :o


But this is the fundamental trade-off isn't it?

If you aren't concerned about your data on input then you can't reliably transform it on output, right?

In this regard fixed schemas slow down one side of development (handling input) but speed up another side (handling output, and related tasks like reporting).


This is ABSOLUTELY the tradeoff. I am not using Mongo for long-term data storage. I am using it for hacking systems together which need a place to store data. Possibly this data will be manually queried and examined later on, possibly this data lives only a day or two. The entire conversation changes when you are wanting structured data output that is stable and scalable for 10 years.


You deploy a database without doing a spike or clearly without reading the documentation. Because it is well known that MongoDB by design is ill-suited for doing lots of join between tables. It is much better thought of as a document store.

So who is the inexperienced one here ?


This kind of rubbish really needs to stop. Just because you don't agree with or understand their choices does not mean that the majority of "NoSQL guys" are ignorant or uneducated.

Some of the biggest companies e.g. Twitter, Foursquare, Google, Amazon all rely on NoSQL.

The real issue I see is that by dismissing NoSQL as only for fools RDBMS developers are failing to see why they are popular to begin with. Take PostgreSQL for example and how difficult it is to shard/replicate compared to CouchDB or MongoDB. This is an area PostgreSQL should see as an opportunity for improvement.

But dismissing huge groups of people as uneducated just makes you seem uneducated.


The majority of NoSQL guys are not Twitter, Foursquare, Google, or Amazon, and Twitter, Foursquare, Google, and Amazon all have plenty of people experienced enough in traditional RDBMS to tell them what things would probably be better in them.

Most NoSQL guys I meet have little to no experience in SQL. Plenty have lots.


The plural of anecdote is not data. You know that right ? In my case every NoSQL developer I've met had a lot of experience in SQL as well.

But do you have ask why so called inexperienced users are choosing NoSQL in the first place ? Hint: it's because most RDBMS are ridiculously complex and inflexible.


The plural of insulting everyone's intelligence who disagrees with you is not authority.

>But do you have ask why so called inexperienced users are choosing NoSQL in the first place? Hint: it's because most RDBMS are ridiculously complex and inflexible.

If we're going to have a cliche fight, this one is called having your cake and eating it too. Either inexperienced users are gravitating to NoSQL or they aren't.

Operating under the assumption that they are, I'd say it's both because they can interact with them without an impedance mismatch; also because they are flat, they are easy to visualize. Another reason might be because they don't have to put a lot of thought into their schema, which would involve new concepts that would take a little time to learn. The biggest reason in my eyes would be that they don't know how big a performance hit they're taking in a write-heavy environment.

In very read-heavy environments with wide heterogeneous queries that you would end up denormalizing in 18 different ways anyway? They could be doing it because they're smart and have done their research. I love NoSQL.


I never meant to imply that inexperienced users were definitely moving to NoSQL. I would imagine that orders of magnitude more of them are still using MySQL due to its pervasiveness.

Only that if they were the current complexity of RDBMS would be a big factor. Accessing a database as a REST service like say CouchDB or having fluid schemas like MongoDB is much easier to handle than ER models.


The last line sounds pretty familiar after replacing "NoSQL" with "PHP" and "RDBMS" with "decent languages".


I completely agree. This whole NoSQL/RDBMS situation is programming language debate being played out over again.

And just like not every programming language works for every situation. Neither does every database.


I agree. BTW, the most lucid description of when to use relational vs non-relational models is not all that new. You can read it here:

http://db.cs.berkeley.edu/papers/Informix/www.informix.com/i...

In this model, there is virtually no difference between the complex data/no query db's of the paper then and NoSQL today in terms of use case. Interesting that a one-time executive of Informix was able to present such a long-lasting discussion of db technologies.....


The plural of anecdote is not data.

OK...

every NoSQL developer I've met had a lot of experience in SQL

THAT is what data looks like, FTW.


I think we did "sharding" with relational databases... Back in the 80s. Then we got fast hash joins and partitioning and it turns out that the disadvantages of sharding just aren't worth it. The NoSQL crew will figure this out too around 2030 :-p


Be a little careful with this level of disdain.

Sharding may have been available in the 1980's, but what it lead to in some products is quite amazing. Consider Teradata's clustering ability which is sort of like sharding your database but without the disadvantages typically associated with it. Postgres-XC now offers something similar as a semi-fork of PostgreSQL.

Basically what we are talking about here is a two-tier database layer where storage and coordination are separated, and two phase commit is used between these two. Thus the coordination tier can enforce referential integrity between storage nodes if necessary and thus allow write-extensibility.

This isn't something without uses. For high-end, high-write-load databases, serving very large amounts of traffic (think airline reservations), this has been a typical approach for quite some time.

The fundamental problem though is that once you give up on local consistency over a given domain, you cannot have any guarantee of global consistency. The current relational approaches (Postgres-XC and Teradata) both enforce ACID compliance. BASE doesn't offer any consistency guarantee and therefore it is only good for throw-away data.


Oh absolutely, but what you're talking about there, people do with CICS today, and that's even older than the 80s. CICS is a technology I have a lot of respect for.

But my point is - when I need to use something like that, I know that's what I'm using. I don't imagine that it's some new invention. Hell, a lot of what the NoSQL guys think they're inventing, IBM did back then too - IMS.


"Those who don't understand Unix are condemned to reinvent it, poorly." – Henry Spencer


I guess so will the likes of Facebook, Google, Twitter, eBay etc.

Must be amazing to know more about databases than all of the world's biggest companies put together.


Why? The guys at FB or Google are just guys like you and me, don't believe the hype, they are not superheroes or gurus despite their much talked about interview process (the Google interviews I did were a walk in the park compared to GS, btw). Some will know more than me sure, but some'll know less.


Just guys like you or me except they build systems with billions of page views a month.


At my last job, we averaged 60k pages/sec, on what the NoSQL guys would call a "legacy" database, and there was plenty of headroom. It's not rocket science, just engineering. Companies like Google love to weave a mystique around what they do, it's in their interests after all, to convince their investors that what they do is magic. But I'm from back in the day when we were the same way about millions of pages per month, then per week, then per day...


On good hardware there were PostgreSQL instances running a billion queries a day back in the 9.0 days. Often these are actual accounting apps where reporting matters and so NoSQL would be a very poor fit. For example the French government uses it to distribute welfare program dispersements. The Wisconsin Courts also uses PostgreSQL at loads in the billion per day range. I know there are larger instances out there.

Now a days we are talking about thousands of concurrent users and up to 350k reads/sec on high end machines.


Yes there is a lurking iceberg of things that aren't in the public eye like Facebook or Google, and that's where the really intense and interesting stuff is happening. I wonder what the guys at Visa make of all the hype around these websites, when they were doing these volumes all along...

At said 60k pages/sec job, you know who we looked up to for databases? Starbucks. Walmart. McDonalds. 'Cos we'd seen what they do, and anyone who thinks Twitter is impressive, well it'd blow their minds.


Have you looked at Postgres-XC yet? It's Teradata-style clustering for PostgreSQL.....


Those were only a few examples. I can keep listing companies that use sharding.

The idea that it is some 1980's approach to scaling databases is laughable.


You're right, it isn't a 1980s approach. I recall hearing about it in the mid 1970s, and even then it was probably originally devised in the 1960s.


This is a valid point. I dismissed NoSQL for years, and still (largely) do (I prefer to have a schema, documented data structures, deal with relational data etc.)... but I finally gave Mongo a go the other day, and am actually quite impressed. For a simple endpoint I can throw arbitrary persistent data in (rather than the filesystem), it's let me get a couple of proof-of-concept projects out the door much quicker.

Note that I'm not intending to comment on its scalability or how appropriate it is for other projects, just that I agree with you: dismissing it out of hand is hubris.


So do you believe there is a use case for a document-oriented database? I feel like your comment writes off a huge swath of useful technology.

I said this below, but I'll say it again. Data is malleable, and writing apps to fit around any datastore seems wrong. I write applications to fulfill their use case. When the needs of the application change, so could it's database (or other dependencies).

http://gigaom.com/cloud/mongodb-or-mysql-why-not-both/


I don't know if there's a use case. I have never had one come up in my own projects. I do sometimes need to manage a large number of documents (unstructured data) but I have found the Unix file system sufficient for those, maybe augmented by metadata in the database.

I've worked on big applications with big databases where DB operations and aggregations started to get too slow but so far I or someone smarter on the team has always been able to figure out a solution without throwing the RDBMS out. One big advantage of working with RDBMSs is the roads are paved and well-mapped. By contrast you still need a machete and good luck in the NoSQL world, and even then you can get sick or lost in the jungle.

I haven't personally worked on a web app with the kinds of database scaling issues I've dealt with in logistics applications involving tens of millions of movements every day.

RDBMSs were invented to allow data malleability. Go back to the old textbooks and look at what preceded the relational model. Remember that data is more valuable and permanent than application code. When you have multiple applications accessing the same database, each with their own implementation of consistency rules and their own internal idea of the data schema, come back and explain how that is more fun than using an RDBMS and being able to sleep nights.


Respectfully, I believe your comment is not very open minded. I believe the obvious answer is "Of course there's a use case!", just like there's obviously a use case for RDBMS, NoSQL (Mongo, Cassandra, BigTable, Redis, etc) serves a purpose.

I'll give you a couple examples. Google stores a huge amount of data in BigTable [1] which is fundamental for the operation of Gmail at scale. Another, Facebook initially used Cassandra [2] for it's Inbox search, then migrated to HBase when their needs changed for their new messaging feature.

Do we know if the the underlying data is stored in RDBMS? I'm guessing not at Google, but yes for Facebook/Inbox search. But my point is I'm not ruling it out, I'm open to different options.

[1] http://en.wikipedia.org/wiki/BigTable [2] http://en.wikipedia.org/wiki/Apache_Cassandra#History


I wrote "I don't know if there's a use case. I have never had one come up in my own projects." I don't work for Facebook or Google, perhaps they have a use case. I guess I could assume there's a use case because the tools exist and people use them, but in my experience I have not had a need for Mongo or any NoSQL database.

I'm not dismissing the tools out of hand -- I haven't used them except to play around with tutorials. I am pretty sure a lot of people start out with Mongo or whatever without having a use case defined and without having good reasons to eliminate the tried-and-true RDBMS solutions. "Use something cool and new" is not a use case. "I've read on slashdot that SQL databases don't scale" is not a use case.


> I haven't used them except to play around with tutorials.

And yet you feel the need to lecture others on their choices ?

Wow you are something.


I'm not sure if Cassandra and Bigtable are "document" databases in the same vein as MongoDB. These guys are more like turbo-charged key-value stores - which is quite different. They trade in the ability to manipulate data for an ability to scale out linearly and have no singular point of failure (and of course various other trade-offs). Cassandra nicely gives you composite columns and column-key sorting, but that's about all you're going to get.


"NoSQL is simply a buzzword. It basically means 'anything but the old and trusty MySQL server in the attic', which includes a lot of alternative storage mechanisms. Even a simple text file can be considered a NoSQL solution"

http://stackoverflow.com/questions/4255895/why-googles-bigta...


I'll add that email and Facebook postings and Instagram photos don't require the same kind of data consistency and durability that medical records, bank transactions, or rail car movements require. If Google loses some old emails or Facebook loses the last thousand "likes" on a Jersey Shore fan page that's not much of a catastrophe, and they don't guarantee their data anyway. For those use cases Mongo or Cassandra et al. may be the right fit.

I work on a app that stores documents (PDFs) in Amazon S3 containers, and sometimes those documents get lost or inaccessible for some reason. We have backups for those. The only consistency requirement there is that the PDF opens and the name matches the name in the RDBMS. That's a much lower bar for consistency and durability than our credit card transaction data, for example.


This is complete and utter nonsense.

Facebook and Instagram itself I imagine would care VERY much if data went missing as it would mean not just (a) user data going missing but (b) potentially huge inconsistencies in the social graph.

Unless you have some evidence that those companies do not care about data durability ?


What? It's the absolute truth. If someone's wall post or tweet goes missing, it's totally inconsequential.

"Social graph" - LOL.


You think FB doesn't consider durability and consistency essential? With almost one billion users?!

Wow.

Whether it is consequential or not is completely beside the point. As is whether you understand the "social graph".


Eventual consistency, yes. What I'm saying is if a few pics disappear for a half hour while things resolve themselves, nobody is going to care... or likely notice.

It's not a bank.

By the way, I believe social networking is mostly bullshit, which is my "LOL" at the "social graph".


Eventual consistency however provides no guarantee of consistency unless it is backed by absolute local consistency on every node. Don't you run into CAP theorem problems otherwise?


Sure. But if a person or a photo data object goes missing ?

Facebook has different data stores for each of their features so I imagine the loss of a person would be pretty nasty as would a photo data object that multiple people have tagged or commented on.

My point is that when you are the size of Facebook, Twitter etc a loss of data in 0.001% of cases equates to a LOT of data and relationships.


If it is that important, they'd be using an ACID-compliant solution, right?

You don't use Best-guess Attempt at Sorta-consistent Eventualities for important data ;-)


I suspect that a lot of NoSQL folks come to NoSQL from MySQL.

As for this part:

>"When you have multiple applications accessing the same database, each with their own implementation of consistency rules and their own internal idea of the data schema, come back and explain how that is more fun than using an RDBMS and being able to sleep nights."

The obvious answer from the MySQL/NoSQL folks is "What? That's crazy! Multiple apps hitting the same db? Are you nuts???"

See my blog post at http://ledgersmbdev.blogspot.com/2012/09/or-modelling-interl... for more on this mentality.


I wasn't able to figure out how to post a comment/reply on Blogger, sorry.

Thanks for pointing me to your article from the HN flame war about MongoDB. If only there were more useful articles like this and less flaming.

You wrote "...in MySQL, the application is king and the db a humble servant." That is exactly my experience, and it leads not only to terrible databases but to piles of awful code written around terrible databases. In 1974 Kernighan and Plauger wrote "Choose a data representation that makes your program simple" (Elements of Programming Style). Other esteemed programmers have said the same thing. A well-designed and normalized database schema with validations and access rules implemented in the database will reduce the amount of application code needed.

I've written about keeping data management code in the database where it belongs (http://typicalprogrammer.com/?p=101), but you make a much clearer argument with real examples. A few years ago the Ruby on Rails people championed the idea of "application databases" and writing HTTP+XML APIs, which probably seemed like a great idea given that few, if any, Rails applications ever had databases that other applications needed to access. Now the NoSQL folks have taken this even further, eliminating structure from the data and all ACID compliance, separating the data from integrity enforcement, all in the name of scalability problems their app might have if they get as big as Google.

I am stuck with MySQL most of the time because I work with broken web applications, and so far all but a couple have used MySQL. As a long-time Oracle and SQL Server programmer I almost cry trying to figure out some of the MySQL databases I come actoss. I am working on one now that uses GUUIDs as primary keys in every table, generated in PHP code, for reasons I can only attribute to ignorance.


It's too late to post on your blog, unfortuanately (great article btw) but I think your article hits the nail on the head. A couple points I would make though is that I don't see anything wrong really with using an ORM plus views to encapsulate data. INSERT INTO invoice isn't all that bad if invoice is a view. Data operations are still encapsulated.

The second point is that I am a relative fan of loosely coupled systems. Loose coupling gets away from CAP theorem problems because each system is autonomous and can always defer interop with other applications until later. On the other hand your centralized RDBMS goes down and every app that needs it dies with it. How many apps we want entangled in that way is a business decision but there are tradeoffs on both sides.

That doesn't mean endorsing the NoSQL approach though. I maintain that you can't have "eventual consistency" in any meaningful, guaranteed way unless you always have "absolute local consistency" within some scope. Postgres-XC defines that scope as cluster-wide. In a loosely coupled system it could be each individual node. For this reason BASE doesn't work for anything that is important. It reduces to:

Best-guess Attempt at Semi-consistent Eventualities.

Lesson is: not only do you want to encapsulate your data, but anywhere that you want eventual consistency, you need to have local ACID compliance for some defined scope of local. This is a big issue that NoSQL db's have and that they cannot solve (due to CAP theorem issues). If only it were absent from the RDBMS world. I groan every time someone asks for "multi-master asynchronous replication."


>"That is exactly my experience, and it leads not only to terrible databases but to piles of awful code written around terrible databases. "

Not planning for the future tends to do that.

>"I am stuck with MySQL most of the time because I work with broken web applications, and so far all but a couple have used MySQL. As a long-time Oracle and SQL Server programmer I almost cry trying to figure out some of the MySQL databases I come actoss. I am working on one now that uses GUUIDs as primary keys in every table, generated in PHP code, for reasons I can only attribute to ignorance."

Ouch, that sounds painful. I have horror stories I could tell too. Ambiguous foreign keys? ouch.....


Yes, as a pre-processing or post-processing layer for an RDBMS ;-). It might also be a useful backplane for some kinds of applications, but most of those are going to have some sort of fixed schema, so you run into a bunch of issues with this.

Another area is for high-write throw-away data. If you are facebook, do you really care if every message gets through or every feed element gets propagated?

Also perhaps content management (which is a write seldom, read often field) might be good, particularly where data can be reconstructed from other sources. Maybe doing the writes first in an RDBMS and pushing out to MongoDB might be a good thing there? You could then do things like referrer tracking and log analysis back in your RDBMS.

Although hybrid databases are way cooler. One idea we are throwing around regarding LedgerSMB development right now is using JSON types in PostgreSQL 9.2 to store extended attributes, so you could have a customer account, and decide you want to store some extra data about this that we don't support right now, and have it stored in JSON.


> Watching the NoSQL "movement" encounter the problems RDBMSs fixed 20 years ago and then hand-wave and kludge them away is frustrating.

But but but.... NoSQL is cool ;)


I'm no fan of MongoDB, but this same advice goes for any NoSQL data store. I am an Apache Cassandra contributor and community MVP, but my advice stays the same: it's best just to start with a SQL database and go from there. Read some books and learn it well: the "SQL Cookbook" from O'Reilly is great, and so is "The Art of SQL." Premature optimization continues to be the root of all evil.


it's best just to start with a SQL database and go from there.

This is bad advice. It's best to understand your problem domain and use the tools that are most appropriate. You see a lot of two types of posts on HN:

* "I picked a NoSQL database for a problem domain with a better relational fit." Those posts look like this one.

* "I picked an RDBMS for a problem domain with a better NoSQL fit." Those posts are usually titled something like "How I scaled Postgres to XYZ qps" and describe an insane amount of re-engineering and operational hell. Oddly, these posts are usually proud of the accomplishment rather than embarrassed that they picked the wrong tools in the first place.

There are upsides and downsides to RDBMSes. From my experience, you should be leaning towards NoSQL systems (of which there are many, each suited to different use cases) when you have very large scaling needs (in terms of dataset and qps), heavily polymorphic data, or data that has ambiguous structure.

It's been stated many times already: Use the right tool for the right job.


"It's been stated many times already: Use the right tool for the right job."

The problem with that argument is that things change, and it's not always obvious what problems you will need to solve when you start a project.

RDBMSs are backed by extensive theory as well as decades of product hardening, development of expectations and best practices, and battle-tested in production environments. They are explicitly designed as general-purpose systems. That means that when your needs change and you need to adapt, an RDBMS likely provides features that make that transition work smoothly and avoid breaking applications.

At this time, those things are not true of MongoDB, so it is somewhat more specialized. That's OK, but it means that you need to be more careful about using it than, say, postgres. Trying to put MongoDB on equal footing with postgres and say "choose the right tool" is an oversimplification.

I think it's perfectly reasonable advice to start with SQL unless you have specific reasons that you think something else will be a better fit.


This advice -- the most frequent reply in this thread -- is the same as saying you should be able to accurately predict the future. Database management is a big and complicated topic. Especially with emerging tools like Mongo there is no way reading a book is going to give you the expertise and experience you need to choose the right tool. You do your best and maybe make the wrong decision. It's not stupid to go down the wrong path. What's stupid is continuing down the path after you've encountered one problem after another.


It doesn't require the ability to predict the future, just some familiarity with the tools. The OP's use case (small amounts of data, low traffic, lots of aggregation) is perfect for an RDBMS, and anyone with even casual familiarity with both SQL and NoSQL systems could easily prognosticate that.


  > This advice [Use the right tool for the right job]...
  > is the same as saying you should be able to accurately 
  > predict the future.
The future that's easy to predict here is the one where a person who doesn't know anything about the tools will need extra time to learn about them, and possibly extra time to switch tools once they've become more knowledgeable.


The problem with your argument is that all problem will look like it will have polymorphic and ambigous data until you have mapped it out properly. What the post you replied to was basically an advice for new projects and developers to fully map out the problem and see where things are going.

It is a lot easier to start strict and organized and back down to less strictness with looser rules where you need it, than to do the opposite.


Perhaps for "new developers". If the question is "as a neophyte developer, which should I learn first, Postgres or MongoDB?" then I would tend to lean towards SQL as well. But that isn't the proposal.

The problem is the old cliche that if all you have is a hammer, everything looks like a nail. "Just start with a hammer" is bad advice because with enough effort and willingness to cause collateral damage, you can almost always get a screw in with a hammer. If you've never seen a screwdriver, you're doomed to posting HN stories about "how I spent the last three weeks rebuilding my application architecture to deal with the fact that hammered-in screws tend to fall out a lot."


You failed to address the parents point entirely: it's a lot easier to start strict and loosen restrictions later, than it is to start loose and add restrictions.


I'm not sure what this has to do with SQL vs NoSQL. There are myriad differences between storage systems which have little to do with referential integrity constraints or data validation.

Polymorphic structures are a PITA to model in relational tables, but they fit very well in schemaless, document-oriented NoSQL systems. This has nothing to do with data validation.

Rapidly changing schemas are a PITA in RDBMSes, especially with a large cluster of servers. It's usually very easy to alter schemas on-the-fly with NoSQL systems. This has nothing to do with data validation.

On a related point, the database is a crappy place to put data validation anyways. It can only do the most primitive forms of validation (is this an integer? is the FK reference valid?). Application level-validation gives you the full power of a modern programming language and all the data you could possibly want to fetch. Yeah, "senior discount" is a float... but it should only be given to people over age 65.

The fact that the OP managed to insert "111" into a column instead of 111 means he is using crappy tools. My datastore api guarantees that integer fields in my (statically typed) classes are written as integers. I never have this problem, nor have I ever taken down my GAE (NoSQL) applications for maintenance.


The basic tradeoff is between flexible input and flexible output. I do think it is sound advice to push the relational database model (this overlaps but is not identical to the relational algebra model) to the breaking point before going elsewhere. If you don't do data validation you can't do ad hoc data transformation later reliably. Strict schemas are always an investment in the future for that reason.

> Polymorphic structures are a PITA to model in relational tables, but they fit very well in schemaless, document-oriented NoSQL systems. This has nothing to do with data validation.

JSON and XML types in PostgreSQL rock. If you can limit the areas where the data is polymorphic, you can push the rest into those types. Many other RDBMS's support XML types as well.

> Rapidly changing schemas are a PITA in RDBMSes, especially with a large cluster of servers. It's usually very easy to alter schemas on-the-fly with NoSQL systems. This has nothing to do with data validation.

This again trades one set of problems for another. You are allowing old schema and new schema to co-exist. How many old schemas does your application have to support? It seems to me this would likely ossify things down the road a bit because if your schema is that rapidly changing, you may never get caught up and so backward compatibility is going to be a bitch. In other words, the less guarantees you can make regarding information stored, the fewer guarantees you can make about data output. That's a big deal as your product matures.

And don't get me started on BASE.....


JSON and XML types give you all the pain of a NoSQL system with none of the benefits of an RDBMS. No thanks. My tools elegantly map database fields to my objects; doing this with serialization formats is incredibly crude and brittle. I do it for some edge cases already; I am looking forward to removing that code.

A schemaless datastore does not mean a schemaless app. Furthermore, the application layer is a far better place to do validation simply because it has more knowledge of the real world. An RDBMS can constraint an age field to be integer; the application can constrain it to be positive and within the probable lifespan of a human.

How many old schemas does your application have to support? Only as many as I want to; reprocessing removes the old, which I tend to do fairly quickly. I have done a lot of schema migration over the last three years on GAE and the process is vastly easier than it is in RDBMS-land. There's no need to stop the world.

Hey, I love Postgres. I spent most of my (almost two-decade-long) career building apps on many different RDBMSes. But there are many applications that are better suited to alternatives, especially with large-scale consumer-facing web apps. These threads annoy me; we should all be looking at new tools saying "wow, I wonder what this is good for" rather than smugly burning all the heretics that turn away from to the One True Idol.


Typically if you have some polymorphic data attached to some other data, a hybrid approach works. After all, you can't report on what you can't easily quantify.

My tools elegantly map database fields to my objects; doing this with serialization formats is incredibly crude and brittle.

Build tools to do it.

A schemaless datastore does not mean a schemaless app.

That's actually a problem in most cases.

But there are many applications that are better suited to alternatives, especially with large-scale consumer-facing web apps. These threads annoy me; we should all be looking at new tools saying "wow, I wonder what this is good for" rather than smugly burning all the heretics that turn away from to the One True Idol.

Certainly that last part is true, but the best uses I have seen, including uses put forth by NoSQL proponents, have been as adjuncts to a relational approach, for example handling data transformation on the way in or out. Otherwise you end up with two fundamental problems:

1) How many other apps can your application support? (see https://news.ycombinator.com/item?id=4495377) Where is your public API? Basically you can do this in your db layer and/or your app layer, but you get a lot more flexibility by doing it in the db. With a schemaless app, you can't do it there without intimate knowledge of innards...

2) Strict schemas with validated input and a declarative query language buys you flexible output. Rigid schemas are thus an investment in the future. If your data is sufficiently polymorphic that output can't be flexible no matter how you do it, then I suppose that's not a huge issue. However for many cases where NoSQL is chosen wrongly it is a huge issue.


I'm guessing you're a big fan of stored procedures too.

Good luck.


I'm guessing you're a big fan of stored procedures too.

In certain times and places. However, I am not a fan of the idea that all db access goes through stored procedures. LedgerSMB will eventually do everything through stored procs but the reason is to make sure that all the logic is available to other apps, not to require that everyone else do all access through the stored procs.

Also our stored procs are designed to be discoverable by the application and therefore allow for looser coupling between the app and sproc than is typical.


Too late to edit but it is worth noting that stored procedures are also very much all or nothing in terms of quality. Well-writen SP's are a delight to work with. Horribly written ones are infuriating. It's not like other environments where the quality of the code can degrade gracefully.


Given the context of the OP, a brand new project or startup, my advice still stands true.


There's maths behind RDBMS - the relational algebra and calculus. These give you enormous power when you don't know how you want to query your data in future (something you'd think the "agile" crowd would be all over). But instead, the prevalent thinking is "just stuff it all in and worry about it later". Experienced DB hands call this "painting yourself into a corner".


Part of the reason though is it is easier to get an idea of what you need from a NoSQL database and what you'd be giving up if you are already on an RDBMS, right?

I think the advice is right on and I see NoSQL databases as very much useful adjuncts to the relational systems, not as replacements.


It's best to understand your problem domain and use the tools that are most appropriate.

Sure. The big problem though is that the problem domain of RDBMS's is ad hoc reporting which is pretty universal.

rom my experience, you should be leaning towards NoSQL systems (of which there are many, each suited to different use cases) when you have very large scaling needs (in terms of dataset and qps), heavily polymorphic data, or data that has ambiguous structure.

On the other hand, you should choose an RDBMS if you are pretty sure reporting needs will change or you will need to accommodate new metrics on a frequent basis.

The problem is there is a huge overlap between those two. NoSQL means telling your customers "no, we can't do that new report you'd like right now." That's a huge tradeoff for any customer-facing app.

After all the basic tradeoff is between flexible input and flexible output.


Even for ad-hoc reporting, the relational model has limits. For some requirements specialized structures are a better choice; thus we have mdolap.

RDBMSes were designed for an era of spinning platters, data volumes in the megabytes-to-gigabytes, and small volumes of long-running ad-hoc queries. It's remarkable how far we've been able to push a general-purpose design but it's starting to burst the seams.

Do you want to design for easy reporting or easy scaling? There is no universal right answer because it's different for every app. There are times when I really wish I could run a SQL query. On the other hand, I spend zero time maintaining my (appengine) datastore and it automatically scales to any data or traffic volume. And I have a heavily polymorphic system which would be a nightmare in SQL.

I don't think that overlap is as big as you do.


Even for ad-hoc reporting, the relational model has limits. For some requirements specialized structures are a better choice; thus we have mdolap.

But mdolap is far easier to work with if your feed is from a relational model. Otherwise you are left with a lot of effort getting the data transformed into an appropriate model.

Relational math + fixed schemas means flexible output.

I don't think that overlap is as big as you do.

Any line of business app had better start with a relational model then, as would any case where you are selling business intelligence. The other data stores work best as adjuncts to, rather than replacements for, a relational store.

Edit: Also it occurs to me that MDOLAP is only partly ad hoc. You have to set up your cubes ahead of time and that means deciding on what you are going to report. I imagine that adding a new reporting dimension to a large data set would be painful.


Don't forget to read a book that's specific to your particular RDBMS. Because SQL databases are only trivially interchangeable for trivial cases.

There have been more than a couple times when I was ready to blame the relational model, but further investigation revealed that the real root of the problem was that the existing schema or query used an approach that was optimized for one DBMS but performed terribly on the one we were actually using.


Wrong. SQL databases (except for SQLite) are almost completely interchangeable because they are all based on the same relational model and they all implement the ANSI SQL standard with only minor deviations. If you have a lot of stored procedures and triggers -- executable code embedded in the database -- you will have to rewrite that.

Oracle is in a world of its own in a lot of ways, but if you are moving to or from Oracle you should have the required resources and expertise in your budget.

If you are referring to the well-known cost of COUNT(*) in PostgreSQL (or MySQL with InnoDB, for that matter), or different ways to handle full-text searching, I agree that there are differences you have to deal with, but they are not usually a big deal.

Contrast the fairly easy and routine process of migrating between MySQL, PostgreSQL, or SQL Server with the huge amount of work involved changing your application code from MongoDB to anything else.


SQL databases (except for SQLite) are almost completely interchangeable because they are all based on the same relational model and they all implement the ANSI SQL standard with only minor deviations.

They only all implement the ANSI SQL standard for certain values of "ANSI SQL standard." Those values being SQL-89 or perhaps SQL-99.

Beyond that it's a mess. SQL Server didn't get around to doing a really good job with window functions until this year. Last I checked, MySQL still doesn't do any number of useful things, such as WITH or FULL JOIN. Upserts and bulk inserting are different for every DBMS on which I've learned how to do them. And so on and so on. There are any number of ways you can get yourself coupled to a particular DBMS if you go beyond the core set of basic SQL features.

In other words, SQL databases are only trivially interchangeable for trivial cases.


I agree that there are numerous inconsistencies. I just have a different definition of "trivial cases" than you do. The kinds of things you mention -- real issues that come up in real applications -- fall outside of what most databases are doing in the kinds of web apps that I work on. I haven't had to deal with Oracle vs. SQL Server bulk load problems or Oracle's unique implementation of NULLs in years. Not waving them away, just saying that these kinds of issues are not all that common because most real-world databases are trivial cases.


>"Not waving them away, just saying that these kinds of issues are not all that common because most real-world databases are trivial cases."

If this is the case, then, it's only because database application developers go out of their way to write compatible code. Things like upserts are not trivially compatible between vendors, and so are avoided by people trying to write portable applications.

Similarly, let's look at SQL 2003 structure data type inheritance and how this is implemented in Informix, Oracle, DB2, and PostgreSQL. Ok, so it isn't in PostgreSQL, and Informix allows tables to be inherited as well as types, and will even happily return result sets with different numbers of columns per row (google "Informix jagged rows")... And Oracle and DB2 look pretty close in this way until you get under the hood.


Not wrong. As a DBA with 15+ year experience, I can tell at a glance what an application was first developed against (Oracle, SQL Server, etc) or even where a dev started their career. You might as well say any OS is interchangeable as they're all just a kernel running processes.


I didn't claim RDBMSs are interchangeable or that there are no differences. I disputed the statement that "databases are only trivially interchangeable for trivial cases." That has not been my experience: changing out one RDBMS for another is not trivial, but it's easier and more controllable and predictable than changing out application code.

I also disputed the notion that reading a book first would enable someone to choose the right tool the first time. There's no substitute for experience and real expertise, and you only get that by trying things and making mistakes, or at least learning from other people's mistakes.


. . ."databases are only trivially interchangeable for trivial cases." That has not been my experience: changing out one RDBMS for another is not trivial, but it's easier and more controllable and predictable than changing out application code.

Keeping the punctuation but replacing the words with symbols:

   "X." Not X: X, but tangential comparison. 
Sounds like you actually agree with me, but don't see that as a reason not to make much hay about disagreeing with me. :)


> Wrong. SQL databases (except for SQLite) are almost completely interchangeable because they are all based on the same relational model and they all implement the ANSI SQL standard with only minor deviations.

Wrong again. Here's a very trivial example. To index or not to index? On MySQL InnoDB you usually get a major performance benefit out of indexing just about everything you want to search on later. On PostgreSQL you usually get a major performance benefit out of indexing columns only as needed later. This is because of internal design differences, sequential scans through a table are much cheaper on PostgreSQL (in MySQL it has to scan a table in key order, not physical order which means no OS prefetch). Also non-PK index scans are faster on PostgreSQL though PK index scans are slightly faster on InnoDB (basically the table is contained in the Primary Key index there).

Contrast the fairly easy and routine process of migrating between MySQL, PostgreSQL, or SQL Server with the huge amount of work involved changing your application code from MongoDB to anything else.

So suppose I want to move my database which uses PostGIS and pgcrypto to MySQL or SQL Server. How would I do that?


Just in case you weren't already aware, the count(*) issue should be significantly improved for some queries in postgres 9.2, thanks to the addition of covering indexes.


There was discussion on pgsql-general a bit before the release. It appears that physically sequential scan of a table is about as fast as key-order sequential scan of a covering index.

To speed up a lot, I think you'd need to be able to scan an index in physical order which is not currently supported.

Now if you are using SSD's yeah, you could tweak the planner settings enough to cause it to make an index scan perhaps, and maybe you'd see a modest performance increase, but even there it appears the operating system's prefetch logic really does come to the rescue of a sequential scan of a table in physical order.


ah, that's unfortunate :-(


This is why the "SQL Cookbook" is good, it has dozens and dozens of examples with permutations in Oracle, MS SQL, MySQL, and PostgreSQL.


Thanks rbranson, this comment was very helpful. I also think that the logical fallacy of some of these arguments is that they´re absolutist - it´s better to follow your advice because it´s a very logical progression to start with a SQL database and then move into different things as you grow. Life is a progression and I think your quote about premature optimization is spot on - it IS the root of all evil because it´s absolutist and too rigid to fit the curve of everything that life (and tech, and work) throws at you!


I agree that relational databases are a safer bet after you study the business domains, consider the pros and cons of relational databases vs NoSQL, and there are no clear winners.

Disagree to the books recommended. SQL is only a query language, not the database itself. It definitely should be part of the consideration. Understanding how database engines works under the hood is more important in terms of performance in high-concurrency, high load scenarios.


This is true. Robert Haas ran a couple of posts a while ago about differences between InnoDB and PostgreSQL. He basically pointed out that in InnoDB, the table is contained in the primary key index, while in PostgreSQL, it is an indexed heap of pages. Moreover InnoDB does not support traversing the table in physical order, so sequential scans of a large table are ghastly performance-wise.

What this means is that optimizations appropriate for InnoDB (index everything you search on) are very poor for PostgreSQL and vice versa. In PostgreSQL you index only a few columns, typically join conditions and a few select search columns, and the combination of the operating system and database will do the rest. In short, InnoDB only does index scans (and the table is a special index). This makes some things perform better (primary key lookups), but many other things perform worse. Moreover optimizing for one does not mean optimizing for the other....


Funny.

The same problem exists with OLAP databases.

People trying to do MOLAP without building ROLAP. They are thinking that MS SASS, ESSBASE, Cognos or Qlikview will do some "magic" that eliminate the need to carefully think about data


Fourthly, and this one completely blew my mind - somewhere along the stack of mongodb, mongoid and mongoid-map-reduce, somewhere there, type information was being lost. I thought we were scaling hard when one of our customers suddenly had 1111 documents overnight. Imagine my disappointment when I realised it was actually four 1s, added together. They’d become strings along the way.

I've been having a similar problem with an SQLite data store, only the other way around. Strings were getting converted to numbers, with leading zeros that were significant and needed to be maintained being lost along the way.

It sucked all the fun out of dynamic typing for me. At least in combination with automatic type conversions. Having to think about type and when to make transitions across type boundaries when you need to is just a little light busywork. Having to worry about type and transitions across type boundaries being made contrary to your intentions is a downright PITA and, it turns out, a serious quality control issue.


Mongo accepts the data you give it. If you have a type-conversion error, it's in your application layer. I use Mongo daily and have never seen this problem, because I'm using a statically typed language. This seems like more of a complaint about Ruby than Mongo.

I use Mongo daily on a Go project, and I actually think it's pretty annoying; I'm not trying to be a Mongo apologist, but ... this type conversion argument doesn't seem to be very fair to Mongo.


Bad code is bad code, in any language.

I'm a little disappointed how any post about moving from X to Y (especially if X is Mongo) makes the top of the front page on HN. This is not really a very good or insightful post. It's one persons experience and anecdotes of the pain points of learning a new technology. Mildly interesting, but not really expository in any way.


Yeah - this sounds more like a problem where someone was using a loosely-typed language that uses "+" for both addition and concatenation.


"Any column in an SQLite version 3 database, except an INTEGER PRIMARY KEY column, may be used to store a value of any storage class." http://www.sqlite.org/datatype3.html

SQLite, for better or worse, is designed to do what you had an issue with. Pick a different DB if you need strict data types. Check out section 2.0 Type Affinity.


"To be honest, the decision to use MongoDb was an ill-thought out one. Lesson learned - thoroughly research any new technology you introduce into your stack, know well the strengths and weaknesses thereof and evaluate honestly whether it fits your needs or not - no matter how much hype there is surrounding said technology."

I think you are not alone in learning this lesson with this particular technology. Fortunately it's one I learned by proxy from working adjacent to a team that decided to introduce Mongo into their stack...but I still wake up and hear the sceams at night of "You have to put the whole dataset on RAM?"...you weren't there, man...we lost a lot of good guys...

You have to draw a clean line between "stuff it is really fun and enlightening to play with" and "stuff you introduce into your stack".


One of the issues is that if you actually do this - evaluate, look at all sides, and decide 'shiny new tech' IS NOT right for your situation/project, you're branded as something not good. "Not a team player", "stick in the mud", "not able to keep up with the times", etc.

I'm not suggesting everyone should be sticking with 1966 COBOL - times change, new tech comes up which makes sense to adopt. But not jumping on the shiny new tech bandwagon can have social consequences you need to be aware of.


>> "You have to put the whole dataset on RAM?"

I'm pretty new to the whole database thing, but how is MongoDB different from Postgres or Mysql in this respect? In a traditional database, the data is pulled directly from the hard drive. Why does Mongo suffer a performance hit and Mysql doesn't?


Because MongoDB mmap(2)s its backing stores into its process memory space. It's a naive approach to persistence - it's very fast and simple, but if you overcommit (i.e. you store more in the database than you have memory available), page-thrashing results.

MySQL's InnoDB table engine, on the other hand, uses direct I/O (in the recommended scenario) and manages the buffer pool independently of the kernel. Its buffer pool manager is specifically designed for the typical workloads MySQL is used for (http://dev.mysql.com/doc/refman/5.5/en/innodb-buffer-pool.ht...) - as opposed to the naive LRU that most OSes employ for their filessytem buffers.


Its hardly naive, it is optimized for an application that needs to keep its entire working set in RAM, that is why sharding is so fundamental to the design. Not all apps need that which is once again...


For what it's worth VoltDB uses main memory format as its "naive" approach to persistance too.

I would consider VoltDB naive in the approach to persistance because it gives you dramatic gains (2 orders of magnitude) in read/write performance for similar workloads only by redefining concurrency and durability out of the equation. Durability is no longer a property of each machine but is instead a property of the network. Concurrency is handled by executing queries fast and doing them in series. So it's like ACID without the C or D.


Ah, ok. That makes a lot more sense. Thank you.


It's not naive. It's a well understood and communicated design choice.

Remember that completely in memory databases are going to be how we all store our data in the decades to come and are already the standard for those who care about speed.


You have to draw a clean line between "stuff it is really fun and enlightening to play with" and "stuff you introduce into your stack".

Even if you have a clean line, when do you promote something across that line? It is easy for little bits of weirdness to escape detection(Mostly thinking about Cassandra write failures here.) in a fun to play with environment especially if they only come out when you are running a cluster.


"You have to put the whole dataset on RAM?"

This is hardly a hidden feature...


I am both a database guy and a software engineer. Being a software engineer, i kind of understand the hype behind NoSQL. Being a database guy spending years in studying how database engine works under the hood, many NoSQL implementations make me wonder how powerful marketing can be.

In general, I love the ideas behind NoSQL. I can still feel the excitement when reading the BigTable and MapReduce papers. HBase, Hadoop, Radis, etc. are awesome products. I use some of them in my work. But some other NoSQL products? Being engineers, we must understand the implementation and be full aware of its limitations, instead of believing their marketing materials. Well, if all you want is to test a toy product, to build a prototype, or your product is of low concurrency and low data size and you have no concern on operation, it certainly looks that they make your development easier. But in these scenarios, any good relational databases won't add significant burden either.


> Being engineers, we must understand the implementation and be full aware of its limitations, instead of believing their marketing materials.

And as engineers we must understand that most other engineers do take their role seriously and evaluate products on their merits.

Implying that they are falling for "marketing" just because you don't agree with their choice and then lecturing them for their choice doesn't make you come across well.


I do think that the popularity of MySQL, however, owes a lot to it being used by non-engineers for simple web apps, though ;-)


Mmm, not sure about some of the complains...

- You can make case insensitive searches on the DB using regexes (http://www.mongodb.org/display/DOCS/Advanced+Queries#Advance...). A simple case-insensitive regex is not very bad performance-wise, but in general, case-insensitive searches should be avoided for search purposes (you can normalize to set everything to lower case or other equivalent trick)

- The proper way of doing an audit (and search later) is to make an independent collection with a reference to the other(s) document in a different collection. Then you can index by user, date, or any other field and leave the main collection alone. The described embedded access collection doesn't look very scalable.

- Making map-reduce queries is tricky (at least for me). I think the guys on 10gen realizes that and the new aggregation framework is a way of dealing with this. Anyway, the main advantage of SQL is this kind of things, the rich query capabilities. Even if MongoDB allows some compared with other NoSQL DBs, if there is a lot of work in defining new queries, probably a SQL DB is the best fit, as that is where SQL excel.

I don't truly believe in this "you should research everything before starting" (I mean, I believe in research, but too many times the "you should do your homework" argument is overused. Sometimes you make a decision based in some data that changes later, or is incomplete), as there are a lot of situations where you find problems as you go, not in the first steps. But, according to the description, looks like PostgreSQL is a better match and the transition hasn't been too painful, so we can classify this into "bump in the road"/"success history". Also, probably right now the DB schema is way more known that in the beginning, which simplifies the use of a relational DB.


- The proper way of doing an audit (and search later) is to make an independent collection with a reference to the other(s) document in a different collection. Then you can index by user, date, or any other field and leave the main collection alone. The described embedded access collection doesn't look very scalable.

I think this point is very important even in the RDBMS side. There are cases, even with relational datastores that would preform better if the dataset was built to the query.

The difficulty comes into play when you are trying to keep the denormalized data up to date based on changes within the base dataset.


> "I think this point is very important even in the RDBMS side."

This is true. After all you can't extract an audit trail from a deleted record. Simply in terms of information management it often makes sense to represent this as separate info.

> "There are cases, even with relational datastores that would preform better if the dataset was built to the query."

The big problem with doing it that way is that you are screwed as soon as requirements change. I recently blogged about not-1NF designs in PostgreSQL (nested data structures for subset constraints), and the lesson I took away was that you really don't want to have your select queries hitting the same tables you maintain for inserts. You want your queries hitting a normalized data structure even if the data comes in as something different.


As an analytics professional who was pressured into a MongoDB environment, I feel the OP's pain. If you want to do gymnastics with your data, (aggregations of aggregations, joining result sets back onto data), SQL expressions are a 1000 times easier than Mongo constructs (e.g. map reduces). We usually ended up scraping out data from Mongo and dumping records into a SQL database before doing our transformations.

All that said, our developers loved the ease of simple retrieval and insertion, and of course the scalability. So I guess you ultimately need to base your decisions on your priorities.

I don't fault the OP though, since it's hard to know just how limiting NoSQL will be until you try to do all the things you used to assume were database tablestakes (no pun intended).


Competly aside from the Article. The level of vitriolic discourse in this topic is astounding. I am amazed that as a community discussions of Database engines can draw out such mean spirited anger. I have never down voted as many comments on HN in a single thread then I have on this topic. I don't care which side of the debate you come down on. There is no excuse for belittleing and insulting others in a technical forum. Thats right I am looking at you

    gregjor, gaius, and zemo
In this case it appears to mostly be those arguing for Postgre, but I wouldn't care if you were arguing for sunshine and unicorns there is a way to behave civilly and your not doing it.


You can find this level of discourse in plenty of topics every day. Programmers draw blood over indenting with tabs or spaces. It's geek entertainment.

I have never been upvoted so much on HN. I admit to strong opinions and light sarcasm but you'll have to show me where I've been uncivil, belittling or insulting, except perhaps in response to people who insulted me.


In this case it appears to mostly be those arguing for Postgre

Curious. I have never found a database named "Postgre" to be used by anybody. Perhaps you can direct me to the download site.

I think the bigger issue is that there isn't a lot of discussion from the NoSQL crowd about what you give up when you go to a NoSQL solution. I think that sort of disclosure would help people weigh the options a lot better.

From the comments of some people here you'd almost think they would build an ERP app in Mongo....


I've ran into similar issue as you described. Something that can be done so simple and quickly in SQL, was bewilderingly difficult to do in mongo.

The schema-less database approach also seems attractive at first but updating your data whenever your "app schema" changes starts to become a pain real quick.

Now I can't really live w/o having a schema first, it actually saves you a lot more time in the long run (even short run), being schema-less means you can't really do anything too fancy w/ your data (generate reports, advanced search, etc...)


You still need to update your data when doing schema changes under SQL, and you have a lot less control over the process.

And you can do anything to your data without a schema, you just need to build your app as a service that provides access to it.

IME SQL schemas do more harm than good; usually you end up with a schema that's subtly weaker than what's actually valid for your application, and the difference between the two models will trip you up at the worst possible time. Have a small, distinct set of classes that you store, enforce that you store only those (and don't access the storage layer any other way), enforce that they remain backwards compatible, and enforce that you can't create invalid instances. But application code is the best place to do all these things.


IME SQL schemas do more harm than good

Completely disagree here. The basic tradeoff is between flexible input and flexible output. Without a rigid schema, ad hoc reporting is impossible because you don't have an ability to articulate reporting criteria. I.e. no declarative schema means no declarative reporting query.

I suppose that's ok as long as you never need to report on anything..... Might work....


I'm not saying don't have a rigid schema, just don't enforce it at the storage layer. If you're doing an ad-hoc report then you wouldn't have indexes in place for it in the SQL case, so I don't see how it's any worse or harder in mongodb.


>If you're doing an ad-hoc report then you wouldn't have indexes in place for it in the SQL case, so I don't see how it's any worse or harder in mongodb.

Only true in a case where you ahve to index everything you might want to search on, like with InnoDB. In PostgreSQL all you really need are your foreign key indexes and a couple (if that) of criteria indexes and you are good. That's more of an InnoDB limitation than a relational limitation. Basically InnoDB tables are primary key indexes and they can only be traversed in key order, not physical order, so sequential scans are painful....


If that level of indexing is sufficient I don't see why you can't just do the same thing in mongodb.


>"If that level of indexing is sufficient I don't see why you can't just do the same thing in mongodb."

What level of indexing is sufficient depends a great deal on the specifics of the database layout on disk. In InnoDB for example, sequential scans are very costly, and primary key lookups are very cheap. This is because the table is more or less contained in the primary key index and this must be traversed in key order since physical order is not supported. This means a sequential scan of a table means lots of random disk I/O and OS prefetching is useless.

So to address this you end up indexing everything you want to search on later. Note that non-pk indexes are a little slower in InnoDB because you have to traverse the index to find the primary key value, then you have to traverse the primary key index to retrieve the table info.

In PostgreSQL things work differently. The table is a series of pages on disk and rows are allocated from these as a heap. You can scan a table, but not an index, in physical order in PostgreSQL. Therefore typically PostgreSQL sequential scans on tables are lot faster than on MySQL because it is sequential, rather than random, page reads. Indexes point at the tuple ID which stores the page number and row number within a page. An index scan is a tree traversal followed by processing pages indicated in the tuple ID.

This leads to a bunch of interesting things: Adding indexes is usually a performance win with InnoDB. However for PostgreSQL, it will typically look up what indexes it has and balance index scans against sequential scans of tables. Unlike InnoDB, sequential scans sometimes win out planner-wise, esp. on small tables.

So what indexes you need depends quite highly on how things are organized.


That's all pretty interesting, but I still don't see what you get with postgres that you don't get with mongodb. Your database won't enforce your schema for you, but I don't see how that means "ad hoc reporting is impossible".


I have read through the Mongo db query docs and it does look like you can do some ad hoc retrieval queries, and some aggregation. But in the SQL world that's not really the same thing as ad hoc reporting.

I suppose "can't do" is too strong assuming your reporting matches your query. However these things look a lot simpler to do in SQL than in Mongo's approach, and I don't see how you can reliably transform data on output if you don't have a guaranteed reliable schema on input. Also I don't really understand how you would transform the data in this way with Mongo's API. I suppose you always could but it looks painful to my (admittedly uninitiated) eyes.

How many lines of code are required to express a 50 line SQL query doing 5 joins, complex filters and group-by transformations, etc?


Obviously if your report makes assumptions about your data which aren't true then you might get invalid data out. I absolutely agree with having a single point through which writes to the data store must pass which enforces business-level constraints on the data. I just don't find SQL a convenient form to express those constraints (and my experience has been that any given business domain will have some constraints that are too complex to express in the SQL model, forcing you to resort to e.g. triggers - at which point the constraint is not integrated with the rest of your data model, it's just a piece of code that runs to check your inserts, which you could do equally well in any language); I'd rather do it in a "normal" programming language.

I see what you're getting at with reporting now, you're talking about doing actual calculations on the data? For mongodb I'd probably use its map-reduce API, at which point you're writing javascript and you can do anything, and performance should be fine. Though honestly other than performance I've never had a problem with just fetching the rows I need and doing whatever transformation/aggregation logic in e.g. python. SQL has never struck me as particularly typesafe or gaining much advantage from being close to the data; its sets-oriented way of thinking can be helpful in some places, but it's not the only language that can do that.

If you like SQL as a language for writing logic in I can see why a traditional database would appeal. But even then I feel like input validation, data storage and query processing should be separate layers (and I see some movement towards that with e.g. MySQL offering an interface to do key/value lookups directly). If SQL is a good way of doing transform/aggregation logic then it should stand alone, and I'd expect to see SQL-like query layers on top of key-value stores.


Looks like he jumped ship just a bit too early.

http://docs.mongodb.org/manual/applications/aggregation/


You have to load every document in the database and extract the audit trail from it, then filter it in your app for the user you’re looking for. Just the thought of what that would do to my hardware was enough to turn me off the whole idea.

Naive question from somebody who has done a little reading on and dabbling with key/document-with-MapReduce style datastores, but who hasn't tackled a real production problem: I thought running queries over the entire dataset was one of the assumptions of horizontally scalable document stores? In terms of avoiding computation, you can only limit queries by document key, which even if you're clever/lucky doesn't always encode the parameters you're querying on, or doesn't encode them in the right order, so you should be prepared to run queries over your entire dataset. Hopefully the queries you run often are optimized (e.g., using indexes or clever use of key ranges), but in the general case, you have to be prepared to scan the whole shebang, and that's supposed to be okay because of horizontal scalability, right?


Yep, or you need to build some other construct to support it (IE: keeping running tallies and the like).

It's a tradeoff between the benefits of the document store vs the loss of relational data. The blog author here clearly didn't understand the trade-offs he was making.

As always with these discussions: it's important to use the right tool for the job. I'm a big fan of what Mongo is doing. I've used it two higher scale projects and have no complaints. Of course, I'm using it in the context for which Mongo excels.


As a relative idiot when it comes to this sort of thing, I'd like to insert the following supplementary question: what is the sort of application/dataset for which Mongo is particularly suited?

I've used it on small projects, and have enjoyed it. Perhaps my data has just been simple/loosely-coupled enough to never run into these problems?

I read a lot of posts like this on HN before every trying Mongo, so I've at least been convinced to always implement schema at the application layer. Others seem to keep learning that lesson in harder ways.


The biggest lure of Mongo is that it gives you a nice SQL-like query API. So it's fairly easy to get started with, compared to other NoSQL alternatives. I primarly use it for small-medium size apps - when I know upfront that I will never need to scale it beyond certain number of users in the short-medium term.

It's not as bad as it's made out to be. It's only if you really are looking to scale out, you should probably be better of picking something else.


Can you explain your reasoning? Isn't high scalability one of MongoDB's key features?


When you go to the supermarket, you may see products calling themselves "quality" or "luxury" or similar. Words like "powerful" and "scalable" are like that.


Thanks, but you didn't really explain anything. In what specific ways does it fail to scale well?


A few of the major issues I've seen around are the global write lock which is starting to get taken care of. The reliance on the OS to cache instead of intelligently caching based off data. Poor fail over support. Sharding not being a "turn key" solution like it was supposedly advertised as, misconfiguring it can lead to poor performance. Bringing extra nodes online can take a long time for data to migrate over to the new nodes.


Scaling has a number of components to it - it's not just about the absolute number of requests. There are always trade-offs (read up about the CAP theorem). My experience with Mongo has been that it has very inconsistent performance. Such inconsistency makes capacity planning very difficult. Also - stuff like failover, sharding etc. are not elegant.


"what is the sort of application/dataset for which Mongo is particularly suited"

The majority of the NoSQL databases are based on Amazon's Dynamo: loosely coupled replication. MongoDB is one of the few (next to Hbase and a few others) that adopts Google BigTable's architecture: data is divided in Ranges, and each mongod node serves multiple Ranges.

This means MongoDB is able to provide atomicity where it's harder with other SQL databases. In particular, we need to be able to do some sort of "compare and swap" operation that is guaranteed to be atomic/consistent, while still being able to have our mongod nodes distributed over multiple datacenters.

In Dynamo-based architectures, in order to provide the same amount of atomicity, you always end up writing to at least half + 1 the amount of replicated nodes you have available in your cluster. This is more awkward, and reduces the flexibility of the whole (the atomicity guarantee Mongo provides also works for stored javascript procedures, for example).

Having said that, we're using MongoDB about 3 years in production at this point, but we're far from happy about the availability it provides (issues like MongoDB not detecting that a node has gone down, failing to fail over, etc). We run a HA service, and to date all of our failures in uptime have been either the fault of our hosting provider or mongodb not failing over when it should. As such, we're always looking for a better alternative to move to, but at the moment MongoDB is about as good as it gets.


MongoDB is not even remotely close big table architecture. It has a different data model and a different sharding model, and just about a different everything.


I know that MongoDB is very different in architecture from BigTable (as opposed to HBase and BigTable, for example), but I always understood that the fundamental way they choose to assign and lookup regions to regionservers (or, in mongodb terminology, shards and shard servers) was based on the BigTable architecture.

Could you elaborate on the differences in the sharding model between the two?


Mongo's sharding model is the only thing I would call remotely close to BigTable. Everything else is leagues different.


You said that you don't like the availability issues then state "but at the moment MongoDB is about as good as it gets."?

Just. Wow.

Get your head out of the sand mate. MongoDB is nowhere NEAR as good as it gets.


I think you misunderstood what I meant; I meant that MongoDB is as good as it gets for our requirements, not availability-wise. I agree if availability is your only concern, there are far better solutions.


Documents with lots of optional fields, or lots of fields that can hold multiple values.

Workloads where writes are rare or you have a single writer separate from your readers, most reads are simple fetch-by-ID, and more complex queries are unpredictable and/or suitable for overnight batch runs.

Workloads where performance doesn't matter but you want schemaless for convenience.


digiDoc is all about converting paper documents like receipts and business cards into searchable database, and so a document database seemed like a logical fit(!).

It looks like this single initial assumption is where things started going wrong: conflating the pieces of paper that happen to be called "documents" in the real world with the concept of a "document" in the context of a system like MongoDB.


It's an easy mistake to make - to assume two things with the same name might be similar. Especially when paper documents have been called thus for a very long time.

It seems like another instance where all the good names were taken. http://jcooney.net/post/2012/04/03/Transaction-Argument-Clas...


The guy is getting such hate on the comments on his site, yet his opening line is that his choice was ill thought. Let him express his issues right?

I choose MongoDB for my last side project and while it was awesome working schema-less and developing the client facing part of the project was certainly quicker to deliver, i feel pretty lost on the analytics/BI side of it and couldn't say it better than him: "Not having JOINs makes your data an intractable lump of mud"

So coming from a relational/SQL background I found MongoDb awesome upfront, but frustrating later on... and yes I'm off to learn http://docs.mongodb.org/manual/applications/aggregation/


The downside, or challenge, with NoSQL (generally speaking) is that you need to handle your aggregations ahead of time - you need to know what queries you'll want to run in the future when you store your data. If you have some new aggregation you want to keep, you'll need to re-process the data (with Hadoop or something else).

It's the trade-off of being able to scale reads and writes horizontally. And unless you need it, an RDBMS makes sense given the flexibility.

Maybe, instead of looking at NoSQL as a full-on replacement for RDBMS, we can look at it as a better solution to sharding.


This is the opposite of "agile." It is difficult to know where your product will be in 2 months let alone 12, so it seems the advice to use SQL first is sound - unless you enjoy long distractions to solve simple JOINs.


If you are refactoring your app enough to need a different class of joins, you'll probably need schema and data migrations too.

I think no-schema fits agile quite well. For rapid prototyping, I prefer Mongo to even sqlite.


You should at least know if you will likely have any analytic use cases.


This is also true of SQL databases. It just depends on _what_ your aggregating. In this case if you need to aggregate the count of something then in a DB like MySQL you can lean on the index to get a quick count, but if you need to SUM or AVG you will be doing just as much CPU level work as a NoSQL solution. The difference is in MySQL its a simple query with the AVG operator, in MongoDB its a Map-Reduce query which is much harder to write.

In general you need to know what your doing under the hood and how either solution effects your problem domain. Where I work we need to aggregate billions of data points on demand. This can't be done in real time without pre-aggregating the results ( and even then it takes tons of I/O just to process the aggregated data set )


The guy just jumped on the bandwagon without having a clue.

Just reading this blog - it's clear that MongoDB was not a good fit for him, if he had bothered to do some research, he would have found this out on day one.

Thats the real lesson he should be taking away from this and blogging about yet somehow MongoDB are trolls and it's all their fault because of a lack of features and they have bypassed 40 years of computer science and blah blah blah blah, excuses, excuses, excuses.

edit: removed a few pointless sentences :-)


"I can only come to the conclusion that mongodb is a well-funded and elaborate troll."

It's possible the reasoning he used to use mongodb is the same as the one he used to abandon it.


For me, what killed my enthusiasm for mongodb is the write locks. Yes they have been greatly improved in the 2.x release but it's still not good enough (for me).


Look, there are some places where document DB's solve problems easier/better than SQL, other places kind of suck. For example, plain old object mapping is easier with a document DB. Relational DB's tend to make your code look/feel/act more relational and less object oriented. Your object model tends to look just like your table structure. This can be good or bad depending on your viewpoint.

There are some approaches to solve some of the author's problems that end up making the Mongo system look and feel a lot more like a SQL system because sometimes data is actually related.

The author could have also taken a different approach to his data schema that would have fit more of a non-relational worldview.

Software development and architecture is about making choices and working with and around the limitations of your tools. It doesn't matter if PostgreSQL or MongoDB are "better". It's about solving a problem using a set of tools you are comfortable with.


I find this article to be more a reflection of a NoSQL newbie's failed foray with a document database that later realized that the grass is not as green as originally perceived. The developer realized that he does not like map-reduce and missed not having joins. I don't see how this person's failed experience with MongoDB is a reflection on MongoDB.

I think the recent popularity of MongoDB bashing is maybe a testament to MongoDBs popularity. I'd guess that because MongoDB is probably the closest NoSQL database to a RDBMS with its ad hoc queries, that it is attracting many newcomers.


Yes, MongoDB is more of a general-purpose database with lots of features that remind us of relational databases. It is a purpose-built application database for applications that would otherwise almost certainly be built with a relational database. As they get deeper into their project and find out how some of the trade-offs play-out their self-doubt is always about "should we just go relational" - no one is staying up all night wondering if they should migrate to Riak. If you start out with Riak you almost certainly know why and are using it in a very specific context.


Can someone confirm that there is no such thing as a case insensitive index/search in Mongo? If true it seems likely that the author's comments have some degree of truth, at least when it comes to its usefulness for web and mobile applications. Storing data only lowercase isn't a good a idea for obvious reasons, and storing two copies of the same data for searching only, while not the end of the world, seems a little silly.


case-insensitive regex searches are supported.


Do you know why are case-insensitive searches not recommended? What's the realistic work-around?


http://www.mongodb.org/display/DOCS/Advanced+Queries#Advance...

Regex queries ending with /i (case-insensitive flag) cannot efficiently use indexes, but must do full index scans.


I've found that a simple {"lastname":/cholis/i} works great. However, trying to do the same thing for a multi-key search isn't ideal. Specifically, searching for 3 words in a title using $and with multiple regex queries on a collection with 100,000+ documents took about 520 ms.

The mongodb documentation suggests that you could have an array with your keywords, generated from the field you wish to search. Using indexes on multikeys would make this faster, but your index size would be much larger.[1]

For my project, I'm likely going to institute solution like elasticsearch or SOLR.

[1] http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mong...


Building over a full text search engine that indexes your db is going to scale much better in the long run. Plus you get stemming and other niceties to boot as long as consistency is not of the highest priority.


With tools like SOLR or Elasticsearch, I keep wondering why people try to re-invent the full-text indexing wheel - these tools make it easy to focus your app/datastore on your functionality and leave search as an additional service, to be used whenever you need it.


The documentation mentions that prefix searches will be much faster than other searches. Most likely reason is that the index is mapped with some sort of prefix tree, perhaps a case-sensitive prefix tree.


There are 3 reasons I have gone running and screaming from and RDBMS. 1. Software gets large / complex to get meaning full work done. I am all about data consistency but at some point it is time to break things up into services and not have a single database. 2. If the software is popular enough everyone is running to use NoSQL (cache is NoSQL). 3. Clearly it is not a good storage solution either because for example in an address book nested list greatly simplifies everything. (right tool for the job)

I spent many years hammering away with RDBMS and by and large it was great until it wasn't. I try to look at data storage more holistically now based on best guess of the problem. I have tried to convert an application from Postgresql to MongoDB and it failed but that wasn't MongoDB's falt it was because I didn't change the data model to fit a document storage system. I have also tried to use PostgreSQL for a realtime reporting system and failed horrifically and that was not Postgesql fault it was mine. Amazing what happens when you stop pushing a chain and start pull it!


>1. Software gets large / complex to get meaning full work done. I am all about data consistency but at some point it is time to break things up into services and not have a single database.

This is true. Managing complexity is always an important task. I am not sure that NoSQL solves this however. Also the best way to break things up is to loosely couple things, and this requires to some extent that you have ACID compliance. A good RDBMS, like PostgreSQL or Oracle, will provide tools for managing that loose coupling.

>"2. If the software is popular enough everyone is running to use NoSQL (cache is NoSQL)."

Like proverbial lemmings over a cliff....

>"3. Clearly it is not a good storage solution either because for example in an address book nested list greatly simplifies everything. (right tool for the job)"

Funny, I thought nesting was what WITH RECURSIVE was for....

I am not saying there aren't use cases for MongoDB or reasons to switch some applications. For example I can think of a few really cool apps, like maybe a network back-plane for a huge LDAP directory. Also content management might be a good fit. But despite your years of experience, it doesn't sound like you have really looked at how to solve these with good RDBMS's.....


I work for a company that operates in the energy industry. We utilize both RDBMS and "NoSQL", both have their purposes that they fit in well. We store customer account and configuration data in Postgres, and use Cassandra to store time-series statistics and high write volume data.

I have a background in data warehousing in both Oracle and SQL Server, and was part of the decision to use a polyglot persistence model. I've got at least a decade's worth of experience in the DW world, and more as a general developer before that, so I like to think I've got a relatively credible background in a variety of data stores.

I haven't looked at Mongo much - it's durability concerns and the write lock stuff pushed me away from it early on (I don't mean to disparage it, but that was where it was at when I evaluated it), but Cassandra's configurable consistency levels and operational story at a cluster level are what sold us for our time-series data (that, and the ability to construct a sparse timeline and multiplex reads/writes). For anything we need flexible querying with, we push it into specialized Postgres dbs.

The level of willful ignorance and vitrol in this thread is kind of amazing. Most of the really experienced DW guys I know are all looking at HBase, Cassandra and others because they fit a niche that we've all been looking for in certain data sets at really large scale. It doesn't mean we're ditching our relational data stores, it just means we're augmenting them with other tools because they fit the job at hand. To suggest that one tool is absolutely perfect for every scenarios seems a little short-sighted to me, possibly driven out of inexperience. I don't mean that as an insult - I know a lot of guys who've been working on the same data sets for 30 years who really do just need the one tool - however, you've got to realize there are other data sets and problems for which your hammer just won't fit.


On your home page you imply that you can automatically OCR arbitrary handwritten receipts into an analyzable format.

No one can do that. That is your problem, not MongoDB.

As far as aggregation, use the new Aggregation Framework http://docs.mongodb.org/manual/tutorial/aggregation-examples...:

    db.zipcodes.aggregate
      $group:
        _id: "$state"
        totalPop:
          $sum: "$pop"
    ,
      $match:
        totalPop:
          $gte: 10 * 1000 * 1000
As far as "losing the independence of your data access paths", no you don't. You are free to use linking instead of embedding wherever you want. http://www.mongodb.org/display/DOCS/Schema+Design#SchemaDesi...

MongoDB doesn't have a built-in full text search? So what. Most systems with large amounts of text to search do not rely on the text search capabilities built into relational databases anyway. People use actual full-text search engines like Lucene/Solr, Sphinx, reds, etc. Having said that, if you just wanted to support lowercase keyword queries with MongoDB, would it really be so hard to extract and store lowercase keywords from your text, as suggested here? http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mong...

If you are trying to add four 1s and get '1111' instead of 4, that is an error in your application code which has nothing to do with MongoDB. Very common problem with JavaScript. If it is JavaScript, try finding the code where you are attempting numeric addition and change it so that instead of saying for example 'total += newNumber' it says 'total += (newNumber * 1)' .


"On your home page you imply that you can automatically OCR arbitrary handwritten receipts into an analyzable format. No one can do that. That is your problem"

Jeez, lay off the confrontational tone. He doesn't say anything about OCR. Maybe he's using humans to do data entry? In any event, it's completely irrelevant to the topic of databases.


I notice you ignored all of my several very specific points directly related to his issues with the database system and your only comment was a criticism about the tone you perceived.

OK, maybe he is using humans to do data entry. The home page to me implies that the process is automatic, but I guess it doesn't rule out the possibility of humans doing data entry when he says 'tag and categorize'. But if he is using humans to do data entry instead of some automatic OCR, that is still his main business problem, rather than MongoDB. The application is relevant to the database discussion, and Hacker News is about all aspects of startups.


It's because I have nothing to say about the database stuff. Why are you so adversarial? I'm not here to cross swords with you; I don't have an opinion on the matter.

But I did notice your rudeness, and you're now being rude to me. Totally uncalled for.


I was not rude to the poster. I corrected him as far as his misguided complaints about MongoDB and the main problem for his business. That is the only way to help him.

I was definitely not rude to you either. You suggested that my comment was irrelevant, and I pointed out that my comment had a lot of relevant content in it whereas by your own definition of relevance your comment had none.


We don't do OCR, we have people who scan the documents if you snailmail them to us (or you can upload the scans directly to your account, email them in etc.) and then convert these images into queryable data with special emphasis on receipts (also invoices, bills, bank withdrawals, etc) and business cards. It's a service that our customers are happy to pay for, so I think I'm not quite getting your point about why this model won't work.

Had it been any one the issues I've pointed out, I would be quite happy to work around them. I do that routinely with other pieces of technology that do 95% of what I want. However, the hits kept coming. At the same time, I was using Postgres on another project and the question was begging to be asked - if I am going to run extra services like Lucene and do extra work to achieve full text search, what am I getting in return? And I am sorry to say - case insensitive search is quite a basic feature.

As for the 1111 issue - I admit it can be fixed. But the fact still remains - there's no way I would face this problem with Postgres. So why not just switch?

Postgres is a really nicely thought out and well executed database. And don't get me wrong - I am not building a system designed to operate at stratospheric scale. That would be premature in my opinion.

Thanks for taking the trouble to comment.


Relational databases are awesome if you are not dealing with huge amounts of data that your current hardware can't handle the relational way. There are some cases where you have a ridiculous amount of data(rows) and you simply can't store that in a relational database and you are happy to live without the benefits of relational databases.

If you have millions of rows, you are probably better off with something like MongoDB, if you need to search that, you should probably use something like Sphinx or Lucene anyway. But if you know that you won't have too much data for the forceable future, you should use relational databases. OR you could simply use both.


A relational database at very modest hardware can handle millions of rows, and with a solid database, good hardware and a DBA who knows his shit you can handle billions.

OpenStreetMap has over a billion nodes stored in a PostgreSQL database.

http://www.openstreetmap.org/stats/data_stats.html

My point is that you can get very far with a classic relational database before you have to scale vertically.


Billions of rows btw is not a problem even on modest hardware, with a half-decent db, etc. It's analytics on this where it become possibly an issue. Retrieving one row out of 1 billion is not that much more complex than retrieving one row out of 1 million, nor is it that much more expensive computationally assuming the right index is in place.

The problem comes with high concurrency, in particular very high write concurrency, or with very complex queries which require a lot of RAM to do properly. But that's where you need a solid db, good hardware, and a solid DBA.


Very true. If you do not do much with your data it is trivial to handle.


I always get a kick out of Stonebraker selling NewSQL/VoltDB as the solution to this sort of problem. He provides a triangle where at the vertices are VoltDB, column stores, and NoSQL.

However, what he always fails to mention is that some of the most important OLTP that happens is in enterprise ERP systems, and these involve every kind of workload of all of these. Yes, you could probably get some additional performance gains by splitting up your ERP into VoltDB, Vertica, and CouchDB with connectors between them but why?

And relational data doesn't even start to get moderately big until you are in the TB range. This is true even of ERP apps.

Millions of rows are a problem? What are you using? Microsoft Access?


There may be cases where non-relational databases are the right solution, but unless you are using a toy RDBMS millions of rows is not considered a lot of data. Millions of rows added every day... now you're talking. Oracle handles that kind of thing just fine.


Taking all these comments into account, if startups have to prototype quickly while trying to find market fit, does it makes sense to start off using something like mongodb but with the plan to migrate to another database when you business starts growing? The database space is so confusing right now. It seems like Postgres is the safest choice and i also like this post fron Adam D'Angelo - 'http://www.quora.com/Quora-Infrastructure/Why-does-Quora-use...


It's not necessarily all or nothing - I have worked on several projects now each using multiple database-type options: mongodb for read-intensive, loose-schema type stuff where the growth is generally predictable (e.g. products, suppliers, logs), postgres for relational-type stuff (orders) and solr for searching (I know solr isn't a database but people seem hung up on whether mongodb supports case-insensitive searching - hint: don't use any database for search).

I doubt that, unless it's extremely simple, any set of requirements are an exact match to only one of these technologies... mix and match is the future :P


Very briefly put, I use MongoDB to start off most new projects.

My primary objective is that my application fulfills it's use case. Data is malleable, so you should use the right tools for your needs.

That being said, it sounds like the OP was trying to use a chisel as a replacement for a toolbox. Basically fighting the software (mongodb) to fit his requirements, instead of using additional tools.

http://blog.8thlight.com/uncle-bob/2012/05/15/NODB.html

http://blog.heroku.com/archives/2010/7/20/nosql/


Wow, the first comment on the blog is so vile. He angrily blamed the "victim" (OP) as a talentless developer.

Tools are enablers and supposed to make ordinary people rock star. If it takes a rock star to use a tool, the tool fails.


If he just did 10min of research he would of realized MongoDB isn't for him. Also by doing that research he would've realized that MongoDB has no data constraints. Thats all done in your model in your application.


You used a tool without researching it first, you jumped a bandwagon without finding out its destination, you most likely used it wrong because you didn't RTFM. Now you ditch and diss it. Grow up.


Very harsh given that he points out he made these mistakes right at the beginning of his post.


Fair enough, perhaps a bit harsh. But from that to call an entire system a Troll... It works for some, it doesn't for others. Usually its down to use case and knowledge of how things work. For me mongoDB was a perfect fix as a datastore along side a tranditional Postgres instance.


I've never used Ruby on the application layer, but I'd be wary of using an ODM with MongoDB. The single most shocking issue seems to be at the application / ODM level. Using the official 10gen-supported drivers gives you more control and a better understanding of what's going on every step of the way.

Also, a thorough understanding of MongoDB indexing, advanced queries and schema design would have squashed all of these issues. Has anybody had a more pleasant experience with a MongoDB ODM?


To be fair, some NoSQL solutions were being sold marketing wise as the be-all and end-all of data solutions. Just google "mongodb mysql migration" and look how everyone is/was so eager to jump on the non-relational bandwagon. Some backlash was to be expected, after all, we might have reached the Trough of Disillusionment


Made the same mistake on a banking project 10 years ago (with domino). The project in question was a project tracking database.

For accounting type problems use a relational database. For document driven items - e.g. a resume database - nosql works great. For a hybrid pick your battles... or use both.


I think the real lesson here is use the right tool for the right job.


calm down redditors, he just need a basic non majestic scale solution, SQL fitted him very well. large scale data aggregation needs to address disk and network latency, that's where NoSQL shines. and if you operate at large scale no 1 single simple tool will do justice, remember quote from google, 'at scale everything breaks'?


calm down redditors

Please knock this crap off.


Upvoted for?


Reasons why drop mongodb: 1. try $or $and with $near 2. No b-tree index :: count() 10.000 rows = 100% CPU usage ;) Type google.com then:: site:jira.mongodb.org/browse/ planned but not scheduled




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: