Hacker News new | past | comments | ask | show | jobs | submit login
MongoGate — or let's have a serious NoSQL discussion (erlang.de)
94 points by zeit_geist on Nov 6, 2011 | hide | past | favorite | 32 comments



My big concern with so-called NoSQL solutions is the "culture" that seems to be brewing there.

If you go to the "Don't use MongoDB" post ( http://news.ycombinator.com/item?id=3202081 ) you will read some, IMO, extremely worrying comments from a few pro-NoSQL users including antirez (Redis).

For some reason NoSQL now apparently means "unreliable datastore for unimportant, throwaway data" and defaults are chosen accordingly. Why the hell is that?

NoSQL for me doesn't imply anything other than "no SQL", and at a stretch "no schema" - this makes a lot of sense for many of us who routinely need to create databases that are logically trivial. In many cases they are a bunch of glorified persistent hash tables that usually don't fit in memory. But this doesn't mean they aren't critical. Why would it have to? This isn't anything new either, we've had Berkeley DB for a long while. It's just a bit of the dry side and it may fall short in many cases.

What I was looking forward to and I hoped I could find in the "NoSQL scene" is an alternative to traditional DBs but without the overhead that many times is not necessary (but sometimes is, and I intend to continue using PostgreSQL when appropriate). Ideally, something as simple as mongoDB appears to be (tried the interactive tutorial).

So when exactly NoSQL stopped meaning "no SQL" and started meaning "unreliable cache"? Other than the simplicity, I fail to see where it would fit in the market then (other than the amateur market). There are better, stablished DB caching solutions. There are persistence libraries in any moderately language. There are reliable databases that are fast enough when you have the budget to scale to several dedicated servers.

How about Riak?


NoSQL has never meant "unreliable datastore for unimportant, throwaway data". If it did, there would be no need for the MongoDB rant because that poor level of at-scale reliability would have been understood from the beginning. MongoDB wasn't marketed as an unreliable data store, so expectations weren't met by the rant author.

I'm worried about the culture that's brewing as well, but I see it more as an attempt from some NoSQL supporters to keep MongoDB looking good, even in the face of serious data integrity issues. The battle lines are forming between SQL and NoSQL (relational vs. non-relational data stores, really) and there's a lot of money and reputation at stake. What we don't want is for the facts to die in a war of rhetoric about the merits of SQL vs. NoSQL. That would be dumb.

With that said, the first paragraph of the rant is worrying:

"I've kept quiet for awhile for various political reasons, but I now feel a kind of social responsibility to deter people from banking their business on MongoDB."

What the hell does "various political reasons" mean? I'm more concerned about that than any deficiencies in MongoDB's codebase. Is there a well-funded campaign to silence MongoDB/NoSQL criticism, or is this just one customer's attempt to save face for choosing the wrong data store?



CouchDB is designed to be as durable / reliable as the underlying io abstractions (posix and friends) will allow. This in memory stuff is really just a minority. I believe Cassandra is also reasonably durable as well.


[ First off: I'm a committer on Project Voldemort, a Dynamo-style distributed data store ]

First, Riak is excellent. I can only say positive things about it as well as the folks that work on it.

Re: "store for unimportant data". I'll go beyond that. Not only should new databases be suitable for reliable storage, new databases should do things than existing databases can't. I am a bit sad that NoSQL had become to mean "replacement for an improperly tuned, ad-hoc sharded MySQL setup". To be clear, having a simple setup that provides partitioning, replication and defaults more tuned to modern hardware is a fine goal -- but why not do better? If I wanted something better than MySQL, I'd use Postgres (or properly tune my MySQL installation).

For example, Dynamo-style stores allow for any replica to initiate a write (something not possible with primary copy replication), allowing high availability applications. Some systems (Voldemort, riak-core, HBase with co-processors) also allow custom code to run on the server, significantly extending the capability of a system in a way in which a store procedure can't.

It's also sad to see NoSQL style systems repeat many mistakes that MySQL has made. MySQL in late 90s with MyISAM is a completely different beast from MySQL today with InnoDB: far better concurrency, durability, referential integrity, better replication. BerkeleyDB JE is also a powerful beast: log structured storage (this is why we're using it as the default storage engine in Voldemort), Paxos-based leader elections with tunable replication.

Schema-less data or (as in Voldemort) evolvable schemas is also a huge feature, but it's not impossible to replicate it on top of MySQL (e.g., Friendfeed's data model).

Here are some things that I'd like to really see evolve in NoSQL space:

* Support for new and interesting distribution models. Allowing users to choose between eventual consistency, quorum protocols, primary copy replication and even transactional replication.

* Support for large, unstructured blob data: Riak is going the right way with Luwak, I believe Facebook has been using HBase as a front-end for Haystack -- it would also make a great choice for Haystack's metadata store.

* Most NoSQL systems support transactions within the scope of a single value (or document) via the use of quorums, serializing through a single master, etc... However, it'd be nice if something like MegaStore's Entity Groups (or Tablet Groups in Microsoft Azure Cloud SQL server) were supported.

* Secondary indices, whether internal or external (by shipping a changelog) to the system.

* True multi-datacenter support (local quorums if desired, async replication to the remote site) including across unreliable, high latency WAN links (disclosure: Voldemort supports this -- https://github.com/voldemort/voldemort/wiki/Multi-datacenter... )


Looking at HN today - it's full of hate and negativity.

Kudos to the developers that rise above this, often working for nothing, to build the awesome tools that future generations will use to build awesome apps.


it's full of hate and negativity

You can't spell NoSQL without the word no.

This is why I try never to use the word NoSQL. It's a flamebait word, deliberately engineered to add heat rather than light. There's no such thing as "a NoSQL database"; there are only databases. Even the relational databases that parse SQL have significant differences, and the databases that don't speak SQL are all over the map.


> and the databases that don't speak SQL are all over the map

And have been since the punchcard days. For many people, it's hard to imagine there have been databases of all types, feature sets and performance characteristics before the dawn of SQL.


You can't spell NoSQL without the word no.

Nor can you spell it without the word os. But I don't think we're talking about a mouth or other external opening.

A lot of people now read "NoSQL" as "Not Only SQL", which seems more positive than negative.


Maybe I've just been living under a rock, but that's the first time I've ever seen "Not Only SQL".

I can't tell if you're being serious or not.


Not Only SQL has been around for at least two years: http://twitter.com/#!/simonw/status/5339626595

I'm not sure who came up with it or when.


That's creative; if I ever find myself backed into a corner and forced to pronounce NoSQL out loud I will borrow that reading.


In the enterprise community, they call it Data Warehouse or Analytics solution. Which probably work out for the better.


But also kudos for developers who are honest and are calling out a broken design for what it is, without worrying too much about being called a 'hater'.


And your vacuous optimism is any better?


HN critics best be intepreted for what they truly are. Not critics, but truly caring about software. In doing so, weak elements are pointed out. Not to make any coder look bad, but to improve it.


Yep. This is the most like reddit it's been yet.


I'm calling troll.

Having a serious discussion about NoSQL databases begs the exact same question as having a serious discussion about cancer: what kind would you like to have a serious discussion about?

I think the most important lesson we can learn from NoSQL in general is that the idea of a one-size-fits-all database is becoming dated. NoSQL databases certainly don't solve the problems the author points out, and they probably never will. In fact that's the point. By not solving one set of problems, you allow yourself to solve another set of problems.

How about we use databases to solve the problems they were meant to solve, rather than basing our choices on whatever the popular opinion is at the moment.


"I think the most important lesson we can learn from NoSQL in general is that the idea of a one-size-fits-all database is becoming dated."

For programming languages, using the "right tool for the job" has little downside. Perhaps the developers need to learn an extra language, or perhaps there is some communication overhead between them. But unless the components are tightly-coupled, there's not much of a loss.

In contrast, the value of the whole data is greater than the sum of the parts. If you have a website selling products and an inventory management system and an automatic price-setting tool, it's hard to use a different DBMS for each one.

Even for data sets that seem unrelated at first, there may be a lot of value in the small connections between them. This is becoming increasingly apparent and companies are trying very hard to see these connections. Being in separate systems just makes that more difficult.

So, there are good reasons to use multiple database systems, but there is also a much higher cost. Saying "use the right tool for the job" doesn't give any guidance about when it's worth the cost and when it's not.


I think you're mixing concerns a bit. For data warehousing purposes, I agree that it's absolutely preferable to have all the data in one place (like hadoop/HDFS).

For production OLTP stuff, I'd argue that it's a bad idea to do the kind of processing you're talking about in the database unless you can avoid it. Beyond the performance implications, you'll likely have to alter your schema in unnatural ways that you wouldn't otherwise.

Now, I absolutely agree that you need to do a cost/benefit analysis and that there are costs associated with having multiple databases. But I don't think those costs are as high as they would appear on first intuition.


I think you can run into problems in OLTP, as well. To stick with the example, you have three systems: sales from the website, price-setting tool, and inventory system.

Should the sale happen at all? Not if the inventory is depleted. Sure, you can put it on back-order, but then you have an unhappy customer.

At what price should the sale happen? It would be nice if you could automatically raise prices when the inventory drops below 10 units (which may indicate a demand spike or a supply interruption), for example. If you don't raise prices soon enough, you're more likely to run into a depleted inventory, again making the customer unhappy.

And what if you encounter an error moving data between systems? The customer thinks the sale happened, but it wasn't (or couldn't be) loaded into the inventory system for some reason. The customer will call a week later asking why it still isn't shipped, the service rep will be clueless trying to trace between the systems, and ultimately the customer will be unhappy.

(Just to be clear: properly integrated data management may still be done with multiple systems. But it's harder.)


Hi zeit_geist,

to me the problems you've described in your blog post are specifically application model problems. I think we shouldn't abstract the application model into the database, but the database into the application model.

I know a very innovative French developer who wrote an application server, that comes integrated with the database. In this very way you just call the exported functions provided by your database directly. How you model your (re)caching/(re)indexing and other application needs is totally up to you. This a) a freedom you barely find anywhere else. b) bare to the metal development of an application c) the most effcient way to develop an application. (b/c you only implement what you need and don't use a generalized construct that serves a general purpose very well, but doesn't scale with your application very well)

I would recommend to implement an application using the pattern that you know works best for the application, if you don't know it yet, then it's time to read books that enlighten our horizon of available solutions until we can start developing again.

I will show you an example of what I mean.

http://gwan.ch/api#kv

This is how I think is the most elegant way to interact with a(n integrated) database.

I am curious on what you think about this. I know I've not referred to the points in your post, but I've read it carefully. Thanks for hearing me out. I'm sorry I didn't post to your blog, but I prefer to post without subscribing to an external party. You limit the users who can answer this way imho. I'm not sure if it helps you to keep out trolls/spammers, but it sure helps to keep response rate low.


I have taken a short look at your approach only. I still think KV-stores are Assembler-like constructs and as such I would apply my criticism to your approach equally -- please correct me if I (mis-)judge your project! But in general, I think your approach is a good one.

Regarding comments at my blog: I don't understand what you mean with "subscribing". According to the settings page, you do not have to register. You are free to comment there anonymously. That being said, your comment at HN is highly appreciated. Thank you for taking your time!


"The problem is: NoSQL is not a solution at all. It's a trade-off."

Bingo, that is precisely what NotOnlySQL is all about. For example you trade some consistency guarantees for the ability to scale out.

Uninformed (has the author heard about the CAP theorem?), either-or diatribes like this article don't really serve any purpose other than sowing discord.

This is just like "C++ is better than Java is better than..." type flames wars. :)

We use Oracle and we use HBase. We would never replace Oracle with HBase for all of our data needs. At the same time we have need for a store that scales beyond what even Oracle can provide (and yes, we use RAC with multi TB caches across a database instance).

For the same reason we use Java, C++, Scala, Perl, Closure, Bash, JavaScript, etc... The right tool for the right job.

Personally what I would like to see is:

* secondary indexes

* snapshot isolation (in leu of global transactions, which will never scale).

Disclaimer: HBase committer here.


I find one of his counterexamples amusing:

Managing Highly-dimensional data and access to it: ...I'm thinking of e.g. geo/spatial data here. Where are the solutions out there?

http://www.mongodb.org/display/DOCS/Geospatial+Indexing


True. The author forgot to add "scalable" there -- but that is actually mean there.

I am studying Multi-Dimensional Indexing for more than 3y now and have implemented many of the state-of-the-art indexes. They are all not sufficient as especially MongoDB's implemention is insufficient in especially the scalability-domain.


The original document to which you refer (that I refuse to refer to as the "MongoGate" document :-) is about system reliability. Those types of problems can exist in any database system and are not specific to every NoSQL database system. The document claims that MongoDB doesn't perform well under very high loads in a replicated environment.

Yes, NoSQL doesn't fit the problem you're trying to solve. Perhaps there are a set of problems that are difficult to solve with NoSQL, but there exists sets of problems for which NoSQL databases are perfectly suited. So, I would modify your post to state that NoSQL isn't the solution to every problem, but don't think you're uncovering some big secret, because most people already know that.


I took "MongoGate" (sorry, I just fell in love with that term) as an example of what happens when overly positive expectations hit the hard ground. Removing the hype from NoSQL ("it's innovative", "it scales", "the cool guys use it", yada yada) is what I like to do.

I surely do not uncover any secret there. But I haven't stumbled upon a "what NoSQL lacks" blog post recently either. You can read my post in many ways, but the latter one is actually one possible way imho.


NoSQL has nothing to do with the replication or persistence choices made by the mongo developers. NoFlatFile is about as descriptive for these discussions.


I just recently published an article describing my experience migrating from SQL Server to MongoDB, you can read it here: http://www.wireclub.com/development/TqnkQwQ8CxUYTVT90/read

I tried my best to describe both what we gained and what we lost after the transition. At the end of the day, MongoDB (and other NoSQL solutions) are different tools for different jobs. Obviously it takes investment to master a new tool and we almost aborted the migration in two different occasions simply because we didn't know enough about maximizing MongoDB performance. Now that the dust has settled and with all things considered, I am glad we didn't.


Disclosure: I wrote a product called Citrusleaf, which also plays in the NoSQL space.

I also want a better discussion of NoSQL. It isn't fair to hate on databases without understanding the pressures of operations. I saw a friend's company where a big, fancy oracle system lost all of its data on their main test/dev system at a crucial moment - lost over 100,000 user accounts, including those of executives of key customers. They were forces to merge with a competitor about 4 months later.

You need to take database backups, you need to stage your systems. You need to have extra hardware on hand.

Some of our customers at Citrusleaf continue to "run with scissors". I like the attitude, but we've had to talk sternly with them about the benefits of staging, bucket testing new releases (app and db), and penciling out the realistic hardware requirements.

The new crop of distributed databases provide an immense opportunity for all of us. We can write more agile applications than ever before, and as a community we all need to understand the benefits of flexibility. This includes your entire organization.

That being said, there are technology differences between the NoSQL solutions, and at Citrusleaf we've focused on operations and deployability. My co-founder ran Yahoo Mobile's engineering and ops group, so understands the tradeoffs. We have a group in India (hi guys!) of great developers (not support guys) simply to make sure that when you've got a problem at 3am there's someone to take care of you.

Performance is important in this agile world, and Citrusleaf has it. http://bit.ly/rRlq9V

A slide I showed at HPTS (the high performance transaction systems conference) showed a Zynga game on the right, and an EA facebook game on the right. Zynga is an amazing machine in terms of getting huge, rich applications to market. Every pixel is covered with things to do, artwork, everything. And they're rolling out new games every week, and I haven't ever seen downtime (unlike Netflix Streaming, which has maintenance on a regular basis).

Zynga has been a huge proponent of NoSQL (but not Mongo) since its inception, and although I don't know what EA does internally (maybe they use the same tech but have other agility issues), NoSQL is clearly part of a high scale, rich application need.

Join or be flattened.


Is there some reason why your benchmarks look like you're cheating?

"The Citrusleaf server node received input from 4 client nodes, the MongoDB server node received input from 1 client node running 2 client processes, and the Redis server node received input from 2 client nodes."

I mean -- if you're cheating, that's bad. If you're not cheating, why the hell do you set things up to look like you're cheating?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: