Hacker News new | past | comments | ask | show | jobs | submit login
How You Should Go About Learning NoSQL (openmymind.net)
81 points by latch on Aug 15, 2011 | hide | past | web | favorite | 48 comments

Step 1 in learning NoSQL: learn SQL very well. There is a good chance it already solves your problems AND ensures your data makes sense.

Not just that, but getting a deep understanding of how the RDBMS (or NoSQL solutions, for that matter) implements the database is important as well is very important. It requires a low level understanding of how the implementations write things to disk, perform indexing, structure data in memory, and their characteristics when distributed to make truly informed, non-cargo-cult decisions about database technologies. Otherwise you're just shooting in the dark.

Step 2: convince yourself that you should only use NOSQL if you absolutely have to. (Corollary: if you ask for NOSQL by name, you most probably don’t need it.)

As a bonus, SQL cares about data reliability.

Not that all of "NoSQL" doesn't, but two major "NoSQL" databases, Redis and MongoDB, both seem to care more about performance than durability.

MySQL cares about data reliability? Maybe if you spend much time agonizing over the configuration and set it to strict mode.

I generally don't use MySQL, but iirc it confirms that nothing went wrong saving the data (whereas with mongo there is no confirmation step, you just have to hope that it was successful), and the data is written to disk. Additionally, i've heard of zero "Oh no, MySQL just lost 50% of my production database!" where it wasn't the user's fault, and i've heard enough of that from Mongo to stay away.

Respectfully, this isn't true. Yes, fire and forget is the default behavior, but there is a confirmation step that you can check. It may be implemented a bit differently from driver to driver, but it is generally called a "safe insert". Numerous people use this to ensure their writes across single databases as well as multi-node master-slave and replica sets.

Respectfully, this isn't true. Safe inserts are safer, but not safe. There could still be a problem writing the data to disk, and (My|Postgre)SQL just doesn't have this problem.

Assuming the changes that were introduced in MongoDB 1.8 for single-server durability, writes are being put to disk with journaling. So, the data is written to disk.

That doesn't actually guarantee the write was written to disk unless an fsync was issued. Of course that comes with a significant effect on MongoDB's famously marketed write performance.

yes but the redundancy you have to use with mongodb gives you a very low probability of that type of failure.

Then in MongoDB's case durability is a function of scale, which leads back to the parent's suggestion that it is a technology optimized for performance.

Personally I think this is a bad foundation for data that is important. There are probably a lot of use cases where data not being on disk for n seconds (or one minute in the case of MongoDB) is ok.

Even when that is the case, I still think that is the most important question to be addressed when choosing MongoDB as a data store.

The flexible query API, schema-less document format, secondary indices... those are siren songs of rapid development.

I really absolutely do not understand "verses" mentality and that if everything must support same feature set or it's "bad".

You shouldn't choosing SQL over noSQL, you choose both (or none, or other stuff as your problem requires) and use them in appropriate places in your infrastructure. Sometimes you want performance over durability.

Would you mind elaborating on why MongoDB cares about speed over reliability. Sure, you can go that route (and sometimes that is what developers want to do), but MongoDB, as a database, reliably stores and persists your data.

Just curious what your experiences were that made you think MongoDB isn't a reliable data store.

It's pretty easy to make Redis basically ACIDic (use MULTI/EXEC and WATCH with AOF on and fsync on every command) if you really need to.

In practice you probably don't. Financial transactions should take another path, but for almost everything else, you can probably afford the few seconds of data loss you may encounter.

My approach was to read the Dynamo paper. It's presents the goals of Dynamo and the technology from which it is constructed in a very compact form. From there you have a basis for understanding the engineering tradeoffs made by the other NoSQL databases. Then you can actually make decisions based on what fits your use case, rather than just choosing Mongo because it's popular. (Don't get me wrong, Mongo is great in its niche.)

Reading the Google BigTable paper would also be a good idea, as it represents another major strand of work.

My blog post on Dynamo: http://untyped.com/untyping/2011/01/21/all-about-amazons-dyn...

From the article:

A lot of NoSQL solutions are about solving specific problems. MongoDB is a general solution which can (and probably should) be used in 90% of the cases that you currently use an RDBMS.

As I use don't use any NoSQL solutions today, can someone list out a few of these cases where I am using an RDBMS and I should be using MongoDB?

I'm wondering about this too. For almost everything i work on, the data being in the correct format, written to disk, and stored reliably is more important than the speed or other advantages of Mongo.

I have a hard time understanding responses like this and, in the end, they make me think that people a) really don't understand some of these new data technologies and b) are just as much on the anti-bandwagon as those that sit on the latter.

For people using MongoDB daily, their data is in the correct format, it is persisted to disk, it is stored reliably and it is fast. On top of that, it gives them flexibility as their web app changes and new features and data structures are added, it is easily viewed and manipulated in JSON-format, and if and when the day comes that they need serious scaling, it helps there as well. Also, in the world of EC2, it is straight-forward to set up replica sets to offer redundancy.

All these conversations sound eerily familiar to the Java guys bashing Ruby and Ruby on Rails about five years ago. If you want to stick with what you have, then no worries. I just think people should be excited that over the last couple of years, the "golden hammer" approach to storing data has finally been overtaken and developers have a choice about what technology best solves their data problem.

Actually i've used a few "NoSQL" databases in tons of projects and come to the conclusion that traditional relational database systems are more appropriate for many, many tasks. I really like mongodb and will continue to use it, but not for serious projects as i just don't see the tradeoff being worth it.

Thanks for the ad hominem attack though.

I have first-hand experience with MongoDB silently eating my data (yes, I was using the 64-bit version) which I realized a week earlier when I noticed half the dataset missing.

The reputation (specifically of Mongo) is probably not undeserved.

I am not knowledgeable of your situation, but I am curious about your set up, the intentions you had when you were storing things and other factors in MongoDB not storing your data. Also, I am curious how long ago this was...what version of MongoDB you were using and if you were checking getLastError after the write (assuming you wanted that behavior).

I am not denying your experience, just curious about the whole picture.

I really don't see how you couldn't replace MongoDB in that post with any other database and still be presented with the same issues: - adding indexes slows writes down and increases memory requirements - running queries with poor / no indexes will cause you to have I/O constraints and an unhappy CPU - it seems like your data set was better suited for a graph db like neo4j than a document db or RDBMS.

As for the data corruption, I didnt see you mention if you were checking for a response on saves or not.

I think the problems you experienced were due much more to the fact that you apparently had more data than the machine could handle, not necessarily the database engine used.

MongoDB, the "blame the user" database

See the subsequent writeup on SQLite and how it handled the same dataset magnificently. About the checking of responses, if you read the article you will see that previously stored data just went away, it wasn't failure to store outright.

Given that you didn't know better than to not use 1.3.* initially I find it hard to take your continued propagation of your posts on MongoDB seriously.

If it's common knowledge not to use 1.3.*, why isn't that anywhere on the first page of google when searching for mongodb or mongodb 1.3?

Just curious what the magic incantation is for this information.

Yep, that pretty much invalidates the fact that the stable version of Mongo silently corrupted data.

If only we could make database software that didn't hold grudges!

When you want to do many small real time operations against your data. The model fits well with most web sites (the unit of work for a given request tends to be quite small). Loggin also comes to mind as does some basic geospatial work. I think for most cases it does what a RDBMS can do but with less friction (you'll still use an ORM/ActiveRecord, but it'll be a much more lightweight).

Why don't you check out mongly.com/tutorial/index and get a quick feel for it?

That makes sense, but it is certainly not 90% of what I use an RDBMS for as the author stated. Especially if most of the things I do are not real time web. I will take a look at the link above, as I am curious about it. But I am still struggling for use cases.

By the way, thank you for your comment. I up-voted it.

Because it's "web scale".

There are several different NoSQLs.

For example, currently MongoDB is a good choice if you're looking for indices, ad-hoc queries and want to get up and running quickly on eg. a web project. Interestingly, with indices and ad-hoc queries, MongoDB becomes a lot like Mysql, only you're writing weird And() and Or() functions instead of "SELECT ... WHERE AND ... OR". Also, indices are tied to the documents themselves, so it's not clear how that scales.

Shameless plug:

Or you could go more bleeding-edge and check out ScalienDB, which is a straight key-value store built on Paxos and sharding. Nevertheless, getting started is easy:

I wrote ScalienDB. The plan going forward is to add a data model and distributed transactions and in general become a low level data layer substrate similar to how Google uses its databases.

SQL is (an expression of) relational algebra: http://en.wikipedia.org/wiki/Relational_algebra. NoSQL is your (poorly named) ability to cache stuff (flat datasets) in assorted places.

It’s obvious the author means well — and thanks to him for plugging the NOSQL Tapes! — but there are some seriously leaky assumptions at work here about a) how Mongo, Redis (& RDBMSes) work and b) the extent to which this poorly-digested (or regurgitated) understanding applies to other NOSQL solutions.

Too many shortcuts, inaccuracies, half-truths and implied errors to list here. The core idea of learning about NOSQL through first-hand experience (& dissection) of each project is sound, though.

In some RDBMSs, primary and secondary indexes are very different. MsSql clustered tables store the tuples themselves in the primary index B Tree. Traversing the data in the order of the primary index incurs no extra disk seeks. Also, you can add extra columns to secondary indexes to support queries without searching the primary index at all.

The emulation of secondary indexes is inefficient. In a RDBMS without built in index maintenance you would create table (LeaderboardId, ScoreId), not (*LeaderboardId, ScoreIds). There's no need for comma separated fields.

You don't hear as much about it, but RavenDB is an excellent No-SQL solution in MS land. It was written from the ground up with .NET instead of being ported. It's become my first choice ahead of MS-SQL Server for .NET projects.


Very light reading, but all the same an excellent overview of MongoDB, Redis and Cassandra. Well done.

How can one learn something that is defined by what it is not.

This thought is not mine, but I cannot recall the source: NoSQL is a bad name; it should be NotSQL. As a result it is a very large umbrella. When one sees NoSQL it is a safe assumption to think MongoDB. But you could also think DB4O (I like this much more, in an abstract way). So you can go about learning any of these technologies since you find an instance of NotSQL. To learn NoSQL, you are really still able to learn this. You are learning a philosophy rather than a technology.

I remember reading something by Erik Meijer where he states he also thinks it is an unfortunate name. I think he suggested coSQL, and gave an interesting perspective on the relationship between SQL and noSQL.

This was the article: http://cacm.acm.org/magazines/2011/4/106584-a-co-relational-...

The plural of index is indices, not indexes...

"Yesterday I tweeted three simple rules to learning NoSQL. Today I'd like to expand on that. The rules are:

    1: Use Mon" CTRL-W

I originally skipped the article, and thought you were kidding... Alas...

What path would you suggest people take in learning something that challenges decades of best practices?

1st, be damned sure that decades old best practices actually aren't the best tool for your current application, developer hype be damned.

Hard to know whether they aren't the best tool for the job without at least familiarizing yourself with some alternatives.

I've had a great experience with CouchDB as a NoSQL solution, simply because it is NOT like your traditional MySQL/Postgres. I'm not trying to map my existing knowledge to something that is fundamentally different. I think MongoDB might be a mistake in this regard, because users will attempting to deploy it similarly to a SQL solution.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact