Hacker News new | past | comments | ask | show | jobs | submit login
SQL vs. NoSQL (linuxjournal.com)
78 points by dkthehuman on Sept 3, 2012 | hide | past | web | favorite | 44 comments

I am presently researching these so-called NoSQL databases and one thing that I keep wondering is why can't all of these database still support a limited form of SQL-like language?

Almost any json document could be represented as a db table as far as I can see. Why can't I query using a common language instead of learning each of these NoSQL database's own way of doing simple stuff like returning documents of User object that have a gender field value as 'male'. In SQL, it would be something like "select * from users where gender='male". Why can't NoSQL databases support a query similar to that? Why do they require me to describe a similar request in their own unique syntax?

I sometimes get the feeling that coining a term like NoSQL is a marketing gimmick which hurts people actually trying to learn the nuanced differences but I am only getting started. Why can't we extend SQL to support "NoSQL"-specific cases instead of replacing it with nothing.

I get that a big part of SQL is joins and the philosophy of joins goes against the idea of NoSQL. The solution to that is to still accept SQL but just throw an exception when join is used with a link to educate the person on alternative implementations.

I believe some do offer SQL-like query languages.

But if all you support is lookup by key, or even just lookup by field, then SQL really isn't that useful in my opinion. And if your lookups are based on map-reduces that have to be pre-specified then I can't see any place for an SQL-like language at all.

SQL is ok for specify queries to relational databases, but I don't think it generalises to any type of store apart from maybe the very basics. And at that point what benefit are you getting appart from a slight, and misleading, sense of familiarity?

Yeah Cassandra has CQL and it works pretty well, it's comfortable to use where the cassandra-cli is quite painful for anything other than the very beginning of a project.

I believe in earlier releases CQL was slower but now it seems fine. Also the most popular Cassandra client, hector, supports CQL, adopting CQL should be possible for a lot of Cassandra use cases.

Disclaimer: I'm not a DB person, so take everything I say with a grain of salt.

I only used a few SQL and NoSQL products and have done it from statically compiled languages. The immediate benefit I perceived when using language constructs instead of SQL was type safety and compile time validation of the query. I hate the chained function calls, I find SQL easier to read, but I would not trade the type safety for a more pleasant syntax.

Joins are in SQL because SQL was developed to query relational databases, and the join is the fundamental assumption of the relational database.

I always find it sad that the new data storage engines define themselves under a banner that says "We don't care that everybody who learned how to query a database since 1988 needs to learn a new way because our way is better." Which wouldn't be so bad if it was uniform across engines, but it isn't. It's a new API for each engine.

That's a lot of yaks to shave when you already know SQL.

> In SQL, it would be something like "select * from users where gender='male". Why can't NoSQL databases support a query similar to that?

They can and they do, e.g. Cassandra and CQL, which syntactically is the SQL you requested.

I don't really get why HBase isn't bigger than it is. I mean, it's BigTable, right? And Facebook dropped Cassandra for it, IIRC.

Not trying to start a war here, I am not an interested party, I am just curious.

Only a fraction of companies can ever really outgrow the capabilities of a fully tuned RDMS. And only a fraction of those will choose HBase over its competitors. And so only people who have worked for such companies have any real experience with the product... and only a fraction of those are vocal about it on the internet.

I agree with what you say, but I think we also need to consider cost as well as capability of the underlying product.

Introducing cost massively changes how that equation stacks up.

"How many tps can I run given a budget in the form of these 6 available servers". Well the boxes are under spec for what you want to achieve with an RDBMS, but yeah no problem with XXX nosql product.

It's like the Drake Equation for word-of-mouth.

Mostly because it doesn't seem to have a very vocal community behind it.

It is a really solid product, and some folks have built their entire business on it.

It's also used a lot in enterprise setups, but again, those folks don't typically go around talking about it.

Since it isn't a hot new trend, it stays off the radar of a lot of "startup" folks.

Also, the overhead of Hbase for a small install is pretty high, whereas a single-node Cassandra cluster is extremely easy to set up.

There's no such thing as a single-node Hbase cluster, since it involves, at the very least, setting up Zookeeper.

This keeps Hbase out of the hands of tire kickers.

>> Facebook dropped Cassandra for it

I guess there's a bit more to it than that. I understand the fb employees that took over were Hadoop guys so it made complete sense for fb to go in that direction.

The bit I don't get is why is hbase still a bit of a turtle. For example a recent paper linked to from here had random read and write workload running at 180k ops / sec on a 6 node Cassandra cluster but only 20k ops / sec for the same workload on hbase.

I think part of the issue might be that HBase is designed for lots of data. The architecture overview says that if you only have a few thousand or million rows, then a relational database may be a better choice:


Even if you plan on scaling to billions of rows of data, it may be easier to start with something else that works well for smaller data sets.

cassandra is a really solid product. it's actively developed and very stable.

from what i understand, facebook move to hbase for their messaging platform- which makes sense. cassandra's consistency model is weaker than hbase, and an eventually consistent model doesn't make sense for a real-time messaging platform.

cassandra's consistency model is weaker than hbase

This is simply not true. What is true is that Cassandra supports weaker consistency models than hbase in addition to the strong consistency model that hbase supports (and in fact, requires -- you can't turn it off).

Hbase has it's place, but "consistency" isn't a good reason to pick Hbase over Cassandra. Rather, particular workloads can (currently) be done faster in Hbase than Cassandra, due to locality assumptions with how writes are done in Hbase, and the nature of those writes.

Actually, eventual consistency for their messaging product makes a ton of sense, because any changes to states are a "push", so any "eventual consistency" is just messaging latency/propagation delay.

That aside, you don't appear to be familiar with how messaging got built.

It was built by a large team of Hadoop people, HBase was simply what they knew, there was no conscious decision to snub Cassandra beyond the fact that it wouldn't have leveraged their extensive HDFS experience.

Furthermore, it's trivial to tune the consistency levels in Cassandra to your needs, so there's no real reason to not use Cassandra just because of a "lack of consistency". Just use ALL or QUORUM, jesus.

Can I ask what people here think of Cassandra and HBase in terms of stability/quality? I seem to recall hearing bad things about Cassandra, maybe from reddit's use of it, but I'm not sure, or it might have changed.

I got bitten by MongoDB and still don't trust it, but I've heard many great things about riak. What does everyone think about the other two?

You should know basic cap theorem before reading this. If you don't, go read a 1 paragraph summary.

Fyi, this has nothing to do with you. I'm using you to soapbox about some data/scaling misconceptions. This subject happens to be one of my foremost interests.

Riak (AP, HTTP interface) is just plain a pain to use. Don't use it unless you have very specific use-cases in mind for the BitCask backend or you know what you're doing (aka, don't have to ask open-ended questions like this, no offense). Riak is amazing at a constrained set of use-cases, and pretty awful at most other things. The vector clocks, conflict resolution, and awful AWFUL API and documentation are a goddamn atrocity. Might be cool if you really need the magical replication/clustering, but realistically Cassandra and ElasticSearch offer the same wicked-cool scaling. No multi-master replication in the community/free edition. THIS IS A MASSIVE PAIN FOR LARGE DEPLOYMENTS. Also, ripping data out of the fucker is a pain.

Hypothetically Riak allows intelligent conflict resolution. In practice, this is like getting your wounds reopened and salted with ritualistic regularity.

MongoDB is just sorta..."okay" at a variety of things and especially things that can be done with sharding. Replication in MongoDB is a joke, as is the underlying infrastructure of i t. I use it as a stand-in for what most other people use an RDBMS for. I'm generally relatively disciplined, so I haven't paid the dire DIRE costs some people pay for being unhygienic with document stores. I've seen people totally trash their data in the absence of schematic enforcement. I wouldn't recommend MongoDB except to startups that I trust to know what they're doing.

MongoDB is especially handy for discrete/isolated user data and environ as its designed to shard. I'm not really comfortable describing MongoDB as being designed for denormalization because that's not really true. My real metric for denormalized data is Hadoop/HBase/Cassandra, and MongoDB totally shits the bed after the documents get past 16mb IIRC. The limit used to be 4mb.

Oh and by the way, don't take MongoDB or Riak's "map-reduce" support seriously at all. Just don't even bother. Pretend they don't exist.

HBase and Cassandra are both more solid than MongoDB and easier to use than Riak, they're more specialized than MongoDB though.

A few things to keep in mind:

Cassandra, when it first got open sourced, was frankly awful. It's actually improved a lot, to the point where it's no longer the intense pain point for Reddit and Formspring that it once was. If you need SRSFACE replication, truth-propagation, and tuneable consistency, Cassandra is your girl.

Cassandra is nominally AP, but allows tuneability to full-blown CP by all rights with ALL (it can otherwise use QUORUM, ANY, etc.). Cassie is conceptually simpler than Riak due to using timestamps rather than vector clocks to track state transitions. Hardcore database theorists will complain about this loudly. I remain undecided.

HBase is a bit simpler, but it's built on HDFS. This is, depending on your point of view, either a great thing or an awful thing. HBase is strictly a CP wide-column store. You can pretend it's Google BigTable, but that would be a dire mistake. HBase is equivalent to BigTable like Bangladeshi slums are equivalent to the Taj Mahal. Google's stuff is way...way better and, IMHO, contributes to the design being a lot more practical. It's my opinion that modeling wide-column stores and map-reduce frameworks on top of a distributed filesystem only works if that DFS is extremely top-notch.

HDFS is extremely not top notch. I'm still waiting for someone to leak the source to GFS or Colossus. When that day comes, I will probably cry tears of joy until I die of dehydration.

If you're using HBase or Cassandra, you're using a wide-column store. Cassandra is the more flexible of the two, HBase is more well-understood. Use Cassandra if you need AP/CP tunability, otherwise use HBase. Hadoop/HBase people are easier to find anyway. I personally prefer Cassandra.

Cassandra replication is more auto-magic, HBase is less auto-magic being built on HDFS. Cassandra is thrift-only, HBase is everything you normally get with the Hadoop ecosystem. REST, Java, Thrift, etc.

Cassandra is P2P, HBase is master/slave. HBase means finagling with SPOF Zookeeper nodes and all that other contemptible HDFS bullshit. Cassandra scales better. When Facebook built messaging on HBase, they smacked right into the usual HDFS "feature" that hits everybody with a large deployment.

Basically, they had to sub-cluster and shard the fuck out of it. That's a lot of work. Cassandra hasn't yet necessitated this. This is typical for non-trivial Hadoop/HDFS deployments. It's also a massive pain.

Have to wonder how Google is faring with Colossus in comparison. Hadoop is just so goddamn awful.

Example problem that works well in a wide-column store: storing and updating the 1,515,106 followers a single twitter user has.

You can use HBase and Cassandra as general-purpose data stores, but that's not really a good idea.

Realistically by the time you need the kind of scale either can offer, you've broken down your data/ops needs into discrete problems to be solved.

It looks like this:

"We need a work queue, job dispatch, and distributed filesystem for the OLAP...a wide-column store for tracking followers...a SQL database for payment information...a high-throughput cache for denormalized projections of backend data for the frontend...a sharded index for searchable data"

Not like this:

"Well. We used (MongoDB|Riak|Cassandra|HBase|Neo4j|PostgresQL) for our data and it sprinkled scaling fairy dust on our foreheads like good little catholics on ash wednesday and now our scaling problems are solved."

I'm not taking questions unless you're in the bay area and offering beer. Read a white paper if you can't send beer wenches to my door.

Down voted by mistake. Happens all too often when voting on the phone. I usually don't care, but your comment was exceptionally good and didn't deserve down vote.I wish someone finally makes wider gap between arrows.

A NoSQL discussion isn't complete without a link to the inner platform effect


Basically if your application domain is inherently relational or inherently "SQL" no "NoSQL", then building your own RDBMS in your application absolutely dooms you, unless you're not a "real" application writer but actually a RDBMS author.

I have having a conversation with a guy "I wish there was a library for (whatever nosql DB he was complaining about) I could link in to do transactions and indexing for me" my reply "yeah, its called postgresql". That's not trendy and buzzword compliant, he ended up annoyed with me. There's a "pragmatic programming" book "seven databases in seven days" which is a pretty good book and it describes polyglot database design a little toward the end... so you "need" a key-value store and indexed transactions and there's nothing that does both perfectly... Well, there's plenty of good free open source DBs, so install and use two DBs... its really not that hard.

Found it: ( I hope its as good as the 7 languages one ):


Interesting that one of the original "noSQL" databases that is still going strong today, Pick and its variants, was not mentioned in the article at all.

Universe and JBase are both well supported on Linux and there are several multi-billion dollar (revenue) companies running their core business on it - I work for one of them.

Both UniVerse and UniData are equally well supported on Windows, Unix and Linux. They have some interesting features that are still reasonably unique in the database world. E.g. Built-in transparent record encryption, including both key & index. http://2012.nosql-matters.org/cgn/wp-content/uploads/2012/06... Full disclosure: I'm the product manager for the databases.

You're probably familiar with the Eclipse distribution management package then (not to be confused with the IDE), that's the app I was referring to. I've been using Universe since the Vmark days, ADDS Mentor and Sequoia before that. Pick type databases are sadly under-rated. Possibly because there's never really been any kind of marketing aimed at indy developers and/or students?

stopped reading the articled referred to amazon s3 a database. instant credibility killer. no thanks.

(p.s the amazon product you're looking for is dynamo)

I think you're the one who is mistaken. Amazon has multiple database products. S3 is one of them. In fact, some of the folks who developed S3 also developed BigTable for Google.

Dynamo is a new database product they have that is also a key value store.

Dynamo is not a new database product. Sure, their SaaS offering is new, but Dynamo has been used internally at Amazon since atleast 2007, when they published a whitepaper on it. Dynamo was inspired by BigTable.

Sure, when you break it down S3 is just a big key-value store, but Dynamo is a much closer comparison to BigTable.


Yes, I've read the Dynamo paper too and I know all about it's history at Amazon. I give talks on this subject all the time.

My point was that you're being short sighted by taking away the credibility of the author for calling S3 a database, which many people do, and especially did in 2010.

Hey, Jeremy, you used to use Cassandra at reddit, no? I seem to remember it caused you many problems, is that the case? Have you used it/would you recommend it nowadays?

Maybe I should start a StackOverflow thread, or make a mini-site where people can opine about various technology products. I find myself looking for opinions every time I'm thinking of using a new piece of software for my business.

S3 has been around for a much longer time than Dynamo has been. Dynamo is a very recent introduction by Amazon. In fact, Dynamo did not even exist when this article was published. (Dynamo came out early '12, the article is from '10).

And when you think about it, S3 is just a massive distributed key-value store with simple key querying and an HTTP api.

Amazon's Dyanmo Whitepaper is from 2007, they have used it internally for a long time- it was inspired by BigTable.. not S3.

Source: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp...

Dynamo cites BigTable as solving similar problems but it's not inspired by BigTable at all, architecturally.

What prevents S3 from being used as a highly scalable NoSQL key-value store?

I have read a few posts over the years of people using s3 for exactly that. I don't have specific links anymore, but a Google search turned up an example: http://petewarden.typepad.com/searchbrowser/2010/10/how-i-en...

latency (and costs)

Neither of those things prevents something from being a database.

S3 is an excellent key/value store for large values. It's also publicly available, which is nice.

For example, all the thumbnail images on reddit are stored in S3. Essentially the client is given the key and then they can go look up the value themselves, and since it is publicly available http, it works right there in the browser.

Your example (reddit using S3) is not a database use, it is a content delivery network use. For that, it is good enough, but I'd choose a different key-value store for database, and mostly because of the latency (been there, tried that).

Alright, I can give you a better example. Netflix uses S3 to store movies that haven't been rendered yet. It's definitely a database. The server says "I need to render this movie to the iPhone format, what are the movie bits" and the "database" (S3) returns the entire movie.

Also, I would argue that static content delivery is just another form of database. It's just a massive key/value store, there the keys are the files and the values are the contents of the keys.

Let me ask you this: What is your definition of a database?

Yeah, right, the definition of the database is fuzzy. As every technology, S3 has its pros and cons, and the latency is a con this time. Take a latency-intensive application (1), and it will feel miserably slow with S3. Take a storage-intensive application with not-so-frequent access requirement (2), and it will be just OK with it.

(1) any page-load that requires 3-5 sequential DB access (based on the results that are returned in the previous steps).

(2) Netflix's "let's store somewhere until we really want to use it" qualifies.

Interesting. Would you say a file system is a database?

I don't think I could come up with a coherent definition of a database which includes low-durability NoSQL key/value stores yet excludes file systems. This doesn't necessarily mean that file systems are a database, however; it could be that in order for the word "database" to be meaningful it must be defined in a way that excludes some things commonly thought of as databases, similar to why Pluto was deplaneted.

Generally, yes.

S3 is also used as intermediary data storage when running Elastic MapReduce queries over data in your DynamoDB tables.


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact