Almost any json document could be represented as a db table as far as I can see. Why can't I query using a common language instead of learning each of these NoSQL database's own way of doing simple stuff like returning documents of User object that have a gender field value as 'male'. In SQL, it would be something like "select * from users where gender='male". Why can't NoSQL databases support a query similar to that? Why do they require me to describe a similar request in their own unique syntax?
I sometimes get the feeling that coining a term like NoSQL is a marketing gimmick which hurts people actually trying to learn the nuanced differences but I am only getting started. Why can't we extend SQL to support "NoSQL"-specific cases instead of replacing it with nothing.
I get that a big part of SQL is joins and the philosophy of joins goes against the idea of NoSQL. The solution to that is to still accept SQL but just throw an exception when join is used with a link to educate the person on alternative implementations.
But if all you support is lookup by key, or even just lookup by field, then SQL really isn't that useful in my opinion. And if your lookups are based on map-reduces that have to be pre-specified then I can't see any place for an SQL-like language at all.
SQL is ok for specify queries to relational databases, but I don't think it generalises to any type of store apart from maybe the very basics. And at that point what benefit are you getting appart from a slight, and misleading, sense of familiarity?
I believe in earlier releases CQL was slower but now it seems fine. Also the most popular Cassandra client, hector, supports CQL, adopting CQL should be possible for a lot of Cassandra use cases.
I only used a few SQL and NoSQL products and have done it from statically compiled languages. The immediate benefit I perceived when using language constructs instead of SQL was type safety and compile time validation of the query. I hate the chained function calls, I find SQL easier to read, but I would not trade the type safety for a more pleasant syntax.
I always find it sad that the new data storage engines define themselves under a banner that says "We don't care that everybody who learned how to query a database since 1988 needs to learn a new way because our way is better."
Which wouldn't be so bad if it was uniform across engines, but it isn't. It's a new API for each engine.
That's a lot of yaks to shave when you already know SQL.
They can and they do, e.g. Cassandra and CQL, which syntactically is the SQL you requested.
Not trying to start a war here, I am not an interested party, I am just curious.
Introducing cost massively changes how that equation stacks up.
"How many tps can I run given a budget in the form of these 6 available servers". Well the boxes are under spec for what you want to achieve with an RDBMS, but yeah no problem with XXX nosql product.
It is a really solid product, and some folks have built their entire business on it.
It's also used a lot in enterprise setups, but again, those folks don't typically go around talking about it.
Since it isn't a hot new trend, it stays off the radar of a lot of "startup" folks.
There's no such thing as a single-node Hbase cluster, since it involves, at the very least, setting up Zookeeper.
This keeps Hbase out of the hands of tire kickers.
I guess there's a bit more to it than that. I understand the fb employees that took over were Hadoop guys so it made complete sense for fb to go in that direction.
The bit I don't get is why is hbase still a bit of a turtle. For example a recent paper linked to from here had random read and write workload running at 180k ops / sec on a 6 node Cassandra cluster but only 20k ops / sec for the same workload on hbase.
Even if you plan on scaling to billions of rows of data, it may be easier to start with something else that works well for smaller data sets.
from what i understand, facebook move to hbase for their messaging platform- which makes sense. cassandra's consistency model is weaker than hbase, and an eventually consistent model doesn't make sense for a real-time messaging platform.
This is simply not true. What is true is that Cassandra supports weaker consistency models than hbase in addition to the strong consistency model that hbase supports (and in fact, requires -- you can't turn it off).
Hbase has it's place, but "consistency" isn't a good reason to pick Hbase over Cassandra. Rather, particular workloads can (currently) be done faster in Hbase than Cassandra, due to locality assumptions with how writes are done in Hbase, and the nature of those writes.
That aside, you don't appear to be familiar with how messaging got built.
It was built by a large team of Hadoop people, HBase was simply what they knew, there was no conscious decision to snub Cassandra beyond the fact that it wouldn't have leveraged their extensive HDFS experience.
Furthermore, it's trivial to tune the consistency levels in Cassandra to your needs, so there's no real reason to not use Cassandra just because of a "lack of consistency". Just use ALL or QUORUM, jesus.
I got bitten by MongoDB and still don't trust it, but I've heard many great things about riak. What does everyone think about the other two?
Fyi, this has nothing to do with you. I'm using you to soapbox about some data/scaling misconceptions. This subject happens to be one of my foremost interests.
Riak (AP, HTTP interface) is just plain a pain to use. Don't use it unless you have very specific use-cases in mind for the BitCask backend or you know what you're doing (aka, don't have to ask open-ended questions like this, no offense). Riak is amazing at a constrained set of use-cases, and pretty awful at most other things. The vector clocks, conflict resolution, and awful AWFUL API and documentation are a goddamn atrocity. Might be cool if you really need the magical replication/clustering, but realistically Cassandra and ElasticSearch offer the same wicked-cool scaling. No multi-master replication in the community/free edition. THIS IS A MASSIVE PAIN FOR LARGE DEPLOYMENTS. Also, ripping data out of the fucker is a pain.
Hypothetically Riak allows intelligent conflict resolution. In practice, this is like getting your wounds reopened and salted with ritualistic regularity.
MongoDB is just sorta..."okay" at a variety of things and especially things that can be done with sharding. Replication in MongoDB is a joke, as is the underlying infrastructure of i t. I use it as a stand-in for what most other people use an RDBMS for. I'm generally relatively disciplined, so I haven't paid the dire DIRE costs some people pay for being unhygienic with document stores. I've seen people totally trash their data in the absence of schematic enforcement. I wouldn't recommend MongoDB except to startups that I trust to know what they're doing.
MongoDB is especially handy for discrete/isolated user data and environ as its designed to shard. I'm not really comfortable describing MongoDB as being designed for denormalization because that's not really true. My real metric for denormalized data is Hadoop/HBase/Cassandra, and MongoDB totally shits the bed after the documents get past 16mb IIRC. The limit used to be 4mb.
Oh and by the way, don't take MongoDB or Riak's "map-reduce" support seriously at all. Just don't even bother. Pretend they don't exist.
HBase and Cassandra are both more solid than MongoDB and easier to use than Riak, they're more specialized than MongoDB though.
A few things to keep in mind:
Cassandra, when it first got open sourced, was frankly awful. It's actually improved a lot, to the point where it's no longer the intense pain point for Reddit and Formspring that it once was. If you need SRSFACE replication, truth-propagation, and tuneable consistency, Cassandra is your girl.
Cassandra is nominally AP, but allows tuneability to full-blown CP by all rights with ALL (it can otherwise use QUORUM, ANY, etc.). Cassie is conceptually simpler than Riak due to using timestamps rather than vector clocks to track state transitions. Hardcore database theorists will complain about this loudly. I remain undecided.
HBase is a bit simpler, but it's built on HDFS. This is, depending on your point of view, either a great thing or an awful thing. HBase is strictly a CP wide-column store. You can pretend it's Google BigTable, but that would be a dire mistake. HBase is equivalent to BigTable like Bangladeshi slums are equivalent to the Taj Mahal. Google's stuff is way...way better and, IMHO, contributes to the design being a lot more practical. It's my opinion that modeling wide-column stores and map-reduce frameworks on top of a distributed filesystem only works if that DFS is extremely top-notch.
HDFS is extremely not top notch. I'm still waiting for someone to leak the source to GFS or Colossus. When that day comes, I will probably cry tears of joy until I die of dehydration.
If you're using HBase or Cassandra, you're using a wide-column store. Cassandra is the more flexible of the two, HBase is more well-understood. Use Cassandra if you need AP/CP tunability, otherwise use HBase. Hadoop/HBase people are easier to find anyway. I personally prefer Cassandra.
Cassandra replication is more auto-magic, HBase is less auto-magic being built on HDFS. Cassandra is thrift-only, HBase is everything you normally get with the Hadoop ecosystem. REST, Java, Thrift, etc.
Cassandra is P2P, HBase is master/slave. HBase means finagling with SPOF Zookeeper nodes and all that other contemptible HDFS bullshit. Cassandra scales better. When Facebook built messaging on HBase, they smacked right into the usual HDFS "feature" that hits everybody with a large deployment.
Basically, they had to sub-cluster and shard the fuck out of it. That's a lot of work. Cassandra hasn't yet necessitated this. This is typical for non-trivial Hadoop/HDFS deployments. It's also a massive pain.
Have to wonder how Google is faring with Colossus in comparison. Hadoop is just so goddamn awful.
Example problem that works well in a wide-column store: storing and updating the 1,515,106 followers a single twitter user has.
You can use HBase and Cassandra as general-purpose data stores, but that's not really a good idea.
Realistically by the time you need the kind of scale either can offer, you've broken down your data/ops needs into discrete problems to be solved.
It looks like this:
"We need a work queue, job dispatch, and distributed filesystem for the OLAP...a wide-column store for tracking followers...a SQL database for payment information...a high-throughput cache for denormalized projections of backend data for the frontend...a sharded index for searchable data"
Not like this:
"Well. We used (MongoDB|Riak|Cassandra|HBase|Neo4j|PostgresQL) for our data and it sprinkled scaling fairy dust on our foreheads like good little catholics on ash wednesday and now our scaling problems are solved."
I'm not taking questions unless you're in the bay area and offering beer. Read a white paper if you can't send beer wenches to my door.
Basically if your application domain is inherently relational or inherently "SQL" no "NoSQL", then building your own RDBMS in your application absolutely dooms you, unless you're not a "real" application writer but actually a RDBMS author.
I have having a conversation with a guy "I wish there was a library for (whatever nosql DB he was complaining about) I could link in to do transactions and indexing for me" my reply "yeah, its called postgresql". That's not trendy and buzzword compliant, he ended up annoyed with me. There's a "pragmatic programming" book "seven databases in seven days" which is a pretty good book and it describes polyglot database design a little toward the end... so you "need" a key-value store and indexed transactions and there's nothing that does both perfectly... Well, there's plenty of good free open source DBs, so install and use two DBs... its really not that hard.
Universe and JBase are both well supported on Linux and there are several multi-billion dollar (revenue) companies running their core business on it - I work for one of them.
(p.s the amazon product you're looking for is dynamo)
Dynamo is a new database product they have that is also a key value store.
Sure, when you break it down S3 is just a big key-value store, but Dynamo is a much closer comparison to BigTable.
My point was that you're being short sighted by taking away the credibility of the author for calling S3 a database, which many people do, and especially did in 2010.
Maybe I should start a StackOverflow thread, or make a mini-site where people can opine about various technology products. I find myself looking for opinions every time I'm thinking of using a new piece of software for my business.
And when you think about it, S3 is just a massive distributed key-value store with simple key querying and an HTTP api.
S3 is an excellent key/value store for large values. It's also publicly available, which is nice.
For example, all the thumbnail images on reddit are stored in S3. Essentially the client is given the key and then they can go look up the value themselves, and since it is publicly available http, it works right there in the browser.
Also, I would argue that static content delivery is just another form of database. It's just a massive key/value store, there the keys are the files and the values are the contents of the keys.
Let me ask you this: What is your definition of a database?
(1) any page-load that requires 3-5 sequential DB access (based on the results that are returned in the previous steps).
(2) Netflix's "let's store somewhere until we really want to use it" qualifies.