

Big Data Debate: HBase - jeromatron
http://www.informationweek.com/software/enterprise-applications/big-data-debate-will-hbase-dominate-nosq/240159475

======
linuxhansl
Oh please.

On the one side we have a commercially competing entity arguing against HBase
and on the other side we have an individual "defending" HBase who never
contributed a single line of code to open source HBase promoting their own
closed source solution.

HBase is a "Sparse, Consistent, Distributed, Multidimensional, Sorted map" and
at that it is pretty good.

Other stores are either not consistent or not sorted by default (which means
you cannot do range scans). Some can be configured to do that, but then
suddenly all those nice claims made by the commercial entities backing them
just vanish.

If you do not need consistency and range scans, then do not use HBase. If you
do need those HBase will be a excellent fit.

Some of the most heavyweight entities on this planet have committers to HBase
(Facebook, Intel, Salesforce, to some extend Twitter, etc) as well as
commercial backers such as Cloudera and Hortonworks.

Disclaimer: HBase committer here.

~~~
dialtone
Agreed. Out of the box none of the other storage systems actually allow you to
not worry about range scans at all. While HBase will repartition your data to
balance the used disk space automatically, in most other systems you'll end up
having to manually partition your data in different ways and then aggregate
the queries across partitions.

This is pretty literally the main reason why many companies use HBase. Then in
the same companies you can also find a good usage for Cassandra but it'd be a
different use case.

------
trun
Linkbait aside there are some reasonable points being made here, specifically
"Failover means downtime" about HBase. We run into this pretty much every day
in one of our customer facing applications or APIs and it's quite frustrating
to have to explain that there's very little you can do to prevent it.

I haven't looked too closely at MapR yet, but "instant recovery, seamless
sharding and high availability" are impressive claims. It's still decidedly
differently than HBase in my mind given the cost.

~~~
monstrado
Which version of HBase are you running? If we have a node or two go down in
our cluster, HBase is completely unaffected. Recoveries can take up to a few
minutes, in the old ages it took hours.

I know that MTTR (mean time to recovery) is being worked on together by
several large companies to get down to seconds.

~~~
jeremiahjordan
"Minutes" is not "unaffected" if your site is now down.

~~~
monstrado
Sorry, I should have clarified. When a node goes down, there isn't any sort of
"minute" interruption...When there's a full scale outage and you need to
perform an actual recovery, that can take minutes.

------
karterk
We use HBase pretty extensively and I have mixed feelings about it. On one
hand, it's very clunky. Though the documentation is pretty good these days,
setting up and managing a HBase cluster on production and in scale involves
tackling many moving parts. Setting up Cassandra is almost a joke in
comparison.

HBase-Hadoop integration is great, but Cassandra has caught up significantly
on that front. If you want strong consistency, apart from HBase there is no
other (non-sql) solution that's really battle tested. If you're okay with
eventual consistency, you should take a hard look at Cassandra.

~~~
threeseed
I don't understand why people keep thinking that Cassandra is only eventually
consistent.

It can be set to any consistency level you like for both reads and writes:
[http://www.datastax.com/docs/1.1/dml/data_consistency](http://www.datastax.com/docs/1.1/dml/data_consistency)

~~~
duaneb
It doesn't even have transactions, I'm not sure how 'consistency' is even in
the conversation.

EDIT: Just saying, failures and partitions are extremely difficult to deal
with without atomic operations.

~~~
jbellis
Here we're discussing the Consistency in CAP, not the one in ACID. Here's a
good introduction to the concepts involved:
[http://www.allthingsdistributed.com/2008/12/eventually_consi...](http://www.allthingsdistributed.com/2008/12/eventually_consistent.html)

~~~
duaneb
It's no less applicable. During a partition (you know, the 'P' in CAP) it's a
nightmare to identify data that's been dropped AND recover from it without
strong transaction guarantees. Transactions are the heart of consistency and I
wouldn't consider a piece of software to be CAP consistent as opposed to
available without transactions.

------
samspenc
Heavy HBase user here. My two cents, FWIW.

I totally identify with both sides of this article, but if we had to do this
again, we would probably go for HBase again. Its quite a pain to manage unless
you have someone on your team with a PhD in HBase, but:

1\. The main HBase committers are also the ones who contribute to the Hadoop
project, so its in a forward trajectory with good velocity, since its fairly
coupled to the Hadoop ecosystem. Cloudera is a big contributor, as are
Facebook and Salesforce.

2\. Facebook, Adobe and other BigTM companies use it in production and at
scale. (Granted they have armies of smart people to maintain HBase.)

3\. For all the pain it is, its reasonably documented and has a growing
community that fills in the gap. I could be totally totally wrong about this,
but Cassandra doesn't have that same level of community as HBase does.

~~~
jbellis
/author of the "con" position here

I can see why you might come to those conclusions -- a lot of people with
their heads down in Hadoop just don't realize that there's a world outside
HDFS. Not saying that in a mean way; that's just the way it is. If you're
involved in that ecosystem, there's enough to keep up with without researching
what others are doing.

1\. True enough, but HBase tends to be an afterthought for the datawarehouse-
focused Hadoop players. You see this manifest in a bunch of ways, but as just
one example: when I evaluated HBase at Rackspace 4 years ago, they were
exploring options for secondary indexes. They're still exploring.

2\. Yes, you really do have to have expert-level knowledge of the internals to
deploy it successfully. And there's a LOT of those internals, by design
(hmaster, ZK, regionserver, and close ties to the HDFS infrastructure).
Cassandra is much simpler, much easier to deploy and troubleshoot. When
Cassandra gets evaluated vs HBase, it tends to win [1], but often HBase is the
"default" choice because of "Hey, we already have HDFS" thinking. This is
changing as awareness grows of alternatives.

3\. I think that's the tunnel vision I mentioned. By any metric I can think of
-- conference attendance (over 1100 at the Cassandra Summit in June), IRC
activity, StackOverflow participation, jobs advertised -- the Cassandra
community is larger and more active than HBase's. Here's a list of some of the
users: [2].

Give it a look, I'll be happy to answer questions!

[1]
[http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf](http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf)
[2]
[http://planetcassandra.com/Company/ViewCompany?IndustryId=-1](http://planetcassandra.com/Company/ViewCompany?IndustryId=-1)

~~~
samspenc
Thanks for the reply!

1\. True, the fact that we were deploying Hadoop was a big reason to go with
HBase. Despite the challenges, it was a case of "better the devil we know than
the one we don't."

2\. Agreed, I played with Cassandra, but again, point #1 carried the day.

3\. That's great! I was at HBaseCon 2013 and that only had 750-850 people. I
concede this point. ;) Sorry I didn't explore this further, that was my bad.

[EDIT] One thing we like is that HBase re-partitions data really fast since
data is in HDFS. Not sure how well Cassandra holds up there.

------
m0nastic
At the risk of hijacking what seems like a linkbait article, I'll ask folks
here a question:

Does anyone have good experiences/recommendations for storing a "reasonable"
amount of unstructured logs/pcap files. By "reasonable", I mean, not petabytes
(maybe a couple terabytes, over time).

I ask, because I keep thinking something like HBase is overkill (although one
of my alternate solutions is to run an internal Openstack Swift cluster, which
seems like pretty much the same amount of hardware/engineering).

If I could, I'd just send it to the cloud (to the cloud!), but I need it to be
local and internally controlled (however, the developer niceness of S3 or
Azure blob storage is what got me thinking about just making my own Swift
object store).

~~~
monstrado
Disclaimer: I work at Cloudera as a Tools Developer

What do you mean by unstructured? Do you mean the data has yet to be parsed
into a format which could be logically grouped into columns? Or do you mean
that it's deeply nested?

Since log data doesn't really change, it might be overkill to use something
like HBase (or any database for that matter). On the tools team at Cloudera,
we've found that writing the data into HDFS and using Impala to analyze it
works pretty well.

We typically analyze chunks of log data and then ingest it into HDFS (due to
the use case), but if you're looking to ingest data in "real-time", you'll
want to use something like Apache Flume.

With the data separated into partitions, we're able to run queries that
analyze GBs of data in under a second (15 nodes). This is log data (LOG4J)
that has been extrapolated into columns, and then loaded into a columnar
storage format (RCFile, soon to be Parquet).

Let me know if you have any questions, glad to help.

~~~
m0nastic
Thanks, I've been testing out a bunch of hare-brained schemes and it didn't
even occur to me to just use HDFS directly (and seeing you and karterk both
suggest it helps).

Basically, at a high level, the system I'm working on aggregates and processes
security information (It's a SIEM, if that product category means anything to
you). At the point the logs get ingested, the server determines if they're
"actionable" (which is determined by rules I load into Redis), in which case
it parses them and stores them in a Postgres event table; or "not individually
actionable, but may cause an action in conjunction with some other log" that I
want to just store somewhere for batch processing.

I don't really need to tokenize those logs, as at the point I care about them
I'm just going to be searching through them. So, they're "unstructured" in the
sense that there's about 15 different collection points, each with it's own
format (many just an ugly facsimile of syslog with some JSON in the middle).

So, I think your suggestion will work out very well.

Thanks again.

~~~
monstrado
No problem, glad I could help. Your use case sounds pretty interesting, HDFS
should fit the bill for sure.

You should take a look at Parquet ([http://parquet.io/](http://parquet.io/))
for storing your data. It's an open source columnar format that was designed
for Hadoop, it even supports nesting ([https://github.com/Parquet/parquet-
mr/wiki/The-striping-and-...](https://github.com/Parquet/parquet-mr/wiki/The-
striping-and-assembly-algorithms-from-the-Dremel-paper) <\-- really
interesting). Also, it already works with a lot of the Hadoop ecosystem
components (MR, Hive, Pig, Cascading, Impala, ..), so your data doesn't have
to move once it's in HDFS.

Good luck!

------
duaneb
NoSQL is my least favorite buzzword of the decade. It means absolutely nothing
but being counter cultural to... SQL culture. Many NoSQL "databases"—and I use
that in the lightest term possible—even have SQL engines.

~~~
otterley
How to make a simple NoSQL database:

(1) Install MySQL.

(2) Create a database called "nosql".

(3) In the "nosql" database, create a single InnoDB table called "data" with 2
columns: a BLOB primary key "mykey" and a BLOB column called "myval".

(4) Write a thin wrapper over your favorite mysql client library implementing
"get" and "set" operations (which are translated to SELECT and UPDATE
statements).

(5) Enjoy your ludicrous speed.

~~~
monstrado
If you're comparing this to your riak, redis, tokyo cabinet, ... database than
most likely.

With HBase, one of it's bread and butter operations is the start and stop row
scan (otherwise known as a range scan). The only thing equivalent I know of in
MySQL is using windowing functions, and even then I don't think that's an
appropriate comparison.

I hear you though, most people equate NoSQL to some sort of KeyValue store and
that's it.

~~~
duaneb
How are multi-row transactions these days? That was the largest problem when I
looked last... A data store does not a database make.

~~~
threeseed
I don't know about HBase but for Cassandra it should be available in the next
minor update.

Single row transactions have just recently been added:
[http://www.datastax.com/dev/blog/lightweight-transactions-
in...](http://www.datastax.com/dev/blog/lightweight-transactions-in-
cassandra-2-0)

~~~
mh-
FWIW, Cassandra 1.2 is the latest production/stable version.

~~~
jbellis
You are correct, for about another week. :)

~~~
mh-
so, 1.2 will become unstable? ;)

------
PaulHoule
I laugh at mongodb, but every time I've seen a shootout between Hbase and
Cassandra, Cassandra has won.

~~~
monstrado
Could you please elaborate? It totally depends on use case...

~~~
PaulHoule
(1) I've seen more than one project failure involving mongodb,

(2) Every time I've done a performance/features shootout of mongodb vs. other
projects, mongodb doesn't just lose, it comes back with two black eyes.

------
rjurney
Interesting that neither person has anything to do with Apache HBase. Both are
in a position to belittle it.

------
capkutay
some people even describe hbase as nosql, because its not sql right? unless
you use cloudera impala, which is sql, but impala is supposed to kill sql too.
in the end no one wins.

~~~
monstrado
> impala is supposed to kill sql too

I'm not sure what you mean by this, Impala is in no no way trying to replace
"SQL". Impala is a general purpose distributed query engine that currently
translates SQL to a query plan.

~~~
capkutay
I was commenting in the theme of tc-style enterprise tech journalism but thank
you for clarifying.

~~~
monstrado
:) sometimes I don't know what to believe on the internet

------
hayksaakian
Blatant link bait

[http://en.wikipedia.org/wiki/Betteridge's_law_of_headlines](http://en.wikipedia.org/wiki/Betteridge's_law_of_headlines)

------
zzzeek
NoSQL, what's that, is that like Big Data ?

~~~
dallasmarlow
NoSQL is only a stepping stone to NoQuery, it's going to be awesome

------
knodi
Does it matter?

