

Apache HBase 1.0 released - fs111
https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces72

======
andrewstuart2
I'll admit total naivete here. Is millions of columns _really_ something you'd
see in the real world? Sure, billions of rows, I understand, but on the order
of 10^6 columns?

For data warehousing, does it really affect performance _that_ much to
denormalize to such an insane degree? Like I said, I may be completely naive
here, but wouldn't a little normalization there increase both maintainability
and speed?

It seems a lot more like a data junkyard or landfill than a data warehouse to
me.

~~~
teraflop
The problem is that "denormalization" is an overloaded term.

Often, it's used to mean storing multiple copies of data; for instance, to
work around a key-value store's lack of secondary indexing. In that case,
you're paying a penalty in terms of space (and code complexity) to make
certain operations faster.

In other cases, denormalization just means structuring your data non-
relationally. For example, you might want to add a set-valued field to one of
your tables. The relational way would be to split that field into a separate
table and access it with a join operation, but it's not obvious that that's
more maintainable or efficient than storing the set inline.

HBase's support for wide rows is just a mechanism that gives you that kind of
flexibility in organizing your data. As norkakn alluded to, the distinction
between "row" and "column" in HBase isn't nearly as fundamental as in an
RDBMS. Data is indexed by a tuple of (row, column family, column), where the
row determines atomicity and the column family controls storage locality.

There are solid technical reasons for wanting to go with an RDBMS or a
distributed key-value store in different situations. Metaphors like "data
junkyard" don't add anything productive to the discussion.

~~~
andrewstuart2
> Metaphors like "data junkyard" don't add anything productive to the
> discussion.

Yeah, I probably went a bit overboard. It's just very much different from the
way I'm used to thinking about data, so it seems very disorganized to me,
though I'm sure it's not when done well.

I was just imagining an RDBMS with a million columns. Even in DW scenarios,
I'm pretty sure that's a bad plan, though I may be wrong there. I'd definitely
cringe, though.

~~~
teraflop
Yeah, it's just a totally different model, and maybe the problem is that we're
using the word "column" to mean totally different things in different
contexts. What HBase calls a "column" is really more like part of a composite
key, and nobody gets upset by a composite key that has millions of distinct
values.

In an RDBMS, a table with millions of columns would be unmanageable for a
bunch of reasons:

\- If only a few columns were set in any given row, you'd waste a ton of space
storing all the NULL values.

\- Modifying one column in a row would probably require reading and re-writing
the entire row.

\- There's no good way to retrieve a large-ish subset of columns that you're
interested in; you'd have to either specify them all by name in your query, or
fetch the entire row.

None of those downsides apply to the Bigtable data model (which includes
HBase, Cassandra and a few other similar projects). Null columns are free
(since they don't exist on disk in the first place), writes are cheap no
matter the row size, and you can filter by columns in interesting ways.

You're not wrong about the potential for messiness, though. The biggest
drawback (IMO) of the Bigtable model is that the database server doesn't know
anything about the structure of your data. If you're used to having a SQL
prompt where you can examine and manipulate data in interesting ways, HBase's
"shell" is a huge step backward. If you want to have any kind of useful
visibility into your data, you have to build those tools yourself.

------
eclark
If you want a more in depth blog post about what's different about this
release Enis has written up something here:
[https://blogs.apache.org/hbase/entry/start_of_a_new_era](https://blogs.apache.org/hbase/entry/start_of_a_new_era)

There's been a ton of work on getting to where the community really felt good
about stamping something with a real stable version. This release should have
not just stability for a running system but stability of api.

------
kul_
Finally!! Thanks for this Enis and HBase Team.

