I'll admit total naivete here. Is millions of columns *really* something you'd s...

teraflop · on Feb 25, 2015

The problem is that "denormalization" is an overloaded term.

Often, it's used to mean storing multiple copies of data; for instance, to work around a key-value store's lack of secondary indexing. In that case, you're paying a penalty in terms of space (and code complexity) to make certain operations faster.

In other cases, denormalization just means structuring your data non-relationally. For example, you might want to add a set-valued field to one of your tables. The relational way would be to split that field into a separate table and access it with a join operation, but it's not obvious that that's more maintainable or efficient than storing the set inline.

HBase's support for wide rows is just a mechanism that gives you that kind of flexibility in organizing your data. As norkakn alluded to, the distinction between "row" and "column" in HBase isn't nearly as fundamental as in an RDBMS. Data is indexed by a tuple of (row, column family, column), where the row determines atomicity and the column family controls storage locality.

There are solid technical reasons for wanting to go with an RDBMS or a distributed key-value store in different situations. Metaphors like "data junkyard" don't add anything productive to the discussion.

graffitici · on Feb 25, 2015

Can you recommend a resource (book, blog article, website) for learning more about NoSQL database design and denormalization? The issue not being of course that I have trouble finding such resources, but that there are way too many..

andrewstuart2 · on Feb 25, 2015

> Metaphors like "data junkyard" don't add anything productive to the discussion.

Yeah, I probably went a bit overboard. It's just very much different from the way I'm used to thinking about data, so it seems very disorganized to me, though I'm sure it's not when done well.

I was just imagining an RDBMS with a million columns. Even in DW scenarios, I'm pretty sure that's a bad plan, though I may be wrong there. I'd definitely cringe, though.

teraflop · on Feb 25, 2015

Yeah, it's just a totally different model, and maybe the problem is that we're using the word "column" to mean totally different things in different contexts. What HBase calls a "column" is really more like part of a composite key, and nobody gets upset by a composite key that has millions of distinct values.

In an RDBMS, a table with millions of columns would be unmanageable for a bunch of reasons:

- If only a few columns were set in any given row, you'd waste a ton of space storing all the NULL values.

- Modifying one column in a row would probably require reading and re-writing the entire row.

- There's no good way to retrieve a large-ish subset of columns that you're interested in; you'd have to either specify them all by name in your query, or fetch the entire row.

None of those downsides apply to the Bigtable data model (which includes HBase, Cassandra and a few other similar projects). Null columns are free (since they don't exist on disk in the first place), writes are cheap no matter the row size, and you can filter by columns in interesting ways.

You're not wrong about the potential for messiness, though. The biggest drawback (IMO) of the Bigtable model is that the database server doesn't know anything about the structure of your data. If you're used to having a SQL prompt where you can examine and manipulate data in interesting ways, HBase's "shell" is a huge step backward. If you want to have any kind of useful visibility into your data, you have to build those tools yourself.

tveita · on Feb 25, 2015

Schema design tends to be a bit different than in a traditional relational database due to the different constraints and capabilities.

For instance, IIRC Facebook has a video talking about how they use HBase for messages, and they have an example of a table indexing the messages for search:

Each row contains the index of one user. For each word in a message they insert an entry in the row with the word as the column name, the message id as the version (a column can have several values with different versions), and maybe some additional info in the value.

You can see how you'd quickly reach a million columns when every word in every message you send or receive becomes a column entry.

It may have been this presentation: http://www.infoq.com/presentations/HBase-at-Facebook/

crypto5 · on Feb 25, 2015

Typical usage is an inverted index for full text search. You have search token as a key, and doc_id as columns.

mark_l_watson · on Feb 25, 2015

Think of HBase (and also Google's BigTable) a a sorted hash table containing hashes, so sparse columns (that is a million possible columns, but in any row a much smaller number of columns used) is efficient and makes sense.

nkrumm · on Feb 25, 2015

I've thought about using HBase for storing genomic variants-- each row is a genomic position (3.1 billion for humans), and each column represents a sample or variant. At the current speed of sequencing, a million human genomes might be reached within a 5 years (a number of projects are attempting this).

graffitici · on Feb 25, 2015

Cassandra is another NoSQL database that is known for having lots of columns (I believe around 2 billion columns per row). It is typically used for storing time-series. The row is taken to be the "base" epoch, and each column in that row a particular sensor reading.

andymurd · on Feb 24, 2015

There are (approximately) 31 million seconds per year.

Using column-based storage for metrics at one-second granularity makes for a very fast and useful time-series store.

However, for storing relational data (such as customers and orders) you are correct, HBase is not the right tool for the job.

TallGuyShort · on Feb 25, 2015

I've seen tables with 10^3 and 10^4 columns if I remember correctly. As others have pointed out it's an efficient way to do timeseries, and as HBase tables can be broken into "column families", you may often associate many types of data with the same key, whereas in a relational database you're encouraged to split them into separate tables. Denormalized doesn't just mean duplicated.

norkakn · on Feb 24, 2015

Yeah, you'll see it. Operations are basically atomic on a single row, so it often makes sense to have fat rows with tons of columns. You can use the version as an extra dimension, or encode multi-dimensional information into the column to make it pretty nice.

sidi · on Feb 25, 2015

There are some real use-cases when you stop confining yourself to a single tenant architecture. Apart from that, think of any categorization that's dynamic - timeline based data, versioning of data, etc.

arthursilva · on Feb 24, 2015

Hot keys can be a problem, performance wise. So people sometimes "shard the key".