Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'll admit total naivete here. Is millions of columns really something you'd see in the real world? Sure, billions of rows, I understand, but on the order of 10^6 columns?

For data warehousing, does it really affect performance that much to denormalize to such an insane degree? Like I said, I may be completely naive here, but wouldn't a little normalization there increase both maintainability and speed?

It seems a lot more like a data junkyard or landfill than a data warehouse to me.



The problem is that "denormalization" is an overloaded term.

Often, it's used to mean storing multiple copies of data; for instance, to work around a key-value store's lack of secondary indexing. In that case, you're paying a penalty in terms of space (and code complexity) to make certain operations faster.

In other cases, denormalization just means structuring your data non-relationally. For example, you might want to add a set-valued field to one of your tables. The relational way would be to split that field into a separate table and access it with a join operation, but it's not obvious that that's more maintainable or efficient than storing the set inline.

HBase's support for wide rows is just a mechanism that gives you that kind of flexibility in organizing your data. As norkakn alluded to, the distinction between "row" and "column" in HBase isn't nearly as fundamental as in an RDBMS. Data is indexed by a tuple of (row, column family, column), where the row determines atomicity and the column family controls storage locality.

There are solid technical reasons for wanting to go with an RDBMS or a distributed key-value store in different situations. Metaphors like "data junkyard" don't add anything productive to the discussion.


Can you recommend a resource (book, blog article, website) for learning more about NoSQL database design and denormalization? The issue not being of course that I have trouble finding such resources, but that there are way too many..


> Metaphors like "data junkyard" don't add anything productive to the discussion.

Yeah, I probably went a bit overboard. It's just very much different from the way I'm used to thinking about data, so it seems very disorganized to me, though I'm sure it's not when done well.

I was just imagining an RDBMS with a million columns. Even in DW scenarios, I'm pretty sure that's a bad plan, though I may be wrong there. I'd definitely cringe, though.


Yeah, it's just a totally different model, and maybe the problem is that we're using the word "column" to mean totally different things in different contexts. What HBase calls a "column" is really more like part of a composite key, and nobody gets upset by a composite key that has millions of distinct values.

In an RDBMS, a table with millions of columns would be unmanageable for a bunch of reasons:

- If only a few columns were set in any given row, you'd waste a ton of space storing all the NULL values.

- Modifying one column in a row would probably require reading and re-writing the entire row.

- There's no good way to retrieve a large-ish subset of columns that you're interested in; you'd have to either specify them all by name in your query, or fetch the entire row.

None of those downsides apply to the Bigtable data model (which includes HBase, Cassandra and a few other similar projects). Null columns are free (since they don't exist on disk in the first place), writes are cheap no matter the row size, and you can filter by columns in interesting ways.

You're not wrong about the potential for messiness, though. The biggest drawback (IMO) of the Bigtable model is that the database server doesn't know anything about the structure of your data. If you're used to having a SQL prompt where you can examine and manipulate data in interesting ways, HBase's "shell" is a huge step backward. If you want to have any kind of useful visibility into your data, you have to build those tools yourself.


Schema design tends to be a bit different than in a traditional relational database due to the different constraints and capabilities.

For instance, IIRC Facebook has a video talking about how they use HBase for messages, and they have an example of a table indexing the messages for search:

Each row contains the index of one user. For each word in a message they insert an entry in the row with the word as the column name, the message id as the version (a column can have several values with different versions), and maybe some additional info in the value.

You can see how you'd quickly reach a million columns when every word in every message you send or receive becomes a column entry.

It may have been this presentation: http://www.infoq.com/presentations/HBase-at-Facebook/


Typical usage is an inverted index for full text search. You have search token as a key, and doc_id as columns.


Think of HBase (and also Google's BigTable) a a sorted hash table containing hashes, so sparse columns (that is a million possible columns, but in any row a much smaller number of columns used) is efficient and makes sense.


I've thought about using HBase for storing genomic variants-- each row is a genomic position (3.1 billion for humans), and each column represents a sample or variant. At the current speed of sequencing, a million human genomes might be reached within a 5 years (a number of projects are attempting this).


Cassandra is another NoSQL database that is known for having lots of columns (I believe around 2 billion columns per row). It is typically used for storing time-series. The row is taken to be the "base" epoch, and each column in that row a particular sensor reading.


There are (approximately) 31 million seconds per year.

Using column-based storage for metrics at one-second granularity makes for a very fast and useful time-series store.

However, for storing relational data (such as customers and orders) you are correct, HBase is not the right tool for the job.


I've seen tables with 10^3 and 10^4 columns if I remember correctly. As others have pointed out it's an efficient way to do timeseries, and as HBase tables can be broken into "column families", you may often associate many types of data with the same key, whereas in a relational database you're encouraged to split them into separate tables. Denormalized doesn't just mean duplicated.


Yeah, you'll see it. Operations are basically atomic on a single row, so it often makes sense to have fat rows with tons of columns. You can use the version as an extra dimension, or encode multi-dimensional information into the column to make it pretty nice.


There are some real use-cases when you stop confining yourself to a single tenant architecture. Apart from that, think of any categorization that's dynamic - timeline based data, versioning of data, etc.


Hot keys can be a problem, performance wise. So people sometimes "shard the key".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: