Hacker News new | past | comments | ask | show | jobs | submit login

This method has been the bane of my existence for going-on 5 years now. YMMV, of course, but here's my experience:

Back in 2004, we were decided to denormalize a set of records containing year, quarter, and 15 or so data points going back to 1996. Using this cross-tab method, we started out with 8 (years) * 4 (quarters/year) * 15 (data points) = 480 columns (a few more for non-denormalized data). 5 years later, we're up to the high 700s. Needless to say, this pretty much blows out your standard MySQL admin tools.

Another pitfall is, you really have to make sure you data is pristine (of course you should be doing that anyway, but reality is usually far from ideal, especially if you're me in 2004), otherwise you end up with doubling of values or values included when they shouldn't be. The generated queries are very very difficult to debug, again blowing out many admin tools and also text editors (unless you add pretty formatting in your query generator).

So what would I recommend instead? There are multi-dimensional databases or ETL tools that will properly put your transactional data in an analytical style data warehouse. Read Kimball's Data Warehouse Toolkit [1] (ignore the kinda ratty paper quality—the content is good). The key thing is to denormalize down to a star schema, but keep your facts and dimensions separate (things you'll learn in DWT).

There's an Java based OLAP server called Mondrian [2] that also may be worth your time. I ended up going with ActiveWarehouse and ActiveWarehouse-ETL [3], but I've felt some things a little more complicated than they needed to be (but that's another discussion—and I haven't kept up with their development lately, so it may have improved).

On a side note, recently I've started wondering if map-reduce style tools would make sense in this context, but I haven't explored using them that way yet, so I can't actually make a recommendation there.

[1] http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimens...

[2] http://mondrian.pentaho.org/

[3] http://activewarehouse.rubyforge.org/




Cross-tabulation is not about denormalization. If anything, it is about producing the reports you want while keeping your data normalized.


It creates repeating groups (gender -> M,F; location -> Boston, Cagliari, etc; dept -> Development, Personel, etc)—to me, that looks like denormalization.

I'm not sure this is what you're objecting to, but whether or not the data is _stored_ denormalized (as opposed to just using the result set and throwing it away), is not really of consequence. I think we'll both agree that we can and should keep our transactional data normalized.

Either way, this method goes against the grain of what SQL is supposed to do (otherwise you could meta-program queries without having to resort to a Perl script to string this thing together). My critique still stands that these queries can grow to intractable sizes.

If one is going to be a lot of reporting like this, they should really look to separate their transactional and analytical databases.

But hey, like I said YMMV. In a simple case it works. But trust me when I say, after 5 years of data growth and requirements shifts, I wish I knew then the burden that it is now. :)




Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: