Now, I agree that it's kind of a bummer we can't play with it right now, but the fact that you guys made this are are going to open source it is already awesome in itself.
That said, I'd love to see some code released, even if it isn't ready for primetime.
Our HBase cluster (3 boxes serving 30 human oracles, each submitting data at a rate of 1 record every 5-10 seconds) choked frequently - i.e., it stopped accepting new records. Ultimately what I had to do is have the human data go into postgres and a cron job flushed that into HBase every half hour or so.
I'll emphasize that this is probably my fault. I'm not claiming HBase doesn't scale to 30 concurrent users - clearly Facebook demonstrates it can. But I couldn't figure out how to make that happen. HBase is a complex system and I make no claim of understanding it.
ElephantDB + MaryJane are simple. There is almost nothing that can go wrong - put together they probably amount to 5000 lines of code and have as many as 10 minimally interacting configuration options. The effort required to manage them is minimal - I had EDB working flawlessly in less than a day.
HBase is an enterprise tool. It works well if you are Facebook and can put a couple of people on maintenance duty. It's overkill if you are Styloot (my stealth mode startup, currently smaller than Backtype).
The data I'm loading is stuff like tags - e.g., <itemid>\t<tagid>. In human terms, "Dress A has a ruched collar." Mapreduce can handle data like this, even when it comes unordered.
The data I'm reading is computational results based on the loaded data - e.g., an index: <tagid>\t[<itemid1>, <itemid2>, ...] (where each itemid has been tagged with tagid). E.g., "here are all the dresses with a ruched collar."
(Actually, we do considerably more than this, nor do we need Hadoop for an index. But an index is the simplest example I could give.)
The original data is very boring. It's only after aggregation and calculation that it becomes worth reading.