That said, I'd love to see some code released, even if it isn't ready for primetime.
Our HBase cluster (3 boxes serving 30 human oracles, each submitting data at a rate of 1 record every 5-10 seconds) choked frequently - i.e., it stopped accepting new records. Ultimately what I had to do is have the human data go into postgres and a cron job flushed that into HBase every half hour or so.
I'll emphasize that this is probably my fault. I'm not claiming HBase doesn't scale to 30 concurrent users - clearly Facebook demonstrates it can. But I couldn't figure out how to make that happen. HBase is a complex system and I make no claim of understanding it.
ElephantDB + MaryJane are simple. There is almost nothing that can go wrong - put together they probably amount to 5000 lines of code and have as many as 10 minimally interacting configuration options. The effort required to manage them is minimal - I had EDB working flawlessly in less than a day.
HBase is an enterprise tool. It works well if you are Facebook and can put a couple of people on maintenance duty. It's overkill if you are Styloot (my stealth mode startup, currently smaller than Backtype).
The data I'm loading is stuff like tags - e.g., <itemid>\t<tagid>. In human terms, "Dress A has a ruched collar." Mapreduce can handle data like this, even when it comes unordered.
The data I'm reading is computational results based on the loaded data - e.g., an index: <tagid>\t[<itemid1>, <itemid2>, ...] (where each itemid has been tagged with tagid). E.g., "here are all the dresses with a ruched collar."
(Actually, we do considerably more than this, nor do we need Hadoop for an index. But an index is the simplest example I could give.)
The original data is very boring. It's only after aggregation and calculation that it becomes worth reading.