I have terabytes of new data arriving per day that needs to be verified, ingested and translated. There simply is no way to do this via online or stream processing. If you ingesting clickstream or Twitter data then sure it will work. But more often than not you need to work with sets of data. And for that batch processing is the only option.
Are you folks doing sensor analysis or something, or eating logfiles, or what?
What we can get for that data is a major competitive advantage. We can offer much cheaper financial products since we model risk individually rather than as a cohort. It also allows us to have a single customer view despite reselling other companies products.
Building a single customer view with lots of disparate data sets is a big trend right now.
This is generally one of the things that's frustrated us about the Hadoop ecosystem. Hadoop's implementation of MapReduce, which only supports batch processing, has become conflated with MapReduce as a paradigm, which can be used for way more than just batch processing.
Hope this clears a few things up!
Our storage layer is built on top of btrfs, we haven't put together comprehensive benchmarks yet but our experience with btrfs is that there isn't a meaningful penalty to reading a batch of data from a commit based system compared to more traditional storage layers. I really wish I had concrete performance numbers to give you but we haven't had the time to put them together yet. I will say that our observations match with btrfs' reported performance numbers and are what I'd expect given the algorithmic complexity of their system.