Hacker News new | comments | show | ask | jobs | submit login

So you're trying to build a realtime MapReduce? The other part about the diff-vs-batch tradeoff really depends on what the performance penalty of moving to a stream of diffs is going to be, over a batch. If it's uniformly better than batch processing, then you've also just invented a better batch processing transport.



MapReduce is actually an incredibly flexible paradigm. If you have a clean implementation of MapReduce with well defined interfaces you can make behave in a number of different ways by putting it on top of different storage engines. We've built a MapReduce engine as well as a the commit based storage that lets us slide gracefully from streaming to batched.

Our storage layer is built on top of btrfs, we haven't put together comprehensive benchmarks yet but our experience with btrfs is that there isn't a meaningful penalty to reading a batch of data from a commit based system compared to more traditional storage layers. I really wish I had concrete performance numbers to give you but we haven't had the time to put them together yet. I will say that our observations match with btrfs' reported performance numbers and are what I'd expect given the algorithmic complexity of their system.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: