Hacker News new | comments | show | ask | jobs | submit login

Hi cofounder / lead Pachyderm dev here. Pachyderm actually does a lot more with MapReduce than batch processing. We need to work on making this clearly in our marketing. The way this works under the hood is pretty cool though. Pfs is a commit based filesystem, which means that we think of the entire world as diffs. When your entire world is diffs then you can transition gracefully between batch processing and stream processing because you always know exactly what's changed.

This is generally one of the things that's frustrated us about the Hadoop ecosystem. Hadoop's implementation of MapReduce, which only supports batch processing, has become conflated with MapReduce as a paradigm, which can be used for way more than just batch processing.

Hope this clears a few things up!




So you're trying to build a realtime MapReduce? The other part about the diff-vs-batch tradeoff really depends on what the performance penalty of moving to a stream of diffs is going to be, over a batch. If it's uniformly better than batch processing, then you've also just invented a better batch processing transport.


MapReduce is actually an incredibly flexible paradigm. If you have a clean implementation of MapReduce with well defined interfaces you can make behave in a number of different ways by putting it on top of different storage engines. We've built a MapReduce engine as well as a the commit based storage that lets us slide gracefully from streaming to batched.

Our storage layer is built on top of btrfs, we haven't put together comprehensive benchmarks yet but our experience with btrfs is that there isn't a meaningful penalty to reading a batch of data from a commit based system compared to more traditional storage layers. I really wish I had concrete performance numbers to give you but we haven't had the time to put them together yet. I will say that our observations match with btrfs' reported performance numbers and are what I'd expect given the algorithmic complexity of their system.


So pretty much everyone here mentions Hadoop or Spark, but even spark streaming is not realtime, but glorified microbatches. Am I missing something here or has no one in this thread considered storm or LinkedIn's samza for true real time stream parsing / ETL type jobs? Seems like the slam dunk to me. Bonus points that they both use Kafka


Spark streaming will handle 0.5 second batches. If you deeply care about sub-second latency, JVM in general is not where you want to be.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: