Hacker News new | comments | show | ask | jobs | submit login

Writing new implementation of a batch processing misses the main point that batch processing in general is just a trade-off made in order to get things started. This is why new generations of systems are going the way of online and/or stream processing.



Have you actually seen this happening ?

I have terabytes of new data arriving per day that needs to be verified, ingested and translated. There simply is no way to do this via online or stream processing. If you ingesting clickstream or Twitter data then sure it will work. But more often than not you need to work with sets of data. And for that batch processing is the only option.


JD @ Pachyderm here. We try to talk to as many people as possible about their data infrastructure so I have a decent sample size on this. My experience is that there a specific use cases for which streaming solutions are gaining ground but there's still a lot for which it isn't and probably never will. Streaming is great but there are a lot of situations where it just doesn't fit. I think what might be going on is that there isn't a lot of chatter surrounding batch based processing so it seems like it's going out of style but the use cases are still real


What is it about stream processing that doesn't fit your needs? I do research and development on a streaming system, so I'm very interested to hear about how you need to use your data.


So, I've got to ask...what do you actually get for that data? Like, what does your business actually use it for?

Are you folks doing sensor analysis or something, or eating logfiles, or what?


Financial Services with millions of customers (and growing rapidly).

What we can get for that data is a major competitive advantage. We can offer much cheaper financial products since we model risk individually rather than as a cohort. It also allows us to have a single customer view despite reselling other companies products.

Building a single customer view with lots of disparate data sets is a big trend right now.


Ah, okay, so that makes sense. I've seen a few cases where data is just indiscriminately hoovered because reasons, and I can never help but wonder what the expected ROI on it is.


Hi cofounder / lead Pachyderm dev here. Pachyderm actually does a lot more with MapReduce than batch processing. We need to work on making this clearly in our marketing. The way this works under the hood is pretty cool though. Pfs is a commit based filesystem, which means that we think of the entire world as diffs. When your entire world is diffs then you can transition gracefully between batch processing and stream processing because you always know exactly what's changed.

This is generally one of the things that's frustrated us about the Hadoop ecosystem. Hadoop's implementation of MapReduce, which only supports batch processing, has become conflated with MapReduce as a paradigm, which can be used for way more than just batch processing.

Hope this clears a few things up!


So you're trying to build a realtime MapReduce? The other part about the diff-vs-batch tradeoff really depends on what the performance penalty of moving to a stream of diffs is going to be, over a batch. If it's uniformly better than batch processing, then you've also just invented a better batch processing transport.


MapReduce is actually an incredibly flexible paradigm. If you have a clean implementation of MapReduce with well defined interfaces you can make behave in a number of different ways by putting it on top of different storage engines. We've built a MapReduce engine as well as a the commit based storage that lets us slide gracefully from streaming to batched.

Our storage layer is built on top of btrfs, we haven't put together comprehensive benchmarks yet but our experience with btrfs is that there isn't a meaningful penalty to reading a batch of data from a commit based system compared to more traditional storage layers. I really wish I had concrete performance numbers to give you but we haven't had the time to put them together yet. I will say that our observations match with btrfs' reported performance numbers and are what I'd expect given the algorithmic complexity of their system.


So pretty much everyone here mentions Hadoop or Spark, but even spark streaming is not realtime, but glorified microbatches. Am I missing something here or has no one in this thread considered storm or LinkedIn's samza for true real time stream parsing / ETL type jobs? Seems like the slam dunk to me. Bonus points that they both use Kafka


Spark streaming will handle 0.5 second batches. If you deeply care about sub-second latency, JVM in general is not where you want to be.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: