Using stream processing framework like Storm maybe fine when you are running exactly the same code for both real time and batch but it breakdown in more complex cases when code is not exactly the same. Let say we need to calculate Top K trending item from now to last 30 mins, One day and One week. We also know that simple count will always make socks and underwear trend for an ecom shops and Justin Bieber and Lady Gaga for twitter(http://goo.gl/1SColQ). So we use count min sketch for realtime and a sligtly more complex ML algorithm for batch using Hadoop and merge the result in the end. IMO, training and running complex ML is not currently feasible on Streaming Frameworks we have today to use them for both realtime and batch.
edited for typos.
I think in cases where you are running totally different computations in different systems the Lambda architecture may make a lot of sense.
However one assumption you may be making is that the stream processing system must be limited to non-blocking, in-memory computations like sketches. A common pattern for people using Samza is actually to accumulate a large window of data and then rank using a complex brute force algorithm that may take 5 mins or so to produce results.
One of the points I was hoping to make is that many of the limitations people think stream processing systems must have (e.g. can never block, can't process large windows of data, can't manage lots of state) have nothing to do with the stream processing model and are just weaknesses of the frameworks they have used.
I think the argument eventually comes down to what people mean by "batch" and "stream". Some people might describe the aforementioned Samza use case as (micro-)batch as opposed to stream processing.
All in all, I found your post insightful. If a system with fewer moving parts can handle the same data processing requirements, that will be appealing to the majority of users.
Have you had a look at Samoa? It is a streaming machine learning library for Storm and S4.
I understand that mllib is strictly for use as a batch data library.
- Historically, the data rarely (never) fitted into memory and was at least 100x larger than it.
- If we want to have it for the long run we need to store it on disk. Smarter, dumber, compact or verbose .. you have to do it.
- We have to make sure we spend the little CPU time we have on processing data not jiggling with it. Map-Reduce jobs takes ages to initialize and burn CPU just to read and write to file partitions.
- If you have a long processing pipeline there are two major concepts that we use: buffers and pumps. Files, cache, DBs act as buffers. Kafka is a essentially a pump with a buffer;
- When you process data, depending on what you compute, you need or need not multiple passes through the data. ML and AI most of the time needs such things. Descriptive stats with some smart math tricks avoids two passes. This variable number of passes is the party pooper in stream analytics. In cryptography they solved the problem by breaking down the stream into blocks of equal size. That makes sense for raw data because it is being assemble back using some buffers and pumps at some upper layers. Data wise, mathematically and statistically wise, it doesn't make sense to randomly split data into chunks and apply your choice of algos.
- I still don't understand why so many of us rely on out-of-the-box solutions instead trying to solve the problems, they have specifically, on their own. Why wouldn't a developer stick his java code directly into the pipeline to suck data from Kafka and do his bespoke magic. It will be super fast because it is very specific and does exactly one single job. Yes, there will be maintenance time but all solutions require that time. Instead of debugging apache hadoop/spark/Tez code you debug your own.
What is mentioned above just scratches the surface of the knobs and tuning points of a processing pipeline. These are decisions we need to take and expecting fast-food solutions to do it for us are completely unrealistic expectations.
Actually because we think this tradeoff is important we allow consumers from Kafka to specify how much data they want accumulated on the server before the server should complete their fetch request. The default is 1 byte, which means respond as soon as there is anything new for me, which optimizes latency. However setting this to a higher number will reduce round-trips and optimize throughput by avoiding lots of small fetches. This allows you to trade a small amount of latency for throughput. (In either case you can bound the waiting with a timeout so the latency is never worse than some maximum delay even if sufficient data hasn't arrived).
In any case when doing re-processing, as described in this article, there will always be lots of data accumulated and you will always fetch chunks of the maximum size you have configured (say 1MB). So in the reprocessing case you always get the "batch"-like throughput.