
Questioning the Lambda Architecture - boredandroid
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
======
mashraf
I have never used Samza but have build similar pipelines using
Kafka,Storm,Hadoop etc. In my experience you almost always have to do your
transformation logic twice one for batch and one for real time and with that
setup Jay's setup look exactly like Lamda Architecture with your stream
processing framework doing real time and batch computation.

Using stream processing framework like Storm maybe fine when you are running
exactly the same code for both real time and batch but it breakdown in more
complex cases when code is not exactly the same. Let say we need to calculate
Top K trending item from now to last 30 mins, One day and One week. We also
know that simple count will always make socks and underwear trend for an ecom
shops and Justin Bieber and Lady Gaga for
twitter([http://goo.gl/1SColQ](http://goo.gl/1SColQ)). So we use count min
sketch for realtime and a sligtly more complex ML algorithm for batch using
Hadoop and merge the result in the end. IMO, training and running complex ML
is not currently feasible on Streaming Frameworks we have today to use them
for both realtime and batch.

edited for typos.

~~~
boredandroid
I am the author of the post.

I think in cases where you are running totally different computations in
different systems the Lambda architecture may make a lot of sense.

However one assumption you may be making is that the stream processing system
must be limited to non-blocking, in-memory computations like sketches. A
common pattern for people using Samza is actually to accumulate a large window
of data and then rank using a complex brute force algorithm that may take 5
mins or so to produce results.

One of the points I was hoping to make is that many of the limitations people
think stream processing systems must have (e.g. can never block, can't process
large windows of data, can't manage lots of state) have nothing to do with the
stream processing model and are just weaknesses of the frameworks they have
used.

~~~
kiyoto
>A common pattern for people using Samza is actually to accumulate a large
window of data and then rank using a complex brute force algorithm that may
take 5 mins or so to produce results.

I think the argument eventually comes down to what people mean by "batch" and
"stream". Some people might describe the aforementioned Samza use case as
(micro-)batch as opposed to stream processing.

All in all, I found your post insightful. If a system with fewer moving parts
can handle the same data processing requirements, that will be appealing to
the majority of users.

~~~
boredandroid
Yes, totally. In my definition the difference is that a stream processing
system let's you define the frequency with which output is produced rather
than forcing it to be "at the end of the data". This doesn't preclude blocking
operations.

------
haddr
While I really liked the main idea of the Lambda Architecture, I think this
article is answering my doubts I had. It is really more reasonable to have
another stream processing pipeline for treating historical data, rather than
separate hadoop or other map-reduce framework. In case you need to reprocess
the data, you instantiate another pipeline and stream that historical data
through it. Clever!

------
vicpara
The ideas exposed in the article aren't new. Now we just use all these
hipsterish technologies that we hope to magically solve our problems by just
sticking one's output into another's input. If we think about what happens in
a single machine while processing data we have had exactly the same problems
for decades. How do you process 2 CDs worth of data when you only have a 486
with 4/8/16Mb of ram ?

\- Historically, the data rarely (never) fitted into memory and was at least
100x larger than it.

\- If we want to have it for the long run we need to store it on disk.
Smarter, dumber, compact or verbose .. you have to do it.

\- We have to make sure we spend the little CPU time we have on processing
data not jiggling with it. Map-Reduce jobs takes ages to initialize and burn
CPU just to read and write to file partitions.

\- If you have a long processing pipeline there are two major concepts that we
use: buffers and pumps. Files, cache, DBs act as buffers. Kafka is a
essentially a pump with a buffer;

\- When you process data, depending on what you compute, you need or need not
multiple passes through the data. ML and AI most of the time needs such
things. Descriptive stats with some smart math tricks avoids two passes. This
variable number of passes is the party pooper in stream analytics. In
cryptography they solved the problem by breaking down the stream into blocks
of equal size. That makes sense for raw data because it is being assemble back
using some buffers and pumps at some upper layers. Data wise, mathematically
and statistically wise, it doesn't make sense to randomly split data into
chunks and apply your choice of algos.

\- I still don't understand why so many of us rely on out-of-the-box solutions
instead trying to solve the problems, they have specifically, on their own.
Why wouldn't a developer stick his java code directly into the pipeline to
suck data from Kafka and do his bespoke magic. It will be super fast because
it is very specific and does exactly one single job. Yes, there will be
maintenance time but all solutions require that time. Instead of debugging
apache hadoop/spark/Tez code you debug your own.

What is mentioned above just scratches the surface of the knobs and tuning
points of a processing pipeline. These are decisions we need to take and
expecting fast-food solutions to do it for us are completely unrealistic
expectations.

------
ntoshev
There seems to be a fundamental trade-off between latency and throughput, with
stream processors optimizing for latency and batch processors optimizing for
throughput. The post says stream processors are actually used for reprocessing
in production at LinkedIn, and simplifying things might well worth it being
slower, but do they know how much slower this is than using Hadoop?

~~~
boredandroid
This is an excellent point, there IS a fundamental tradeoff between latency
and throughput. Computers are much better a processing larger chunks of data
linearly rather than small bits of data. However the question to ask is how
much data do you have to batch together to get good throughput? There are
diminishing returns and you stop getting much benefit after about 1MB in my
experience. So this line of reasoning will not get you to additional latency
of more than a few seconds, it definitely won't get you to a 24 hour batch
data cycle.

Actually because we think this tradeoff is important we allow consumers from
Kafka to specify how much data they want accumulated on the server before the
server should complete their fetch request. The default is 1 byte, which means
respond as soon as there is anything new for me, which optimizes latency.
However setting this to a higher number will reduce round-trips and optimize
throughput by avoiding lots of small fetches. This allows you to trade a small
amount of latency for throughput. (In either case you can bound the waiting
with a timeout so the latency is never worse than some maximum delay even if
sufficient data hasn't arrived).

In any case when doing re-processing, as described in this article, there will
always be lots of data accumulated and you will always fetch chunks of the
maximum size you have configured (say 1MB). So in the reprocessing case you
always get the "batch"-like throughput.

