
Big Data Lambda Architecture - mjbellantoni
http://www.databasetube.com/database/big-data-lambda-architecture/
======
zacharyvoase
So, 'Big Data' has been simplified to 'Problems Solvable by MapReduce'?
_sigh_.

Not every problem can be reduced into a completely cache-able batch job,
trivially parallelizable across all of your data. 'Big Data' isn't about
breaking up your batch processing into three layers, it's about being smart
enough and knowledgeable enough in compsci, statistics, calculus, text
processing, regexes, machine learning, business analysis, _et cetera_ , to
design an effective system which harvests _useful_ insights from a large bank
of atomic, messy, inconsistent data, with an appropriate level of availability
and consistency.

The real work is not in using/configuring Hadoop—it's about figuring out what
information would bring greater-than-marginal value to a business, and how to
compute that efficiently from an existing corpus of data.

There's no silver bullet. Remember?

EDIT: I think the following is particularly disingenuous: "The lambda
architecture solves the problem of computing arbitrary functions on arbitrary
data in real time by decomposing the problem into three layers"

This is such a ridiculous promise, that it put me in a strongly skeptical mood
for the rest of the article.

~~~
gfodor
This is a lot of platitudes but doesn't provide any specific criticisms. Most
problems are "Solvable by MapReduce" since map/reduce != Hadoop but is simply
an abstraction for computing functions on data in general. The architecture
outlined here is a pattern for a generalized system for computing functions on
data in real time that has nice properties.

~~~
lvh
That it's an abstraction that can do anything isn't sufficient (and I'm not
even convinced it's necessary): Turing machines can compute anything too, that
doesn't mean they're practical.

~~~
gfodor
Ok, more platitudes, still waiting for a specific criticism of this
architecture and its claimed limitations.

I'll repeat myself: most big data problems fit into the mold of query =
function(data), map reduce is a practical substrate for building algorithms to
compute these functions, and this paper presents a practical architecture to
implement these types of systems.

~~~
lvh
You made the argument that it was just platitudes and MapReduce is a
sufficient model. I was specifically counter-arguing that point; that doesn't
mean the claimed model is bad or that I have to reply to every single point in
a single comment.

------
spinron
Some of you might have missed the perspective of the article's author (perhaps
it isn't that clear); you might want to re-examine it from an implementer
point of view. In other words, if you have a real-life big data problem (that
would benefit from parallel processing via Hadoop) and you actually have to
build the thing so that it would work and scale, the decomposition presented
by the architecture would make the implementation a lot simpler. Architecting
such systems isn't trivial, and this is a solid blue-print to start with. And
it's really an architecture, not a model: It doesn't tell you how to formulate
algorithms, it rather suggests how to build a complete system around them.

I have read the recent draft of the "Big Data" book by the author, which
describes the architecture that the article discusses in better detail.
Honestly, if you are a beginning practitioner in this field, you can't really
go wrong by reading it.

------
noelwelsh
I'm pretty bullish on the "speed layer" coming to dominate. I've done a fair
bit of work with streaming algorithms [1] and they have advantages beyond just
latency, reduced memory usage being the primary one. If you believe data is
growing faster than computing power it seems that streaming algorithms must be
the way forward.

Note that you can do a lot with streaming algorithms (it's not just counting).
Also the reduced memory usage (orders of magnitude) makes the complexity of
random writes not such a problem as you have less need to go outside a single
machine.

[1] Slides on streaming algorithms: [http://noelwelsh.com/streaming-
algorithms/2012/11/22/streami...](http://noelwelsh.com/streaming-
algorithms/2012/11/22/streaming-algorithms-scala-exchange-edition/)

~~~
avibryant
Nice slides. You should check out <https://github.com/twitter/algebird> \-
we've implemented a number of streaming algorithms (HyperLogLog, Count Min
Sketch, along with stuff like minhash and bloom filters) in Scala as Monoid
typeclasses. Love to hear your thoughts or get your contributions.

~~~
noelwelsh
Algebird looks great! Thanks for mentioning it.

