
Runaway complexity in Big Data... and a plan to stop it - timf
http://www.slideshare.net/nathanmarz/runaway-complexity-in-big-data-and-a-plan-to-stop-it
======
confluence
We invented RDMS databases to store large amounts of important data safely on
extremely constrained storage hardware (kilobytes were expensive in the 1970s)
where you could not economically store redundant copies due to astronomical
cost.

Append only immutable DBs are inherently better just because they essentially
add a 4th dimension to your data (all history, all data, all the time), and
hook up to the fact that Von Neuman machines love splitting things up into
constrained sub-streams/problems and crunching through the entire data set in
memory across clusters using discrete memory frames/units of work (think
Google search index/GPU framebuffers/Integer arithmetic).

Storage is now unlimited and random seeks are very expensive. The more you
serialize your processing and split them into in memory units that utilize
cache locality - the faster you'll perform. You'll turn performance problems
into throughput problems - all you have to do is to keep the data flowing.

RDMS databases are dead for the same reason I don't manage my YouTube bandwith
use or bother managing files or bother managing RAM - I have cable now with
terabyte hard disks and 32-64 GBs of RAM.

The future is immutable with periodic/continuous historical stream compaction
(pre-computation) with queries running across hundreds of clusters and
reducing all searches to essentially linear map-reduce (sometimes with
b-trees/other speed ups) + a real time dumb/inaccurate stream layer.

1 machine cannot store the Internet - a million machines can. 1 machine cannot
process the Internet - but a million machines operating on one 64GB slice of
it located in RAM and operating with cache friendly memory-local discrete
operations can run through the Internet millions of times a second.

~~~
fauigerzigerk
Just a few facts that I have to deal with:

1) I can't afford a million machines or hundereds of clusters or keeping
petabytes of data around in a way that doesn't make the data completely
useless. Yes mutable state causes complexity, but I can't afford to rid myself
of that problem by using brute force.

2) Performance is not replaceable by throughput if you have users waiting for
answers on questions they just dreamt up a second ago and you have new data
coming in all the time.

3) Cache locality and immutability don't go well together. Many indexes will
always have to be updated, not just replaced wholesale with an entirely new
version of the index.

~~~
cbsmith
1) Even with hard drive prices inflated today, we're at over 9GB/$1 for
storage. That means for $1 you can store 1.25 _billion_ 64-bit integers. From
that perspective, there is very little data (at least in terms of traditional
RDBMS data) that is actually cost effective to throw out. To the extent that
you do need to throw it out, you can wait until all transactions involving the
data have long since completed.

2) I think you misunderstood what was meant here. His point was that your
limiting factor really becomes throughput if you have the right architecture.

3) Indexes and cache locality also tend not to play well together. ;-)

------
pbailis
"Conflating the storage of data with how it is queried" is a misinformed
criticism of the relational model. Codd's Relational Algebra (Turing Award
material) was in large part a move towards data independence, relaxing what
was previously a tight coupling of storage format and application access
patterns. Take a look at Rules 8 and 9, or just read the original paper:
<http://en.wikipedia.org/wiki/Codds_12_rules>

> "Activities of users at terminals and most application programs should
> remain unaffected when the internal representation of data is changed and
> even when some aspects of the external representation are changed."
> <http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf>

It's not clear from the slides how RDBMSs manage to conflate these two
concerns in practice.

It's also unclear to me how the "Lambda Architecture" differs from what we've
been calling [soft real-time] materialized view maintenance for decades.

~~~
nathanmarz
It's not a criticism of the relational model. It's a criticism of the
relational database as they exist in the world. The normalization vs.
denormalization slides explain how these concerns are conflated. The fact that
you have to denormalize your schema to optimize queries clearly shows that the
two concerns are deeply complected.

~~~
ucee054
Your ideas will work very well for small data, big storage. eg daily stock
price histories

They will not always work for big data, which often means "max out our storage
with crap, don't worry disk is cheap" eg web metrics

When a decision maker goes "don't worry disk is unlimited", the resulting
application is prone to maxing out storage.

Whenever your application maxes out your storage, you have no space for
previous versions of the data.

~~~
bkirwi
You realize he works at Twitter, right? And that he's responsible for Storm,
and Cascalog?

I'm not going to say anything better than it was said in the slides / book
draft, so I'll just encourage you to take these techniques seriously...
they're born out of necessity, not because they sound like fun, and real
people are using them to solve problems that are hell to solve any other way.

That said, these are not problems that everyone has. If you're not nodding
you're head along with the mutability / sharding / whatever complaints at the
beginning of the deck, you can probably still get by with a more traditional
architecture.

(Also, rereading... I should probably note that not everything needs to be
kept forever; only the source data, since the views can be recomputed from
them at any time. That makes things a bit cheaper.)

~~~
ucee054
> problems that are _hell_ to solve any other way.

'Bigness' of data != data size

'Bigness' of data == data size / budget

Twitter isn't a typical company. I assume they have both a budget and
competent management that will let them get away with something like the
Lambda architecture.

I reckon it's a lot harder to scale to even a _terabyte_ under the constraints
of a grubby setting like a datawarehouse for some instrument monitoring
company.

Those guys will allow _at best_ MS SQL for storage, and won't mind putting
their developers through _hell_.

------
ebiester
I'm interested enough in the idea to pick up the (partially-completed) book,
but I wonder if this isn't far too complex of a solution for 95% of cases. By
the time that your system is in the 5% of cases, you're looking at a massive
rearchitecting effort.

Of course, to be fair, you're probably already looking at a rearchitecting
effort at this point!

However, looking at the architecture diagram, much of the complexity could be
hidden behind the scenes with a devoted toolset built on top of Postgres,
rather than trying to cobble together Kafka, Thrift, Hadoop, and Storm.
Sometimes, one big tool beats a lot of small tools.

For that matter, I wonder if a series of Postgres-XC (or Storm) servers
couldn't do the same thing without learning a series of complex tools.

Step 1: Everything goes through a stored procedure. Deletes, updates, and
creates are code generated through a DSL. Migrations would be a pain under
this system, but mitigated by the fact that the complexity is being handled by
the tool.

This stored procedure then writes to the database of record and then sends a
series of updates to, essentially a system of real-time materialized views
that serve the same purpose as the standard NoSQL schema.

The lambda architecture purposes would still be fulfilled with lower data-side
complexity. After I read through Big Data I'll revisit this with a more
nuanced view, but I wonder if hadoop and nosql really give you anything.

------
spenrose
I really like this. Has some overlap with Command-Query Responsibility
Segregation, which I have found very useful:
<http://www.udidahan.com/2009/12/09/clarified-cqrs/>

~~~
dm3
There was an interesting production-grade event store released recently which
fits the described use-case: <http://geteventstore.com>

Basically, you store all of the data as events (event sourcing) and create
separate query views (projections) which are populated from event streams.

------
nasmorn
Interestingly I am just building a complete analytics setup for my ecommerce
site, since Google Analytics doesn't give me everything I want to track. I
looked at different things like statsd and redis and whatnot but realized that
my data is not actually big. I will just put everything in Postgres and will
be able to do that on cheap hardware even when I grow tenfold. Since revenue
per pageview is high as in multiple cents there will never be an unsolvable
performance issue.

tldr - the hand me down from the 90s can be good enough if you are not big yet

------
markelliot
Interestingly the chart at the end suggests implementations for everything
except the raw data, but generating batch views from an arbitrary store and
keeping your raw data reliably are hard problems. The charts motivate dumping
this stuff in Hadoop (or some distributed file system), but any reliable store
would do. (@nathanmarz: would love recommendations)

~~~
nathanmarz
A distributed filesystem, such as HDFS or MapR, is ideal for the master
dataset.

------
lmm
This is a reasonable architecture, but it introduces complexity of its own -
you have to implement all your view creation logic in two places, which should
go against every programmer's instincts. If you have a system capable of
streaming all the events and writing out the reporting views, why not use it
for everything.

At my previous^2 job, we had a distributed event store that captured all data
and could give you all events (or events matching a very limited set of
possible filters) either in a given time range, or streaming from a given time
onwards. For any given view we'd have four instances of the database
containing it, populated their own streaming inserters; if we discovered a bug
in the view creation logic, we'd delete one database and re-run the (newly
updated) inserter until it caught up, then repeat with the others (queries
automatically went to the most up to date view, so this was transparent to the
query client - they'd simply see bugged data (some of the time) until all the
views were rebuilt).

The events system guaranteed consistency at the cost of a bit of latency
(generally <1 second in practice, good enough for all our query workloads); if
an event source hard-crashed then all event streams would stop until it was
manually failed (ops were alerted if events got out of date by more than a
certain threshold). This could also happen if someone forgot to manually close
the event stream after taking a machine out of service (but at least that only
happened in office hours); hard-crashes were thankfully pretty rare.
Rebuilding the full view after discovering a bug was obviously quite slow, but
there's no way to avoid that (and again this was quite rare).

In use it was a very effective architecture; we handled consistency of the
events stream in one place, and building the view in another. We only had to
write the build-the-view code once, we only had one view store and one event
store to maintain. And we built it all on mysql.

------
blaines
I like this idea but my first impression is it sounds like a lot of work to
setup and scale. Am I wrong?

Secondarily, if this were a service, wow!

~~~
siganakis
It does seem like a lot of work to me.

One of the problems is servicing two workloads, one for "transactional"
processing and another for analysis.

For transactional systems you need to be able to change things quickly and
consistently. For analysis you need to be able to query lots of data quickly.

For decades people have realized that these are two separate workloads, so
have built 2 systems, a transactional system (on an RDBMS) and a data
warehouse (generally on an RDBMS). Data is then shipped between the 2 in batch
jobs.

The transactional system is normalized, and the data warehouse is normalized.
Within the data warehouse you make denormalized copies of the data that fit
the reporting workload required so that as much of the workload is pre-
computed as possible.

The problem is that as reporting requirements change, you need to modify these
pre-computed stores, as they are very heavily tuned for the particular
reporting requirement. Building pre-computed stores is generally done in SQL
and can be challenging as you are generally shifting a lot of data and you are
trusting the RDBMS to get its optimizations right.

There is a trend now to use Hadoop for the building of these pre-computed
stores (and even to use Hadoop for the entire data warehouse). However,
writing map-reduce jobs for queries is cumbersome compared to SQL so your
productivity suffers. But you don't have to pay Oracle or IBM for licenses.

The key problem is that you need to pre-compute stuff to do fast aggregation,
but you can't pre-compute everything. So what you pre-compute is dependent on
what your users want, and that changes all the time.

So what you want is a system that lets you change what is pre-computed easily
and efficiently.

------
wellpast
The only novel part of this seems to be that you issue realtime queries over
the batch view AND realtime view together; otherwise I don't see what else is
interesting here. And I think that novelty is unneeded (and therefore only
additional complexity).

Why don't you just: 1) Master your data in "normal form" (graph/entity-based
data models like Datomic or Neo4j are perfect for this) in a scalable database
(e.g., Cassandra) 2) Use Hadoop for offline analytics 3) Index all of the data
needed for realtime queries in a search engine (e.g., elasticsearch), smartly
partitioning the data for performance

As for "schema" changes -- you don't need them in an entity-based model
because you can always be additive with the attributes in your schema. (You
may wish to have a deprecation policy for some attributes but doing this
allows you to as slowly as you'd like update any applications and batch jobs
that depend on deprecated attributes.)

~~~
nathanmarz
It all comes down to Query = Function(All data). The architecture I describe
is a general approach for computing arbitrary functions on an arbitrary
dataset (both in scale and content) in realtime. Any alternative to this
approach must also be able to compute arbitrary functions on arbitrary data.
Storing data in Cassandra while also indexing it in elastic search does not
satisfy this, and it also fails to address the very serious complexity
problems discussed in the first half of the presentation of mutability and
conflating data with queries.

The Lambda Architecture is the only design I know of that:

1) Is based upon immutability at the core.

2) Cleanly separates how data is stored and modeled from how queries are
satisfied.

3) Is general enough for computing arbitrary functions on arbitrary data
(which encapsulates every possible data system)

~~~
DanWaterworth
I'm writing a distributed DBMS based on immutable data. I have a different way
of providing fast realtime arbitrary querying of all data and it's much
simpler. You just cache intermediate results.

For example, if you want to know the sum of the elements in a tree, but you've
already calculated the sum of the elements of another tree that differs on
only one path, then you won't need to do much work, you can use the
precomputed work for most of the tree and just perform calculations for the
parts that are different.

~~~
DanWaterworth
In fact, my approach is more general than yours because it isn't restricted to
catamorphisms.

~~~
nathanmarz
Afraid not. Nothing is more general than function(all data).

~~~
DanWaterworth
You rely on mapreduce, so you don't have the ability to calculate any
function. You can only calculate catamorphisms of your data, (
<http://en.wikipedia.org/wiki/Catamorphism> ). You can't solve a SAT problem
effectively using mapreduce over the tree of the expression for example. I
can't think of a case where you'd want to calculate a function that isn't a
catamorphism, but saying that you aren't restricted to them is wrong.

~~~
Evbn
Is "median" expressible as a catamorphism? What about a neural network
propagation?

~~~
DanWaterworth
First, let me explain what I mean by catamorphism. I mean a function that
operates on the data within a node of a tree and the values returned by
performing the catamorphism on its child nodes.

To find the median, you need the data sorted. You can express something like
mergesort as a catamorphism. You take the lists of the child nodes and the
singleton list of the value at the node and merge them together. However you
have to do a linear amount of work at each node and store a linear amount of
data, so it's not really within the spirit of mapreduce because it doesn't
scale well, although it is possible.

Catamorphisms are operations on trees, so the first problem with neural
networks is that you'd have to embed your neural network in a tree. In the
worst possible case, a fully connected network, to find the next value of each
neuron, you'll have to find the current value of every other neuron. The
problem is that you don't have access to "cousin nodes'" data, so it isn't
possible.

------
gbvb
The interesting part about timestamping change to records is essentially
"effective dating" and it is bread and butter in any transactional system that
needs to record employee info. This is done varying degrees to columnar values
or entire rows themselves. In an RDBMS, you will create a compound key to
allow you to know the history and the current record. I guess the new fangled
NoSQL will make it easier to store it easily..?

------
NDizzle
I don't have a critique for the content yet, and HN is usually the place to
post this kind of stuff so here goes.

My initial reaction is to close the window immediately due to the font on the
slides. I don't know if that is the fault of my browser (chrome 21.x on this
PC), slideshare, or the presenter who put together the slides, but the
staggered characters drive me bonkers.

~~~
Jare
My bet is on Chrome's sometimes clumsy font rendering. Go fullscreen on the
presentation and the characters will render fine.

Edit: scratch that - each letter shows up as a separate span in the generated
HTML. There's no sane way to typeset that.

------
ChuckMcM
It would be useful if there was a speaker narrative, it would be better still
if there was a white paper that this presentation was just summarizing.

What I got out of it was "Just store the raw data", always compute
results/queries, BigTable and Map/Reduce are cool. I felt like I missed
something perhaps someone here can help.

~~~
nathanmarz
The first chapter of my book discusses a lot of what's in the presentation
<http://manning.com/marz/BD_meap_ch01.pdf>

~~~
grandalf
Out of curiosity, why is it called the lambda architecture?

~~~
nathanmarz
Lambda is the standard symbol used to represent functions (coming from the
lambda calculus). Since the purpose of this architecture is to compute
arbitrary functions on arbitrary data, and since the internals of the
architecture make heavy use of pure functions (batch view is a pure function
of all data, and queries are pure functions of the precomputed views), "Lambda
Architecture" seemed like an appropriate name.

------
bcoates
I'm not sure how to square the normalized schema with the immutability;
normalization implies schema changes which imply data changes, no?

In particular, what do I do when there is erroneous historical data that
violates the new schema (the newly discovered constraint that was there all
along)?

~~~
siganakis
Essentially they end up rebuilding the "transaction log" that most RDBMS's use
to commit data safely.

Before any data is written to the "data" files of the database, it is written
to a "log" file sequentially. Once that has succeeded, it will write to the
data file, then write to the log file again to say that the write was
successful.

That way if the database breaks during the write, the system knows about it.
It's also useful because it allows you to restore the database to a point in
time, as you have a full log of all the transactions run (and you know how to
un-run them).

In this case, they just make the transaction log a little more accessible.

For your example, erroneous historical data would remain in the system but
with a new record being added indicating that at a certain date and time the
old record was replaced with the new one.

Basically if you think of a store processing a purchase for $100 and refund of
$100 as 2 transactions (for $100 and -$100) rather than simply deleting the
first transaction.

This is important because things may have happened between the deposit of
money and its refunding (e.g. interest payments).

------
calibraxis
Does anyone know what his NewSQL critique was? (The "“NewSQL” is misguided"
slide. BTW, I've read the released chapters of his _Big Data_ book, so I know
a fair bit of the slides' context.)

~~~
nathanmarz
NewSQL inherits mutability and the conflation of data and queries. It's no
longer 1980, and we now have the capability of cheaply storing massive amounts
of data. So we should use that capability to build better systems that fix the
complexity problems of the past.

~~~
Evbn
... but it is no longr 1980, amd we have massively more data.

------
DanWaterworth
> MapReduce is a framework for computing arbitrary functions on arbitrary
> data.

No, MapReduce is a framework for computing catamorphisms over trees.

~~~
Evbn
I think he meant catamorphisms of arbitrary functions over trees of arbitrary
data. As in, you can apply any function at the leaves, and the leaves can be
any type.

~~~
DanWaterworth
This is directly contradicted in the first chapter of his book:

The Lambda Architecture, which we will be introducing later in this chapter,
provides a general purpose approach to implementing an arbitrary function on
an arbitrary dataset and having the function return its results with low
latency.

------
danso
The OP is an engineer at Twitter...doesn't Twitter use NoSQL to store tweets?
I know your critique of NoSQL is that it is an overreaction to the pain of
schemas, so I'd like to see some discussion on how to implement lambda with
data architecture that, at some point, uses document databases.

Thanks for the post...slideshare is annoying to read (on an iPad) but the
content was digestible. Mostly, it helped remind me to download the update of
the OP's book

