
Turning the database inside-out with Apache Samza - martinkl
http://blog.confluent.io/2015/03/04/turning-the-database-inside-out-with-apache-samza/
======
eloff
Immutability is hardly a cure-all, see the discussion here for why RethinkDB
moved away from it: [http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-
and-g...](http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-
collection/)

The reality is shared, mutable state is the most efficient way of working with
memory-sized data. People can rant and rave all they want about the benefits
of immutability vs mutability, but at the end of the day, if performance is
important to you, you'd be best to ignore them.

Actually, to be more honest, reality is more complicated still. MVCC that many
databases use to get ACID semantics over a shared mutable dataset is really a
combination of mutable and immutable.

~~~
coffeemug
Slava @ rethink here.

This is a really interesting subject -- I should do a talk/blog post about
this at some point. Here is a quick summary.

RethinkDB's storage engine heavily relies on the notion of
immutability/append-only. We never modify blocks of data in place on disk --
all changes are recorded in new blocks. We have a concurrent, incremental
compaction algorithm that goes through the old blocks, frees the ones that are
outdated, and moves things around when some blocks have mostly garbage.

The system is very fast and rock solid. But...

Getting a storage engine like that to production state is an enormous amount
of work and takes a very long time. Rethink's storage engine is really a work
of art -- I consider it a marvel of engineering, and I don't mean that as a
compliment. If we were starting from scratch, I don't think we'd use this
design again. It's great now, but I'm not sure if all the work we put into it
was ultimately worth the effort.

~~~
boredandroid
I really think there are a couple of levels of immutability that it is easy to
conflate.

Specifically immutability for

1\. In memory data structures...this is the contention of the functional
programming people.

2\. Persistent data stores. This is the lsm style of data structure that
substitutes linear writes and compaction for buffered in-place mutation.

3\. Distributed system internals--this is a log-centric, "state machine
replication" style of data flow between nodes. This is a classic approach in
distributed databases, and present in systems like PNUTs.

4\. Company-wide data integration and processing around streams of immutable
records between systems. This is what I have argued for
([http://engineering.linkedin.com/distributed-systems/log-
what...](http://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying)) and I think
Martin is mostly talking about.

There are a lot of analogies between these but they aren't the same. Success
of one of these things doesn't really imply success for any of the others.
Functional programming could lose and log-structured data stores could win or
vice versa. Pat Helland has made an across the board call for immutability
([http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf](http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf)),
but that remains a pretty strong assertion. So it is worth being specific
about which level you are thinking about.

For my part I am pretty bullish about stream processing and data flow between
systems being built around a log or stream of immutable records as the
foundational abstraction. But whether those systems internally are built in
functional languages, use lsm style data layout on disk is kind of an
implementation detail. From my point of view immutability is a lot more
helpful in the large than in the small--I have never found small imperative
for loops particularly hard to read, but process-wide mutable state is a big
pain, and undisciplined dataflow between disparate systems, caches, and
applications at the company level can be a real disaster.

~~~
eloff
Excellent points, yes it's important to clarify what we're talking about here.
Samza sounds like an event-sourcing style immutable event log. You could think
of it like the transaction or replication log of a traditional database.
Having that be immutable is very sensible! But you can't always query that in
"real-time".

On the other hand, the data structures you query in real-time, making that
immutable is problematic, because then you'll need a LevelDB style compaction
step. That doesn't mean to say that it can't be done well, but that it's hard
to do well.

~~~
hyc_symas
LMDB does ACID MVCC using copy-on-write with no garbage collection or
compaction needed. It delivers consistent, deterministic write performance
with no pauses. It is actually now in use in a number of soft realtime
systems.

~~~
eloff
I was specifically thinking of LMDB as a counter-example when I wrote that
it's not impossible, just hard to do well. A much more sensible set of
tradeoffs than LevelDB.

------
pavlov
_... most self-respecting developers have got rid of mutable global variables
in their code long ago._

I'm not convinced that's the case. Almost everyone has merely hidden their
mutable globals under layers of abstractions. Things like "singletons",
"factories", "controllers", "service objects", "dependency injection" are the
vernacular of the masked-globals game.

~~~
danellis
None of those things you said imply mutability. (Okay, maybe singletons,
depending on the implementation.)

~~~
pavlov
True, but in practice they tend to be used as containers or initializers for
mutable variables.

------
bmh100
As one who works with analytics databases and ETL (extract-transform-load)
processes a great deal, immutability of data stores is an incredibly valuable
property. Maybe append-only does not make sense in operational databases all
the time, but for non-real-time analytics, it makes a huge amount of sense. In
my case, operational data is queried, optimized for storage space and quick
loading, and cached to disk. Because it is an analytics database used for
longer-term analysis and planning, daily queries of operational data are
sufficient in many cases. Operational workload is not even a consideration.
The ETL process also allows for "updating" records in the "T" (transform)
part. Updates to operational data are not even necessary, and often
impossible, so correcting and enhancing the data for decision making is a huge
win for clients. Issues similar to "compaction time" can still occur, but an
ETL approach allows for many clean ways of controlling the process and
avoiding those failure scenarios.

------
boredandroid
Anyhow in the Bay Area interested in learning more about Apache Samza should
attend the meetup tonight in Mountain View: [http://www.meetup.com/Bay-Area-
Samza-Meetup/events/220354853...](http://www.meetup.com/Bay-Area-Samza-
Meetup/events/220354853/)

------
shanemhansen
I'm not sold on Samza, but I can tell you that creating isolated services that
create their datastore from a stream of events is a really useful pattern in
some use cases (ad-tech).

I've made use of NSQ to stream user update events (products viewed, orders
placed) to servers sitting at the network edge which cache the info in
leveldb. Our request latency was something like 10 microseconds over go's
json/rpc. We weren't even able to come close to that in the other nosql
database servers we tried, even with aggressive caching turned on.

~~~
anonymousDan
What don't you like about Samza out of interest? Something fundamental with
their model or more implementation related?

~~~
shanemhansen
I've seen organizations have lots of trouble operationally with kafka (which
samza uses). I've seen NSQ be extremely reliable operationally.

However they offer very different guarantees so it's an apples to oranges
comparison. NSQ isn't really designed to provide a replayable history,
although you can fake it by registering a consumer which does nothing but log
to file (nsq_to_file) and that works pretty well.

(disclaimer: the nsq mailing list has lots of chatter these days, nsq may be
growing features I'm not aware of)

~~~
jonathanoliver
What troubles did you typically see with Kafka? Was it Zookeeper related?
(Also, good to talk to you at the Gopher Meetup on Tuesday)

------
sivers
Similar interesting talk by Rich Hickey:

[http://www.infoq.com/presentations/Value-
Values](http://www.infoq.com/presentations/Value-Values)

~~~
luddypants
I was wondering how this relates to Datomic... I'm not really familiar enough
to say much about similarities and differences, but would be interested if
someone who is could comment.

~~~
ludwigvan
I asked the same question at the end of his talk, see the relevant section in
the video:

[https://www.youtube.com/watch?v=fU9hR3kiOK0&t=2579](https://www.youtube.com/watch?v=fU9hR3kiOK0&t=2579)

------
vkjv
You can do similar "magic" cache invalidation with Elasticsearch and the
percolate feature. Each time you do a query and cache some transformation of
the result, put that query in a percolate index. Then when you change a
document, run the document against the percolate index and, voila, you get the
queries that would have returned it and can then invalidate your cache.

This method of cache invalidation fails in a very key place though (just like
in the article). What happens if you change a very core thing that invalidates
a large percentage of the cache?

~~~
fizx
What you're hoping for is that some cacheable function of many documents is
also a monoid.

In an example, you're hoping that when you invalidate the query "SELECT
COUNT(*) FROM foo WHERE x = 1" because a new document that matched came in,
you're simply incrementing the existing cached value, rather than rescanning
the database index.

------
bonobo3000
This is a cool idea - the holy grail scenario I'm envisioning is storing all
data in the log i.e

1\. the transaction log is a central repository for all data 2\. much more
detailed data is stored, enough that analytics and can run off this same
source of data

The amount of data generated increases proportional to the number of updates
on a row/piece of data whereas with a mutable solution, it is constant w.r.t
number of updates on the same data. That is a pretty big scaling difference.

However, storing that much data translates to much higher costs for
HDDs/servers, or possibly lower write performance if the log is stored on
something like HDFS.

There would also be performance costs for building and updating a materialized
view. Imagine a scenario like this:

Events -> A B C D E F G H I J K Materialized view M has been computed up to
item J (but not K yet) Read/Query M

Now either writing K incurs the cost of waiting for all dependent views to
materialize, or the read on M incurs the cost of updating M.

Some fusion of this would be pretty interesting though. For example, what if
we just query on M without applying any updates if there have been <X updates?
That translates to similar guarantees as an eventually consistent DB - the
data could be stale. Atleast it gives us more control over this tradeoff.

------
swah
I really enjoyed reading about Storm too: [http://nathanmarz.com/blog/history-
of-apache-storm-and-lesso...](http://nathanmarz.com/blog/history-of-apache-
storm-and-lessons-learned.html)

This kind of "competition" leads to analysis paralysis though. Its much better
when there is a single winner...

~~~
sitkack
You mean like Hadoop? I disagree, the popular bad solution starves out
innovation by sucking all the air out of the room. Easier to decide on the
globally bad choice.

~~~
swah
I meant Samza and Storm. I thought both of them could run on Hadoop.

~~~
mrits
I believe they meant that Hadoop is an example of a clear winner.

------
bambax
_A more promising model, used in some systems, is to think of a database as an
always-growing collection of immutable facts._

That would already be a huge progress over how databases are currently used;
if records were in fact immutable many problems would be instantly solved.

~~~
hyc_symas
You would just be trading them for the intensely ugly problem of garbage
collection. Disk space is cheap, but it's not infinitely cheap. There are
plenty of append-only data stores out there now, and they all suffer from
compaction-related performance issues.

------
steve-rodrigue
Does anyone knows which app has been used to create the "handwritten" images?
I draw very badly so I'm looking for such an app to explain data flows on a
corporate blog/wiki.

~~~
discardorama
It looks like the free app "Paper" by FiftyThree, available on iPads:
[https://itunes.apple.com/us/app/paper-by-
fiftythree/id506003...](https://itunes.apple.com/us/app/paper-by-
fiftythree/id506003812?mt=8)

------
hyc_symas
Streams - another reinvention of LDAP Persistent Search.

Yes, there really _are_ protocols that handle single request/multiple response
interactions, and they've been around for decades. Unlike crap built on HTTP,
which was never intended for uses like this, these protocols work well with
multiple concurrent requests in flight simultaneously, etc.

------
hyperliner
Conceptually, one of the challenges of streams as first class citizens is that
humans don't do well with them. For the purposes of analysis, humans need a
"snapshot" or fix on the data. This way they can derive insights from the data
and act on human things. The reality is that, for many real-world scenarios, a
real-time view of the data is not just a luxury, it's actually a drawback,
because data changes are noisy. Many human problems deal with abstract
representations of the actual data, and so imprecision is part of the problem.

I really like the talk from the point of view of simplifying the system-wide
problems caused by a gigantic mutable state. But I feel that at the border of
system to humans there will be other issues to discuss.

------
fiatjaf
This is CouchDB, right?

