
Turning the database inside-out with Apache Samza (2015) - shawndumas
https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/
======
matthewrudy
If you're interested in this, I recommend Martin Kleppmann's book "Designing
Data-Intensive Applications" which is a longer form discussion of these
topics.

[http://dataintensive.net](http://dataintensive.net)

~~~
basetensucks
Another vote for this, it's an excellent book.

------
sbpayne
What he proposes here seems to be:

Service -> Messaging Broker (Kafka) -> Consumers (Samza) -> View, where writes
go to service and reads read from a view.

However, I more often see a CDC (change data capture) approach used in
practice:

Service -> DB -> CDC -> Messaging Broker -> Consumers -> View

My understanding is that there were three reasons that this was the case:

1) Many already had some conventional DB in place, so it was easier to slap a
CDC + Messaging Broker to get a lot of the benefits with less infra changes.

2) You require strong consistency

3) Messaging broker of choice might not be durable enough to entrust your
writes to.

My understanding is that Kafka is durable (as this article states), so aside
from (1) and (2) is there a strong reason to prefer a CDC approach?

~~~
weixiyen
One potential downside is disk space. If you actually want to retain all
records, that's going to be an order of magnitude more space if your workload
is update-heavy. If it's primarily inserts not a big deal.

The other potential downside I can see is if lost your views for any reason.
If your views are not tolerant to partitions, then you face a long time to
rebuild it the further out you go, so you may have to snapshot your views at
certain intervals and sync with the specific record of the last kafka insert
during snapshot, otherwise you have the same problem that people have when
they clone a huge long-running git repo from the beginning. (edit: Kafka has
log compaction so it may help alleviate some of the rebuilding process)

~~~
fintler
> Kafka has log compaction so it may help alleviate some of the rebuilding
> process

I've seen jobs spend 6+ hours rebuilding from Kafka when the local store is
blown away for all partitions (using log compaction) -- if it's just a
partition or 10 (single node failure), it takes way less time.

When it comes to production, Samza's host affinity support is probably the
most important thing to have working for when a job fails.

[https://samza.apache.org/learn/documentation/latest/yarn/yar...](https://samza.apache.org/learn/documentation/latest/yarn/yarn-
host-affinity.html)

------
marknadal
Another option (which turns the database even more inside out) is to use CRDTs
(conflict-free replicated data types). They allow you to get the end state
that you want, while still running your database across many different
concurrent machines.

Give them a google, they are also called Convergent Replicated Data Types,
you'll find them very enlightening. We have them as the base conflict-
resolution algorithm for our database,
[http://gun.js.org/](http://gun.js.org/) and they work great.

~~~
Gravityloss
(Just a note but your website makes the browser stutter a bit)

~~~
marknadal
Thanks... really need to switch from Adobe Edge Animate to something better
(or just custom build something) have any good suggestions?

------
SilasX
Wait, what does this do about the case where you have to purge the data
permanently, e.g. if someone's private medical records get added
accidentally[1]? Then you lose the benefit of independent streaming/views as
you have to force an update to the later cached representation (which is what
you have to do when you purge something from git's commit history, which is
the same model).

[1] plus, the "right to be forgotten"

------
jdreaver
If you enjoyed this talk then I highly recommend you also go down the rabbit
holes of Event Sourcing (see
[https://martinfowler.com/eaaDev/EventSourcing.html](https://martinfowler.com/eaaDev/EventSourcing.html))
and CQRS. The idea of separating your writes from your reads and doing so in a
consistent (well, maybe eventually consistent) way is a real eye-opener.

Warning: a lot of the literature and talks on ES/CQRS get bogged down in
Object Oriented and Enterprise minutia, and detract from the promising core
ideas. My advice is to try and see the forest from the trees when Googling
these topics.

------
niklasbuschmann
[https://medium.com/@dan_abramov/my-inspiration-
ce454ab65f33](https://medium.com/@dan_abramov/my-inspiration-ce454ab65f33)

This was also a inspiration to Redux (React).

------
amelius
It seems he forgot about one aspect: security. How do you limit the data-
stream to agents which should not have access to the full data, while still
allowing them to read/write to parts of the data they should have access to?

~~~
nickpeterson
Wouldn't they just use a view for their security level?

~~~
amelius
Not sufficiently expressive.

For example, let's say I want to model a filesystem on top of the database.
And I want to give users access to folders when they have access to an
ancestor of that folder. How do I express that in SQL?

Also how would revoking access work in the proposed "data-stream" model?

------
bsg75
> However, most self-respecting developers have got rid of mutable global
> variables in their code long ago.

Is this a dig on those of us who have not gone to pure FP languages, or just
those who use globally scoped (unscoped) variables?

~~~
cowardlydragon
It's piggybacking a known issue in one domain into another one with dubious
intent...

Aka marketing

------
usgroup
I guess the pattern is pretty inevitable if you're building on a centralised
distributed log. You write events to a log which consumers read and create
custom databases , which are the more convenient means to your ends.

Fine but the semantic of your events will be exceptionally important as will
versioning if you'd like a consistent way of making your views re-buildable.
This is easy to mess up.

Kafka + YARN + Hadoop + DBs + code. To me this is neither elegant nor simple
but perhaps necessary complexity if you're into something in realtime and on
scale.

~~~
josephg
I'm currently in the process of rewriting a simple version of this stack with
some nice versioning properties.

It doesn't look simple from the outside, but thats how complex any modern
database is. The only difference with samza is that you can decide how much of
that complexity to buy into. And because you have access to the "internals" of
how data flows through the system, you can extend that view through your
application. (Eg by making a standalone access control filter which only
allows some users access to some documents. Or wrapping your react static HTML
rendering code as a database view that you can, for free get a change feed
to.)

For me the complex part that I can't quite stomach is having to buy into
hadoop and zookeeper. But thats a tooling problem, not an architecture
problem. We need lighter implementations of all these tools.

------
aaron-santos
How is this (dis)similar to Datomic[1]?

1\. [http://www.datomic.com/](http://www.datomic.com/)

------
z3t4
related problem: how do you scale collision detection across multible servers?
think simulation of balls in a vacum that bounce at each other and more and
more balls being added. any suggestions?

~~~
Joeri
You can't avoid some level of checking across servers, but using binary space
partitioning it should be possible to exclude most balls from cross-server
collision checks.

------
assafmo
Cool! Sounds like there are a lot of shared concepts with CouchDB.

------
kaeluka
The expression "old thinking" really makes me cringe.

------
jankotek
I like drawings in this article

------
cowardlydragon
Immutable append only database with constantly processed analytics...

Why does it have its own stream processing when there are 5 or so ones already
in Apache?

~~~
hoprocker
The docs provide a pretty good comparison against the other major players,
Storm and Spark (and MUPD8, which I'm not familiar with):
[http://samza.apache.org/learn/documentation/0.11/comparisons...](http://samza.apache.org/learn/documentation/0.11/comparisons/introduction.html)

The tl;dr is that Samza is closest to Storm+Trident in delivery guarantees,
but is more flexible about how data can be passed around (pluggable
serializer/deserializer (serde) architecture), and offers an out-of-the-box
solution for state management (RocksDB).

------
irickt
(2015)

~~~
bb01100100
I'm a bit late to the party on this post, but a lot has changed since 2015,
too:

\- Kafka writes its topic offsets into, er, Kafka, rather than Zookeeper;
which means ZK load is much lighter. I'd be interested in knowing more about
why people find ZK heavy weight - I've found it to be light, unobtrusive and
very robust. I also use it for managing state (distributed locks) as brokers
boot up and obtain a consistent Broker ID from a defined list of available
'slots' (e.g. if broker 2 goes down, I want the replacement node to also boot
up and be assigned broker ID 2; if two nodes go down, I want those free slots
to be allocated to the next two nodes that boot up).

\- Kafka v0.10 released the Streams API, which has many similarities to Samza,
but removes the requirement to run another cluster for your streaming
applications (no YARN requirement - phew! Kafka will load balance your
consumer-group streaming-app; kill/restart/boot as you see fit). Consume data
as streams and build virtual tables (persisted to disk via RocksDB for get/put
operations, written into Kafka for availability/persistence); join
stream->stream, stream->table, filter, map, aggregate via time windows, write
your own processor/transform classes.. it's very flexible without always
needing to dive into the lower level Processor API.

\- Kafka v0.10.[01] also included an Interactive Queries API - what you get
when you realise that your stream apps have created intermediate
aggregates/tables of information that would be so handy to query (by other
apps, or internally, from across the cluster). I haven't used this API yet,
but anticipate doing so in 2017.

Samza to me always felt like it was in tech preview - not comprehensively
documented and the only real source of knowledge was reading the source and
asking questions in the mailing lists (of which there were excellent and high
quality answers).

What other architectures are people using for streaming data pipelines/apps
(things that do more than count clicks, i.e. transforms, enrichment, etc) -
i.e. lowish latency (event to output in less than, say, five seconds) where
relational databases aren't a natural solution?

I'm leveraging Kafka at my day job as a distributed data backbone (company-
wide focus - think application logs, app usage logs, usage-derived billing,
subscriptions, master data, etc), which when coupled with Streams (new breed
of application at this company) gives us our first steps towards event-driven
processing and data delivery as a stream of new information, rather than
periodic batches.

This provides us with an alternative to the API+database model we've used for
more than a decade and which works very well, but there are times when polling
isn't the answer. Decoupling producers of data from consumers has subtle but
substantial benefits (data liberation, for one, removing/reducing point-to-
point integrations for another).

