
Eventual Consistency isn’t for Streaming - arjunnarayan
https://materialize.io/eventual-consistency-isnt-for-streaming/
======
kqr
I agree with the other commenter. Eventual consistency has always been roughly
a synonym for "tactical lack of consistency." The reason this works is that
inconsistency is, in many business domains, not such a big deal as we make it
out to be. Most business are used to data lagging behind, documents being
filed incorrectly, decisions being changed and half of documents referring to
the old decision, to mention just a few possibilities. As long as everything
is dated and there are corroborating versions of all facts, this can be
untangled by experts in the few cases it really matters. Most of the time, it
doesn't matter that much.

Eventual consistency is embracing this philosophy of a lack of consistency for
computer systems too, on the basis that maintaining actual consistency would
be too expensive/complex/slow, which is frequently the case.

This of course, in principle, can lead to ever degrading consistency and since
you can't assume everything is consistent, you also cannot really verify
consistency in any other way than heuristically, as another commenter
suggested.

Eventual consistency is a design driven by practical needs. It is never a path
to reach complete data purity.

And this applies both to streaming and batch tasks alike.

~~~
dustingetz
enterprise is rapidly approaching a data quality crisis where they have all
these data warehouses but the final analytic artifacts end up being garbage
and unusable for data science ... you will be hearing a lot more of this in
the 2020s

~~~
majormajor
A lot of this isn't related to data processing tools at all, but is a sort of
downstream affect of the predominant "bugs are cheap" mentality of today.

The less guarantees of correctness on your daily/weekly/whatever releases, the
messier your downstream data is gonna be. Monday's data is partially missing
due to a bug in the client; Tuesday's data is weird/nonrepresentative because
of a server bug that caused 5% of sessions to get disconnected; Wednesday's
data is good; Thursday's data is good but was a release day and the feature
changed so it means different stuff...

------
asdfasgasdgasdg
This article isn't very convincing to me. I mean, I one hundred percent buy
that eventually consistent stream processing systems can theoretically be
subject to unbounded error. But eventual consistency isn't just a theoretical
model. It's also a practical engineering decision, and so in order to evaluate
its use for any given business purpose we have to see how it performs in
practice. That is, what is the average/99.9%/max error? And we have to
understand how business-critical the correct answer is. This article has some
great examples of theoretical issues with eventually consistent stream
processing computation, but it doesn't demonstrate that any real systems
evince these problems under any given workload.

~~~
dominotw
> Not all is lost! There are stream processing systems that provide strong
> consistency guarantees. Materialize and Differential Dataflow both avoid
> these classes of errors by providing always correct answers

yeah i was expecting to see what tradeoffs materialize made to get 'always
correct' result. There is definitely something 'lost' for 'always correct'
too.

I can only attribute this one sided take to deviousness. Personally , I would
avoid whatever this company is selling.

~~~
irfansharif
Click around through the rest of their blog if you're interested in what those
tradeoffs are. Let's not spread FUD unnecessarily.

~~~
bradhe
Do you, like, work for Materlize? You're awful connected to many of the folks
there and this is a pretty plant-like statement to make in the comments
section for a piece by the company.

~~~
dang
" _Please don 't post insinuations about astroturfing, shilling, brigading,
foreign agents and the like. It degrades discussion and is usually mistaken.
If you're worried about abuse, email hn@ycombinator.com and we'll look at the
data._"

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

[https://hn.algolia.com/?query=by:dang%20astroturf&sort=byDat...](https://hn.algolia.com/?query=by:dang%20astroturf&sort=byDate&dateRange=all&type=comment&storyText=false&prefix=true&page=0)

------
cs702
For more concise and precise explanations of the rationale for these kinds of
tools, see this paper: [https://github.com/TimelyDataflow/differential-
dataflow/raw/...](https://github.com/TimelyDataflow/differential-
dataflow/raw/master/differentialdataflow.pdf) \-- here's the abstract:

 _> Existing computational models for processing continuously changing input
data are unable to efficiently support iterative queries except in limited
special cases. This makes it difficult to perform complex tasks, such as
social-graph analysis on changing data at interactive timescales, which would
greatly benefit those analyzing the behavior of services like Twitter. In this
paper we introduce a new model called differential computation, which extends
traditional incremental computation to allow arbitrarily nested iteration, and
explain—with reference to a publicly available prototype system called
Naiad—how differential computation can be efficiently implemented in the
context of a declarative dataparallel dataflow language. The resulting system
makes it easy to program previously intractable algorithms such as
incrementally updated strongly connected components, and integrate them with
data transformation operations to obtain practically relevant insights from
real data streams._

See also this friendlier (and lengthier) online book:
[https://timelydataflow.github.io/differential-
dataflow/](https://timelydataflow.github.io/differential-dataflow/)

~~~
virgilp
materialize.io is literally timelydataflow/ differential dataflow... same
product, developed by Frank McSherry. It's not "the other tool", it's the very
same.

~~~
cs702
Corrected. Thanks!

------
alextheparrot
I'm actually just fundamentally confused about what is being argued.

I'm familiar with streaming, as a concept, from the likes of Beam, Spark,
Flink, Samza - they do computations over data, producing intermediate results
consistent with the data seen so far. These results are, of course, not
necessarily consistent with the larger world because there could be
unprocessed or late events in a stream, but they are consistent with the part
of the world seen so far.

The advantage of streaming is the ability to compute and expose intermediate
snapshots of the world that don't rely on the stream closing (As many streams
found in reality are not bounded, meaning intermediate results are the only
realizable result set). These intermediate results can have value, but that
depends on the problem statement.

To examine one of the examples, let's use example 2, this aligns with the idea
that we actually don't have a traditional streaming problem. The question
being asked is "What is the key which contains the maximum value". There is a
difference between asking "What is the maximum so far today" and "What was the
maximum result today" \-- the tense change is important because in the former
the user cares about the results as they exist in the present moment, whereas
the other cares about a view of the world in a time frame that is complete. It
seems like the idea of "consistent" is being conflated with "complete",
wherein "complete" is not a guaranteed feature of an input stream.

If anyone could clarify why the examples here isn't just a case of expecting
bounded vs unbounded streams?

~~~
arjunnarayan
The argument is that without stronger consistency guarantees you can't do
joins between two streams (or even something like argmax over a single stream,
since it splits the stream into two subcomputations, which then have to be
joined back together).

I think when folks say that eventual consistency is okay, they're thinking
about simple aggregates - where transient incorrectness in the result is
indistinguishable from noise.

But if you want to do joins, you really want to be able to reason about your
unbounded streams causally - Flink, Beam, (and as another commenter points
out, Firebase as well) provide stronger consistency guarantees on computations
over unbounded streams.

~~~
PeterCorless
This, of course, presumes that ppl want to do JOINs. Wouldn't eventual
consistency be mostly NoSQL where joins are not an issue?

------
nikhilsimha
In both examples 2 and 3, the author reads the same stream twice
_independently_ and assumes that a join is not synchronized between the
transformed streams. This seems like a fundamental flaw in their offering.

Pushing in a timestamp along with the max/variance change stream[1]. And then
using the timestamp to synchronize the join[2] would naturally produce a
consistent output stream.

I quoted flink because they have the best docs around. But it should be
possible in most streaming systems. Disclaimer, I used to work for the fb
streaming group and have collaborated with the flink team very briefly.

[1] [https://ci.apache.org/projects/flink/flink-docs-
stable/dev/t...](https://ci.apache.org/projects/flink/flink-docs-
stable/dev/table/streaming/dynamic_tables.html#table-to-stream-conversion)

[2] [https://ci.apache.org/projects/flink/flink-docs-
release-1.11...](https://ci.apache.org/projects/flink/flink-docs-
release-1.11/dev/table/streaming/joins.html#event-time-temporal-joins)

~~~
rkhaitan
The aim of the examples is to show what goes wrong in eventually consistent
systems where it's possible that two reads of a stream may not be consistent
with respect to each other. The examples are not intended to say that such
anomalies can't be fixed by providing stronger consistency guarantees by using
timestamps.

------
dekimir
> you should be prepared for your results to be never-consistent

Isn't this a core feature of distributed systems? How can you be "consistent"
if there's a network failure between some writer and the stream? How can you
tell a network failure from a network delay? How can you tell a network delay
from any other delay?

And finally, how can you even talk about "up-to-date" data if the reader
doesn't provide their "date" (ie, a logical timestamp)?

~~~
jlokier
> Isn't this a core feature of distributed systems? How can you be
> "consistent" if there's a network failure between some writer and the
> stream? How can you tell a network failure from a network delay? How can you
> tell a network delay from any other delay?

This is covered by the CAP theorem.
[https://en.wikipedia.org/wiki/CAP_theorem](https://en.wikipedia.org/wiki/CAP_theorem)

The basic solution is: If you need consistency and there's _too much_ network
failure (or delay), you'll have to pause operations and wait until the network
is fixed.

If there's only a bit of network failure (or delay), consistency stays
possible using quorum protocols such as Paxos and Raft.

> how can you even talk about "up-to-date" data if the reader doesn't provide
> their "date" (ie, a logical timestamp)?

Implicit causality helps.

You're right that there may be no definite logical time, but it often doesn't
matter.

When a program issues a read command, the logical timestamp is, implicitly,
greater than the timestamp of all results previously received from the network
that were inputs to produce the read command.

So the rest of the network "knows" something about the logical time of the
read command. It's not an exact logical time, and if the timestamps aren't
passed around, it might not even be an inequality. It's more like a logical
property that relates dependent values.

If done right, that's enough to ensure strict consistency in observable
results.

Unless the program issuing reads does wild things with value speculation. You
may have heard how much things can go wrong with speculative execution...

------
anonymousDan
There's been plenty of work in the past on weaker correctness guarantees for
stream processing system (e.g. concepts like rollback and gap recovery from
Aurora). Not sure it's an either/or between eventually consistent and strong
consistency.

------
satyrnein
Side question - has anyone tried using Materialize beyond toy workloads? Can I
move billions of rows off of a batch workflow on Snowflake onto Materialize
and suddenly everything is near realtime?

------
DevKoala
I keep falling for these clickbait titles in the hopes I will find a fair
argument. However, the moment I realize the article is trying to sell me a
product based around an argument, I lose faith on the perspective of the
writer.

If the title was something more honest such as “How product X solves for Y”
I’d feel more compelled to put trust on the analysis being objective.

------
tlarkworthy
Firebase provides causal consistency. By subscribing to streams (listen), the
client opts into which data sources it was consistent snapshots of, then all
distinct client streams are bundled up and delivered in order over the wire.
It's a very elegant model which does not get in the way and has nice
ergonomics.

------
andrekandre
so, if i understand the article correctly, for purposes of realtime
reporting/monitoring (streaming, as stated), eventual consistency is not an
appropriate "store" to hook into because you cant know when things have become
consitstent, and reliable streaming of (near?) realtime data requires some
chance for that to occur

is that a correct interpretation?

------
erikerikson
TL;DR: accessing materializations is necessarily a snapshot.

This article reads as though the author hadn't shifted mindset from "the
database will solve it for me" to "I'm taking on the relevant subset of
problems in my use case". This seems off given that they're trying to sell a
streaming product. They claim their product avoids problems by offering
"always correct" answers which requires a footnote at the very least but none
was given.

Point of note: The consistency guarantee is that upon processing to the same
offset in the log that, given that you have taken no other non-constant input,
you will have the same computational result as all other processes executing
semantically equivalent code.

I take this sort of comment as abusive of the reader:

> What does a naive application of eventual consistency have to say about > >
> \-- count the records in `data` > select count(*) from data > > It’s not
> really clear, is it?

A naive application of eventual consistency declares that along some
equivalent of a Lamport time stamp across the offsets of shards in the stream,
the system will calculate account of records in data as of that offset. Given
the ongoing transmission of events that can alter the set data, that value
will continue changing as appropriate and in a manner consistent with the data
it processes. The new answers will be given when the query is run again or it
may even issue an ongoing stream of updates to that value.

Maybe it got better as the article went on...

~~~
erikerikson
I appreciate that the downvote mechanism is low friction. I wish it were
easier to learn and improve from it too.

~~~
quodlibetor
I didn't downvote you, but as an outside observer I can see that folks might
take issue with the fact that you start with a tl;dr, suggesting that you are
summarizing the whole article, followedan in-depth analysis, and then ending
by stating that you didn't read the whole thing.

~~~
erikerikson
Thank you very much. I was really excited to read the article and became very
disappointed but my take was too hot and insufficient. I definitely should
have done better. Thank you for expanding my perspective.

------
ecopoesis
Almost every distributed system (including "simple" client-server systems) is
eventually consistent. And all systems are distributed.

It's great that your DB is ACID and anyone who queries it gets the latest
greatest but in reality you also have out of date caches, ORM models that
haven't been persisted, apps where users modifying data that hasn't been
pushed back to the server and a million other examples.

I'm sure it's possible to create a consistent system but I'm also sure it's
not practical. No one does it.

Instead of constantly fighting eventual consistency just learn to embrace it
and its shortcomings. Design systems and write code that are resilient to
splits in HEAD and provide easy methods to merge back to a single truth.

~~~
bcrosby95
There is a huge difference between having an ACID store of Truth surrounded by
eventual consistency, vs making even your store of Truth eventually
consistent. You're basically doubling or tripling your work for any given
constraint because you have to both monitor after-the-fact violations and
build in a way to resolve those violations.

This is on top of regular "nope, can't do that" code that you would write in
both systems.

~~~
lmm
It's just the opposite, in my experience. If you have an ACID database that's
supposed to represent the current state of the world, you have to handle both
transaction rollbacks and logical inconsistencies. If you have a streaming
system where you record an event log and generate the current state of the
world from that, you already have the logic to recover from inconsistencies
and can reuse it.

