
Streaming: a skill gap? - michalc
https://charemza.name/blog/posts/streaming/data/streaming-not-just-for-big-data/
======
falcolas
Any time streaming data comes up, I want to point people towards some of the
existing research and conceptual tooling under the name of "dataflow
processing", which is functionally equivalent to stream processing.

[https://en.wikipedia.org/wiki/Dataflow](https://en.wikipedia.org/wiki/Dataflow)

There's a lot of interesting ideas which we can use to this day, albeit with
some changes to work with modern programming languages and data encoding
frameworks.

~~~
mrdoops
This area of research is fascinating - my favorite part of research in this
area last 6 months is finding DAGs everywhere. More or less Dataflow models
work by lifting the computational dependencies into data and breaking each
compute step into a small enough piece so that they're composable and can
replicate across machines arbitrarily. The dependencies between each step is
usually modeled as some kind of DAG in a simple case the output of one step
feeds into another. In more complex cases you have logical (if this then that)
dependencies for error handling and branching cases.

I think the Dataflow paper:
[https://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf](https://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf)

and the MillWheel paper in particular are good reads about the problem:
[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41378.pdf)

What's especially interesting to me is that the same approach is used for
rule/workflow engines. Although the use-cases are somewhat different and the
graph structure isn't usually the same - they're still modeling compute steps
in a graph just with more conditional logic.

~~~
dmux
Is DAG in this context "Directed acyclic graph"?

~~~
mpfundstein
Indeed

------
cultofmetatron
I've run into plenty of situations where a streaming approach would be faster.
The complexity of it always necessitates making a slower conventional version.
(wait for all the data to load into memory and the operate on it) the
conventional approach is easier to debug and get working. 90% of the time, the
gains from streaming aren't worth the added effort.

Generally you only really get value for it when you're processing a huge data
set continuously or modifying data as its being sent to the user over a
websocket for a fairly lengthy bit of time.

In terms of business value, a cron job running in a high memory vps will more
than satisfy and take much less time to develop.

~~~
Jach
> In terms of business value, a cron job running in a high memory vps will
> more than satisfy and take much less time to develop.

Yeah, but I hate it... I've worked with a team where we had a cron job do some
batch processing every night, but for some large customers it started taking
~12-15 hours to complete, and certain important user operations are locked
while it's running. The solution? Running once per week starting on the
weekend, with a manual trigger for customers who really need the results ASAP.
Tiny effort for that easy fix and the team can continue working on new
features, all-sized customers are still mostly happy, but Dijkstra would not
have liked this...

~~~
cultofmetatron
I would say this is the 5% of the time where it suddenly becomes worth it to
streamify. you'll already have code showing what transformations need to be
done. The point is to avoid premature optimization.

------
nwhitehead
My favorite is Welford's Algorithm [1] which lets you compute mean and
standard deviation in one pass. Every programmer should be aware of it.

[1]:
[https://en.wikipedia.org/wiki/Algorithms_for_calculating_var...](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance)

~~~
mc13
Exactly, i deal with a few of data processing programs, wish my more
experienced colleagues told me about these algorithms before, but doesn't this
algorithm only compute an approximation and not the exact value ?

~~~
anonymoushn
To about the extent that a 64-bit float can ever be said to contain an exact
value, this algorithm will compute the exact value. There's no sampling.

------
commandlinefan
If you've been around long enough, you learned to program in a severely
memory-constrained environment: any solution _besides_ a streaming solution
was unworkable.

~~~
zwieback
Yeah, I'm glad I lived through those days but also glad they are mostly over.

------
gen220
We have built a few streaming pipelines at my current place of work, with the
standard tech (kafka-confluent, a self-hosted Schema registry, etc) operated
for us by SREs.

For our use case (messages never expire, consuming logic is complex and
evolves every couple of months necessitating a full replay, we’re the only
consumer, few 100k messages per year), we’ve decided that it just isn’t worth
the trade offs vs storing temporal data in Postgres.

The place where streaming seems to be most useful for us is for piping data
from one team to any other team. That way, no one needs to quibble over
database permissions, we can just hand over the topic name and say “have fun”.

For internal workloads, the pain points are the ones where streaming is known
to have faults: quick and easy queries on the data in the queue, ergonomics
around schema evolution and topic offsets, exactly once delivery with n>1
partitions and shards, Zookeeper crapping out and dropping messages for god
knows what reason.

These are all problems that Postgres solves for free.

I think streaming makes sense between teams, or if your workload is
significantly different from ours, but that’s just my two cents.

Kafka and friends are for sure useful tools, but they are not the be all end
all they’re sometimes made out to be by people who don’t have to deal with
them day in and out.

------
allenu
As the author alludes to at the end, streaming is a skill, and I think that
really is a big issue. A lot of us, myself included, don't immediately reach
for streaming as a solution to our problems, even if it can be.

For instance, I'm working on a file-based CRDT-style distributed, journaling
database (for single-user but multi-device) and the "diffs" for a single
device are stored in a journal. Now in retrospect, it's obvious that I don't
need to store the entire journal of diffs in memory (this is a "streaming"
architecture), but on my first couple of passes, because it was easier, I
stored everything in memory.

After playing around with some toy projects using my database, I realized this
wouldn't scale (why load all of history into memory if you don't need it?) and
not only that my system lends itself better to a "file stream", i.e. don't
store everything in mem, just append to existing files or read from them as
needed. Seems obvious now, but when building it out, my first inclination
wasn't to do it that way.

------
JMTQp8lwXL
In addition to streams, Buffers confused me for longer than they should have.
Unlike streams, which have an indefinite length, buffers are a fix-sized
amount of binary data.

------
mmis1000
The buffer size handling of stream is is handling in nodejs using a mechanism
called backpressuring.

[https://en.m.wikipedia.org/wiki/Backpressure_routing](https://en.m.wikipedia.org/wiki/Backpressure_routing)

And nodejs's site even has a article for it.

[https://nodejs.org/es/docs/guides/backpressuring-in-
streams/](https://nodejs.org/es/docs/guides/backpressuring-in-streams/)

However, even the nodejs runtime itself already provide mechanism to let you
handle backpressuring easily. Most people still don't do and make their
program crash when streaming target is stuck for whatever reason.

------
avi_das
I love articles like this in hacker news which teaches something specific

~~~
lma21
Got any other similar links by any chance :) ?

------
Const-me
I'm not sure it's skill, more likely languages + runtime support.

When I program C#, I use streaming quite a lot, often combined with async-
await for I/O. The framework also helps, e.g. all compression/encryption
algorithms support asynchronous versions.

When I program C++, I tend to avoid streaming at all cost. The ergonomic is
just not there. Technically doable, but will turn the code into callback hell,
hard to debug and expensive to support.

------
pornel
I'm jaded. It takes one non-streaming detail in the entire system to ruin the
whole thing.

Some data formats need writing a length in the header. Or reading requires
metadata stored after the data. Some algorithms need two passes, or can be
streamed over columns when you're getting rows.

------
foreigner
I recently rewrote some streaming code to slurping. While there are
theoretical advantages to streaming, unless you really really need them the
drawbacks are not worth it. My comment in the code was "Streaming is hard,
memory is cheap."

~~~
JMTQp8lwXL
It really depends on the problem space. Mobile phones and browsers only have
so much memory available so you have to be mindful of memory consumption
there.

------
marsdepinski
This is also called buffering where buffer size is in bytes or messages. The
string/stream size being processed using the buffer is undefined. Just new
words to describe the way Unix has worked for decades.

------
ampdepolymerase
The issue is that the structures to handle streaming are not often first class
(in languages other than Go, Rust, and JavaScript). And when they are, the the
third party interfaces are not. Eventually your data will have to reach the DB
and it would bottleneck. Nobody wants to use KV storage all the time, SQL
systems and ORMs need to step up their concurrency game.

~~~
falcolas
SQL systems on pretty boring hardware are capable of millions of transactions
per second. The schema must simply be designed with concurrent modifications
in mind (less mutation and long lived transactions).

