Hacker News new | past | comments | ask | show | jobs | submit login
Streaming: a skill gap? (charemza.name)
137 points by michalc on Feb 3, 2020 | hide | past | favorite | 36 comments

Any time streaming data comes up, I want to point people towards some of the existing research and conceptual tooling under the name of "dataflow processing", which is functionally equivalent to stream processing.


There's a lot of interesting ideas which we can use to this day, albeit with some changes to work with modern programming languages and data encoding frameworks.

This area of research is fascinating - my favorite part of research in this area last 6 months is finding DAGs everywhere. More or less Dataflow models work by lifting the computational dependencies into data and breaking each compute step into a small enough piece so that they're composable and can replicate across machines arbitrarily. The dependencies between each step is usually modeled as some kind of DAG in a simple case the output of one step feeds into another. In more complex cases you have logical (if this then that) dependencies for error handling and branching cases.

I think the Dataflow paper: https://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

and the MillWheel paper in particular are good reads about the problem: https://static.googleusercontent.com/media/research.google.c...

What's especially interesting to me is that the same approach is used for rule/workflow engines. Although the use-cases are somewhat different and the graph structure isn't usually the same - they're still modeling compute steps in a graph just with more conditional logic.

Is DAG in this context "Directed acyclic graph"?


I've run into plenty of situations where a streaming approach would be faster. The complexity of it always necessitates making a slower conventional version. (wait for all the data to load into memory and the operate on it) the conventional approach is easier to debug and get working. 90% of the time, the gains from streaming aren't worth the added effort.

Generally you only really get value for it when you're processing a huge data set continuously or modifying data as its being sent to the user over a websocket for a fairly lengthy bit of time.

In terms of business value, a cron job running in a high memory vps will more than satisfy and take much less time to develop.

> 90% of the time, the gains from streaming aren't worth the added effort.

I gotta disagree with that estimate. Virtually any time I have a backend service operating on (mostly) arbitrarily-sized user input, I use streaming so that I can make better guarantees about how much memory my service needs. This, in turn, lets you give your customers much higher service limits (unless you want to scale your fleet's memory just to handle 100th-percentile style use cases).

The number of times I've seen backend services fall over, with a heap graph that looks like a repeated sawtooth pattern to OOM, because a customer's objects were unusually sized (but within limits..)..

Yeah this is an important accidental DOS vector, and streaming APIs are a classic way to fix them.

But you do have to be careful that you're not just overloading some other system (like consuming disk space with files that don't need to be retained). Keep good stats on all of your exhaustible resources, kids.

"90% of the time, the gains from streaming aren't worth the added effort."

I... won't go so far as to say "I think", but "I have a pet theory" that part of the reason for this is actually effect rather than cause. That is, developers generally do not think in streaming, so they build libraries that are based on doing things non-streaming, which have libraries built on top of them that assume non-streaming, which have frameworks built on top of them that assume non-streaming, etc. etc. and so on, and the end result is that it's just way harder to get streaming working than it would be if more developers were comfortable with it.

The web world even more so, which for pretty much its entire run has been conceptualized by developers as returning chunks of content, even though the tech nominally had more streaming support than that, being (until recently) TCP sockets under the hood. Web developers even made it a virtue that once a chunk was emitted, all context was dropped on the floor. (I see this as less a virtue than an accidental way old CGI stuff worked that got raised into a requirement.)

Historically speaking, only the minimal things that needed to support streaming to work at all supported it. I am seeing a slow trend towards more streaming-thinking though, and it's getting easier to stream things.

This is an explanation of why I think the quoted text is true, not a disagreement. I think in a more perfect world it wouldn't be true, and I have hope that it won't be true in the medium-term future, but today it often is, depending on details of your local environment.

> In terms of business value, a cron job running in a high memory vps will more than satisfy and take much less time to develop.

Yeah, but I hate it... I've worked with a team where we had a cron job do some batch processing every night, but for some large customers it started taking ~12-15 hours to complete, and certain important user operations are locked while it's running. The solution? Running once per week starting on the weekend, with a manual trigger for customers who really need the results ASAP. Tiny effort for that easy fix and the team can continue working on new features, all-sized customers are still mostly happy, but Dijkstra would not have liked this...

I would say this is the 5% of the time where it suddenly becomes worth it to streamify. you'll already have code showing what transformations need to be done. The point is to avoid premature optimization.

How long until the cron start takes 70 hours to finish?

Hah! One of my favourite topics: The Gentle Tyranny of Call/Return.

I am still writing it up, but it looks like we are currently stuck in the call/return architectural style or even paradigm. Meaning all our languages offer what are in essence variations of call/return, be they subroutines, procedures, function or methods.

However, a lot of the problems we need to solve or systems we want to build do not conform to this pattern. Probably the majority by now. When we have such a system, we have choice to make, with two bad options on offer: either conform to the system/problem, therefore having something that constantly grates against the language/environment, or conform with call/return and grate against the problem you're trying to solve.

Streaming is an example of this. I presented Standard Object Out: Streaming Objects with Polymorphic Write Streams at DLS '19, which shows some of the nasty effects and the start of a solution.

I also addressed this problem more generally at last summer's ESUG '19, talking about Objective-Smalltalk:


Objective-Smalltalk makes it possible to express systems (I hesitate to even call them programs) in non-call/return styles (such as dataflow/streaming) as naturally as call/return systems and without giving up interoperability with the (large) call/return world.

> The complexity of it always necessitates making a slower conventional version.

I agree with this, but I feel that in most cases it's not nessecary complexity. It comes from poor APIs that don't make streaming easy, or mismatch between push-oriented ("pass me each new chunk as it arrives") and pull-oriented ("give me a queue/file/iterator that will yield chunks").

Processing time becomes a problem as well when the depth of the call tree across process boundaries starts to climb.

Each service in turn has to request, receive, parse, process, and emit the data. Bandwidth and CPU time turn into latency. Those can start to add up.

Assuming you can stream, doing so in this particular scenario will also improve the latency, not just throughput.

Most programming languages support generator semantics, letting you do development monolithically while still allowing for easy refactoring into streaming components. You simply have to include the goal of using generator semantics into the design up front.

My favorite is Welford's Algorithm [1] which lets you compute mean and standard deviation in one pass. Every programmer should be aware of it.

[1]: https://en.wikipedia.org/wiki/Algorithms_for_calculating_var...

Exactly, i deal with a few of data processing programs, wish my more experienced colleagues told me about these algorithms before, but doesn't this algorithm only compute an approximation and not the exact value ?

To about the extent that a 64-bit float can ever be said to contain an exact value, this algorithm will compute the exact value. There's no sampling.

If readers would like to read an implementation, one is here: https://github.com/sharpobject/runningstats

It includes skewness and kurtosis.

If you've been around long enough, you learned to program in a severely memory-constrained environment: any solution _besides_ a streaming solution was unworkable.

Yeah, I'm glad I lived through those days but also glad they are mostly over.

We have built a few streaming pipelines at my current place of work, with the standard tech (kafka-confluent, a self-hosted Schema registry, etc) operated for us by SREs.

For our use case (messages never expire, consuming logic is complex and evolves every couple of months necessitating a full replay, we’re the only consumer, few 100k messages per year), we’ve decided that it just isn’t worth the trade offs vs storing temporal data in Postgres.

The place where streaming seems to be most useful for us is for piping data from one team to any other team. That way, no one needs to quibble over database permissions, we can just hand over the topic name and say “have fun”.

For internal workloads, the pain points are the ones where streaming is known to have faults: quick and easy queries on the data in the queue, ergonomics around schema evolution and topic offsets, exactly once delivery with n>1 partitions and shards, Zookeeper crapping out and dropping messages for god knows what reason.

These are all problems that Postgres solves for free.

I think streaming makes sense between teams, or if your workload is significantly different from ours, but that’s just my two cents.

Kafka and friends are for sure useful tools, but they are not the be all end all they’re sometimes made out to be by people who don’t have to deal with them day in and out.

As the author alludes to at the end, streaming is a skill, and I think that really is a big issue. A lot of us, myself included, don't immediately reach for streaming as a solution to our problems, even if it can be.

For instance, I'm working on a file-based CRDT-style distributed, journaling database (for single-user but multi-device) and the "diffs" for a single device are stored in a journal. Now in retrospect, it's obvious that I don't need to store the entire journal of diffs in memory (this is a "streaming" architecture), but on my first couple of passes, because it was easier, I stored everything in memory.

After playing around with some toy projects using my database, I realized this wouldn't scale (why load all of history into memory if you don't need it?) and not only that my system lends itself better to a "file stream", i.e. don't store everything in mem, just append to existing files or read from them as needed. Seems obvious now, but when building it out, my first inclination wasn't to do it that way.

In addition to streams, Buffers confused me for longer than they should have. Unlike streams, which have an indefinite length, buffers are a fix-sized amount of binary data.

The buffer size handling of stream is is handling in nodejs using a mechanism called backpressuring.


And nodejs's site even has a article for it.


However, even the nodejs runtime itself already provide mechanism to let you handle backpressuring easily. Most people still don't do and make their program crash when streaming target is stuck for whatever reason.

I love articles like this in hacker news which teaches something specific

Got any other similar links by any chance :) ?

I'm not sure it's skill, more likely languages + runtime support.

When I program C#, I use streaming quite a lot, often combined with async-await for I/O. The framework also helps, e.g. all compression/encryption algorithms support asynchronous versions.

When I program C++, I tend to avoid streaming at all cost. The ergonomic is just not there. Technically doable, but will turn the code into callback hell, hard to debug and expensive to support.

I'm jaded. It takes one non-streaming detail in the entire system to ruin the whole thing.

Some data formats need writing a length in the header. Or reading requires metadata stored after the data. Some algorithms need two passes, or can be streamed over columns when you're getting rows.

I recently rewrote some streaming code to slurping. While there are theoretical advantages to streaming, unless you really really need them the drawbacks are not worth it. My comment in the code was "Streaming is hard, memory is cheap."

It really depends on the problem space. Mobile phones and browsers only have so much memory available so you have to be mindful of memory consumption there.

Memory can be relatively cheap (at least up to the 1TB point), but you still need to account for hard software development problems - such as populating that large blob of data into memory from across the network, in a fault tolerant fashion.

This sounds quite sad.

This is also called buffering where buffer size is in bytes or messages. The string/stream size being processed using the buffer is undefined. Just new words to describe the way Unix has worked for decades.

The issue is that the structures to handle streaming are not often first class (in languages other than Go, Rust, and JavaScript). And when they are, the the third party interfaces are not. Eventually your data will have to reach the DB and it would bottleneck. Nobody wants to use KV storage all the time, SQL systems and ORMs need to step up their concurrency game.

SQL systems on pretty boring hardware are capable of millions of transactions per second. The schema must simply be designed with concurrent modifications in mind (less mutation and long lived transactions).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact