There's a lot of interesting ideas which we can use to this day, albeit with some changes to work with modern programming languages and data encoding frameworks.
I think the Dataflow paper: https://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
and the MillWheel paper in particular are good reads about the problem: https://static.googleusercontent.com/media/research.google.c...
What's especially interesting to me is that the same approach is used for rule/workflow engines. Although the use-cases are somewhat different and the graph structure isn't usually the same - they're still modeling compute steps in a graph just with more conditional logic.
Generally you only really get value for it when you're processing a huge data set continuously or modifying data as its being sent to the user over a websocket for a fairly lengthy bit of time.
In terms of business value, a cron job running in a high memory vps will more than satisfy and take much less time to develop.
I gotta disagree with that estimate. Virtually any time I have a backend service operating on (mostly) arbitrarily-sized user input, I use streaming so that I can make better guarantees about how much memory my service needs. This, in turn, lets you give your customers much higher service limits (unless you want to scale your fleet's memory just to handle 100th-percentile style use cases).
The number of times I've seen backend services fall over, with a heap graph that looks like a repeated sawtooth pattern to OOM, because a customer's objects were unusually sized (but within limits..)..
But you do have to be careful that you're not just overloading some other system (like consuming disk space with files that don't need to be retained). Keep good stats on all of your exhaustible resources, kids.
I... won't go so far as to say "I think", but "I have a pet theory" that part of the reason for this is actually effect rather than cause. That is, developers generally do not think in streaming, so they build libraries that are based on doing things non-streaming, which have libraries built on top of them that assume non-streaming, which have frameworks built on top of them that assume non-streaming, etc. etc. and so on, and the end result is that it's just way harder to get streaming working than it would be if more developers were comfortable with it.
The web world even more so, which for pretty much its entire run has been conceptualized by developers as returning chunks of content, even though the tech nominally had more streaming support than that, being (until recently) TCP sockets under the hood. Web developers even made it a virtue that once a chunk was emitted, all context was dropped on the floor. (I see this as less a virtue than an accidental way old CGI stuff worked that got raised into a requirement.)
Historically speaking, only the minimal things that needed to support streaming to work at all supported it. I am seeing a slow trend towards more streaming-thinking though, and it's getting easier to stream things.
This is an explanation of why I think the quoted text is true, not a disagreement. I think in a more perfect world it wouldn't be true, and I have hope that it won't be true in the medium-term future, but today it often is, depending on details of your local environment.
Yeah, but I hate it... I've worked with a team where we had a cron job do some batch processing every night, but for some large customers it started taking ~12-15 hours to complete, and certain important user operations are locked while it's running. The solution? Running once per week starting on the weekend, with a manual trigger for customers who really need the results ASAP. Tiny effort for that easy fix and the team can continue working on new features, all-sized customers are still mostly happy, but Dijkstra would not have liked this...
I am still writing it up, but it looks like we are currently stuck in the call/return architectural style or even paradigm. Meaning all our languages offer what are in essence variations of call/return, be they subroutines, procedures, function or methods.
However, a lot of the problems we need to solve or systems we want to build do not conform to this pattern. Probably the majority by now. When we have such a system, we have choice to make, with two bad options on offer: either conform to the system/problem, therefore having something that constantly grates against the language/environment, or conform with call/return and grate against the problem you're trying to solve.
Streaming is an example of this. I presented Standard Object Out: Streaming Objects with Polymorphic Write Streams at DLS '19, which shows some of the nasty effects and the start of a solution.
I agree with this, but I feel that in most cases it's not nessecary complexity. It comes from poor APIs that don't make streaming easy, or mismatch between push-oriented ("pass me each new chunk as it arrives") and pull-oriented ("give me a queue/file/iterator that will yield chunks").
Each service in turn has to request, receive, parse, process, and emit the data. Bandwidth and CPU time turn into latency. Those can start to add up.
Assuming you can stream, doing so in this particular scenario will also improve the latency, not just throughput.
It includes skewness and kurtosis.
For our use case (messages never expire, consuming logic is complex and evolves every couple of months necessitating a full replay, we’re the only consumer, few 100k messages per year), we’ve decided that it just isn’t worth the trade offs vs storing temporal data in Postgres.
The place where streaming seems to be most useful for us is for piping data from one team to any other team. That way, no one needs to quibble over database permissions, we can just hand over the topic name and say “have fun”.
For internal workloads, the pain points are the ones where streaming is known to have faults: quick and easy queries on the data in the queue, ergonomics around schema evolution and topic offsets, exactly once delivery with n>1 partitions and shards, Zookeeper crapping out and dropping messages for god knows what reason.
These are all problems that Postgres solves for free.
I think streaming makes sense between teams, or if your workload is significantly different from ours, but that’s just my two cents.
Kafka and friends are for sure useful tools, but they are not the be all end all they’re sometimes made out to be by people who don’t have to deal with them day in and out.
For instance, I'm working on a file-based CRDT-style distributed, journaling database (for single-user but multi-device) and the "diffs" for a single device are stored in a journal. Now in retrospect, it's obvious that I don't need to store the entire journal of diffs in memory (this is a "streaming" architecture), but on my first couple of passes, because it was easier, I stored everything in memory.
After playing around with some toy projects using my database, I realized this wouldn't scale (why load all of history into memory if you don't need it?) and not only that my system lends itself better to a "file stream", i.e. don't store everything in mem, just append to existing files or read from them as needed. Seems obvious now, but when building it out, my first inclination wasn't to do it that way.
And nodejs's site even has a article for it.
However, even the nodejs runtime itself already provide mechanism to let you handle backpressuring easily. Most people still don't do and make their program crash when streaming target is stuck for whatever reason.
When I program C#, I use streaming quite a lot, often combined with async-await for I/O. The framework also helps, e.g. all compression/encryption algorithms support asynchronous versions.
When I program C++, I tend to avoid streaming at all cost. The ergonomic is just not there. Technically doable, but will turn the code into callback hell, hard to debug and expensive to support.
Some data formats need writing a length in the header. Or reading requires metadata stored after the data. Some algorithms need two passes, or can be streamed over columns when you're getting rows.