Hacker News new | past | comments | ask | show | jobs | submit login

I've run into plenty of situations where a streaming approach would be faster. The complexity of it always necessitates making a slower conventional version. (wait for all the data to load into memory and the operate on it) the conventional approach is easier to debug and get working. 90% of the time, the gains from streaming aren't worth the added effort.

Generally you only really get value for it when you're processing a huge data set continuously or modifying data as its being sent to the user over a websocket for a fairly lengthy bit of time.

In terms of business value, a cron job running in a high memory vps will more than satisfy and take much less time to develop.




> 90% of the time, the gains from streaming aren't worth the added effort.

I gotta disagree with that estimate. Virtually any time I have a backend service operating on (mostly) arbitrarily-sized user input, I use streaming so that I can make better guarantees about how much memory my service needs. This, in turn, lets you give your customers much higher service limits (unless you want to scale your fleet's memory just to handle 100th-percentile style use cases).

The number of times I've seen backend services fall over, with a heap graph that looks like a repeated sawtooth pattern to OOM, because a customer's objects were unusually sized (but within limits..)..


Yeah this is an important accidental DOS vector, and streaming APIs are a classic way to fix them.

But you do have to be careful that you're not just overloading some other system (like consuming disk space with files that don't need to be retained). Keep good stats on all of your exhaustible resources, kids.


"90% of the time, the gains from streaming aren't worth the added effort."

I... won't go so far as to say "I think", but "I have a pet theory" that part of the reason for this is actually effect rather than cause. That is, developers generally do not think in streaming, so they build libraries that are based on doing things non-streaming, which have libraries built on top of them that assume non-streaming, which have frameworks built on top of them that assume non-streaming, etc. etc. and so on, and the end result is that it's just way harder to get streaming working than it would be if more developers were comfortable with it.

The web world even more so, which for pretty much its entire run has been conceptualized by developers as returning chunks of content, even though the tech nominally had more streaming support than that, being (until recently) TCP sockets under the hood. Web developers even made it a virtue that once a chunk was emitted, all context was dropped on the floor. (I see this as less a virtue than an accidental way old CGI stuff worked that got raised into a requirement.)

Historically speaking, only the minimal things that needed to support streaming to work at all supported it. I am seeing a slow trend towards more streaming-thinking though, and it's getting easier to stream things.

This is an explanation of why I think the quoted text is true, not a disagreement. I think in a more perfect world it wouldn't be true, and I have hope that it won't be true in the medium-term future, but today it often is, depending on details of your local environment.


> In terms of business value, a cron job running in a high memory vps will more than satisfy and take much less time to develop.

Yeah, but I hate it... I've worked with a team where we had a cron job do some batch processing every night, but for some large customers it started taking ~12-15 hours to complete, and certain important user operations are locked while it's running. The solution? Running once per week starting on the weekend, with a manual trigger for customers who really need the results ASAP. Tiny effort for that easy fix and the team can continue working on new features, all-sized customers are still mostly happy, but Dijkstra would not have liked this...


I would say this is the 5% of the time where it suddenly becomes worth it to streamify. you'll already have code showing what transformations need to be done. The point is to avoid premature optimization.


How long until the cron start takes 70 hours to finish?


Hah! One of my favourite topics: The Gentle Tyranny of Call/Return.

I am still writing it up, but it looks like we are currently stuck in the call/return architectural style or even paradigm. Meaning all our languages offer what are in essence variations of call/return, be they subroutines, procedures, function or methods.

However, a lot of the problems we need to solve or systems we want to build do not conform to this pattern. Probably the majority by now. When we have such a system, we have choice to make, with two bad options on offer: either conform to the system/problem, therefore having something that constantly grates against the language/environment, or conform with call/return and grate against the problem you're trying to solve.

Streaming is an example of this. I presented Standard Object Out: Streaming Objects with Polymorphic Write Streams at DLS '19, which shows some of the nasty effects and the start of a solution.

  https://conf.researchr.org/details/dls-2019/dls-2019/7/Standard-Object-Out-Streaming-Objects-with-Polymorphic-Write-Streams
I also addressed this problem more generally at last summer's ESUG '19, talking about Objective-Smalltalk:

   https://www.youtube.com/watch?v=vrD3TrVuiV0&list=PLJ5nSnWzQXi8DPNpy1jCkjE4yE0WUtDP2&index=53

Objective-Smalltalk makes it possible to express systems (I hesitate to even call them programs) in non-call/return styles (such as dataflow/streaming) as naturally as call/return systems and without giving up interoperability with the (large) call/return world.


> The complexity of it always necessitates making a slower conventional version.

I agree with this, but I feel that in most cases it's not nessecary complexity. It comes from poor APIs that don't make streaming easy, or mismatch between push-oriented ("pass me each new chunk as it arrives") and pull-oriented ("give me a queue/file/iterator that will yield chunks").


Processing time becomes a problem as well when the depth of the call tree across process boundaries starts to climb.

Each service in turn has to request, receive, parse, process, and emit the data. Bandwidth and CPU time turn into latency. Those can start to add up.

Assuming you can stream, doing so in this particular scenario will also improve the latency, not just throughput.


Most programming languages support generator semantics, letting you do development monolithically while still allowing for easy refactoring into streaming components. You simply have to include the goal of using generator semantics into the design up front.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: