Hacker News new | comments | show | ask | jobs | submit login
Stateful Multi-Stream Processing in Python with Wallaroo (wallaroolabs.com)
59 points by jtfmumm 10 months ago | hide | past | web | favorite | 13 comments

Couple questions:

1. How does this compare to other distributed streaming systems in terms of latency, throughput, fault tolerance, and message processing guarantees?

2. Why Pony? I’m not super familiar with the language, does it give you any major gains?

3. How does this handle shared aggregations - i.e. I have one stream updating a table constantly and another that reads from the table. This is useful as it decouples the two event streams, meaning my reading job can have better latency guarantees with the tradeoff of possible having stale data. I didn’t see how I can link together two independently deployed streaming states, unless the idea is any stream using some state will be declared in one place.

4. Does this support transactional event processing, wherein my originals HTTP call is returned with a success or failure, or does it just ack the event was received, even if the event fails downstream of the source?

5. Are there any major, novel differences in this stream processing system versus the next that I’m not aware of? I didn’t see anything too unique in this post.


I work at Wallaroo Labs.

strmpnk had good content for answers for 1 and 2. thank you for that!

re: 3.

The one stream updates/one stream reads is detailed in the "MarketSpread" example in the post. I suspect the key to your question is "two independently deployed streaming states". Could you elaborate on what you mean by that? I think if I understood that better, I could give a good answer to #3.

4. Currently all pipelines have a source and a sink. In your HTTP call example, that would most likely be a combination source/sink where we want to have both the input and some output result flow back out. That currently isn't supported in Wallaroo although we have had discussions about what that might mean. You could do transactional processing where you receive an event and then reply but at the moment, the reply would need to go out over the sink which is a different channel than the source. I'd be really interested in working with folks who could benefit from a combination source/sink approach. We prefer to add additional functionality after working through use cases with interested parties. We find that leads to better results than our designing in a vacuum.

5. Everyone has a different definition of novel. In my view, most streaming systems have left the handling of application state in the hands of the programmer. You can keep it in memory for speed but then you have to manage the resilience of that in-memory state. Wallaroo makes that in-memory state a first class citizen. It's not detailed in this post, but Wallaroo can provide resilience for you state so that if there is a failure we can bring that state back. Currently that involves using a write ahead log that is stored on the local filesystem. We are working on replicating that log to other nodes in the cluster to provide additional resilience.

If i was to hightlight one feature to answer your question about novelty, that would be it. We have a lot more content on the blog that might answer your question better. The post in question is very much a "how do I do X" sort of post rather than a "Why Wallaroo?" (of which there are a couple on the blog).

Hey, thanks for the comment. A few clarifications you asked about:

3. One of the main use cases for using external databases and stores is that they can be multi tenant. Bob comes up with a really cool market aggregation, Susan and Ralph want to use that aggregation in their models. Currently, is there a way for Susan and Ralph to just hook into that aggregation in memory and allow them to use that as a feature?

4. The main use case is that your streaming framework makes a bunch of cool aggregates, but if you have a customer who needs the question “Should I let this person do this thing?”, you want to answer that synchronously. The alternative is to load your state into an external store and setup a web server that just does a lookup on that store, instead of doing it in your streaming system.

3. currently, there's no adhoc way to query in-memory data. wallaroo while having a couple database like features isn't a database. you'd need to export that aggregation or the results thereof via a sink into some other system for adhoc querying. we've had discussions about providing adhoc querying but it isn't on the immediate radar.

4. we've worked with folks who want to do what you are describing but the "answer channel" was different than the "question channel". so the question came in over a source (in their case TCP) and the answer would leave over a sink (also TCP). in that case it was something like:

client -> persistent web socket -> server -> streaming system -> server -> persistent web socket -> client

having a combined source/sink is something that we've discussed but we have higher priorities at the moment. if we were to start working with someone for whom it was a priority, we'd move it up in priority.

What needs to be enabled in ublock/umatrix to actually complete the survey? I got to the very bottom but clicking submit does nothing and I didn't feel like debugging anything deeper. I rarely do surveys but the shirt looks cool.

if you are running NoScript, I'd suggest temporarily allowing the page. its hosted by typeform that in turn loads most of its content from various cloudfront urls.

you'll definitely need javascript enabled.

While I'm not affiliated with Wallaroo Labs, I've been following their work. One of the best things to check out is their nice collection of blog articles.

1: https://blog.wallaroolabs.com/2017/03/hello-wallaroo/ has some figures around latencies their working with and https://blog.wallaroolabs.com/2017/10/measuring-correctness-... talks about how they test their system under faults

2: https://blog.wallaroolabs.com/2017/10/why-we-used-pony-to-wr...

For the rest, I better leave it to them to give the most up to date answers on those as the project is moving fast.

I'm the author of this post and I'm happy to answer questions here.

Thank you for the informative blog post, this is the type of well-written detailed technical post with solid use case (rather than vague sales-ey ones) that pushes me to give a chance to new software.

I had never heard of Wallaroo before your post, but quite impressed. If this works, this can solve a lot of problems. I like the price tiering and great to see a NYC company amidst a sea of Valley ones.

Thanks! That's great to hear.

As someone just getting into stream processing, does anyone have any resources for comparing frameworks/systems? When Apache has multiple projects that sound like they do the same thing, I don't even know where to start.

It's a very confusing space to start getting into. I'd be happy to step outside of my role as a principal at Wallaroo Labs and have an email conversation to discuss what your use cases are and what tools you should consider looking at.


I've been trying to learn about all of this stuff over the last couple of weeks, and agree that it isn't obvious.

Designing Data-Intensive Applications by Kleppmann has been helpful. It doesn't cover every framework, but I think it helps explain where a lot of the pieces fit together and when you might want to use some of them.

I've also found it useful to find podcasts that explain specific projects, such as Apache Kafka, and listen to them when I'm running.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact