
Stateful Multi-Stream Processing in Python with Wallaroo - jtfmumm
https://blog.wallaroolabs.com/2017/12/stateful-multi-stream-processing-in-python-with-wallaroo/
======
alextheparrot
Couple questions:

1\. How does this compare to other distributed streaming systems in terms of
latency, throughput, fault tolerance, and message processing guarantees?

2\. Why Pony? I’m not super familiar with the language, does it give you any
major gains?

3\. How does this handle shared aggregations - i.e. I have one stream updating
a table constantly and another that reads from the table. This is useful as it
decouples the two event streams, meaning my reading job can have better
latency guarantees with the tradeoff of possible having stale data. I didn’t
see how I can link together two independently deployed streaming states,
unless the idea is any stream using some state will be declared in one place.

4\. Does this support transactional event processing, wherein my originals
HTTP call is returned with a success or failure, or does it just ack the event
was received, even if the event fails downstream of the source?

5\. Are there any major, novel differences in this stream processing system
versus the next that I’m not aware of? I didn’t see anything too unique in
this post.

~~~
spooneybarger
hi!

I work at Wallaroo Labs.

strmpnk had good content for answers for 1 and 2. thank you for that!

re: 3.

The one stream updates/one stream reads is detailed in the "MarketSpread"
example in the post. I suspect the key to your question is "two independently
deployed streaming states". Could you elaborate on what you mean by that? I
think if I understood that better, I could give a good answer to #3.

4\. Currently all pipelines have a source and a sink. In your HTTP call
example, that would most likely be a combination source/sink where we want to
have both the input and some output result flow back out. That currently isn't
supported in Wallaroo although we have had discussions about what that might
mean. You could do transactional processing where you receive an event and
then reply but at the moment, the reply would need to go out over the sink
which is a different channel than the source. I'd be really interested in
working with folks who could benefit from a combination source/sink approach.
We prefer to add additional functionality after working through use cases with
interested parties. We find that leads to better results than our designing in
a vacuum.

5\. Everyone has a different definition of novel. In my view, most streaming
systems have left the handling of application state in the hands of the
programmer. You can keep it in memory for speed but then you have to manage
the resilience of that in-memory state. Wallaroo makes that in-memory state a
first class citizen. It's not detailed in this post, but Wallaroo can provide
resilience for you state so that if there is a failure we can bring that state
back. Currently that involves using a write ahead log that is stored on the
local filesystem. We are working on replicating that log to other nodes in the
cluster to provide additional resilience.

If i was to hightlight one feature to answer your question about novelty, that
would be it. We have a lot more content on the blog that might answer your
question better. The post in question is very much a "how do I do X" sort of
post rather than a "Why Wallaroo?" (of which there are a couple on the blog).

~~~
alextheparrot
Hey, thanks for the comment. A few clarifications you asked about:

3\. One of the main use cases for using external databases and stores is that
they can be multi tenant. Bob comes up with a really cool market aggregation,
Susan and Ralph want to use that aggregation in their models. Currently, is
there a way for Susan and Ralph to just hook into that aggregation in memory
and allow them to use that as a feature?

4\. The main use case is that your streaming framework makes a bunch of cool
aggregates, but if you have a customer who needs the question “Should I let
this person do this thing?”, you want to answer that synchronously. The
alternative is to load your state into an external store and setup a web
server that just does a lookup on that store, instead of doing it in your
streaming system.

~~~
spooneybarger
3\. currently, there's no adhoc way to query in-memory data. wallaroo while
having a couple database like features isn't a database. you'd need to export
that aggregation or the results thereof via a sink into some other system for
adhoc querying. we've had discussions about providing adhoc querying but it
isn't on the immediate radar.

4\. we've worked with folks who want to do what you are describing but the
"answer channel" was different than the "question channel". so the question
came in over a source (in their case TCP) and the answer would leave over a
sink (also TCP). in that case it was something like:

client -> persistent web socket -> server -> streaming system -> server ->
persistent web socket -> client

having a combined source/sink is something that we've discussed but we have
higher priorities at the moment. if we were to start working with someone for
whom it was a priority, we'd move it up in priority.

------
jtfmumm
I'm the author of this post and I'm happy to answer questions here.

~~~
TuringNYC
Thank you for the informative blog post, this is the type of well-written
detailed technical post with solid use case (rather than vague sales-ey ones)
that pushes me to give a chance to new software.

I had never heard of Wallaroo before your post, but quite impressed. If this
works, this can solve a lot of problems. I like the price tiering and great to
see a NYC company amidst a sea of Valley ones.

~~~
jtfmumm
Thanks! That's great to hear.

------
stocktech
As someone just getting into stream processing, does anyone have any resources
for comparing frameworks/systems? When Apache has multiple projects that sound
like they do the same thing, I don't even know where to start.

~~~
spooneybarger
It's a very confusing space to start getting into. I'd be happy to step
outside of my role as a principal at Wallaroo Labs and have an email
conversation to discuss what your use cases are and what tools you should
consider looking at.

sean@wallaroolabs.com

