1. How does this compare to other distributed streaming systems in terms of latency, throughput, fault tolerance, and message processing guarantees?
2. Why Pony? I’m not super familiar with the language, does it give you any major gains?
3. How does this handle shared aggregations - i.e. I have one stream updating a table constantly and another that reads from the table. This is useful as it decouples the two event streams, meaning my reading job can have better latency guarantees with the tradeoff of possible having stale data. I didn’t see how I can link together two independently deployed streaming states, unless the idea is any stream using some state will be declared in one place.
4. Does this support transactional event processing, wherein my originals HTTP call is returned with a success or failure, or does it just ack the event was received, even if the event fails downstream of the source?
5. Are there any major, novel differences in this stream processing system versus the next that I’m not aware of? I didn’t see anything too unique in this post.
I work at Wallaroo Labs.
strmpnk had good content for answers for 1 and 2. thank you for that!
The one stream updates/one stream reads is detailed in the "MarketSpread" example in the post. I suspect the key to your question is "two independently deployed streaming states". Could you elaborate on what you mean by that? I think if I understood that better, I could give a good answer to #3.
4. Currently all pipelines have a source and a sink. In your HTTP call example, that would most likely be a combination source/sink where we want to have both the input and some output result flow back out. That currently isn't supported in Wallaroo although we have had discussions about what that might mean. You could do transactional processing where you receive an event and then reply but at the moment, the reply would need to go out over the sink which is a different channel than the source. I'd be really interested in working with folks who could benefit from a combination source/sink approach. We prefer to add additional functionality after working through use cases with interested parties. We find that leads to better results than our designing in a vacuum.
5. Everyone has a different definition of novel. In my view, most streaming systems have left the handling of application state in the hands of the programmer. You can keep it in memory for speed but then you have to manage the resilience of that in-memory state. Wallaroo makes that in-memory state a first class citizen. It's not detailed in this post, but Wallaroo can provide resilience for you state so that if there is a failure we can bring that state back. Currently that involves using a write ahead log that is stored on the local filesystem. We are working on replicating that log to other nodes in the cluster to provide additional resilience.
If i was to hightlight one feature to answer your question about novelty, that would be it. We have a lot more content on the blog that might answer your question better. The post in question is very much a "how do I do X" sort of post rather than a "Why Wallaroo?" (of which there are a couple on the blog).
3. One of the main use cases for using external databases and stores is that they can be multi tenant. Bob comes up with a really cool market aggregation, Susan and Ralph want to use that aggregation in their models. Currently, is there a way for Susan and Ralph to just hook into that aggregation in memory and allow them to use that as a feature?
4. The main use case is that your streaming framework makes a bunch of cool aggregates, but if you have a customer who needs the question “Should I let this person do this thing?”, you want to answer that synchronously. The alternative is to load your state into an external store and setup a web server that just does a lookup on that store, instead of doing it in your streaming system.
4. we've worked with folks who want to do what you are describing but the "answer channel" was different than the "question channel". so the question came in over a source (in their case TCP) and the answer would leave over a sink (also TCP). in that case it was something like:
client -> persistent web socket -> server -> streaming system -> server -> persistent web socket -> client
having a combined source/sink is something that we've discussed but we have higher priorities at the moment. if we were to start working with someone for whom it was a priority, we'd move it up in priority.
1: https://blog.wallaroolabs.com/2017/03/hello-wallaroo/ has some figures around latencies their working with and https://blog.wallaroolabs.com/2017/10/measuring-correctness-... talks about how they test their system under faults
For the rest, I better leave it to them to give the most up to date answers on those as the project is moving fast.
I had never heard of Wallaroo before your post, but quite impressed. If this works, this can solve a lot of problems. I like the price tiering and great to see a NYC company amidst a sea of Valley ones.
Designing Data-Intensive Applications by Kleppmann has been helpful. It doesn't cover every framework, but I think it helps explain where a lot of the pieces fit together and when you might want to use some of them.
I've also found it useful to find podcasts that explain specific projects, such as Apache Kafka, and listen to them when I'm running.