Hacker News new | comments | show | ask | jobs | submit login
Go Python, Go: Stream Processing for Python (wallaroolabs.com)
251 points by spooneybarger 10 days ago | hide | past | web | 66 comments | favorite

What about the use of Pony? I did not hear about it before but I do like some of the ideas of the language I read on their website. However, they also state "Pony is pre-1.0. We regularly have releases that involve breaking changes. This lack of stability is plenty of reason for many projects to avoid using Pony." How are you going to deal with that?

I personally work at both Wallaroo Labs and am on the Pony core team. A number of engineers at Wallaroo Labs are committers to Pony. We are active members in the Pony community and a large driver of many of those breaking changes to Pony. At least 3/4 of the breaking changes to Pony over the last 9 months came from us at Wallaroo Labs. From our side it's not particularly problematic to make the occassional update.

"At least 3/4 of the breaking changes to Pony over the last 9 months came from us at Wallaroo Labs." -> And you're proud of yourselves ?!?

(Just kidding of course).

I'm going to be writing a blog post on this, but you inspired me to touch on it a bit.

When we went looking at what we should use to implement Wallaroo, one of the things that appealed to us about Pony was that it was a high-performance actor based runtime that we could help mold. We gave consideration to writing our own implementation in C but, we'd still be working on that rather than talking here on HN now if we had.

Pony has gotten us quite a bit. The runtime was used for a number of production projects at one of the major banks so we knew that while there might be some bugs (what code doesn't have them) that we could use it to jumpstart our process.

I gave a talk about Pony and its relation to Wallaroo where I put out the figure of 18 months. I think that as a wild ass guess, using the Pony runtime saved us about 18 months of work to build the foundation we want/need for Wallaroo.

Speaking as a member of the Wallaroo Labs team, it's really nice to have a community based project that you can help mold and grow with. It's been a boon to us as a small development shop in a way that either writing from scratch ourselves or using an existing more widely used runtime wouldn't be.

Speaking as a Pony core team member, I encourage other companies who think they could benefit from a high-performance actor based runtime to have a look at Pony. You could have a large hand in shaping it into the runtime that you need.

There are quite a interesting ideas floating on Quora about Pony[1]. Unfortunately, after the demise of Casualty where the original authors worked, the language didn't seem to go anywhere.

[1]: https://www.quora.com/What-is-your-experience-with-using-Pon...

Speaking as a member of the Pony community, the Pony community has done quite well since Casuality ceased to exist. I addressed this on the Pony mailing list a while back:


I'd also like to note that the statement that Sylvan is no longer involved in the Pony community in that post is very offbase. He's a founding member of the Pony core team and still actively involved.

Hi, I'm Sylvan Clebsch, the designer of Pony. I'm very much involved in Pony, as are most of the Causality founders in various ways. It's unfortunate that this kind of misinformation keeps being passed around.

Good to know that Sylvan! Honest suggestion: you should debunk that misinformation on Quora which has a bigger audience than HN here.

While the comments about being pythonic are valid, I think this looks fantastic and I'll take it for a spin around the block. Are you using this in your production environment already?

In regards to the 'name' being a function - a class attribute might indeed be more correct but a function allows for dynamic computation names where its applicable.

Thank you harel for you kind comment. We aren't using it in our production environment as we don't have one. We are the builders of Wallaroo. We've done a number of POCs with various clients and are moving towards production deployments with them.

Re the API: if you want to take part in helping shape the next version of the API, please reach out to me personally (via the email address in my HN profile) or via:

* #wallaroo on freenode

* our user mailing list: https://groups.io/g/wallaroo

We did the initial version of the API based on feedback from a couple clients we were working with. We're actively looking for feedback from a larger group of folks as we move forward.

If you did POC's did it turn into Prod? If not why not? Do people use it in prod is what I think is meant.

Not yet but it's looking like the first couple will happen in Q1 or Q2 next year.

Why not really depends per location. None have been because we didn't pass the POC. Most so far have been that it works well and now it's a matter of slotting into the roadmap at those companies.

> In regards to the 'name' being a function - a class attribute might indeed be more correct but a function allows for dynamic computation names where its applicable.


Accessing a simple attribute on the instance means it can be defined as any of a class attribute (instance will delegate to class), an instance attribute (computed/set in init) or a property (computed every time it's accessed), which is convenient, especially when the primary use case will be setting the name on the class directly.

In fact you could even fallback on the class name.

So use @property then?

    __SOME_VAR = "foo"

    def some_var(self):
        return self.__SOME_VAR


Yes, just like that. Check out this fantastic talk on being pythonic: https://www.youtube.com/watch?v=wf-BqAjZb8M


I like the absence of java-based tools. But IMO the Word count example is really verbose : https://github.com/WallarooLabs/wallaroo/blob/0.1.2/examples...

With Apache Spark, the Word count example is a lot shorter: https://github.com/apache/spark/blob/master/examples/src/mai...

The word count example has... 8 classes. To count words.

Proof that you can write Java code in any language. https://steve-yegge.blogspot.de/2006/03/execution-in-kingdom... comes to mind when reading the snippets.

We are in the process of working with folks on a more "pythonic" API. Our first goal was to get something that would be easy to use out the door then get feedback on ideas we have for something more idiomatic to the language. It's an approach we plan to take when we add support for other languages such as Go and JavaScript over the next few months.

We are looking for folks to help drive the next version of the API. The first one was done with feedback from a couple of clients who were interested in using Python and in many ways reflects their tastes. Feedback from a wider range of Python users is something we are actively soliciting at this point.

disclosure: I work at Wallaroo Labs, creators of Wallaroo.

I'll be in touch. Best email please?


So python is the new ruby now.

Whatever that even means.

(Besides, Python has been popular longer than Ruby -- which only got big along with Rails' emergence).

I'm surprised you could even compare Python's current popularity (obviously due to machine learning libraries) with it's past popularity. Especially with all the comments you've been churning out on HN.

I've used Python professionally since 1999.

While Python found an extra niche with scientific libraries (and not just/especially machine learning) post numpy and co, it was already very popular then, taking the throne from Perl.

Ruby at the time (and until Rails) was seen as an interesting contester, but nobody doubted Python's top popularity as far as scripting goes.

P.S. What about "the comments I've been churning"? If you mean regarding P3 adoption, then we're still quite a long ways away. Not even 50% -- and well after 10+ years. But that's beside the point. That doesn't mean I don't like Python (including 3).

It's funny because the article's only mention of java is "Wallaroo isn’t built on the JVM, which provides advantages that we will cover in a later blog post."

Quickly followed by pointless Factory classes and classes implementing an interface with one method.

I guess the reason is that classes are picklable and functions, by default, are not? This is supposed to be a distributed system after all.

Easy pickabability was a strong driver for this initial API.

disclosure: I work at Wallaroo Labs, creators of Wallaroo.

Still doesn't seem very Pythonic.

  def name(self):
      return "split into words"
Why isn't name a class attribute? Does this really need to be a callable function?

In my opinion, I think as much of this stuff should be done in a declarative way as possible.

If you'd be interested in playing with Wallaroo and providing feedback on a 2nd generation of the API, feel free to reach out via our IRC channel, user mailing list or ping me via my personal email address that is in my HN profile.

We are looking for folks to help drive the next version of the API. The first one was done with feedback from a couple of clients who were interested in using Python and in many ways reflects their tastes. Feedback from a wider range of Python users is something we are actively soliciting at this point.

Or use the @property decorator

Having to create a class called xPartitionFunction doesn't seem to favour easy pickability. Why isn't it just a function/callable?

Likewise builder, just ask for a callable, and the smallest "builder" is the state class itself.

Computations seem similar, the APIDoc states they must provide a compute() method but the example shows only a compute_multi. Is there a use case for having both on the same object? If there isn't, just ask for a reduction function and provide a decorator for one of the use case. Or don't and always ask for a list of results.

Or handle the "multi" case via a generator I guess (call reducer, if it returns a generator run it and add all items to the processing queue, otherwise add the one item).

he should make a revised edition where they no longer consult the Sun God but rather the All Seeing Oracle...

Question for spooneybarger or anyone else at Walaroo Labs.

How do you see this comparing to something like Dask? Would it compete with Dask, or be able to somehow work together with it?

Dask seems to let you write idiomatic Python code and not even think about splitting, joining, etc... and it builds the pipeline automagically by introspecting the AST.

We are actively looking at how we can better handle use cases like Dask handles. A couple folks we are working with use Dask heavily and are looking to switch. They are far more expert in Dask than we are so I don't want to shoot my mouth of. The issues they've general discussed with us are around performance and scaling problems they have had. We are still actively learning from them and hope to have a first version to start addressing those use cases around the end of this year. Ideally middle of Decemeber.

Dask is very "batch" oriented, which as I said above, is something we are in the process of adding to Wallaroo. Wallaroo is very stream processing oriented. Wallaroo's strength are working with stateful, event-by-event applications.

If you were to take word count as an example. Dask would be great if you had a body of text in files or whatnot that you needed to count. There's a beginning and end to that task. Count the words in this text. Wallaroo would shine if you had a stream of never ending text, like twitter's trending topics.

That's a very coarse outline of a couple of differences. While we are working with clients to help them move off of Dask by adding that functionality to Dask, I also think that if you wanted to, you could use Wallaroo along side a more batch oriented system like Dask. Stream processing and batch processing are complementary. A number of technologies (us included) are looking to unify them. Why? Well, there's a lot of operational overhead to running a batch system and a streaming system. A lot of folks would like to run a single system that works well for both.

I hope that answers your question.

Thanks, I think it does. I did a quick search to see if or how Dask handles streams. It seems trivial to add queues to dask.


I've never actually played with Dask at all for even batch processing let alone stream. If I ever get some free time I'll try to implement Word Count in Dask and see how the two codes compare.

> Stream processing and batch processing are complimentary

Must be complementary, I think.

Indeed. I regularly mix the two up. Thanks. Editing to fix.

I've used the similar Apache beam project before. https://beam.apache.org

I mainly used their java libraries, but the python binding have been coming along.

We recently checked in streaming support in Beam Python here: https://lists.apache.org/thread.html/82f5b5d2ab2ddd3849584f6...

You can take a look at and run the streaming wordcount example in Beam: https://github.com/apache/beam/blob/master/sdks/python/apach...

In addition to the available local execution, Google also offers running Beam pipelines as a managed service in Cloud Dataflow (https://cloud.google.com/dataflow/). Python streaming is in private alpha--contact us at dataflow-python-feedback@google.com if you'd like to try it out.

Note: I work for Google on Apache Beam and Cloud Dataflow.

I used Google Cloud Dataflow at a previous job. I really did like it and feel like Beam is set up pretty well. Thanks!

No support for Python 3

Python 3 support is on the roadmap for later this year.

In general, our roadmap is determined by what we think is important but also is heavily influenced by the needs of folks we are working closely with.

There's an almost infinite number of things we could work on so we live to drive our direction based on the needs of folks we are working with. In the case of Python, the early users were all Python 2.7 and thus we focused there. We've recently started working with folks who are looking for Python 3 support (in particular, 3.6) so we are going to be adding it.

If anyone is interested in adding features, language support etc to Wallaroo, we'd love to help. You can find us on freenode in the #wallaroo channel or stop by our user mailing list (https://groups.io/g/wallaroo) and we can help you out.

Enterprise user here.

Can confirm that we moved all the things to Python 3 (and it was easier than expected). Especially all the data processing pipelines.

No Python 3 is a deal breaker in 2017.

We'll have you covered by the end of the year if you are interested in checking us out then.

The heartbreaking thing is that if you're willing to only target 2.7 on the Python 2 front, it's relatively trivial to write in Python 3 and have it be immediately compatible in Python 2 using from __future__ imports.

For trivial programs sure... not when dealing with encoders, decoders, binary formats, etc, etc.

It all depends on how they wrote their Python 2.7 code. You can write it oblivious to Python 3 and screw things up, or you can write it in a way that you know you'll eventually support Python 3. In the latter case, you might as well just support both right away though.

Andy's next blog post is going to be in this general area. Check back next month!

Yea this is a bummer. Especially when new applications are being written in Python 2. At least write them so they are forwards compatible with Python 3 and save yourself some headaches later.

Could anyone recommend any resources comparing the various stream processing frameworks? Apache Storm (which I am familiar with), the below mentioned Apache Beam (of which I just heard for the first time), this new Wallaroo and any others? Beside their homepages, that is.

Stepping outside of my "Wallaroo Labs employee" role for a moment.

Comparisions can be really hard. What's right for one application or project isn't right for another. I'd be happy to chat over email with anyone interested in stream processing about the types of applications they are looking to build, the requirements they have etc.

I get nice use cases and information we can use at Wallaroo Labs to help drive our product. In return, I will give unbiased feedback on what you should be looking for to solve a given problem.

My personal email is in my HN profile.


In case it isn't obvious, I work at Wallaroo Labs, the makers of Wallaroo. I'm also one of the authors of Storm Applied, Manning's book on Apache Storm.

I'd be interested in understanding which design is closest to yours, however. Flink? Akka Streams? Another?

That's complicated and nuanced. I'd be happy to have that conversation over email.


Here's one for just the Apache frameworks: https://databaseline.bitbucket.io/an-overview-of-apache-stre...

Not sure how Wallaroo compares though.

Hahaha.... that's so so so similar to something I've developed for my machine vision pipelines...

Would you be interested in chatting more about that? We are always looking for use cases we can learn from. My personal email address is in my profile if you are interested.

Looks good. Guess this is the first library I heard dealing with streaming in Python.


Very interesting library, haven't heard of it yet.

Thanks. We've been working hard on Wallaroo. Plenty more improvements to come but we felt that it was time to get it out there and get feedback to help us drive the work we do over the next few months.

Can I suggest having a Docker image of all the pre-compiled bits available? I was really interested in trying it out but don't know if I want to invest the time trying to compile everything for a tool that I don't know a lot about yet. If setup was just a `docker pull` away I would probably already be going through the tutorial.

If the 4 terminals required were instead just:

`docker run -d --name wallaroo -v ~/wallaroo-tutorial/celsius:/srv/application-module -p wallaroolabs/wallaroo-quickstart`

Or something similar I think a lot more people on this thread would be trying it out right now.

It looks really promising! If I get the spare time I'm definitely interested enough to give it a whirl.

First: thanks!

Re docker:

It's one of the options we are looking at to make it easier to get up and running.

Any particular reason that docker appeals to you?

We're not sure at this point in time what the best means is.

If there was a docker based QuickStart, what would you expect to be able to do with it?

Run the first example app? Something more?

+1 to this approach

Very cool, thanks for sharing!

Thanks! It's really nice to get positive feedback on something that you've spent months working on and are still pouring yourself into every day.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact