
Stream Processing for Go - spooneybarger
https://blog.wallaroolabs.com/2018/01/go-go-go-stream-processing-for-go/
======
spooneybarger
Hi all,

I'm the VP of Engineering at Wallaroo Labs. Happy to answer questions. I
worked with Andy on this preview release of our Go API. Andy will be on in a
while and will also be answering any questions that might come up.

We are getting ready to start work on performance tuning soon. We have a lot
of work already planned there.

If you play around with the Go API, we'd love to hear from you. We need your
feedback to improve our Go offering.

You can reach us via our user mailing list:

[https://groups.io/g/wallaroo](https://groups.io/g/wallaroo)

Or our IRC channel:

[https://webchat.freenode.net/?channels=#wallaroo](https://webchat.freenode.net/?channels=#wallaroo)

We'd also love to do a short call with folks to discuss their Go stream
processing use cases. If you email hello@wallaroolabs.com, we can set that up.

Thanks in advance!

HN was really helpful in helping us develop the new version of our Python API.
We hope to get the same results with the Go API.

~~~
pstuart
One small nit on the Go example code:

    
    
      func (wordTotals *WordTotals) Update(word string) {
            total, found := wordTotals.WordTotals[word]
            if !found {
                total = 0
            }
            wordTotals.WordTotals[word] = total + 1
      }
    

Can be done as:

    
    
      func (wordTotals *WordTotals) Update(word string) {
            wordTotals.WordTotals[word] += 1
      }
    

(thanks to Go default values)

------
Touche
Wow, Pony! Pony is one of my favorite young languages. It fits a similar niche
that Go fits (not saying that either language can't be used for other
purposes), in that they both are great for building cli tools with good
concurrency primitives.

Pony, however, uses the Actor model like Erlang (although the programming
model is a bit different). In my opinion this is a superior model for
concurrency than goroutines. I highly recommend checking it out if you have
any interest in cli tools or erlang/elixir.

~~~
samuell
In my view and experience, the actor model and the CSP model of Go and others
(which is also very similar to Flow-Based Programming (FBP) if using buffered
channels), can actually fit together very nicely, with each of them sitting at
different abstraction levels:

\- The actor model for high-level coordination and communication

\- CSP and/or FBP at the low-level, for high performance processing.

The reason for this is mainly that CSP/FBP inherently supports blocking
channels between processes when processes are busy (or channels are full for
FBP), thus providing implicit backpressure. This can simplify low-level
programming a lot, since you don't need to worry about over-filled mailboxes
or overbooked CPUs. OTOH, for communication e.g. across a cluster, this
becomes problematic since you can't rely on the same level of integrity of the
network connections ... meaning that the very communication-intensive nature
of CSP/FBP might be less of a good fit.

Thus in my view, the actor model would fit better for communicating between
services which are internally implemented e.g. with CSP/FBP.

I wrote a post about this, some time ago:

[http://bionics.it/posts/flowbased-vs-erlang-message-
passing](http://bionics.it/posts/flowbased-vs-erlang-message-passing)

It will be interesting to see whether Pony manages to fill more than just the
coordination role. Given its much higher raw-crunching performance, perhaps
that is the case. I'm a little worried about the lack of implicit back-
pressure (due to the actor model), but let's see.

Perhaps because of its high-performance, back-pressure won't be too costly to
implement anyway? One can hope. Would be nice with a language that can
encompass both programming paradigms, and be suited for the full spectrum
between low-level crunching and high-level coordination and communication.

~~~
felixgallo
the actor model in Erlang and (recently!) in Pony implements something even
better: explicit backpressure by deprioritizing senders to busy mailboxes. The
model can do this because it includes the scheduler in its scope, which
doesn't happen in CSP.

~~~
scottlf
I, too, have a bit of experience with backpressure implementations in both
Erlang and Pony. Erlang's "penalize the sender" doesn't work in many cases, so
I'm not surprised that it's being removed. Erlang's remote distribution
implementation & messaging semantics are Mostly Great but is definitely not
Perfect.

1\. Head-of-line blocking caused by congestion on the single TCP connection
used for message transmission between any two Erlang nodes. This can cause
major problems for apps control vs. data plane design, such as Riak and Lasp.
Work on Partisan ([https://github.com/lasp-
lang/partisan](https://github.com/lasp-lang/partisan)) appears to be a
substantial improvement.

2\. If the single remote distribution TCP connection between two nodes is
broken, then the first Erlang process to send a message to the remote node is,
hrrm, well, borrowed/co-opted by the BEAM VM to connect the new TCP session.
IIRC that process is marked unrunnable by the scheduler until the connection
is set up or perhaps there's an error. If that process is really important to
the functioning of your app, for example, an important system manager in the
control plane of the app, then you have a very difficult-to-diagnose latency
Heisenbug to cope with.

-Scott

~~~
felixgallo
'penalize the sender' has the benefit of being very clean; it doesn't work in
>1 machine distributed cases, but then neither do go or pony yet. I'm
surprised to discover that it's being removed. I wonder what the thinking is
for the local case.

~~~
ramchip
> it doesn't work in >1 machine distributed cases

More generally, it doesn’t work when you have multiple schedulers, which
nowadays is true for nearly all Erlang systems.

------
Gepsens
How does the wallaroo roadmap compare to other ambitious frameworks such as
Flink and Beam ? Haven't had much time to look at the API, are there tools to
reason about both processing / event time ? Is it possible to save state and
restart processes with updated code ?

~~~
spooneybarger
Hi Gepsens,

\- How does the wallaroo roadmap compare to other ambitious frameworks such as
Flink and Beam ?

I'm not familiar with the roadmaps for Flink and Beam so I can't comment on
that. I can however, speak to our roadmap.

For Go in particular, I'm going to be starting on performance improvements
very soon. Beyond that, we are waiting for feedback from folks to determine
what the best course of action for the Go API is. When we released our Python
API, we got a lot of excellent feedback that lead to us switching from a class
based API to one that emphasized functions and decorators. We expect that
similar evolution will happen with the Go API.

For Wallaroo in general, we're working on improvements to our documentation,
our installation experience, Python API improvements, state object migrations
(see below), as well as an expanded API to allow you to express more types of
dataflows with Wallaroo. There's more beyond that, but you get the idea.

Wallaroo is definitely a younger product than Flink and Beam. We open sourced
it in September of 2017 so we have a lot of work to do.

We'd love to hear from folks who are interested in being able to do stream
processing in Go or Python without needing to become experts in running the
JVM in production. Talking with folks helps us prioritize our work and deliver
a product that can be valuable to some folks now and still more in the future
as we add features.

\- Tools to reason about processing/event time:

Wallaroo has a metrics UI that allows you to see throughput and within
Wallaroo latency on a per-pipeline basis as well as per computation. It can be
very helpful in spotting bottlenecks.

I'm interested in hearing what other tools you think would be useful. We have
a few ideas but are always looking for more things we can build to make the
product better.

\- Is it possible to save state and restart processes with updated code?

Yes, so long as your data objects don't change. If your data objects change
(for example, add a field, remove a field) then not currently. However, we
started initial discussion on our approach to that this week and starting work
on it in the near future.

The current solution if your data objects change is to get the state of
objects out of the system before shutdown and then stream in those new values
when you startup with the code that involves a schema change to your data
objects. It's far from perfect but works for now until, the schema migration
feature is added.

------
shizcakes
What led you to choose Pony for this project?

~~~
bpicolo
They also have a blog post on just that:

[https://blog.wallaroolabs.com/2017/10/why-we-used-pony-to-
wr...](https://blog.wallaroolabs.com/2017/10/why-we-used-pony-to-write-
wallaroo/)

------
hooph00p
Does Wallaroo have Apache Kafka support?

~~~
spooneybarger
Yes.

TCP and Kafka are the currently supported Source and Sink types. We are in the
process of improving the Kafka support.

Additionally, we are in the design phase for providing APIS in Go and Python
to make it easier for Wallaroo users to provide their own sources and sinks.

------
lerchmod
Does all state have to fit in memory?

~~~
spooneybarger
Currently yes. If your state doesn't fit in memory on the number of machines
you have, you'd need to add more.

We plan on adding support for only keeping part of state in memory. That's in
the discussion phase. We want to talk to folks about their use cases and
ideally work with them before adding it.

There are many trade-offs involved in such a feature and making those trade-
offs based on informed decisions of our possible users needs is our preferred
solution.

------
joshlemer
Does this support advanced windowing and watermarking strategies like are
found in Flink?

~~~
spooneybarger
You can do event based windowing now, but you need to build it up from
primitives. Time based windows aren't supported yet, they are however on our
roadmap. We plan on adding windowing APIs to make it very easy to do a variety
of windowing.

You can learn more about the existing event based windowing approach that is
available now here: [https://blog.wallaroolabs.com/2017/11/non-native-event-
drive...](https://blog.wallaroolabs.com/2017/11/non-native-event-driven-
windowing-in-wallaroo/)

\---

Can you elaborate on what specifically you are referring to as "watermarking
strategies" in Flink? We are familiar with Flink's use of the Chandy Lamport
algorithm but I'm not sure if that is what you are referring to.

~~~
joshlemer
So in flink you can use a few different bundled in strategies for emitting
watermarks, or you can define your own. Here's the documentation on that
[https://ci.apache.org/projects/flink/flink-docs-
release-1.4/...](https://ci.apache.org/projects/flink/flink-docs-
release-1.4/dev/event_timestamps_watermarks.html)

For example, you can use the bundled MaxOutOfOrdernessTimestampExtractor to
emit watermarks that lag behind the maximum timestamp seen so far by some
constant amount (say, 50 ms). You could customize this behaviour by
implementing the required interface, to emit a watermark, say, whenever we
pass an other calendar day. You can also assign strictly-ascending watermarks
which you could say is equivalent to maxOutOfOrderness = 0.

You can also choose to emit watermarks based on the content of the stream (see
Punctuated Watermarks [https://ci.apache.org/projects/flink/flink-docs-
release-1.4/...](https://ci.apache.org/projects/flink/flink-docs-
release-1.4/dev/event_timestamps_watermarks.html#with-punctuated-watermarks))

~~~
spooneybarger
Our time based windowing has been implemented yet, so the answer would be: not
at this time. It is however on our roadmap.

