
Introducting go-pipeline - tehjojo
https://whiskybadger.io/post/introducing-go-pipeline/
======
lobster_johnson
Speaking from painful experience, channels are not a good building block for
general-purpose pipelines. They're definitely a decent mechanism for
communicating between concurrent processes, which is what they should be used
for — but only that, generally speaking. (Sorry to be a wet blanket!)

For example, you'll run into things like channel buffering (I see that our
"operator" creates an unbuffered channel, so you already have a potential
performance issue there), an aspect which is opaque to both sender and
receiver (unless they explicitly check the capacity), and can result in
surprising blocking issues. You'll also need to deal with channel ownership
(the producer is, in most situations, effectively the only one who can close a
channel). Usually one channel is not enough; in order to propagate errors and
be able to cancel an arbitrary graph of pipes, you'll likely need 2-3. And
regarding performance, all channels are protected by internal mutexes to make
them thread-safe, so by using channels to pipe data, all your code will be
slowed down by mutexes, even the parts that don't need to be thread-safe.

I recommend writing a dedicated iterator-style pipe interface type instead.
Channels can then be used carefully for parallelization, fan-out and such, but
direct data exchanges (e.g. transforming tuples) can be single-threaded and
fast.

Lastly, I recommend annotating channels with arrows to make them read/write-
only for the callee. You have "in chan interface{}", which should probably be
"in <-chan interface{}."

~~~
bogaczio
I think the performance issue is addressed, in the second paragraph, because
you're quite right that it would create a slowdown. This library is quite
simple, and it's mostly an exploration of a potential design pattern, not
intended, nor recommended, as a generic solution.

The channel ownership isn't so much of an issue, at least the way I understand
the question, since currently each Operator creates its own "output" channel,
and is effectively the producer for that channel. The way this works in
practice is that you instantiate an input channel, combine multiple operators
together, and when your input is done, you close that input channel, which
closes all of the subsequent channels in turn (because of the for ... range ch
being used). That said, in practice I have used a separate channel for error
reporting, which is lightly addressed near the end of the article.

The point about annotating channels is very good, I'll probably be making a
change to incorporate that here soon.

Thanks for reading, and I appreciate the good feedback!

------
echlebek
I'm not sure why I should use this and give up typed channels when I could
easily implement the same logic on my own?

~~~
bogaczio
I would say that as long as you have a small enough number of types where
implementing this for each of them (and each combination of them) is
manageable... you should. The absolute worst thing about this is losing type
safety (and I helped write it). The time to use this is IF this kind of design
could simplify your code overall AND the number of types and their
combinations is simply too much to code. Say you have four input types, two
intermediary types, and one output type, and more than one step in between
them, then you'd be writing about 15 times the code. We actually had a
situation like this, which is why it came in handy. If you're not in this
situation though, I would always advocate retaining type safety.

