
Ask HN: A framework for pipeline-oriented processing *not* on the JVM? - lobster_johnson
I&#x27;m looking to design an application that is, conceptually, a pipeline of data being streamed from one end and being increasingly refined, spitting out intermediate data that go into multiple pipelines with multiple outputs that join or loop back, eventually going into one or more output. All this needs to run continuously, with maximum parallelism, on a cluster of hosts. It&#x27;s not realtime, however.<p>Last I checked, all of the distributed processing world outside of Google is is using Java these days. I&#x27;ve heard of, but not used, frameworks such as Hadoop, Spark and Storm that seem like good fits for this kind of processing. However, I have zero interest in being on the JVM, and I want to build this system in Go, Elixir&#x2F;Erlang or C++. (Ocaml&#x2F;Reason is also on my list of interesting languages, but I don&#x27;t think it has the needed concurrency support yet.) I&#x27;d be happy <i>using</i> a JVM system as long as all the application code can be written in something that isn&#x27;t Java, Scala or Groovy.<p>Is anyone else doing anything similar, but without a JVM language?
======
toast0
I haven't used it, but I've seen presentations about Disco [1] a map reduce
thingie in erlang with python. I wrote about half of a map reduce thing in
perl, although I hadn't really gotten the network part down before I stopped
using it.

If most of the workers can read input on standard in, and output to standard
out, it's pretty easy for a manager process to set those up as pipes or tcp
sockets and then exec the worker. For multiple inputs -> one output, or one
output -> multiple input, it's a bit trickier, but you could probably still
open the sockets and just pass some information to the worker about which
sockets are inputs, and which are outputs -- the worker may need to do fancier
I/O handling (select and what not).

If you need to handle transient failures of workers and restart processing, or
cache output, etc, then that's a bit more work.

Also -- do consider if you can do it all on a single box if you just get a
really big box. You can stuff a _lot_ of memory in a single box now, and save
a lot of hassle.

[1] [http://discoproject.org/](http://discoproject.org/)

~~~
lobster_johnson
Disco is interesting, though it seems like the multi-language support is a bit
lacking. Someone wrote a Go worker, but it's unfinished and abandoned.

A pipeline processing system needs durable queueing, though. Maybe you could
use the Disco distributed file system or maybe plug in something like Kafka,
though I'm not sure how.

~~~
toast0
> A pipeline processing system needs durable queueing, though.

Eh, it might be nice to have, and a lot of systems have it, but it's not
necessarily a need. Depends on your use case, and your environment -- most
days, none of my servers have any problems; on the days where there are
problems, I could manually restart my pipeline jobs. (I would need durable
storage of the inputs to the pipeline.)

~~~
lobster_johnson
I was referring to the inputs, mostly. My app also has a lot of intermediate
data that wants to stick around and be durable across batches, things like
scraping results that needs to avoid incurring duplicate work when nothing has
changed. For that, some kind of data-locality-awareness would also be nice.

------
eip
Storm is written in Clojure. Which I am pretty sure is at least part of the
reason it was abandoned by Twitter.

Pretty much all the JVM frameworks should work with any JVM language.

Groovy is kind of a turd so I can understand wanting to avoid it but Scala is
very nice and Java is OK. I would much prefer Scala or Java to any of the
choices you listed. Apparently a lot of people agree with me considering that
most of the frameworks you listed are written in Java or Scala.

Why make things hard on yourself?

------
jon-wood
AWS have their Simple Workflow Service[1] which is an API providing a set of
primitives designed to build arbitrary pipelines which you then plug whatever
language you want into to consume messages from the workflow asking for work
to be done. I've not used it personally but it looks like it could work for
you.

[1] [https://aws.amazon.com/swf/](https://aws.amazon.com/swf/)

