
Onyx: fault tolerant data processing for Clojure - coding4all
https://github.com/MichaelDrogalis/onyx/blob/0.5.x/README.md
======
afandian
This looks very interesting. I'm doing some log file processing in Apache
Spark in Clojure. Spark is written in Scala, but has a Java API, which is
wrapped by Flambo. It looks and feels entirely Clojure.

The semantics look very similar indeed. Does anyone have a comparison between
Onyx and Spark?

~~~
lbradstreet
I've used Onyx, but I haven't used Spark, so take this with a grain of salt.

A few key differences:

Onyx aggressively uses data structures to define the structure of computation,
defining the data flow (Onyx workflow) and parameterization (Onyx catalog) of
the the computation via clojure maps and vectors. In comparison Flambo and
Spark define the structure of computation via functions over collections. One
way in which Onyx's approach is powerful is that it becomes trivial to
manipulate workflows or catalogs before submitting jobs at runtime, allowing
you to add additional tasks, task options, etc.

Onyx also implements batching over streaming operations, whereas Spark appears
to be the opposite. There are likely to be trade-offs between these
approaches.

Spark is also a lot faster, though this isn't necessarily intrinsic to the
approaches.

~~~
jeletonskelly
I'm interested to know if you've used Storm at all and how it compares to
Onyx. I'm currently considering both for a project.

~~~
XPherior
Hello, Michael Drogalis - the author here.

I'm also not a Spark user, but I have used Storm:

\- Storm is significantly more mature and performant the moment.

\- Storm has a better cross-language story in terms of bolt functions.

\- Pretty much everything in Onyx is much more open ended. This applies to
deployment, program structure, and workflow creation - and is mostly an
artifact of how aggressively Onyx uses data structures.

\- Onyx has a far better reach across languages in terms of its information
model.

\- Onyx will be adopting a tweaked version of Storm's message model next
release to get on the same level of performance and reliability. We're
dropping the HornetQ dependency.

\- Onyx is born out of years of frustration of direct usage of Storm and
Hadoop.

~~~
jwr
As someone who has been using Storm, this looks very interesting. What I
particularly like are the clean, well thought-out ideas. Also, easily
reconfigurable (at runtime) topologies are something we'd be interested in. I
will definitely take a very close look at Onyx.

Performance is important: in our case, decreasing it significantly below
Storm's level would not be acceptable.

Also, I watched the Strange Loop presentation and the tree model looks
limiting to me: I have topologies where I need to merge information from two
streams (but perhaps I haven't understood the Onyx model yet).

~~~
XPherior
Performance - wait until the 0.6.0 release. We'll be caught up with Storm by
then.

The tree model is being removed in 0.6.0 in favor of a vector of vectors
(DAG), which allows multiple inputs. See
[https://github.com/MichaelDrogalis/onyx/blob/0.5.x/doc/user-...](https://github.com/MichaelDrogalis/onyx/blob/0.5.x/doc/user-
guide/concepts.md#workflow) The tree model wasn't one of my better ideas.

Edit: to be clear, you can do stream joins right now in 0.5.3 with the DAG
model.

------
XPherior
Hi folks! I'm Michael Drogalis - the primary author. I'm happy to answer any
questions.

~~~
bmh100
What were the main pain points that motivated you to develop Onyx? What
capabilities do you want to add or have already added that Storm doesn't
provide?

~~~
XPherior
See:
[https://github.com/MichaelDrogalis/onyx/blob/0.5.x/doc/user-...](https://github.com/MichaelDrogalis/onyx/blob/0.5.x/doc/user-
guide/what-does-it-offer.md)

These are all the things I wrote down that I wanted before I wrote the first
line of code.

------
johnmurray_io
Checkout the original video introducing Onyx:
[http://youtu.be/vG47Gui3hYE](http://youtu.be/vG47Gui3hYE)

~~~
maelito
Live open sourcing !

------
lkrubner
If this interests you, then you should also check out the post where Michael
Drogalis first introduced this:

[http://michaeldrogalis.tumblr.com/post/98143185776/onyx-
dist...](http://michaeldrogalis.tumblr.com/post/98143185776/onyx-distributed-
data-processing-for-clojure)

------
dj-wonk
Re: Onyx's architecture. I would wonder about performance when keeping a
shared log in ZooKeeper. Why not use something like Kafka -- it is designed
for high-volume, immutable logging. ZK works best for less-frequently changing
configuration, such as node connection information or snapshotting. I could be
wrong. I'd like to hear your thoughts and experience.

~~~
XPherior
\- Picking up Kafka means introducing another dependency.

\- Onyx's log doesn't grow particularly large because it's only used for
coordination, not for messaging.

\- Because the log isn't huge, and can be GC'ed, consumers don't experience
high volumes of messages.

\- ZooKeeper offers sequential node creation - making it a really good fit for
what the log needs to do.

------
boothead
Looks superficially simmilar to
[https://github.com/aphyr/tesser](https://github.com/aphyr/tesser) anyone know
both and can give a comparison?

From a brief examination tesser looks a lot simpler (probably because of
encoding most of the folding using various monoids). Does onyx have a similar
abstraction model that I missed?

~~~
erichmond
Onyx is distributed and Tesser just uses all the available cores of a
particular machine AFAIK.

Both libraries are awesome.

~~~
boothead
Tesser also allows you to distribute it using hadoop i think. I haven't used
it, I only happened to hear about it why @aphyr gave a talk at the clojure
exchange in London.

