

Clojure on Hadoop: A New Hope - dirtyvagabond
http://blog.factual.com/clojure-on-hadoop-a-new-hope

======
wicknicks
My experience with Hadoop tells me that its great for all counting tasks.
Makes a lot of sense that it was designed at Google to construct posting lists
for their search index. Beyond this sweet spot, it gets really tricky to map
your solution to a map-reduce task. The programmer has to rely more on the
fact that map/reduce are java black boxes to express everything he needs.
Hadoop's big victory is the scale it operates on.

Are people here waiting for a "SQL-like-declarative-query-language + Hadoop"
hybrid, which lets one write declarative processes to run on large amounts of
data? Academia seems to be very motivated to produce something like this.

~~~
scott_s
I work on Streams at IBM Research. Our solution to handling large amounts of
data is called stream programming, and its more general than MapReduce. For a
sample of what our language looks like, check out this example:
[http://publib.boulder.ibm.com/infocenter/streams/v2r0/index....](http://publib.boulder.ibm.com/infocenter/streams/v2r0/index.jsp?topic=%2Fcom.ibm.swg.im.infosphere.streams.spl-
language-specification.doc%2Fdoc%2Flangoverview.html) And a pdf version of the
same:
[http://publib.boulder.ibm.com/infocenter/streams/v2r0/topic/...](http://publib.boulder.ibm.com/infocenter/streams/v2r0/topic/com.ibm.swg.im.infosphere.streams.product.doc/doc/IBMInfoSphereStreams-
SPLLanguageSpecification.pdf)

~~~
toisanji
isn't stream programming what storm is supposed to be?

~~~
scott_s
Yes, it is a similar programming model. Some differences (please put "to the
best of my knowledge" in front of all of these):

\- Storm does not allow arbitrary state in operators (what Nathan Marz calls
"bolts"). This makes implementing the runtime easier, such as being able to
replay tuple sends for fault tolerance, but it limits what kinds of
applications one can make. Yes, I'm on board with the idea that we should
avoid mutable state as much as possible, but people who build real
applications want it. Yet, fault tolerance in our system requires more work,
so it's a trade-off.

\- Storm programs are implemented in Java. Streams applications are
implemented in our programming language, which has the rather pedestrian name
Streams Programming Language, but usually just SPL. This may seem minor, but
it's a big deal. Marz is working on a higher level language in Clojure.
Implementing programs in a higher level language enables developers to
abstract away many issues related to high performance, distributed systems. I
compare it to the difference between writing assembly code and writing C code.
(Or the difference between writing Python code and writing C code.) The code
that we generate is similar in principle to how one writes a Storm
application. Which brings me to...

\- Storm runs on the JVM, we generate C++ code which gets compiled.

Neither Storm or Streams are the first or only in this area. Stream
programming is also popular for hardware, but that is usually synchronous and
if there's state, it's shared-memory. Storm and Streams are distributed and
asynchronous. There are academic distributed streaming systems such as
Borealis. The research name for Streams is System S, and there are many
academic papers about it, or that use it as a platform for other research:
[http://dl.acm.org/results.cfm?h=1&cfid=66087472&cfto...](http://dl.acm.org/results.cfm?h=1&cfid=66087472&cftoken=23295126)

And for the record, I am impressed with Storm.

~~~
samstokes
_Storm does not allow arbitrary state in operators (what Nathan Marz calls
"bolts")._

Maybe I misunderstand your point, but Storm does allow state in Bolts - a Bolt
is just a Java object, so it can have member variables. That's how aggregation
(e.g. counting events per user) is done. Of course, if you want a Bolt that
scales horizontally, you need to account for the state being split across
several instances of the Bolt class; and if you need the state to survive a
restart, you need to keep it in an external database instead of in the
object's memory.

~~~
scott_s
Then I stand corrected. Is there any means to declare partitioned state that
the runtime then handles for the user, or does the user always have to manage
it themselves? This is one of those things that is, I think, easier to do
declaratively one level up in abstraction. (You can see examples of this by
looking at invocations of our Aggregate operator in the above documents.)

I had assumed there was no arbitrary state because of the replay semantics.
Let's say bolt A sends tuples to bolt B. B has internal state. A sends tuples
t1, t2, t3 and t4. A receives acknowledgements that t1, t3 and t4 were
processed. So t2 needs to be replayed. But the semantics of what that means is
undefined - B has internal state that already incorporates, for certain, t3
and t4, and maybe t2. (While it's unlikely, you never know where a tuple got
lost.) So replaying t2 is problematic - do you just blindly replay it, and
allow potentially broken semantics? The alternative is to do rollback, which
is quite hairy.

~~~
samstokes
_partitioned state that the runtime then handles for the user_

Yes, when "wiring up" a bolt to a spout or another bolt, you can say that it
should receive tuples grouped by one or more of the tuple fields. (e.g. tuples
[user, num_events] grouped by "user".) Then Storm takes care of the consistent
hashing needed to ensure that each instance of the bolt receives all of the
tuples in a given group (e.g. instance 1 gets all events for user 42, instance
2 gets all events for user 53).

 _I had assumed there was no arbitrary state because of the replay semantics._

Storm itself doesn't implement replay; it relies on the ability of an external
event source (e.g. a message queue) to replay tuples if needed. It just
provides the ability to notify the source of whether replay is needed (by
tracking the tuples as they flow through the topology). So whether or not you
replay tuples is an application choice.

But yes, if you have stateful bolts and replay then you do need to make sure
that processing a given tuple is idempotent.

------
alexatkeplar
This is an entertaining comparison of 8 different map-reduce languages for
Hadoop, including a flame-tastic take on Cascalog/Clojure:
[http://www.dataspora.com/2011/04/pigs-bees-and-elephants-
a-c...](http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-
of-eight-mapreduce-languages/)

~~~
swannodette
That's not flame-tastic. It's just inaccurate.

~~~
CodeMage
As someone who would like to start experimenting with Hadoop in near future, I
would appreciate it a lot if you could elaborate on that or point me in the
direction of a better comparison.

~~~
swannodette
_Personally, I dislike LISP odd syntax, the widespread use of side effects in
a functional language and the poor abstraction that lists represent over RAM,
from a performance point of view — indeed LISP variants often add additional
data structures, somehow negating the “LIS” part of the language. In the
specific case of Clojure, the fact that a compiled language is compiled into
an interpreted one, JVM bytecode, combining a slow dev cycle with suboptimal
performance, makes me think Clojure users must be glutton for punishment._

Clojure has sophisticated state management. So much for widespread use of side
effects.

Clojure has high performance data structure implementations tuned for modern
hardware. So much for performance.

Clojure is compiled on the fly at the REPL and the JVM is one the fastest
runtimes out there. So much for slow dev cycle.

You may not like the syntax, but boy is that Hadoop query short and sweet.

~~~
Raphael_Amiard
Yeah it's almost hilarious how every critic he is aiming at LISP is fixed in
Clojure, where immutable is the default and vector the data structure of
choice. Also he clearly doesn't understand the subtleties/differences between
compilation and interpretation, and how the two can interlace.

