
Rich Hickey: Reducers - A Library And Model For Collection Processing - swannodette
http://clojure.com/blog/2012/05/08/reducers-a-library-and-model-for-collection-processing.html
======
jwr
I've been using Clojure for a production system for about two years now. It
feels so great to use a language that just keeps handing you better tools as
you go along.

What's nice is that the new tools are not just syntactic sugar, as in so many
other languages. They either adress specific performance pain points (non-
rebindable functions and numerics) or introduce new abstractions and tools
(reduce-kv is a nice small example).

I love the fact that each time I read about a new thing coming to Clojure, I
immediately think "well this will fit right into what I'm building, great!".

Clojure strikes a good balance between nice ideas and practicality.

~~~
MatthewPhillips
Clojure is all substance and no glitter. It's all about getting the job done,
which is why it's the most important language development in 20 years.

~~~
jacobolus
The best way to get “all substance and no glitter” is apparently Rich Hickey’s
“hammock driven development”. <http://news.ycombinator.com/item?id=1962051>

“To arrive at the simplest truth, as Newton knew and practiced, requires years
of contemplation. Not activity. Not reasoning. Not calculating. Not busy
behaviour of any kind. Not reading. Not talking. Not making an effort. Not
thinking. Simply bearing in mind what it is one needs to know. And yet those
with the courage to tread this path to real discovery are not only offered
practically no guidance on how to do so, they are actively discouraged and
have to set abut it in secret, pretending meanwhile to be diligently engaged
in the frantic diversions and to conform with the deadening personal opinions
which are continually being thrust upon them.” –George Spencer Brown in _The
Laws of Form_ , 1969

------
krosaen
The implementation is a nice example of clojure protocols kicking ass - the
core vector and map implementations needn't know anything about implementing
coll-fold as the protocol definitions can be added within the reducers
library.

------
snprbob86
Just wondering out loud... I'm interested in how this relates to the existing
implementations in core.

For example, if we went back in time, would all of the core functions have
been implemented this way? Would this be a possible drop in replacement in the
future? Could future versions of Clojure integrate these ideas more deeply? If
so, what are these backwards compatibility concerns?

~~~
puredanger
> For example, if we went back in time, would all of the code functions have
> been implemented this way?

I don't think so - the existing implementations work on the higher-level
abstraction of _sequences_. Reducers are optimized parallel versions that work
on _collections_. While parallelism is extremely useful in some parts of your
code, there is overhead and I don't think you would want either the overhead
or the restriction of working below the sequence abstraction in the general
case.

I seem to see some of the same choices being made available in the new Java 8
collections and parallel operations work. That is, it is up to the developer
when to "go parallel".

For an entirely different approach, check out Guy Steele's Fortress language
which eagerly executes most things in parallel by default (like all iterations
of a loop, arguments to a function, etc) and you have to tell it not to do
that.

Guy's Strange Loop 2010 talk is an interesting complement to this work:
[http://www.infoq.com/presentations/Thinking-Parallel-
Program...](http://www.infoq.com/presentations/Thinking-Parallel-Programming)

~~~
chrismcbride
Well besides the implicit parallelism of fold, wouldn't this be generally
useful to reduce intermediate list creation? Or do lazy-seqs already solve
that problem?

~~~
richhickey
Yes, reducers provide that benefit to sequential reduce, independent of the
parallelism of fold.

~~~
snprbob86
So then, what downsides, if any (I'm sure there are), would there be to moving
this reducers model to being the "default"?

~~~
richhickey
When you call a reducer like r/map or r/filter, the result is reducible but
_not_ seqable. So, if you have an incremental algorithm, are calling first
etc, the seq-based fns are a better fit. Also, lazy fns automatically cache
whereas reducers recalc every reduce, until you save the realized result. They
are genuinely complementary in many ways.

~~~
chrismcbride
1) What is the benefit of recalculated on every reduce? Is this so you can use
side-effects?

2) If seq or first is called on a reducible, wouldn't it be easy to just
implicitly realize the reducible in to a sequence first?

~~~
jules
1) The benefit is that you don't have to cache the results in a data
structure, which really slows it down. Suppose you map the function (fn [x] (+
x 1)) over a reducible, and then you sum it by reducing it with +. With
reducibles, there is no intermediate allocation, and it will run really fast
especially if the functions can be inlined. Compare this with mapping and then
folding a lazy seq: map builds an intermediate data structure, and reduce
immediately consumes it.

2) That's possible, but it makes it too easy to write code with abysmal
performance because of (1). The common case is that you call both first and
rest on the reducible. If both turn the reducible into a seq first, then both
will take O(n) time in the best case (might be much worse depending on how the
reducible was built up). Combine that with the fact that most times, you're
going to recurse on the rest, and you've got an O(n^2) algorithm where you
expected O(n), if everything is based on reducibles. Additionally, it's
impossible to take the first or rest of an infinite reducible (well, perhaps
you could do it with exceptions -- in general you can turn a reducible into a
seq with continuations).

------
leif
Rich Hickey proves again that he is among the leaders, if not _the_ leader in
careful, industrial-strength application of excellent theory. Bravo.

------
picardo
Rich will be talking about this library at the next Clojure NYC meetup. If
anyone is in the neighborhood, feel free to drop by.

<http://www.meetup.com/Clojure-NYC/events/56212552/>

~~~
nickik
Will there be a video of this? That would be very awesome.

~~~
tsdh
Oh yes, please! Or at least slides & example code.

------
rickmode
Is there no need to tune the number of threads with an approach like this? Or
is there a general notion of the appropriate number of threads given the
number of CPU cores?

~~~
weavejester
If you're not doing any I/O, the number of threads can be limited to the
number of cores. I believe this is the default for Fork/Join.

~~~
puredanger
Correct. ForkJoinPool defaults to Runtime.getRuntime().availableProcessors()
threads (but can be adjusted). The reducers library
([https://github.com/clojure/clojure/commit/89e5dce0fdfec4bc09...](https://github.com/clojure/clojure/commit/89e5dce0fdfec4bc09fa956512af08d8b14004f6))
seems to initialize the pool with the default constructor.

------
sanxiyn
Recommended reading: Organizing functional code for parallel execution or,
foldl and foldr considered slightly harmful (2009) by Guy L. Steele, Jr.

<http://dl.acm.org/citation.cfm?id=1596551>

------
vorg
> ...producing the same result as the Clojure's seq-based fns. The difference
> is that, reduce being eager, and these reducers fns being out of the seq
> game, there's no per-step allocation overhead, so it's faster. _Laziness is
> great when you need it, but when you don't you shouldn't have to pay for
> it._

Looking forward to trying this out. I've been implementing another language in
Clojure (tho just experimenting for the moment). It's a non-lazy language
(Groovy) so I have _reduce_ all over the place, e.g.

    
    
      (defn multiply [a b]
        (cond (and (number? a) (number? b)) (* a b)
              (and (vector? a) (integer? b)) (reduce conj [] (flatten (repeat b a)))
              (and (string? a) (integer? b)) (reduce str (repeat b a))))
    
      (defn findAll [coll clos]
        (reduce conj [] (filter clos coll)))
    

Hope reducers give a speed boost by removing the laziness I'm not using.

------
minikomi
Forgive my simplified response.. Is a way to look at this.. Rather than inside
out, the functions compose outside in and then evaluate the reducable as a
last step. That gives the benefit, since there's only one step the laziness
needs to "devolve" in?

------
ashish01
Awesome for the linked talk alone

<http://www.infoq.com/presentations/Simple-Made-Easy>

If you haven't seen this just take time to watch the first 15 minutes. Really
worth it.

------
jules
If you use a mapped reducible twice, does that evaluate the function on each
element twice?

~~~
mquander
If you call reduce on it twice, then sure, it would do all the work twice.

~~~
loumf
Jules's point is that the map is done twice, which it is. If you don't want
that, you can reduce into a collection and then reduce the collection twice.

------
surrealize
The blog post compares this to Haskell enumerators/iteratees, but I think a
more direct comparison to something haskelly is to monads.

He says: "The only thing that knows how to apply a function to a collection is
the collection itself." Which is like a monad in the sense that the insides of
a monad are opaque; you can only interact with a monad through the functions
it gives you.

The "map" function from his "reducers" library has type:

fn * reducible -> reducible

(i.e., it takes a function and a reducible and gives you back a reducible)

while monadic "fmap" is a little higher-order and has type parameters, but it
does something analogous:

(t -> u) -> (M t -> M u)

(i.e., take a function from type "t" to type "u", and return a function from
"monad of t" to "monad of u"). It's a little different in that Hickey's
"reducers/map" applies the argument function itself, while monadic fmap gives
you a function that will do that.

Of course, his "reducers" library addresses a bunch of other stuff like
parallelism, which isn't something that monads themselves are concerned with.
I'm just saying that part of the interface to his new collection abstraction
is monad-like.

~~~
Drbble
Everywhere you wrote "monad" you should have "functor". A monad is a functor
with extra structure relating to how elements are inserted into the
"collection" in the first place. Functors only discuss how functions are
applied to elements already collected.

------
peppertree
This will go nicely with Storm.

~~~
chubot
I thought storm worked on infinite streams? Does this support that?

~~~
loumf
It should. It depends on which reducing function you choose -- the map step
doesn't consume the stream, it just set up the mapped function to be called as
you reduce. If your reducing function is lazy, then it works on infinite
streams.

~~~
puredanger
This works on collections, which are not infinite. If I read correctly,
sequences (a higher level abstraction) fall back to their curent
implementation.

~~~
puredanger
I believe the existing pmap and preduce functions work in parallel over lazy
streams by chunking work and parcelling it out. Depending on your use case,
this is not necessarily ideal.

------
marshallp
Why not just do relational programming - prolog or sql. The lisp weenies still
don't get it.

~~~
jamii
<https://github.com/clojure/core.logic>

~~~
marshallp
So basically a thin sliver of what an sql rdbms provides.

~~~
stuarthalloway
Your "thin sliver" idea is intriguing. Maybe we can call slivers "simple
components" and build bigger things out of them.

I wonder if there would be any benefit to that?

~~~
marshallp
If 'components' were that useful for the things rdbms's are used for the
thousands of highly trained phd's at oracle, microsoft, ibm would have added
them to their sql products.

Of course, this a ROR crowd, software invented by a business school grad/game
review writer, so it's can be an uphill battle explaining this stuff.

~~~
stuarthalloway
For those keeping score at home:

* argumentum ad verecundiam * ad hominem

------
bjaress
It seems like everything I read about Closure gives me a reason to use it, but
a slightly stronger reason not to.

I'm happy to see a language putting this approach to collections into its core
libraries and even combining it with ideas about parallel processing of data
structures.

On the other hand, the whole thing is written as if Rich Hickey had an awesome
idea, wrote some awesome code, and is now sharing his awesomeness with us.
It's kind of a lost opportunity to give credit to the people who gave him the
ideas (and maybe the people who helped him write the code, if there were any)
and it's kind of a turn-off.

One good, prior write-up about reducing as a collections interface is:

<http://okmij.org/ftp/papers/LL3-collections-enumerators.txt>

~~~
richhickey
I make no claims to novelty, and the blog post does link to
<http://www.haskell.org/haskellwiki/Enumerator_and_iteratee>, the most similar
work and a clear influence. If more people knew about Iteratee, it would be
worth spending more time talking about the connections and contrasts, but they
don't, and knowledge of it is not prerequisite to understanding reducers. No
one helped me write the code.

~~~
skew
Isn't foldr/build fusion much closer? A collection is represented by a "build"
function that takes the reducer, and list transformers become reducer
transformers. The main difference is that it's applied automatically by list
library using rewrite rules, so it's not as obvious, the reducer is supplied
as a pair of "cons" function and "init" value rather than a variadic function,
and there's no parallel fold.

[http://research.microsoft.com/~simonpj/papers/deforestation-...](http://research.microsoft.com/~simonpj/papers/deforestation-
short-cut.ps.Z) The first paper (yes, that's a .ps.Z - check the date)

<http://www.scs.stanford.edu/11au-cs240h/notes/omgwtfbbq.html> some recent
slides, which also include a bit about a "stream fusion" which isn't yet in
the standard library

[http://darcs.haskell.org/cgi-
bin/gitweb.cgi?p=packages/base....](http://darcs.haskell.org/cgi-
bin/gitweb.cgi?p=packages/base.git;a=blob_plain;f=GHC/List.lhs;hb=HEAD) The
details from GHC's libraries - it's all in the RULES.

~~~
richhickey
> and there's no parallel fold.

When working up from the bottom it might seem that this is just manual
stream/list fusion. But the parallelism is the key prize. We need to stop
defining map/filter etc in terms of lists/sequences, and, once we do, there is
nothing to fuse, no streams/thunks to avoid allocating etc, because they were
never essential in the first place, just an artifact of the history of FP and
its early emphasis on lists and recursion.

~~~
skew
Yes, foldr/build is almost exactly reducibles, but not foldables.

Iterators do nothing for parallelism either.

