
Anatomy of a Reducer - fogus
http://clojure.com/blog/2012/05/15/anatomy-of-reducer.html
======
drcode
Someone should really write a blog post on what this all means for a novice.
My impression is (and I might be wrong, since I haven't had the time to study
the new reducer library yet) that the point of this library is the following:

Many algorithms are tree-like in nature. Suppose, for instance, you have a
path-finding algorithm that looks at a map to find the best path for
something. In that case, you're going to need to search different path options
iteratively to find the best path, leading to a nested algorithm.

Now, this type of algorithm is tricky to parallelize: Sometimes the root of
your "search tree" might be small (let's say 3 items) other times it might be
large (1000 items). If you know your computer has 8 cores, it's very hard to
meter out an equal amount of work to each core- Sometimes several cores need
to split the work of a single branch. Other times, a single core might need to
handle multiple branches. Also, it may be hard to know ahead of time how much
work each branch will require, making the "parcelling out" of work even more
difficult.

However, the work for each branch can be expressed as a collection of smaller
"jobs", with a function applied to each. And each item in this collection
would then also be a collection with functions applied to it (representing
each sub-branch.) If you express a nested algorithm in this way, it can be
viewed of as a nested collection with differing arbitrary functions mapped,
filtered, etc across different parts of it.

If you use this new Reducer library to implement this type of nested
operation, Clojure can automagically use the Java Fork/Join library to parcel
out equal amounts of work to each core. In effect, you can think of the
reducer library as letting you "flatten" these nested collections (without the
memory & CPU cost this would entail) to figure out how to split the work out
evenly and in the most efficient manner possible, hiding all the ugliness
required for this from the programmer.

Please, if you know this library better than I, let me know where I'm going
wrong.

~~~
lmkg
I'm not an expert, but here's my take on it: The point of this library is to
decouple things that are usually coupled but don't need to be, and provide a
uniform interface for combining them back together. The hardest part of this
is looking at things from a slightly different perspective and getting used to
the new division of responsibilities, but when you get used to it the overall
picture is actually simpler.

The idea is that you have four types of things: you have items, you have
collections that hold items, you have ways of operating on those collections,
and you have operations you want to perform on collections. The library deals
with separating the latter two, since separating elements from a container is
basically a solved problem.

A "reducer" is a view into a collection, which is responsible for knowing how
to apply collection operations (map, filter, reduce) to its elements.

The collection operations (usually map, filter, and reduce) are separate,
called mappings and filterings in this article. These are real entities, which
are composable. I think of them similar to defining a database query, but not
executing it.

Because reducers and operations are decoupled, you can always take an
arbitrary "query" and throw it at at arbitrary reducer, and expect the
collection to be able to handle the query, and always return the same result
regardless of collection type.

Fork/join is an example use-case: you can create a collection type which is
just a thin wrapper over fork/join[1]. You can then wrap this collection type
as a view into your data, and run collection queries against it. This allows
you to use fork/join as a drop-in replacement for naive data structures. It
also keeps tinkering with fork/join separate from both the underlying data
structure, and the work being processed.

The other primary use case seems to allow defining, storing, and composing
queries without running them. This ties in heavily to lazy evaluation. The
composability is a key feature because it allows you to pass around partial
queries, rather than partial results, and saves greatly on time and space by
not creating intermediate data structures (cf. stream fusion in Haskell).

[1] Fork/join would have to be a "foldable" rather than "reducable"
collection, which as far as I can tell just means it's a little more lax with
how you combine results back from a reduce. These relaxations allow for
parallelization.

~~~
pron
Fork-join is not a collection (or any data structure for that matter). It is a
parallel computation process meant to make good use of all available
processors. You can use fork-join to process a collection like an array or a
tree in parallel, and you can use it to factor a single integer in parallel.

------
pron
If I understand correctly, does this mean that a map operation (that is not
reduced further; i.e. we want a collection as the output) could be carried out
in parallel efficiently using reducers only if there is an efficient (O(1))
way to concatenate the collections produced at each combining step (in the
"internal nodes" of the fork-join computation tree)?

This would constrain how collections are implemented to make good use of this
technique.

~~~
weavejester
As I understand it, all of Clojure's internal data structures (save for the
list, which can be substituted by a lazy seq) can be combined efficiently
already.

~~~
pron
True. I'm just wondering if this is indeed a requirement.

