

Babbage: A Clojure library for accumulation and graph computation - ithayer
https://github.com/ReadyForZero/babbage

======
kenko
here's the announcement email (very boiled down version of the readme,
essentially):

babbage is a library for easily gathering data and computing summary measures
in a declarative way.

The summary measure functionality allows you to compute multiple measures over
arbitrary partitions of your input data simultaneously and in a single pass.
You just say what you want to compute:

    
    
        > (def my-fields {:y (stats :y count)
                          :x (stats :x count)
                          :both (stats #(+ (or (:x %) 0) (or (:y %) 0)) count sum mean)})
    

and the sets that are of interest:

    
    
        > (def my-sets (-> (sets {:has-y #(contains? % :y})
                           (complement :has-y))) ;; could also take intersections, unions
    

And then run it with some data:

    
    
        > (calculate my-sets my-fields [{:x 1 :y 2} {:x 10} {:x 4 :y 3} {:x 5}])
        {:not-has-y
         {:y {:count 0}, :x {:count 2}, :both {:mean 7.5, :sum 15, :count 2}},
         :has-y
         {:y {:count 2}, :x {:count 2}, :both {:mean 5.0, :sum 10, :count 2}},
         :all
         {:y {:count 2}, :x {:count 4}, :both {:mean 6.25, :sum 25, :count 4}}}
    

The functions :x, :y, and #(+ (or (:x %) 0) (or (:y %) 0)) defined in the
fields map are called once per input element no matter how many sets the
element contributes to. The function #(contains? % y) is also called once per
input element, no matter how many unions, intersections, complements, etc. the
set :has-y contributes to.

A variety of measure functions, and structured means of combining them, are
supplied; it's also easy to define additional measures.

babbage also supplies a method for running computations structured as
dependency graphs; this can make gathering the initial data for summarizing
simpler to express. To give an example that's probably familiar from another
context:

    
    
        > (defgraphfn sum [xs]
            (apply + xs))
        > (defgraphfn sum-squared [xs]
            (sum (map #(* % %) xs)))
        > (defgraphfn count-input :count [xs]
            (count xs))
        > (defgraphfn mean [count sum]
            (double (/ sum count)))
        > (defgraphfn mean2 [count sum-squared]
            (double (/ sum-squared count)))
        > (defgraphfn variance [mean mean2]
            (- mean2 (* mean mean)))
        > (run-graph {:xs [1 2 3 4]} sum variance sum-squared count-input mean mean2)
        {:sum 10
         :count 4
         :sum-squared 30
         :mean 2.5
         :variance 1.25
         :mean2 7.5
         :xs [1 2 3 4]}
    

Options are provided for parallel, sequential, and lazy computation of the
elements of the result map, and for resolving the dependency graph in advance
of running the computation for a given input, either at runtime or at compile
time.

~~~
defrost
Cutting to the chase, does this make the summary results available in the
midst of the sequence; eg: if it takes two hours to gather pressure data (or
any other time series data) does this expose the running variance 10 minutes
in, an hour in, etc. ?

~~~
kenko
Not currently, but it would certainly be possible to add something _like_ that
---exposing the running stats for partial subsequences of the input sequence
would just be a matter of replacing the "reduce" in the definition of
calculate with "reductions" (well, and at least one other change, but at a
similar level of complexity). That wouldn't give you ten, sixty, etc. minutes
in to the data gathering, because it wouldn't be tied to how long the actual
computation of the elements of the input seq---something outside calculate's
purview, ATM---was taking, but it would start delivering running answers right
away.

~~~
defrost
Running answers right away is fairly useful; a bit of a challenge in that
problem domain is with multichannel sensors ("cameras" with multiple frequency
bands, satellites like MODIS, radiometric spectrometers, etc.) where the
sharpest "image" is produced by using an SVD (singular value decomposition)
type transform to reduce (say) 256 input channels to (say) 6 major dimensions
and using those to recreate an enhanced image. Producing branchless code to
generate basic running stats (min, mean, max, variance, trends) on multiple
input channels is a bit of puzzle, generating an efficient rolling SVD
enhancement (best image based on most recent observations) is a bit trickier.

The application areas are continuous processing of continuously arriving data,
infinite unbounded sequences.

------
eschulte
I actually wrote something similar in bash which I use frequently when I need
to munge a table of numbers on the command line [1]. The whole time I was
thinking I should really be doing this in common lisp.

[1] <http://eschulte.github.com/data-wrapper/>

------
yayitswei
This will be great for building our stats dashboard. Thanks!

~~~
innovate
we use this actively internally @ReadyForZero for a variety of different
analysis, hopefully it's helpful for you and others

------
furqanrydhan
This is great!

