

Saddle: Scala Data Library - aklein
http://saddle.github.com/
Saddle is a high-performance data manipulation library for Scala.
======
wheaties
As one of the colleagues of the author of this library, I can give my semi-
biased opinion. To be honest, having worked in a previous life with
Numpy+SciPy, the appearance of Saddle in our tech stack made the reasoning of
complex numerical code easier. I'd suggest using it not just for it's
performance (quite impressive for a JVM based library) but more for it's clear
API. Expressible code (clean code) is debugged faster and maintained with less
overhead. This library will let your code become expressive as a numeric
library can be without sacrificing some of the nicer language features you've
come to rely upon (map, flatMap, etc.)

------
saintx
I was sort of sad to learn earlier this year that the scalala project had
become inactive, and when a friend pointed me at Breeze, the first thing that
concerned me was that it seemed to "do ALL the things!", rolling in a bunch of
other functionality along with a scalala revamp. What I really wanted was an
elegant, fast, well written numerical computing library in Scala, and this
seems to be it. This is great. Now all we need is to be able to tell this to
use GPU hardware acceleration under the hood for things like FFTs and we're
set!

~~~
aklein
I'd love to explore GPU accelerated solutions. I need more hours in the day...

~~~
saintx
The more I look into this, the better it looks. In particular, I appreciate
that you used good existing solutions on the backend (EJML, Apache commons
math, etc) where appropriate.

------
jfim
It took me a while to realize there were implicit conversions in the companion
objects that are necessary in order to get useful functionality out of the
data structures.

It might be worth adding an example to make it a bit more explicit in the
documentation, such as:

    
    
      import org.saddle.Vec._
      Vec(1,2,3).median // Returns 2
    

Other than that, it looks pretty cool, I'll go use it right now. :)

Edit: Formatting.

~~~
aklein
Edit: you want to

    
    
      import org.saddle._
    

to get all the implicit goodness. I'll add a note.

------
JPKab
I love pandas, and I think this is going to be great.

Is there something like this for Clojure? I guess I'll have to pick up scala
too. Coursera here I come.

~~~
draegtun
> _Is there something like this for Clojure?_

Probably Incanter which uses the Parallel Colt Java library -
<http://incanter.org/> |
[https://sites.google.com/site/piotrwendykier/software/parall...](https://sites.google.com/site/piotrwendykier/software/parallelcolt)

------
pathdependent
Thank you!

Most of my colleagues do data analysis in Python given Numpy+SciPy. I like
Python, but if possible, I'd rather do as much of my development in a single
language, and I prefer Scala.

This library certainly does not replicate the extensive functionality offered
in Python for data analysis, but _it does have the potential to seed Scala
development_. I for one will be perusing the code this weekend, and picking an
avenue for subsequent exploration.

~~~
aklein
Cool, I welcome the feedback!

------
wiradikusuma
How does it different than <https://github.com/scalanlp/breeze>?

~~~
aklein
Breeze is more targeted to NLP and machine learning. Saddle draws heavily on
the design of pandas (python library) to provide data structures enabling
"alignment-free programming". Saddle outsources nearly all its linear algebra
and numerics capabilities.

------
joshklein
Congrats on the release. I can think of at least one big organization I've
talked to that was chomping at the bit to try pandas but had too much of an
existing commitment to Scala to take the Python plunge. [Disclaimer: brother
of OP]

------
achompas
Congrats, Adam!

Do we have performance information yet, even on some basic, common use cases?

Also, the docs mention EJML as the backend for Saddle's data structures--do
you have any thoughts on using EJML?

~~~
aklein
Thanks! I will do some follow-up posts on performance, but know that it has
been a MAJOR design consideration.

Consider the following in Saddle:

    
    
      val s1 = Series(vec.rand(10000), Index(Vec(array.randIntPos(10000)) % 100))
    
      val s2 = Series(vec.rand(10000), Index(Vec(array.randIntPos(10000)) % 100))
    
      clock { s1.join(s2, how=index.OuterJoin) }
    

This clocks in at 19ms on my machine after Hotspot kicks in.

The equivalent pandas:

    
    
      In [10]: ix1 = np.random.random_integers(0, 100, 10000)
    
      In [11]: ix2 = np.random.random_integers(0, 100, 10000)
    
      In [12]: df1 = DataFrame({'x' : np.random.rand(10000)}, ix1)
    
      In [13]: df2 = DataFrame({'y' : np.random.rand(10000)}, ix2)
    
      In [14]: %timeit df1.join(df2, how='outer')

10 loops, best of 3: 37.7 ms per loop

~~~
aklein
PS Regarding EJML, after extensive research, I found it hands down the fastest
pure-java implementation for doing linear algebra.

While it's maybe 2x-4x slower than JNI wrapped ATLAS or MKL, for the cases I
deal with, it just doesn't matter vs ease of use.

That said, it's LGPL, so I made it easy to swap out for other matrix libraries
if you need.

------
endofunctor
Great news! Are you planning to do any integration with Erik's spire
(<https://github.com/non/spire>)? I believe, some libraries already started
collaborating with it (<https://github.com/twitter/algebird/issues/99> and
[https://github.com/typelevel/scalaz-
contrib/tree/master/spir...](https://github.com/typelevel/scalaz-
contrib/tree/master/spire)).

~~~
aklein
I'm definitely interested in exploring Spire. I believe the functionality is
almost entirely orthogonal.

~~~
endofunctor
Awesome, looking forward to using saddle in my next project!

------
wandermatt
I didn't see sparse vector support. Assuming I didn't just overlook it, is it
on the roadmap?

~~~
aklein
Depends what you mean by sparse vector support. Maybe what you're interested
in is best served by Series:

    
    
      val s = Series(Vec(1,2,3), Index(0,5,10))
    

This gives you

    
    
      s: org.saddle.Series[Int, Int] = 
      [3 x 1]
      0  -> 1
      5  -> 2
      10 -> 3
    

Then, for instance,

    
    
      s(5,10)
      res0: org.saddle.Series[Int,Int] =
      [2 x 1]
      5  -> 2
      10 -> 3

~~~
MLnick
For me at least, sparse vector support means you can do elementwise operations
(on the non-sparse elements) and in particular linear algebra like vector dot-
products and matrix-vector multiply.

------
bennylak
great work! BIG like!

------
shawnalaken
lightning fast!

