
Ask HN: What is the best functional programming language for data science? - ptwobrussell
I&#x27;m currently exploring Haskell, OCaml, and Clojure with respect to not only the core language features themselves, but also with respect to their communities and third-party frameworks for math and machine learning.<p>I&#x27;m mostly interested in the best &quot;general purpose functional programming language for data science&quot; but would also be curious which functional languages have a particularly strong hand within specific domains (e.g. medical, finance, etc.)
======
houshuang
This is a question that also interests me. I've been thrown into a lot of data
sciencey work this year with learning analytics data from MOOCs - so far
mostly cleaning and organizing, graphing etc, but we're moving into more
machine learning, inference etc. I have a poor math background, but eager to
learn, so while I've done most of the practical work in R and now Python with
Pandas, I am playing with things like Learn Math, Logic and Computer Science
with Haskell (also Think Stats and Think Bayes with Python).

I would love to see a more mature package with IHaskell, easy access to
graphing, and a nice Pandas/data.table like library, and a set of statistics
tutorials written around them, where you are basically learning
statistics/probability at the same time as using the language. (I found this
paper on functional probabilistic programming in Haskell very fascinating -
the idea of using a Monad to wrap distributions, uncertainty levels etc around
numbers seems very powerful
[http://web.engr.oregonstate.edu/~erwig/papers/PFP_JFP06.pdf](http://web.engr.oregonstate.edu/~erwig/papers/PFP_JFP06.pdf))

~~~
ptwobrussell
I've done most of my practical work in Python, recently picked up Pandas (and
am starting to increasingly grow fond of the ggplot port to Python and some of
the R packages that I find make ridiculously complicated things as simple as
they should be.) Thanks also for the link to the paper.

------
physicsyogi
I don't have any experience with Haskell, for data science or otherwise, but
have been using Clojure a bit for such purposes.

The code I write is mainly for machine learning and natural language
processing with "big data". Some of the libraries that I've found useful are:

1) clojure.core.reducers 2) core.match 3) incanter (lots of statistics-related
stuff. Comparable to R, perhaps.)

Clojure handles XML, JSON, and YAML very nicely as well. Then you have
Cascalog to run map reduce jobs without writing mappers and reducers
explicitly. There's also Marceline, which is to Storm/Trident as Cascalog is
to Hadoop/Cascading.

There are also libraries for serialization like data.fressian, matrix math
like core.matrix. Prismatic had also open sourced three really excellent
libraries that are useful for data processing.

------
ss6012
I am currently in Learning phase of DATA Science. I have used R(More
Proficient) and Python extensively for my Current Data Science Projects. Well
i can say R is more of a functional language than one would think, but it's
certainly not a great general purpose language. So as of now I have used R as
Functional Programming For Descriptive analysis and on top of it used python
for Making Data Science/Analytics Application, but aging there is a library
name "Shiny" Which makes R programmer to make web App so easy as writting code
for Descriptive analysis in R. And i am also interested in getting finger
tight with Haskell and Julia.

------
karstenw
There is a package lambda.r
([http://cran.r-project.org/web/packages/lambda.r/index.html](http://cran.r-project.org/web/packages/lambda.r/index.html))
for the statistical computing environment R that implements functional
programming paradigms. R is not a strict functional language, though. The
package author plans to publish a book on this topic, see
[http://cartesianfaith.com/2013/09/21/preview-of-my-book-
mode...](http://cartesianfaith.com/2013/09/21/preview-of-my-book-modeling-
data-with-functional-programming-in-r/)

------
tel
I'm a proficient Haskell user with some experience writing Haskell data
science code. I also have experience doing the same in Clojure, though it's
about 2 years out of date. I'll begin with Haskell then compare.

\---

I find that purity is a valuable feature of Haskell but, more so than with
other code, I feel a big divide between current practice data science and pure
functional code.

Haskell has a strong base of financial code which is usually unavailable
publicly, but it does lead to a lot of blog posts and commentary describing
how you can build highly efficient, powerful streaming systems in Haskell
which interact with Excel. This is largely true as laziness tends to put
people in a streaming mindset quite easily. Finally, there's a big push in the
pipes/conduits camps to reify streaming as a first class action which can be
manipulated easily. I'm a big fan of pipes—I think it's completely
unreplicated anywhere else.

Haskell tends to be a memory hog and can produce space leaks if you're not
careful. This will decimate your ability to use it for large data sets, but
it's easy to avoid after you get a little bit of practice in. In particular,
it's worth learning where new laziness is generated (whenever you produce a
lifted type) and making the decision as to whether that's correct or not.
Strict data types and UNPACKing eliminates space usage and leakage quite
nicely.

Haskell has san incredibly powerful and fast vector library—called vector,
unsurprisingly—and I encourage you use it constantly. There are also a number
of other very nice data science foundation libraries like ad, linear, vector-
space, statistics, compensated, and log-domain.

Haskell's best dense matrix library, hmatrix, is nice but GPL. It also doesn't
interact as nicely as I'd hope with vector. There's also Repa, though that's
more optimized for images and parallel matrix operations like DFT.

Haskell's interactive runtime has a HUGE deficiency in that it erases all
local variables on each code reload. I've been assured that there are
proprietary (financial) REPLs which don't have this deficiency, so perhaps it
could be eliminated if someone wanted to take it as a project.

If you have a GPU to spare then it's really easy (and fun) to push algorithms
on to it using Haskell's Accelerate library.

Generally, static typing is a huge boon, but there's too little broad usage of
Haskell as a platform for data science yet to see how best to use it. HLearn
is a great test bed for a lot of this. I find it really exciting, but probably
a bit too dense to be practical. There's a big hole in the ecosystem where a
data.frame/pandas and ggplot/lattice duo could fit.

\---

Clojure's primary benefits drive from, unsurprisingly, using functional
algorithms atop Java's runtime and library support. I made a Clojure binding
to JBlas a few years ago for my research (clatrix) which wasn't too difficult
to build, but plugged a needed hole in the ecosystem. I also reimplemented a
bunch of basic machine learning algorithms in Clojure for a class and found
that it was difficult (3 years ago) to get good performance out of raw
Clojure, even when using type annotations. I found that dropping down to Java
types wasn't so painful, but felt incredibly non-native. I'd suffer massive
performance problems to just not have to do it. Clatrix helped to solve that a
bit and it's been developed much further due to core.matrix, though I've not
used core.matrix in anger.

Generally when coding in Clojure I miss static types (though I've not yet used
core.typed) which is entirely personal, so YMMV. I find them to be very, very
key in statistical code, though, since so many error conditions just lead to
difficult to interpret, yet totally false results. I want my errors to come
from bad tuning, not uncaught type mangling.

I also did a fairly large amount of parallel processing in Clojure using map-
reduce implemented atop the actor model. It worked pretty well and distributed
over a few hundred machines, yet was never convenient enough to replace
manually launching the jobs and collecting the results by hand at the end.
After getting some experience with Erlang/OTP I think I could have done
better, but it was still a boon as to how nice it was to do in Clojure.

\---

Generally, I find static types to be a huge boon for statistical programming
as noted above. It's a tragic thing when you lose sensitivity to surprising
results due to general mistrust of your own code. Static types make me rarely
mistrust the correctness of my code (and libraries like quickcheck and simple-
check help to cover the remaining uncertainty!).

Haskell I feel is faster broadly... except when it's not. Space leaks and
excess laziness will destroy performance, but I still vastly prefer
programming in a lazy-by-default environment because it leads to better
composability and reuse. It also provided a focus on streaming algorithms that
I use frequently. Clojure's reducers are nice but don't even approach the
power and sophistication of Haskell's pipes.

Clojure has better "obvious IO" library support in that its dynamic code
requires less ceremony to drag down an online corpus. I've written a parallel
website scraper in Haskell, though, and I feel that the concurrent programming
would have been significantly more difficult in Clojure. Both have STM, but
Haskell's STM is better.

Haskell has better general libraries, though, due to the library reuse made
available by laziness and static typing. They can take a little effort to
learn, but then become massively powerful with ease. Haskell also has the
wonderful Diagrams library for building some kinds of charts, but it's more a
substrate than an answer.

Haskell's vector data type is wonderful, and in both languages you can drop
down to impure chunks of memory if your algorithm or performance needs are a
fit for that. All you pay is expressiveness. Here again static types are a win
as they can enforce impure regions and make sure that those regions don't mix
and don't take over your program.

If I were to make my home in one of those languages for some serious data
science, I'd do it in Haskell. It's still rough around the edges, but I feel
there's a better substrate for building more sophisticated things atop it.
Clojure may be able to solve your particular problem more quickly, but my
experience is that quick things written in Clojure don't pay out over as long
a period as quick things written in Haskell. Further, I think the comparative
effort needed to build long-lasting libraries and tools in lower in Haskell.

If I were to just do a quick data science problem, I'd probably use R.

I'd also use Haskell for data science much more if it had a better REPL.
IHaskell (a Haskell core in an IPython notebook) might become that needed REPL
at some point.

~~~
ptwobrussell
Wow, thanks for the incredibly thoughtful answer. That's a lot of useful
experience to digest. I too am optimistic about IHaskell and where that all
heads, especially in an IPython Notebook style UX with inline charts and such.

------
manholo
If you are curious, check also Faust, used for DSP algorithms, and Pure, for
anything.

[http://en.wikipedia.org/wiki/FAUST_(programming_language)](http://en.wikipedia.org/wiki/FAUST_\(programming_language\))
[http://en.wikipedia.org/wiki/Pure_(programming_language)](http://en.wikipedia.org/wiki/Pure_\(programming_language\))

------
tonyabell
F# Its a functional first open sources language that works on Max/Windows.

Interops with R, Python, MATLAB, Mathematica, Java

Give it a try here. [http://www.tryfsharp.org/Learn/data-
science](http://www.tryfsharp.org/Learn/data-science)

Here are a bunch of resources [http://fsharp.org/data-
science/](http://fsharp.org/data-science/)

------
svenkatesh
Python isn't a "pure" functional language, but it is general purpose, and it
does accommodate functional programming (although it isn't strict).

You get the additional benefit of having well-developed numerical and graphing
libraries (scipy, numpy, matplotlib, etc.)

~~~
ptwobrussell
At the moment, Python is the language I am most proficient with and my go-to
language. I definitely appreciate the functional aspects that it has
incorporated (list/dictionary comprehensions, functions like map, zip, reduce,
and anonymous functions with lambda.) In 2014, however, I'm planning to gain
some proficiency with something close to a Haskell, OCaml, or Lisp dialect.
Hoping to hear back from someone who has done some "heavy lifting" with one of
those...

~~~
carreau
Hopefully you can try Haskell without leaving your favorite environment
(IPython notebook), since there is now a haskell kernel
([https://github.com/gibiansky/IHaskell](https://github.com/gibiansky/IHaskell))

~~~
ptwobrussell
Nice! Didn't know about that yet!

------
pseut
SQL :)

More seriously, R is more of a functional language than one would think, but
it's certainly not a great general purpose language.

~~~
lottin
I agree, R is more functional than Python. In fact R looks a lot like Lisp,
except for the fact that syntactically is not based on parenthesized lists.

~~~
charlescearl
In terms of functional languages for data science, surprised that Julia
([http://julialang.org/](http://julialang.org/)) was not mentioned.

In terms of the R discussion, "Evaluating the Design of the R Language" seems
to put the functional aspects in relief
([http://r.cs.purdue.edu/pub/ecoop12.pdf](http://r.cs.purdue.edu/pub/ecoop12.pdf))

In terms of where things are headed, I just came across Spivak and Wisnesky's
work on Functional Query Language (FQL),
[http://wisnesky.net/fql.html](http://wisnesky.net/fql.html). The introductory
slides
[http://www.categoricaldata.net/doc/introSlides.pdf](http://www.categoricaldata.net/doc/introSlides.pdf)
call to be read seriously.

