Hacker Newsnew | comments | show | ask | jobs | submitlogin
Ask HN: What is the best functional programming language for data science?
15 points by ptwobrussell 423 days ago | comments
I'm currently exploring Haskell, OCaml, and Clojure with respect to not only the core language features themselves, but also with respect to their communities and third-party frameworks for math and machine learning.

I'm mostly interested in the best "general purpose functional programming language for data science" but would also be curious which functional languages have a particularly strong hand within specific domains (e.g. medical, finance, etc.)




This is a question that also interests me. I've been thrown into a lot of data sciencey work this year with learning analytics data from MOOCs - so far mostly cleaning and organizing, graphing etc, but we're moving into more machine learning, inference etc. I have a poor math background, but eager to learn, so while I've done most of the practical work in R and now Python with Pandas, I am playing with things like Learn Math, Logic and Computer Science with Haskell (also Think Stats and Think Bayes with Python).

I would love to see a more mature package with IHaskell, easy access to graphing, and a nice Pandas/data.table like library, and a set of statistics tutorials written around them, where you are basically learning statistics/probability at the same time as using the language. (I found this paper on functional probabilistic programming in Haskell very fascinating - the idea of using a Monad to wrap distributions, uncertainty levels etc around numbers seems very powerful http://web.engr.oregonstate.edu/~erwig/papers/PFP_JFP06.pdf)

-----


I've done most of my practical work in Python, recently picked up Pandas (and am starting to increasingly grow fond of the ggplot port to Python and some of the R packages that I find make ridiculously complicated things as simple as they should be.) Thanks also for the link to the paper.

-----


I don't have any experience with Haskell, for data science or otherwise, but have been using Clojure a bit for such purposes.

The code I write is mainly for machine learning and natural language processing with "big data". Some of the libraries that I've found useful are:

1) clojure.core.reducers 2) core.match 3) incanter (lots of statistics-related stuff. Comparable to R, perhaps.)

Clojure handles XML, JSON, and YAML very nicely as well. Then you have Cascalog to run map reduce jobs without writing mappers and reducers explicitly. There's also Marceline, which is to Storm/Trident as Cascalog is to Hadoop/Cascading.

There are also libraries for serialization like data.fressian, matrix math like core.matrix. Prismatic had also open sourced three really excellent libraries that are useful for data processing.

-----


I am currently in Learning phase of DATA Science. I have used R(More Proficient) and Python extensively for my Current Data Science Projects. Well i can say R is more of a functional language than one would think, but it's certainly not a great general purpose language. So as of now I have used R as Functional Programming For Descriptive analysis and on top of it used python for Making Data Science/Analytics Application, but aging there is a library name "Shiny" Which makes R programmer to make web App so easy as writting code for Descriptive analysis in R. And i am also interested in getting finger tight with Haskell and Julia.

-----


There is a package lambda.r (http://cran.r-project.org/web/packages/lambda.r/index.html) for the statistical computing environment R that implements functional programming paradigms. R is not a strict functional language, though. The package author plans to publish a book on this topic, see http://cartesianfaith.com/2013/09/21/preview-of-my-book-mode...

-----


I'm a proficient Haskell user with some experience writing Haskell data science code. I also have experience doing the same in Clojure, though it's about 2 years out of date. I'll begin with Haskell then compare.

---

I find that purity is a valuable feature of Haskell but, more so than with other code, I feel a big divide between current practice data science and pure functional code.

Haskell has a strong base of financial code which is usually unavailable publicly, but it does lead to a lot of blog posts and commentary describing how you can build highly efficient, powerful streaming systems in Haskell which interact with Excel. This is largely true as laziness tends to put people in a streaming mindset quite easily. Finally, there's a big push in the pipes/conduits camps to reify streaming as a first class action which can be manipulated easily. I'm a big fan of pipes—I think it's completely unreplicated anywhere else.

Haskell tends to be a memory hog and can produce space leaks if you're not careful. This will decimate your ability to use it for large data sets, but it's easy to avoid after you get a little bit of practice in. In particular, it's worth learning where new laziness is generated (whenever you produce a lifted type) and making the decision as to whether that's correct or not. Strict data types and UNPACKing eliminates space usage and leakage quite nicely.

Haskell has san incredibly powerful and fast vector library—called vector, unsurprisingly—and I encourage you use it constantly. There are also a number of other very nice data science foundation libraries like ad, linear, vector-space, statistics, compensated, and log-domain.

Haskell's best dense matrix library, hmatrix, is nice but GPL. It also doesn't interact as nicely as I'd hope with vector. There's also Repa, though that's more optimized for images and parallel matrix operations like DFT.

Haskell's interactive runtime has a HUGE deficiency in that it erases all local variables on each code reload. I've been assured that there are proprietary (financial) REPLs which don't have this deficiency, so perhaps it could be eliminated if someone wanted to take it as a project.

If you have a GPU to spare then it's really easy (and fun) to push algorithms on to it using Haskell's Accelerate library.

Generally, static typing is a huge boon, but there's too little broad usage of Haskell as a platform for data science yet to see how best to use it. HLearn is a great test bed for a lot of this. I find it really exciting, but probably a bit too dense to be practical. There's a big hole in the ecosystem where a data.frame/pandas and ggplot/lattice duo could fit.

---

Clojure's primary benefits drive from, unsurprisingly, using functional algorithms atop Java's runtime and library support. I made a Clojure binding to JBlas a few years ago for my research (clatrix) which wasn't too difficult to build, but plugged a needed hole in the ecosystem. I also reimplemented a bunch of basic machine learning algorithms in Clojure for a class and found that it was difficult (3 years ago) to get good performance out of raw Clojure, even when using type annotations. I found that dropping down to Java types wasn't so painful, but felt incredibly non-native. I'd suffer massive performance problems to just not have to do it. Clatrix helped to solve that a bit and it's been developed much further due to core.matrix, though I've not used core.matrix in anger.

Generally when coding in Clojure I miss static types (though I've not yet used core.typed) which is entirely personal, so YMMV. I find them to be very, very key in statistical code, though, since so many error conditions just lead to difficult to interpret, yet totally false results. I want my errors to come from bad tuning, not uncaught type mangling.

I also did a fairly large amount of parallel processing in Clojure using map-reduce implemented atop the actor model. It worked pretty well and distributed over a few hundred machines, yet was never convenient enough to replace manually launching the jobs and collecting the results by hand at the end. After getting some experience with Erlang/OTP I think I could have done better, but it was still a boon as to how nice it was to do in Clojure.

---

Generally, I find static types to be a huge boon for statistical programming as noted above. It's a tragic thing when you lose sensitivity to surprising results due to general mistrust of your own code. Static types make me rarely mistrust the correctness of my code (and libraries like quickcheck and simple-check help to cover the remaining uncertainty!).

Haskell I feel is faster broadly... except when it's not. Space leaks and excess laziness will destroy performance, but I still vastly prefer programming in a lazy-by-default environment because it leads to better composability and reuse. It also provided a focus on streaming algorithms that I use frequently. Clojure's reducers are nice but don't even approach the power and sophistication of Haskell's pipes.

Clojure has better "obvious IO" library support in that its dynamic code requires less ceremony to drag down an online corpus. I've written a parallel website scraper in Haskell, though, and I feel that the concurrent programming would have been significantly more difficult in Clojure. Both have STM, but Haskell's STM is better.

Haskell has better general libraries, though, due to the library reuse made available by laziness and static typing. They can take a little effort to learn, but then become massively powerful with ease. Haskell also has the wonderful Diagrams library for building some kinds of charts, but it's more a substrate than an answer.

Haskell's vector data type is wonderful, and in both languages you can drop down to impure chunks of memory if your algorithm or performance needs are a fit for that. All you pay is expressiveness. Here again static types are a win as they can enforce impure regions and make sure that those regions don't mix and don't take over your program.

If I were to make my home in one of those languages for some serious data science, I'd do it in Haskell. It's still rough around the edges, but I feel there's a better substrate for building more sophisticated things atop it. Clojure may be able to solve your particular problem more quickly, but my experience is that quick things written in Clojure don't pay out over as long a period as quick things written in Haskell. Further, I think the comparative effort needed to build long-lasting libraries and tools in lower in Haskell.

If I were to just do a quick data science problem, I'd probably use R.

I'd also use Haskell for data science much more if it had a better REPL. IHaskell (a Haskell core in an IPython notebook) might become that needed REPL at some point.

-----


Wow, thanks for the incredibly thoughtful answer. That's a lot of useful experience to digest. I too am optimistic about IHaskell and where that all heads, especially in an IPython Notebook style UX with inline charts and such.

-----


If you are curious, check also Faust, used for DSP algorithms, and Pure, for anything.

http://en.wikipedia.org/wiki/FAUST_(programming_language) http://en.wikipedia.org/wiki/Pure_(programming_language)

-----


F# Its a functional first open sources language that works on Max/Windows.

Interops with R, Python, MATLAB, Mathematica, Java

Give it a try here. http://www.tryfsharp.org/Learn/data-science

Here are a bunch of resources http://fsharp.org/data-science/

-----


Python isn't a "pure" functional language, but it is general purpose, and it does accommodate functional programming (although it isn't strict).

You get the additional benefit of having well-developed numerical and graphing libraries (scipy, numpy, matplotlib, etc.)

-----


At the moment, Python is the language I am most proficient with and my go-to language. I definitely appreciate the functional aspects that it has incorporated (list/dictionary comprehensions, functions like map, zip, reduce, and anonymous functions with lambda.) In 2014, however, I'm planning to gain some proficiency with something close to a Haskell, OCaml, or Lisp dialect. Hoping to hear back from someone who has done some "heavy lifting" with one of those...

-----


Hopefully you can try Haskell without leaving your favorite environment (IPython notebook), since there is now a haskell kernel (https://github.com/gibiansky/IHaskell)

-----


Nice! Didn't know about that yet!

-----


That's awesome! I have another summer goal to add to the list.

-----


SQL :)

More seriously, R is more of a functional language than one would think, but it's certainly not a great general purpose language.

-----


I agree, R is more functional than Python. In fact R looks a lot like Lisp, except for the fact that syntactically is not based on parenthesized lists.

-----


In terms of functional languages for data science, surprised that Julia (http://julialang.org/) was not mentioned.

In terms of the R discussion, "Evaluating the Design of the R Language" seems to put the functional aspects in relief (http://r.cs.purdue.edu/pub/ecoop12.pdf)

In terms of where things are headed, I just came across Spivak and Wisnesky's work on Functional Query Language (FQL), http://wisnesky.net/fql.html. The introductory slides http://www.categoricaldata.net/doc/introSlides.pdf call to be read seriously.

-----




Applications are open for YC Summer 2015

Guidelines | FAQ | Support | Lists | Bookmarklet | DMCA | Y Combinator | Apply | Contact

Search: