
When Is Haskell More Useful Than R or Python in Data Science? - mathattack
https://www.forbes.com/sites/quora/2018/01/24/when-is-haskell-more-useful-than-r-or-python-in-data-science/#1daa23e69e47
======
stared
It is good that it starts with ecosystem. It is IMHO the biggest _practical_
argument for or against any language (along with, what is related, community).

R, language-wise, is awful, but thanks to its community it is very popular in
statistics and some parts of machine learning (albeit it is getting less
popular than Python, in the last years, vide
[https://www.kdnuggets.com/2018/05/poll-tools-analytics-
data-...](https://www.kdnuggets.com/2018/05/poll-tools-analytics-data-science-
machine-learning-results.html)).

Python has pros and cons, but is de facto standard for data science due to its
ecosystem.

Haskell... well, show me a working Haskell data science project, and I may
change my mind. So far I hear about Haskell from people wanting to make things
clean and abstract (and as a self-development tool), rather than done.

Julia may be more interesting (and it offers types!) plus much more focus om
performance for iterative operations (which Haskell is lacking). Though, each
time I tried it, I had to go back to Python - again, due to maturity of its
data science ecosystem.

~~~
perturbation
I am a data scientist (and use both R and Python regularly). I would take the
plot in your link with a grain of salt - RapidMiner is not frequently used (at
all) with any of my friends / coworkers, though maybe I'm just in a bubble.

I like R a __lot __the more I use it. Tidyverse libraries (especially purrr +
dplyr + ggplot2) make R a joy to use. I would argue that R is ahead of Python
in terms of libraries for everything _but_ deep learning (which is, after all,
only part of ML).

~~~
claytonjy
couldn't agree more; I think working in R is a vastly different experience now
than it was 5+ years ago, and has changed (for the better) much more quickly
than Python has. I'm primarily thinking of the tidyverse here; the three you
mention are so much more intuitive than loops + pandas + matplotlib.

With Max Kuhn's new tidymodels stuff, I think R has a real shot at providing a
nice alternative to sci-kit learn, though there's a lot of catching-up to do.
For the deep learning stuff, Keras in R is about as nice as Keras in python,
but I'm not holding my breath for pytorch-like workflows in R.

~~~
perturbation
MLR and caret do a lot in the 'unified API for models' approach, though (IMO)
they're not as streamlined / mature as sklearn, especially if trying to do
modeling using sparse matrices (they mostly expect dense matrices as input).

~~~
claytonjy
I haven't played with MLR; how do you like it, esp. compared to caret?

I never really caught on to caret as it felt rigid and clunky and non-
idiomatic, but I've been using Max's newer rsample (setup CV) + recipes (like
sklearn pipeline) + yardstick (metrics) packages to good effect lately.
parsnip, which will handle the core model-fitting, seems promising but is too
early to use yet.

I don't expect sparse matrix support to get any better, as the core model
functions would have to be rewritten entirely to avoid rehydrating them, which
AFAIK nobody is seriously working on :(

~~~
perturbation
I've only used it a little in side projects (xgboost + mlr); I think it works
better than caret, though (not as brittle). Most of my day-to-day is text data
(which mlr isn't really well suited for).

What originally drew me to it was a blog post [1] about using their model-
based optimization framework for tuning hyperparameters. It's a lot more
sophisticated than anything I've seen for Python, including hyperopt /
hyperband.

[1]: [http://mlr-org.github.io/How-to-win-a-drone-in-20-lines-
of-R...](http://mlr-org.github.io/How-to-win-a-drone-in-20-lines-of-R-code/)

~~~
claytonjy
> Most of my day-to-day is text data

Ah the concern about sparse matrices makes even more sense now! I love that
tidytext can produce sparse tf-idf matrices...but hardly any models can use
them :(

------
patagonia
The author seems to argue that Haskell's mental fit of the language is a
better fit for the domain of data science, and so, really it's the lack of
libraries that are holding Haskell back. So, obvious question, why did the
libraries get created in R and Python..?

Many blog post about Haskell seem to include something like...

"This is not a language for beginners..."

"This language will make you smarter but there is a steep learning curve..."

"I don't feel smart enough to want to learn Haskell..."

And I haven't seen many other posts attempting to dispel that notion.
Libraries are written by people. How big is Haskell's userbase? What is the
Haskell community doing to increase it? It's at a natural disadvantage when
you've got schools (like MIT and Harvard school) using Python as the learning
language of choice now and Python has a reputation for being easy to pickup
and just being really pleasant to write it.

~~~
DonaldPShimoda
I think the problem isn't Haskell but rather the way programming is usually
taught.

Imperative programming is the most widespread paradigm, by far. When people
find an introduction to programming, it's going to be in a language like C,
C++, Java, Python, C#, JavaScript, etc. Many of these languages support
functional style in some way or another, but none of them truly embrace it —
and anyway, you wouldn't find any mention of it in introductory material.

In an imperative language, you write down a list of instructions for the
computer. You can think of it as a step-by-step recipe, and the computer will
dumbly follow your instructions to the T. This is easy to explain, and since
it's what most people use it's also what most people teach.

Haskell (and pure functional languages in general) are fundamentally
different. In a functional language, you're writing _data transformations_.
That's literally all it is. Instead of writing a procedure, you write what
essentially amounts to a mathematical function describing how to transform the
input data into the desired output. It's not a list of steps, but rather a
single modification. You compose multiple functions together to get the
desired result. This definitely seems more aligned with the goals of data
science, for the most part. (Though I'm not a data scientist, so I could be
wrong!)

I don't think functional programming is significantly more difficult to learn.
It's just hard to learn _after you 've already learned the imperative style_.
It's hard to go back and figure out a different way to explain things to the
computer once you're used to doing it the one way. I think this is the part
that those quotes you have are really talking about. I don't think Haskell is
inherently more difficult to learn than Java for somebody whose programming
knowledge is a blank slate; it's just hard to learn after you've started down
the path of imperative thinking.

~~~
patagonia
I don't disagree with what you've said, but I'd like to redirect. I don't
think there is a general expectation that programmers or software engineers
should approach writing software primarily from a mathematical perspective. If
for a given input one generates the expected output then you've written good
code (over generalization). So, when I sit down to extend a library with a bit
of new functionality, I'm generally not going to also go through the process
of learning a new language or making changes to the language I already know.
I'm going to express this new functionality using my existing _tools_. The
library is being extended, not the language.

Which is why I find the other comment's mention of Racket to be interesting.
I've always found it weird that the grammar, syntax, notation, etc of various
different maths are not better represented in programming languages. Software
is after all, theoretically, infinitely flexible. But most new code is written
using the same tricks as old code. Compare this with mathematics. When you
switch from linear algebra to calculus you're expressing it all on the chalk
board in completely different terms. You don't write calculus in the language
of linear algebra. (Yeah, I know many ideas can be expressed / optimized using
it, I'm speaking of how does one achieve the greatest level of mental
ergonomics when thinking about or expressing an idea. Maybe call it speaking
native Calculus.) Different symbols. Different layouts. It all facilitates a
different thinking about different ideas. This is clearly where LaTeX,
Mathematica, SageMath, and others come in. But if we're talking about Python
and R libraries, well that is no where near as natural.

Programming is taught just fine for those wanting to program. Maybe
mathematicians need to put more effort into leveraging the tool we call
software into building out tools, languages, libraries, and human/computer
interfaces for the activity we call mathematics. Or not. We're probably at the
state where we are at because to some extent it gets the job done. With maths
and software being so important to our future, though, I personally feel that
improving the ergonomics of tools used for computational sciences would pay
hefty dividends.

------
peatmoss
The bulk of people trying to get stuff done in a data / statistical / ML
context are heavily biased toward not having to write their own tools. I fancy
myself a reasonably good programmer, and reasonably competent at applied
statistics, but if I had to, say, implement a way to estimate my own GLMs, I’d
quickly go crazy.

A lot of research and data science (at least as far as the computing is
concerned) is plumbing just like a lot of web development is writing CRUD
apps.

Part of the success of R is that statistical researchers (i.e. the tool
makers) implement clever new methods in R. Someone else has done the lit
review comparing that fancy new method to other related methods. Someone else
has validated the math. Someone else uploaded the code to CRAN.

Python has historically lagged R in the stats space because statisticians
weren’t building tooling for it. Even fairly common models that have long been
in R are still missing from the Python ecosystem.

But something started to change a few years back when the buzz about Bayesian
methods started. Most MCMC based estimations of Bayesian models became more of
a computational problem than a stats problem. Python started to not be
laughably deficient in Bayesian methods.

And then we started having an explosion of deep learning whatnot, and nobody
knew what the hell they were doing, and so everyone just used the frameworks
gifted from big companies like Google—and they released stuff with Python
APIs.

So I think that what language you use for data science is still driven by
ecosystem, and also your ecosystem’s comparative advantage in implementing the
methods you care about.

All else being equal, I’d love it if we had better languages for data science.
Julia may get there—it’s a reasonably nice language that both Pythonistas and
R...barians(?) find inoffensive. Given the strength of R’s tidyverse in
implementing useful DSLs, I’d love to see the Racket community blow our minds
with the language-oriented programming that they’re about, but ultimately
mind-blowing will take a back seat to ecosystem.

~~~
werphesti
I've been using R for 20 years, and I think the history does matter.

R is an open source clone of S, which had some traction with rigorous
statisticians via Bell Labs. So it had stats at its core. R was just a version
of S that had less overhead and was easier to access.

The explosive growth of R really is more about the explosive growth of stats
and data analysis than anything else. R and S were sort of growing within the
stats community anyway, and then when stats took off, so did R. I think it's
open source quality and the fact that it is more similar to other programming
platforms than SAS or Stata helped too as stats branched out into computer
science.

Lisp is an interesting comparison with lessons for Haskell. XLispStat was a
competitor to R very early on but died out relatively quickly. I always was
sad about that, because I loved lisp, and it was great having stats embedded
in a broader language, but the reaction was uniformly the same: that lisp was
just too weird, too hard to read, and too hard to program in. Lisp has
diminished in importance in computer science more broadly, but didn't die out
in the same way XLispStat did in stats.

Haskell is suffering similar issues. The lack of ecosystem is partly because
it's coming from the outside in, rather from the inside out, but part of it is
because it is just perceived as odd. I love functional programming but am
increasingly becoming convinced that any language that pushes too hard on one
paradigm is going to lose out to one that is less pure. As great as functional
approaches are, sometimes it's just easier to think and organize procedurally,
and this is increasingly true as you get closer to the metal.

The real elephant in the room is the poor performance of the languages
currently dominating data science, whether that be R or Python. The LLVM
basically made it possible to right a conceptually clean language that also
exhibits good performance, so we don't have to choose between expressive and
performant languages so much, unless you're talking about embedded systems or
low-level systems programming. Although many people don't want to program a
GLM (and maybe shouldn't be for integrity's sake) there are many times when
going down to the likelihood function and optimization level for an unusual
case shouldn't result in a huge performance hit. You shouldn't have to change
to C or even Rust for something that conceptually isn't that much lower level.
Things like Julia and Nim really make this possible, and is where things will
probably eventually head (even if I'm not sure it will be either of those). I
also wonder if we'll see things like Rust taking off via higher-level
extensions such as Lia or Gluon.

My guess is that if "functional" languages will take off with stats, it will
be through something like OCaml (especially if that gets its
distributed/parallelism worked out quickly enough).

~~~
peatmoss
Thank you for the thoughtful reply. The impedance mismatch when needing to
write performant code from the R/Python universe is real. And that Julia has
managed Pretty Good performance while simultaneously being a pretty decent
language is huge.

XLispStat was before my (stats) time, sadly. Being part of a more general
purpose language ecosystem is one of the selling points of Python, though from
a performance and expressiveness standpoint Python is strictly inferior to
Lisp in just about every way. Except weird.

To be honest, I’m more fussed about good support for functional style than I
am about enforced functional purity. So in that sense, maybe you’re right that
the next big data / stats language will be multi paradigm.

------
ggm
R is for people who lean maths and stats from mathematicians and
statisticians. Its scripting is quite well aligned to the way its taught in
those disciplines. Computer Scientists dislike its huge inefficiency, but for
its problem space its a good fit.

Python/Numpy/Pandas is where I think R should be, but I get there are two
models here. I use both. Python+plotly gets you to print ready SVG output
really quickly. I like Jupyter. I also like Shiny/R.

I tend to think Haskell is a quant thing in data science. You think a lot
about the model. You type the royal french bejesus out of the inputs. You
adopt strong formalisms about moments of exchange between types, and
compositions over types because when you stand up and say "this $32b dollar
play has about a 15% upside, if the model holds" you really want to know the
_model_ holds, not _a pile of bugs I didn 't consider_ holds.

~~~
droidist2
Would Haskell's type system really prevent statistical biases like data
snooping, etc?

~~~
ggm
It would prevent mashing distinct data types without conscious effort. So, if
you type the financial data to plus and minus sides, as distinct types, you
cannot cross combine them without an explicit moment to do it in the way you
understand (that is, if they are what in C you'd call both integers plus and
minus, but logistically could not be simply combined without some context,
they typing moment here would make it far less likely you did that except in
known ways)

and, the flow of change over compositions of functions in types, you really
have to have done the hard yards, dealt with 'maybe this doesn't exist'
situations. continuations.

I don't think it would stop you mis-applying a construct as a method on
something, so misuse a statistical model, no. But nothing can prevent acts of
ignorance (as I know to my own shame, I tried to apply the K-S test to
disparate samples, and the moment of 'you cannot compare area under a curve
for a nail and a gaussian distribution' completely passed me by. Five minutes
with a competent statistician cleared it out)

------
metta2uall
IMHO F# is another contender in this space. It's a functional language with a
lot of nice features, tools and support for the whole .NET ecosystem (though
compared to Haskell it does lack certain 'advanced' features like higher
kinded types). Some packages are listed at [https://fsharp.org/guides/data-
science/](https://fsharp.org/guides/data-science/)

I find FSharp.Data especially to be amazing for productivity.

~~~
totalperspectiv
Thanks for the link! I think the page hits the nail on the head with this: 'As
data science employs techniques from many problem domains, numerous base
technologies are required.'

Any language that can successfully and seamlessly pull packages from other
languages will do very well. Since most of the packages for data science start
off in academics, and written by people who are experts in their field, and
not computer science, 'cool' languages are going to struggle to get mind-share
in package development.

------
hobls
I recently spent a small amount of time diving down the functional programming
rabbit hole (following SICP, poking at Haskell, and such). My general
impression is that the coolness of the languages is overshadowed by the poor
ecosystem for a lot of use cases, exactly as the article suggests for data
science.

It all seems possible to overcome, but you’d end up needing to deal with C/C++
bindings for many situations. Cryptography libraries, for example, appear to
be mostly hobby projects, which would be clearly not trustworthy for serious
uses, so now you’re writing/finding bindings.

Is that a fair assertion? I’m not at all an expert; I got kind of excited
about the area and then a little disappointed when I considered doing some of
my pet projects in a functional language and saw the immature ecosystem.

~~~
jimbokun
I think a wrinkle on what you're saying, though, is that many "mainstream"
languages are adapting (or have adapted) functional programming features.

Swift and Rust pull in a lot of functional programming features.

Programming Java 8 feels surprisingly functional, in a lot of ways.

Clojure allows you to experience even more of the functional programming
experience (albeit with dynamic typing) on the JVM.

"Javascript the Good Parts" showed that Javascript best practices generally
follow a functional programming mindset.

Having said that, if you want the full functional programming experience
(first class functions, immutable data structures, isolated side effects,
sum|union types, etc.) then yes, what you say is true. The amount of libraries
and size of the community for Haskell and the ML family just don't compare to
the more mainstream languages.

~~~
nightski
The most distinguishing factor of Haskell is not it's functional programming
features, but rather it's very flexible and powerful type system. You won't
see that pulled in by the mentioned languages any time soon.

~~~
cwyers
Yeah. If Clojure is functional, then so is R.

~~~
cultus
Functional-ness is completely orthogonal to the type system. Clojure is
probably the second most-functional prominent language after Haskell.
Immutability is very strongly encouraged, with mutation used only in
controlled ways.

------
zimbatm
any mod wants to fix the URL? the original content is at
[https://www.quora.com/What-are-some-use-cases-for-which-
it-w...](https://www.quora.com/What-are-some-use-cases-for-which-it-would-be-
beneficial-to-use-Haskell-rather-than-R-or-Python-in-data-science) and Forbes
it pretty much impossible to read for me because of all of the ads

------
minimaxir
If your data project is bottlenecked by speed, and the current libraries
_which already leverage a low level language such as C /C++_ like pandas/numpy
in Python in R and dplyr in R don't suffice, then you probably should be
switching tooling altogether (e.g. to "big data" tools like Spark. Which have
a Python/R API anyways).

~~~
ryanmonroe
If speed is a concern and you're working in R, you should use data.table
instead of dplyr. In Hadley Wickham's words: "We optimise dplyr for
expressiveness on medium data; feel free to use data.table for raw speed on
bigger data.". Although personally I don't think data.table is any less
expressive.

~~~
claytonjy
As a former data.table user and late-comer to dplyr and pipes, I think
data.table is just as expressive when _writing_ (and I do miss indices and in-
group references), but is really awful to read. I don't miss the days of
reviewing data.table commands, esp. given that many teammates have been
relatively inexperienced programmers.

------
msaharia
At this moment, I would love to have something remotely comparable and
intuitive to ggplot2 in Python. I know there is a fork, but it's not as
powerful. Matplotlib is about as unintuitive as a plotting library could be.

------
dtjohnnyb
Does Scala fit as a functional language that is useful for data science?

~~~
jvican
Yes, it’s heavily used in this area.

------
projectramo
tl;dr never.

Jokes aside. tl;dr the author finds Haskell better suited to expressing
mathematical abstractions (no examples), however, they believe that the lack
of libraries holds Haskell back. They end by talking about how RHaskell lets
you integrate R and Haskell to get the best of both worlds.

------
Keyframe
What about that big kid no one talks about - SAS?

~~~
JanisL
Probably because a lot of people are in the situation I'm in where they can't
use it on their projects because their clients don't want to have to buy
licenses just to verify the results.

~~~
Keyframe
Of course. I tend to use Python and I dabble with R - all for personal use.
However, I've seen SAS in a lot of places. It's huge, yet barely anyone
mentions it!

