
Haskell Data Analysis and Machine Learning Cookbook - e19293001
http://haskelldata.com/
======
IanCal
I like seeing more things come out helping people with data analysis in
various languages, but this put me off:

> isWhitespace x = elem x " \t\r\n"

This is the kind of thing that makes me concerned about using this resource
for real-world data. In real-world data you're going to get all kinds of crazy
things coming in, and if you're assuming nobody will ever have something like
a zero width non-breaking space, or a form feed, you're going to have a
problem.

It's the kind of thing I see with people starting out when dealing with data,
similarly the punctuation detection here: [https://github.com/BinRoot/Haskell-
Data-Analysis-Cookbook/bl...](https://github.com/BinRoot/Haskell-Data-
Analysis-Cookbook/blob/master/Ch02/Code02_punctuation/Main.hs)

If you rely on these things, you will have problems. Text is hard and weird
and terribly more complicated than people usually expect.

Does haskell have good libraries for dealing with the more awkward parts? Can
I easily remove all characters marked as whitespace in unicode, for example?
Detecting and managing mangled encodings?

~~~
zallarak
That line of code was demonstrative. The actual book uses 'Data.Char.IsSpace'
which properly handled it.

"Returns True for any Unicode space character, and the control characters \t,
\n, \r, \f, \v."

Before you chastise them for not handling something, verify it. I'm not
affiliated with the book but you probably deterred people from buying it.

[https://github.com/BinRoot/Haskell-Data-Analysis-
Cookbook/bl...](https://github.com/BinRoot/Haskell-Data-Analysis-
Cookbook/blob/master/Ch02/Code01_whitespace/Main.hs)

~~~
IanCal
Well for one the line shouldn't really be one of the first bits of advertising
for the book if the author also knows it's wrong. The second example is taken
from the GitHub repo for the book though, and is exactly the same type of
error.

> you probably deterred people from buying it.

Quite possibly, but I think with good reason. I don't know what's in the book
but I'm concerned it won't contain things like a discussion of what whitespace
is and is not, how to decide what you should do for your data and when isSpace
might not do what you really need. I can't review it properly but at least one
bit of code in the repo looks dangerous and one bit on the website looks
dangerous.

------
edgordon
This is a repost of a book we published in 2014. It's well reviewed, and the
author kept the Github repo for the code up to date with feedback
([https://github.com/BinRoot/Haskell-Data-Analysis-
Cookbook](https://github.com/BinRoot/Haskell-Data-Analysis-Cookbook)). If
anyone wants to try it, you can pick it up in the Packt sale currently for $10
@ packtpub.com

------
sdx23
I have bought that book some years ago and wouldn't do so again. It was very
disappointing to see neither a focus on Haskell nor on data analysis. It
scratches both topics but covers only very elementary things. The content is
mostly short receipts that were to me of no value at the time.

For people interested in the topics I'd recommend to buy some other good
books, on Haskell and data analysis separately.

If, however, receipts are to your liking and you're only starting out with
Haskell / data science maybe this is something for you (or maybe not).

------
blubb-fish
Serious question:

Is there any reason why somebody would use Haskell for data analysis when
there is also R and Python - which are perfect for that job - except for that
the respective person happens to be a Haskell expert anyway?

~~~
gh02t
I use all three, depending on the task. Haskell is compiled and relatively
speedy, as well as being great for writing custom parsers (plus in a lot of
situations it parallelizes very easily). One thing I find Haskell to be really
useful for is coercing large unstructured datasets into a format that is
easier to feed into Python.

For instance, I once had to write a parser for the data coming off of a
digitizer being fed by a rather complicated radiation detector array. It's in
an obscure and somewhat bizarre binary format that is pretty tedious to work
with because it involves a lot of state in the parser, plus the files are
pretty enormous (they describe every radioactive particle hitting every
detector in the array over several hours of measurement). My colleague wrote a
horribly tedious script in Python to parse it that was complicated and
agonizingly slow, but I was able to write a very natural 80-line Haskell
program in a few hours that was several orders of magnitude faster as well as
much more robust. I was just massaging the data to feed into Python, but it
was far far easier to do in Haskell.

So I find for some tasks I want to reach for Haskell, because it's natural to
express the solution in Haskell. Other stuff I wouldn't even consider it,
especially for stuff like data processing and plotting Haskell is not as
elegant. Right tool for the job and all.

~~~
hellofunk
This 80-line solution sounds very interesting. I'd love to see more examples
of this particular strength of Haskell. Are there similar parsing projects
that you can point to that are open source and worth looking at to better
understand this use case of Haskell?

~~~
PinkyThePiggy
Here is an example from Real World Haskell (pretty good intermediate book,
although it is a bit dated): [http://book.realworldhaskell.org/read/using-
parsec.html#csv](http://book.realworldhaskell.org/read/using-parsec.html#csv)

A fully featured CSV parser in 20-30 [depending on how you count] lines of
non-golf code.

~~~
hellofunk
I'm curious what aspects of Haskell have changed to cause this book to be
dated, i.e. are there examples in this book which would be done differently
now?

~~~
luisfoliv
This SO thread is very informative in this regard:
[http://stackoverflow.com/questions/23727768/which-parts-
of-r...](http://stackoverflow.com/questions/23727768/which-parts-of-real-
world-haskell-are-now-obsolete-or-considered-bad-practice)

------
zelos
It definitely seems like there's a lack of intermediate-to-advanced Haskell
books. This looks like it contains a lot of canonical Haskell coding examples
and might be useful: any Haskell experts that can weigh in?

The other Packt Haskell book is apparently terrible, so I'm a bit cautious.

~~~
15155
This is my favorite resource:
[http://haskellbook.com/](http://haskellbook.com/)

It covers beginner topics through arguably advanced topics (monad
transformers, etc.)

"Advanced" is subjective, but I don't believe the current version covers
GADTs, type families, or other more esoteric extensions.

(No affiliation to the authors)

~~~
cm3
I've seen GADTs as a feature that Haskell devs complained about, similar to
Template Haskell. I know the issues with TH, but what's the limitation of
GADTs as implemented in Haskell, and are there languages with less problematic
implementations?

~~~
tel
Maybe some people think they're a little complex, but there's nothing
particularly wrong with them in Haskell. They can be extended further in a
dependently-typed context, but that's really another thing completely.

------
mark_l_watson
I bought this book a couple of years ago and never read through it. It does
provide convenient recipes that you can look up in the table of contents. I
like recipe books like this, but be warned that there is not much depth. I
recently bought another book by the same author that is also useful.

------
dschiptsov
BTW, one might notice how a language with type-tagged data (a value has a
type, not a variable, there are no box-like variables, but bindings) is much
more suitable for data exploration and analysis (python + pandas is a good
example).

Also homogeneous lists and especially conditionals is kind of awkward - tuples
aren't as flexible as lists. Defining a type for each possible row and then
pattern-match on will result in a lot of useless boilerplate, almost as bad as
Java.

Common Lisp, it seems, is a better choice for such problems.

~~~
reuben364
The dependently-typed-ish features of haskell can provide a sweet spot of both
untagged data along with the flexibility of heterogeneous lists.

~~~
nine_k
An example / reference would be nice.

------
eggy
I might have a look, but the page seems to be a big ad with animated buy
buttons at top?

I posted a link for a free online kdb training class 3 months ago for students
that normally goes for $1300 with no affiliation to the company, and it was
flagged. What's the difference?

