
Best Data Science Books According to the Experts - mantshitla
https://builtin.com/data-science/data-science-books
======
scottlocklin
This is a completely, utterly worthless list: the only things that belong on
it are Grus (good for python), maybe Bishop (maybe; it's woefully out of date
and light on details) and Hastie (his other book is vastly better). The
"General interest" books are all horse shit. If you're a pythonista, you
should buy Wes McKinney's book. If you're not, you should buy John Mount and
Nina Zumel's "Practical data science" which is comparable to Grus.

If you don't know Linear Algebra; the "done right" book is absolutely not done
right for people that work with data in the real world: Strang is infinitely
better, and Trefethan and Bau and Gollub if you need to go deep.

As someone pointed out below: Kevin Murphy's ML book is actually very good for
teaching you how the things work. What's more, you can download python or
Matlab for every algorithm in his book (as you can for Bishop at this point).

Nobody needs the Deep Learning books listed. In spite of the hype, it's just
not that important compared to naive bayes, LDA and GBM; most people don't
have access to the hardware or data sets that make DL useful, and those who do
probably studied DL in grad school.

Gads; what trash! Those experts: only one of them appears to actually be an
expert; pedants and post-docs don't count.

~~~
ims
Seconding this comment. Based on experience in hiring data scientists and
comparing notes with many others that hire data scientists, the most frequent
gaps in knowledge are (1) statistics specifically and scientific computing in
general and (2) disciplined software engineering.

People good at (1) and bad at (2) write "PhD code" that may or may not be
right but you can't tell because it's too disorganized. People good at (2) but
bad at (1) get fine-ish looking numbers out of their good looking code but you
can't tell whether it's right because they may have ignored or misunderstood
fundamental assumptions and correctness of the underlying methods.

There are also seemingly tens of thousands of people on the market who have
little experience in either but have adapted projects from examples online
into their Github potfolio and put all of the relevant terms into their resume
anyway.

I think most aspiring data scientists would be better served going with more
introductory texts and _really understanding them_. Maybe Blitzstein and
Hwang's "Introduction to Probability" and then McElreath's "Statistical
Rethinking" or Wasserman's "All of Statistics" for people who need more stats.

I'm not even sure what to recommend for developing good software judgment and
habits. There doesn't seem to be a shortcut for that. Maybe "Fluent Python" or
"Effective Python" for Python people? No idea for the R ecosystem.

~~~
skwb
Perspective for (2): it's because no one in graduate training really cares
about code quality. Your PI focuses more of your attention on scientific
writing, and so there's little to not time to polish your work. The incentives
just don't support this work at the graduate training level.

~~~
LetThereBeNick
I agree. The only incentive to write neat code is to save yourself the pain
and suffering of having to go back through it yourself to fix or add things. I
am currently doing data analysis in MATLAB for my PhD, and I know nobody will
ever use my code besides me.

I’d _like_ to learn to do my due diligence, but without someone training me,
it just takes so much time to learn things like git. I’d rather be recording
more data and submitting my paper so I can get the hell out of here

------
jointpdf
More free books:

* _Vectors, Matrices, and Least Squares_ — IMO the best beginner-friendly and applications-focused intro to (or review of) linear algebra. Covers a ton of fundamental ground while keeping things consistent and concise. Lots of exercises and a Julia supplement book. ([http://vmls-book.stanford.edu/](http://vmls-book.stanford.edu/))

* _Mathematics for Machine Learning_ — good coverage of the most important math concepts relevant to ML ([https://mml-book.github.io/](https://mml-book.github.io/))

* _Forecasting: Principles and Practice_ — best overall resource on forecasting that I know of; R focus. One of a zillion great R/data science books, virtually all of which are open and well-written. ([https://otexts.com/fpp2](https://otexts.com/fpp2))

* _Dive Into Deep Learning_ — can’t personally vouch for this one but it looks comprehensive; numpy/PyTorch/TF focus ([https://d2l.ai/](https://d2l.ai/))

* _Speech and Language Processing_ — clear introduction to all things NLP, nice flow ([https://web.stanford.edu/~jurafsky/slp3/](https://web.stanford.edu/~jurafsky/slp3/))

~~~
scared2
Have you already read them all?

~~~
jointpdf
I’ve read all of VMLS and Forecasting (and ISLR from the original list), and
maybe half of SLP. MML I have skimmed through for review / used as a
reference, and the deep learning book is high on my queue.

I tend to not be a cover-to-cover reader, so I usually deep dive into a single
topic for a while (e.g. forecasting, information retrieval) and read
papers/tutorials/chapters related to that topic and the math concepts related
to it.

PS I feel like impostor syndrome is so common among data scientists because
there is so much material that feels like “must know”. Don’t feel like you
need to memorize thousands of textbook pages to be effective, and you could
spend a lifetime mastering any of these individual subjects. JIT learning is a
great skill to have.

------
screye
The frequency with which Kevin Murphy's ML book gets left out of these lists
in almost bewildering.

If I had to choose 1 book as the ML bible, then it would be Murphy's
(contrasted against Bishop and ESL) for the following reasons:

1\. It uses CS jargon. (Bishop's book while great, uses Math/physics
notation/jargon which add a barrier to entry)

2\. It is more up to date and comprehensive (It covers everything from
probabilistic models, traditional models, to neural networks all the way up to
2015 or so, unlike say ESL which is more introductory)

3\. Everything in deep learning past 2015, is better learnt through
papers/video lectures than any book. (A lot of it is intuition and not truths.
There is a certain authority to books that belies our lack of understanding of
NNs. Opinions on popular operations such as Dropout, Batch Norm, saliency maps
have changed drastically over the last few years)

4\. My ML professor used it for our upper grad level ML course and I came out
very satisfied. (nothing quite like personal validation). I have read ESL and
found it to be better as reading for an intro to ML course. I tried reading
Bishop, and didn't like it :| )

~~~
kanwisher
Looks like the Murphy book is $80 in hardback, they have a similar priced
kindle version that doesn’t even have a cover image. Feels like this book is
really for the college textbook market. Likely why it hasn’t gotten wider
coverage.

------
jll29
You want to read either Elements of Statistical Learning (2nd ed.) OR Kevin
Murphy's ML book for the theory.

Then you will want to consult a text book in your work domain (e.g.
introduction to speech & language OR statistical natural language processing
for the domain of natural language processing).

And finally, you will want either a book, or free online Web
resources/tutorial videos that show you how to do things in practice, given a
particular programming language and tool-set (e.g. Python + TensorFlow, Java +
DeepLearing4J).

This recipe of Theory + Application + Practice/Tools should get you there.

~~~
jll29
BTW, it is okay to read stuff in parallel, and to take a non-linear approach
to learning. Also people have different preferences for textbook styles, and
they vary regarding pre-existing background knowledge.

------
auraham
I have Deep Learning with Python (Chollet, first edition) and Hands-On Machine
Learning (Géron, first edition). Both books are highly recommended.

Introduction to Statistical Learning is also available for free online:

[http://faculty.marshall.usc.edu/gareth-
james/ISL/](http://faculty.marshall.usc.edu/gareth-james/ISL/)

Although I only read a few chapters from that book, I really like it (but I
would have preferred a python version of the book).

Personally, if you have to pick three books from the list, ypu can start with
these three options.

~~~
alanfranz
You can find a couple of repos (google them) that show the exercises in
Python, I had written a post on my own blog some time ago:
[https://www.franzoni.eu/machine-learning-a-sound-
primer/](https://www.franzoni.eu/machine-learning-a-sound-primer/)

~~~
auraham
Good to know, thanks for the link!

------
mrslave
I appreciate that on HN this article may provoke intelligent conversation, but
this article is smells a lot like a "top N things you need right now" blog
post where everything is an Amazon affiliate link.

(Edit: I was wrong about affiliate links. Not deleting to publically self-
shame.)

The real data science would be producing articles like this automatically, and
with good SEO, to drive revenue.

The list is OK. I've studied ISL (and some of ESL). A friend really enjoyed
Think Stats. Charles Wheelan's book is in the same vein as How To Lie With
Statistics (Darrell Huff?) but in greater depth.

I started with R because that's what my team was mostly using when I got into
this area. Hadley Wickham's free books are good too.

------
blackbear_
Pattern Recognition and Machine Learning (PRML) is absolutely recommended.
Available for free as well! [1]

The Elements of Statistical Learning (ESL) [2] is also a very reputable
source.

They also make a good pair: PRML has a Bayesian approach while ESL is
traditionally frequentist.

[1]
[http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%...](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf)

[2]
[https://web.stanford.edu/~hastie/ElemStatLearn/](https://web.stanford.edu/~hastie/ElemStatLearn/)

~~~
sgillen
ESL is an excellent reference, if you are looking for an introduction instead
I can recommend “An introduction to statistical learning” by (most of?) the
same authors.

------
melling
I like the Data Science from Scratch book. I’m working through the book and
rewriting the examples in Swift:

[https://github.com/melling/data-science-from-scratch-
swift/b...](https://github.com/melling/data-science-from-scratch-
swift/blob/master/README.md)

I haven’t read it yet but the Think Stats book is available for free online:

[http://greenteapress.com/thinkstats2/html/index.html](http://greenteapress.com/thinkstats2/html/index.html)

PDF:
[https://greenteapress.com/thinkstats2/thinkstats2.pdf](https://greenteapress.com/thinkstats2/thinkstats2.pdf)

------
bart_spoon
Deep Learning with Python (Chollet) and Hands-On Machine Learning (Géron) may
be a bit redundant currently. With 2.x versions of Tensorflow, Keras (which is
what Deep Learning with Python covers) has been completely integrated into
into the Tensorflow API (which is covered in Hands-On Machine Learning). Both
books are good, but the newest edition of Hands-On Machine Learning is updated
for Tensorflow 2.0, and so it is probably the more relevant of the two.

------
henrik_w
The article notes that "Designing Data-Intensive Applications" is perhaps not
a typical data science book, but it is still very useful. I agree. It is a
fantastic book - one of the best technical books I have read. I wrote why, and
a summary of it here:

[https://henrikwarne.com/2019/07/27/book-review-designing-
dat...](https://henrikwarne.com/2019/07/27/book-review-designing-data-
intensive-applications/)

~~~
s1t5
There's a very real chance that it's mentioned in the article purely because
it has the word "data" in the title.

------
woko
I find it weird that this (free) book was omitted:

\- Goodfellow et al. Deep learning. MIT press, 2016.
[https://www.deeplearningbook.org/](https://www.deeplearningbook.org/)

And there is the book about DL with Python published by Manning in 2017, but
not the book about DL with PyTorch, published by Manning in 2020?

\- Stevens et al. Deep Learning with PyTorch. Manning Publications, 2020.
[https://www.manning.com/books/deep-learning-with-
pytorch](https://www.manning.com/books/deep-learning-with-pytorch)

~~~
s1t5
Goodfellow's book is a poor learning experience. It only makes sense if you
already know the material in the book.

------
Anirudh_Singh
Data Science From Scratch By Joel Grus is my favorite book and it is the best
for those people who doesn't know about Data Science

------
muktabh
If someone finds it helpful, here is a list of 25 free Machine Learning books
available online : [https://blog.paralleldots.com/data-science/24-best-and-
free-...](https://blog.paralleldots.com/data-science/24-best-and-free-books-
to-understand-machine-learning/) and another list of over 50 free Data Science
books : [https://blog.paralleldots.com/data-science/50-must-read-
free...](https://blog.paralleldots.com/data-science/50-must-read-free-books-
for-every-data-science-enthusiast/) we compiled. Pick your poison.

------
stared
While there are some good picks (especially "An Introduction to Statistical
Learning with Applications in R" and "Deep Learning with Python" by Chollet,
the Keras author), I am surprised it is missing "Information Theory,
Inference, and Learning Algorithms" by David MacKay
([http://www.inference.org.uk/itila/book.html](http://www.inference.org.uk/itila/book.html)).

The "Bayesian Inference and Machine Learning" track gives a nice foundation to
anything with "log loss". After that even k-means won't be an ad-hoc
algorithm.

------
Areibman
It's definitely one of the least technical ones in the list, but "Weapons of
Math Destruction" is a must read for anyone deploying models that directly
affect users. Its general point is that opaque judgement systems, such as how
social media platforms these days seem to arbitrarily remove content, are only
going to get worse with black box models. AI ethics are seldom taught, and
this one provides a good framework for thinking about how to build effective,
yet fair systems.

------
the_decider
I personally learn better through coherent code examples rather than math
notation / jargon. Manning's excellent "Data Science Bookcamp: Five Python
Projects" teaches data science basics using code, not greek symbols. Also, the
projects themselves are pretty cool and challenging, but doable.
[https://www.manning.com/books/data-science-
bookcamp](https://www.manning.com/books/data-science-bookcamp)

------
mumblemumble
I would love to see a book like Alex Reinhart's _Statistics Done Wrong_ make
these lists more often.

It's not focused on data science specifically. But it's short, sweet, and just
might leave you better prepared to tell whether the emperor is wearing actual
clothes, or if it's just an artfully arranged assemblage of TensorFlow
operators.

------
iamacyborg
As a dumb marketer. I very much enjoyed John Foreman's Data Smart.

[https://www.goodreads.com/book/show/17682206-data-
smart](https://www.goodreads.com/book/show/17682206-data-smart)

~~~
reactspa
I skimmed the book a few days ago and thought it looked interesting, and put
it in my reading list.

Got any other companion book recos?

I'm particularly interested in data sets that I could play with in Tableau,
Google Data Studio, and Excel.

------
olcor
Link to previous HN post which has a list of all ebooks Springer made
available for free:

[https://news.ycombinator.com/item?id=23520545](https://news.ycombinator.com/item?id=23520545)

Has quite a few of the books mentioned in the post and the comments here.

~~~
euph0ria
Thanks! Didn't know about this

------
danielchavez
In my experience my professors always referenced Hastie, Tibshirani, Freeman,
The Elements of Statistical Learning to be the reference for most of the tasks
you would need to perform as a data scientist. For an AI/ML researcher Murphy
is probably more comprehensive.

------
yters
My favorite is "Learning from Data". Builds from first principles and simple
algorithms, so I really understand what makes machine learning work, and how
to debug as well as build my own techniques vs use off the shelf black boxes.

------
ju-st
My favorite book is "Python for Data Science For Dummies". It won't help you
in a FAANG job interview but in my opinion is super helpful when you just want
to get some insights out of data you have on hand.

------
axegon_
A worthy mention imo, Machine learning pocket reference by Matt Harrison: a
tiny book packed with a ton of fundamental concepts and great examples.

------
input_sh
Was expecting more like the first few.

I have plenty of textbooks in my reading list already.

------
cullinap
Grokking deep learning is an awesome book!

