
A Junior Data Scientist Bookshelf (including Free Versions and HN Discussions) - gghyslain
http://ghyslain.me/bookshelf
======
gghyslain
Thanks everyone for the positives feedbacks. I did not have much time yet to
write down full reviews of all the books, but I'll work on it - so far this
page is more of a personal "bookmark". But to reply to @Nekopa and
@carlsednaoui, here is a short review of the first books.

I have had a really pragmatic approach about reading them - only focusing
first on parts relevant to my projects.

# An Introduction to Statistical Learning (ISL) / The Elements of Statistical
Learning (ESL)

I focused on chapter 8-9 of ISL about Tree Based Methods and SVMs, two
algorithms I used for my dissertation project. I found ISL to provide very
clear explanations of the algorithms with just enough mathematical formalism.

I have a good math background so ESL was interesting to go through. But I am
more of a practical person, and I found ISL to be more suited for me when it
came down to working on my project and supporting my choices.

# Python Machine Learning

Really great hands-on book ! Sebastian Raschka manages well to guide you
through all steps of a ML project data: pre-processing, feature engineering,
model selection... - all the steps are defined and covered with practical
examples.

I strongly recommend this book if you are just starting out with ML and feel
"lost" about how to start your own project.

# Taming Text

I decided to use text data I had available for my dissertation project.
However, half-way through the book I realized my dataset was to small to apply
any of the techniques described there. I still like the practical approach and
in the end the book gave me a good idea of what can be done with text.

# Advanced Analytics with Spark

I picked this book once I started working on the implementation of my project
into production - we use Apache Spark (Scala) at work.

It provided me with a good introduction to Spark BUT it's based on the RDD-api
and as stated on Spark website: "As of Spark 2.0, the RDD-based APIs in the
spark.mllib package have entered maintenance mode. The primary Machine
Learning API for Spark is now the DataFrame-based API in the spark.ml
package."

I'm now mostly relying on Spark Doc / API, I'm not aware of any up-to-date
books yet :)

------
clumsysmurf
"R for Data Science" by Garrett Grolemund & Hadley Wickham was recently
completed.

[http://r4ds.had.co.nz/](http://r4ds.had.co.nz/)

The ebook is free online, you can buy from Amazon & O'Reilly too.

------
nekopa
Nice list! I especially like that you added Resonate in there.

Could you add your personal reason for keeping these books on your shelf? That
would make the page more interesting, and maybe help you out with your job
search, as it will give an insight into your thought processes.

~~~
tomasz_bekas
I don't get why Resonate is there. From the Amazon description I get that this
book helps with creating meaningful presentations (?), yet it landed on the
list among Data Science books and now I read positive comment about it. Can
anyone elaborate why this book is worth any aspiring Data Scientist attention?

------
baldeagle
As a senior practitioner in the field, I feel a few years removed from my
initial learning chunk. I really like this list as a throw back to see how the
I would have done it today.

~~~
lukejduncan
I'm a software engineer, about a year into a career shift into data science /
machine learning. I'd second the majority of the books. Some were new to me. I
did a lot of on the job learning, but Intro To Statistical Learning was my
first formal treatment of much of the material outside 1:1 mentorship. I wrote
my thoughts here [https://www.linkedin.com/pulse/introduction-statistical-
lear...](https://www.linkedin.com/pulse/introduction-statistical-learning-
book-luke-duncan)

I'm also halfway through the Deep Learning Book now. I'm really enjoying it. I
got turned on to the book because there were a large number of people and
reading groups at work (LinkedIn) that had organized around the book when it
first came out in the html format a few months ago.

------
bssrdf
Still debating whether I should start with An Introduction to Statistical
Learning (ISL) or Bishop's Pattern Recognition and Machine Learning (PRML). I
really don't like using R (always a python person). Both have rave reviews on
Amazon. Any thoughts?

------
fixxer
Solid start. I'd strongly suggest adding some Bayesian modeling books; start
with Gelman.

If you look at the academic lineage of many of these authors, it will also
help you understand how they get stuck into little biases.

------
carlsednaoui
This is awesome, thanks for sharing. Ditto what Nekopa said, curious to hear
why you like each of these resources.

