
Data Science from Scratch: First Principles with Python - joelgrus
http://joelgrus.com/2015/04/26/data-science-from-scratch-first-principles-with-python/
======
philliproso
The book should add a subtitle "Includes 116 line implementation of an in
memory database" [https://github.com/joelgrus/data-science-from-
scratch/blob/m...](https://github.com/joelgrus/data-science-from-
scratch/blob/master/code/databases.py) +90 line implementation of neural
networks, wow that is some beautiful code

~~~
joelgrus
Thank you! I worked very hard to make the book's code clear and beautiful.

~~~
tux
Please add some summary comments in each file, like what it does.

~~~
spot
he wrote a whole book of comments...

------
danso
I was going to ask, "Why Python 2.x"? But then I just bought the book. Hope
you don't mind if I post this excerpt:

> _As I write this, the latest version of Python is 3.4. At DataSciencester,
> however, we use old, reliable Python 2.7. Python 3 is not backward-
> compatible with Python 2, and many important libraries only work well with
> 2.7. The data science community is still firmly stuck on 2.7, which means we
> will be, too. Make sure to get that version._

I use the more popular scientific libraries, e.g. numpy, scikit, nltk....and
the bigger ones seem to have been ported over to 3.x. A few libs that haven't
that come to mind: mechanize and opencv. Has anyone here had success with
using 3.x as a data science professional, or is there some massive gaping hole
that I'm missing? (I agree that, "Well, this is what the company has been
using" is a decent enough excuse to stay on 2.x in most situations)

~~~
rdtsc
Even some projects that claim have been ported, will often have bugs in them
because it is new code. Then it is a question of do I have time or want test
the port on my production system? I just kind of look at the issue or commit
stream and see when issues appearing related to Python 3 start to slow down a
bit.

------
Omnipresent
I just finished the ML class from Georgia Tech as part of the OMSCS program. I
used SciKit for most of the assignments as they involved NNs, DT, KNN,
K-means, EM. This might be a naive question as I'm not a python guy but is
there a reason this book is python based but doesn't cover scikit-learn? For
example, what need did you see to write code for k-means[1] than to use an
implementation already available [2]

[1] [https://github.com/joelgrus/data-science-from-
scratch/blob/m...](https://github.com/joelgrus/data-science-from-
scratch/blob/master/code/clustering.py) [2] [http://scikit-
learn.org/stable/modules/generated/sklearn.clu...](http://scikit-
learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)

~~~
treycausey
The title of the book includes "from scratch" for a reason -- it's from "first
principles" where you learn about something by building it up from scratch
rather than using an implementation. At the end of each chapter, Joel points
out the existing resources you can use after learning about the topic.

~~~
Omnipresent
makes sense. I took "from scratch" from an understanding perspective rather
than implementation. Thanks for clarifying it. Looks like it'll be a great
resource.

------
jplahn
Looks great Joel! Definitely going to check this out and start working through
it. I've noticed the huge bifurcation between extremely applied data science
and almost entirely mathematical based. I was always wary of 'learning' data
science through applications only, but as you alluded to, it's significantly
more exciting. Likewise, most introductory statistics classes are so poorly
delivered that many people have a deeply ingrained fear of the underlying
concepts.

As a side note, do you attend any data events in Seattle? I'm moving there in
June after graduation and would love to talk with somebody doing my dream job.

~~~
joelgrus
I attend a lot of data events in Seattle. Especially Data Science Happy Hours.

------
sputknick
Any chance there is a discount code to encourage early readers?

~~~
joelgrus
AUTHD

(And I didn't know that until you asked, I'm going to edit the blog post.)

~~~
arthurcolle
Thanks so much! Just got the ebook. 16 definitely beats 33, pushed me over the
edge on my student budget. :)

------
barely_stubbell
Does anyone have any recommendations of books that might pair well with this
one in the math/data/statistics space? Thought I might pick up a few books and
score some free shipping.

~~~
jkldotio
Programming Collective Intelligence: Building Smart Web 2.0 Applications by
Toby Segaran is a bit old now but is excellent, 4.5 stars on Amazon from 100+
reviews.[1] A bit of overlap with this one, but there are some great
explanations.

[1][http://www.amazon.com/Programming-Collective-Intelligence-
Bu...](http://www.amazon.com/Programming-Collective-Intelligence-Building-
Applications/dp/0596529325/)

------
thehoff
Looks interesting, I'll probably pick this up.

Have you posted to DataTau?

------
jsnk
This book looks very close to what my girlfriend is looking for. She's
interested in learning bioinformatics and it's been difficult to find a good
book that introduces topics in data science in a digestible manner.

If anyone knows the book, can you give a quick overview of how much, math,
stats, programming and comp sci. you'd need to read this book? Thank you.

~~~
joelgrus
I know the book, I wrote it!

Most of the math is vector space arithmetic. There are a few sections that use
matrix multiplication. The probability and stats is stuff like understanding
probability distributions and Bayes's Theorem. (It's all covered in the book,
but you'd need to be comfortable picking it up and using it.)

In terms of programming, not much. Someone who's _never_ programmed before
would probably have a tough time, but the goal is that someone who is bright
and hardworking and who can write fairly simple Python programs should not
have a problem. Very little CS background required. Maybe basic data
structures like list vs dict and so on.

~~~
Goladus
Reading this comment for some reason makes me curious how much time the book
spends addressing computer science fundamentals like cpu and memory. My guess
is that it's included in bits and pieces along the way but I didn't see
anything explicit in the table of contents.

I'm thinking about it in terms of running computation in production
environments where you may be constrained by available compute resources or
budget. Some people have an intuitive grasp of cpu/memory/bandwidth and can do
performance tuning as necessary, but those who don't can find themselves in
situations where they waste a lot of resources, such as running a million
parallel jobs that each have less than 1 second of CPU time, getting stuck
after failing to request or provision nodes with sufficient memory, or
performing unnecessary reads and writes.

------
sonabinu
There is a sampler for a taste of what's in there
[http://cdn.oreillystatic.com/oreilly/booksamplers/9781491901...](http://cdn.oreillystatic.com/oreilly/booksamplers/9781491901427_sampler.pdf)

------
mdesq
Thanks Joel. I just purchased both the print and ebook copies from O'Reilly.
This is exactly what I've been looking for.

------
kunjaan2
Could you please post this on /r/machinelearning as well with an AMA? Thanks.

------
blumkvist
Good that it discusses overfitting.

I can't help but wonder about those recommender systems. With so little
material on statistics I have to assume it's only about observational data,
which is the best way to make the millionth+1 useless recommendation engine.

And why is it that the "data science" books never discuss DoE?

~~~
x0x0
Most data scientists, particularly those coming from the cs department, lack
most probability and statistics fundamentals. I doubt many of them have even
heard of an anova, or sampling distributions, F tests, chi2, etc.

To be fair, ml tends to focus very heavily on prediction, not inference /
interpretation of betas. In many tree models how to even understand coefs is
an open question.

~~~
ced
_anova, or sampling distributions, F tests, chi2_

Most of the big ML books are heavily Bayesians, and these subjects are less
discussed (though IIRC Gelman's book has a "Bayesian ANOVA"). Even _Elements
of Statistical Learning_ , which is very frequentist in its approach, only
references ANOVA in passing. Do you have any book to recommend about these
fundamentals?

~~~
x0x0
basic level, very approachable, filled with case studies (and with R code to
run them easily found), but stupidly expensive: _statistical sleuth_ by ramsey
(but, you know, pdfs can be found on the internets)

intermediate level, covers some blocking IIRC: _Statistics for Experimenters_
by Box et al

advanced: I thought quite good, but classmates did not universally love.
Unfortunately does not come with case studies or R code to run them; I have a
bunch but (very unfortunately) printed instead of computerized and, in any
case, probably copyrighted by my professors. _Experiments: Planning, Analysis,
Operation_ by Wu and Hamada. The math is not complex but can be involved for
various types of blocking designs.

~~~
ced
Thanks a lot, _statistical sleuth_ looks very readable and interesting!

How do you feel about the Bayesian approach to these questions? (cf. Gelman's
_Bayesian Data Analysis_ )

