
Foundations of Data Science [pdf] - Anon84
https://www.cs.cornell.edu/jeh/book2016June9.pdf
======
yummyfajitas
There's a great quote by some bodybuilder. "Everybody wants to get big, but no
one wants to lift heavy weights."

Paraphrasing this to data science: "Everybody wants to have software provide
them insights from data, but no one wants to learn any math."

The top two comments here illustrate this perfectly. Anyone who is serious
about learning data science will read this book and will not shy away from
learning math. You can also learn about data pipelines, but that's not a
substitute for what's in this book.

There are also a variety of other algorithmically focused machine learning
books. They are also not a substitute for this book.

~~~
rocky1138
This is one of the reasons that I really dislike the term "data scientist."
Essentially the job is Analyst, which is something we've had for many years.
It's not a bad thing, but there's no reason to embellish it into something
it's not.

The primary reason I dislike this term, though, is because all scientists are
data scientists. Data is how the whole thing works.

~~~
yummyfajitas
Originally "data scientist" meant a statistician who could write distributed
systems to implement their algorithm in production.

Now that the term is diluted, I propose we use adopt the banking industry term
"quant developer" to refer to that.

~~~
pkaye
I have no experience in Data Science. How much need is there for a distributed
system? How many data points would one need to necessitate it? What if we had
a billion data points. Would it be sufficient to run on a fast system
overnight to crunch data?

~~~
yummyfajitas
According to me (note zintinio's reply - he disagrees), the etymology of the
term is the following.

Back in the day, if you had small data you didn't need a data scientist. You
needed a statistician. He'd do some shit in SAS/Python/etc, reading one CSV
and writing another. Then the developers could run that on a beefy server with
cron and push the CSV output someplace else.

At some point during the 2000's this stopped working - things became
complicated and tangled enough that you couldn't just munge CSVs like this.
You needed folks who understood the math well enough to come up with
algorithms, and who also understood the computer science well enough to scale
out. These folks were termed "data scientists". Banks call them "quant
developers".

Very often you don't need a distributed system, or anything that isn't
trivially parallelizable. Most of the time when I make money it's based on a
CSV < 100GB - often it fits in ram. Nowadays I don't think it's misleading to
call yourself a "data scientist" if you can only handle such data sets.

I recently learned that Facebook calls their business analysts data
scientists. I guess it's just sexier.

------
daemonk
The high theory stuff is great, but a significant portion of the job is being
a data janitor. Being experienced and fast at manipulating data structures,
recognizing patterns in text datasets, understanding common formats used in
the field and just having domain knowledge in what you are analyzing should be
more emphasized in my opinion.

~~~
crispyambulance
I sort-of agree but the "data janitor" knowledge can be learned "on the job"
ad-hoc and as-needed.

Mastering basic theory, on the other hand, needs a coherent and structured
study-plan which requires extended focus and single-minded emphasis (at least
for most folks).

~~~
coldtea
> _I sort-of agree but the "data janitor" knowledge can be learned "on the
> job" ad-hoc and as-needed._

Logically yes. But it's amusing how many scienty types (PhDs et al) who know
all the fundamentals in theory can't do such practical tasks if their life was
depended on it.

It's like they thought the theorems and abstract objects they've learned would
never be encountered in the wild.

~~~
Ar-Curunir
Yet PhDs are disproportionately more likely to get the interesting jobs that
require stats: quabts, experimental.positions at Google, etc.etc.

------
onetwo12
You can read similar to chapter two, high dimensional spaces, in
[https://jeremykun.com/2016/02/08/big-dimensions-and-what-
you...](https://jeremykun.com/2016/02/08/big-dimensions-and-what-you-can-do-
about-it/)

Also, chapter three about SVD, is in
[https://jeremykun.com/2016/04/18/singular-value-
decompositio...](https://jeremykun.com/2016/04/18/singular-value-
decomposition-part-1-perspectives-on-linear-algebra/)

and [https://jeremykun.com/2016/05/16/singular-value-
decompositio...](https://jeremykun.com/2016/05/16/singular-value-
decomposition-part-2-theorem-proof-algorithm/)

the advantage is that you have the python code available.

[https://jeremykun.com/2015/04/06/markov-chain-monte-carlo-
wi...](https://jeremykun.com/2015/04/06/markov-chain-monte-carlo-without-all-
the-bullshit/)

The book seems to be interesting.

~~~
j2kun
Indeed, the Monte Carlo post was inspired by some discussions after reading
that chapter. I also borrowed and adapted a proof or two from the SVD chapter
for that second post.

~~~
onetwo12
Your Monte Carlo post is interesting, but it doesn't consider the context of
markov networks, for example reading Norvig Modern IA there are good examples.
There is a subtle point that having probabilities locally you must prove that
the global structure is a probability and this requires some order in the
vertexes to propagate the information from the root to the leafs, also Markov
models can have higher degree and then they are not like random walks. Anyway,
the context of matrices and eigenvalues is interesting.

------
itissid
I was surprised to see so little attention paid to regression. Regression is
quite powerful and is a powerful thing in a DS's toolkit.

For example: \- What is the distribution of the residuals, how does it change
over time as data comes in. How Gaussian they are(or not), analyzing
weird/oddities especially around the tails \- What kind of features offer the
most significant signal to the model and which ones are not.

These skills are even applicable to SVM and other classification analysis.

~~~
arahuja
Agree with this, simple concepts taught in-depth with lots applications and
practice is very useful. Knowing 10 ML algorithms superficially will never be
as good as knowing 1 with all of it's caveats.

------
mmcclellan
"Background material needed for an undergraduate course has been put in the
appendix."

So just being honest, the Appendix is still rather terse and advanced for me.
Does anyone have suggestions for prerequisite readings that would help getting
someone prepared for this text?

~~~
iaw
Have you taken the standard undergraduate engineering math coursework?

\- Calculus covering Derivation, Integration, Multi-Variate

\- Linear Algebra

\- Differential Equations (May not be relevant here)

'Discrete math' is also useful.

Some of the derivations in the book you can take on faith and not fully prove
out to save time, but you should feel 100% comfortable/confident with the
notation used.

Let me know what's confusing you and we can try to figure out what you are
missing.

~~~
laichzeit0
No remember this is Hacker News. Most people here learned to code "in a
weekend" and they can pick up a new framework "in a weekend". They also see no
use wasting money on a 4 year computer science curriculum if you can just
learn to program "in a weekend".

Ok I'm being extremely facetious here but it still shocks me when people
comment in machine learning or data science threads asking about how they'd go
about picking this up but you can clearly tell they have no formal background
in the sciences. I guess the only reason it angers me is because if they had
just done a CS degree instead of trying to "hack it" none of this would seem
like magic.

~~~
1457389
I'm in this category. I guess my excuse is that I did Bio instead. What's the
best way to get up to speed without having to go back to school? I stopped at
linear algebra in undergrad but I probably need to refresh that as well. Would
appreciate any thoughts you have.

~~~
laichzeit0
Without getting bewildered I would suggested going through these 3:

1\. Book of Proof by Hammack
([http://www.people.vcu.edu/~rhammack/BookOfProof/](http://www.people.vcu.edu/~rhammack/BookOfProof/))

2\. Calculus by Spivak

3\. Linear Algebra Done Right by Axler

Be prepared to work through all (or at the very least only the odd numbered)
exercises. If you can't stomach that or find that life gets in the way of you
completing even these very basic books, you do not have the time or discipline
required to advance in mathematics.

~~~
1457389
Thank you!

~~~
sn9
If you make it through all of those, pick up Hubbard and Hubbard's _Vector
Calculus_ for a unified treatment of multivariable calculus and linear
algebra.

------
Jcol1
What sort of foundational mathematics is required to fully consume this book?

I assume multivariate calculus and linear algebra?

~~~
iaw
Skimming that's the impression I got. Discrete math is always super helpful,
some rigorous stats pops up here and there, and then there are things here and
there that you just need to wikipedia.

------
mizzao
Despite being familiar with most of the material here and agreeing that it is
generally useful to know, I still don't know if I'd call this book
"Foundations of Data Science". It feels more like "Assorted topics in
algorithms, machine learning, and optimization": data science from the
perspective of a computer scientist.

Notably missing are causal inference, experiment design, and many topics in
statistics--causal inference being one of the primary things we'd want to do
with data.

------
qqzz6633
Wow, book from Avrim Blum. Definitely worth reading.

------
Luftschiff
Would working through this book be sufficient to be able to start doing data
science/analysis work?

Some context: I did my undergraduate degree in Economics (in a pretty math
intensive university), have been working in marketing for the last 2 years and
want to go back to do work in something more analysis centered.

~~~
hadley
No - while it covers the theoretical foundations, it doesn't give you any of
the practical skills you actually need to work with data.

------
swordsmith
I was planning on going through Pattern Recognition and Machine Learning by
Bishop, for those who have gone through this PDF, which do you think is more
useful for learning data science?

------
graycat
It's a book, a real book, quite long.

It has a lot of topics on applied math.

Some of its main topics are linear algebra, probability theory, and Markov
processes.

Really, the book just touches on such topics. Usually in college each of those
topics is worth a course of a semester or more. So, what the book has on such
topics is much less than such a course. E.g., for linear algebra, the book
gets quickly to the singular value decomposition but leaves its treatment of
eigenvalues for an appendix and otherwise leaves out about 80% of a one
semester course on linear algebra. Similarly for probability and Markov
processes.

Some of the topics the book has or touches on are unusual with, likely, few
other sources in book form. E.g., early on the book has Gaussian distributions
on finite dimensional vector spaces where the dimension is larger than is
common.

So, for the topics rarely covered in book form, the book could be a good
reference.

For topics such as from linear algebra, a reader might get misled without an
actual course in linear algebra from any of the long popular books, e.g.,
Halmos, Strang, Hoffman and Kunze, Nering or more advanced books by Horn,
Bellman, or others.

Usually in universities, probability and Markov processes quickly get into
graduate material with a prerequisite in measure theory and, hopefully, some
on functional analysis, e.g., to discuss some important cases of convergence.

So, the book seems to have some good points and some less good ones. A good
point is that the book is a source of a start on some topics rarely in book
form. A less good point is that the book gives very brief coverage of topics
otherwise usually covered in full courses from popular texts.

A student with a good math background could use the book as a reference and
maybe at times get some value from the coverage of some of the topics rarely
covered elsewhere. But I would suspect that students without courses in linear
algebra, probability, etc. would need more background in math to find the book
very useful.

E.g., early in my career, I jumped into various applied math topics using very
brief treatments. Later when I did careful study of good texts with relatively
full coverage, I discovered that the brief treatments had been misleading.
E.g., no one would try to learn heart surgery in a weekend and then try to
apply it to a real person. Well, for applied math, maybe learning singular
value decomposition, etc. in a weekend might not be enough to make a serious
application.

It is good to see a book on applied math try to be a little closer to real,
recent applications than has been traditional in applied math texts. I'm not
sure that the being closer is crucial or even very useful for making real
applications, but maybe it will help.

------
dandermotj
Does anyone know if this is available as a hard copy?

~~~
FraaJad
you can use a service like printme1.com

~~~
jimlawruk
439 pages is $22.29 at printme.1com

~~~
infinite8s
That's not too bad considering the content is equivalent to a first year
graduate course textbook.

~~~
ska
Which would likely cost $200 or so new, in current textbook market. So $22 is
a bit better than "not too bad".

------
blahi
That seems more about computer science and graphs than what the average
analyst would be doing.

~~~
dandermotj
Table of Contents:

* High Dimensional Space

* Best Fit Subspace & SVD

* Random Walks & Markov Chains

* Machine Learning

* Massive Data: Streaming, Sketching, Sampling

* Clustering

* Topic Models, Hidden Markov Process, Graphical Models, and Belief Propagation

~~~
tempodox
I'm new to the field (which his why the book interests me), so I have to ask:
Is your TOC a confirmation or a contradiction to the parent comment?

~~~
FraaJad
The table of contents is pretty much the core topics of serious data science
(as opposed to "learn to use data science libraries in $lang in 21 days")

So yes, this book that is avaialble for download from a TURING Award winner's
web page IS a Computer Science book for Data Science.

~~~
Ar-Curunir
Avrim Blum is not a Turing award winner; his father Manuel is.

~~~
FraaJad
this book was posted on John E. Hopcroft's webpage -
[https://www.cs.cornell.edu/jeh/](https://www.cs.cornell.edu/jeh/) who also
has won a Turing award.

