
Foundations of Data Science by John Hopcroft [pdf] - fangwang
http://www.cs.cornell.edu/jeh/book11April2014.pdf
======
luisjgomez
"Please do not put solutions to exercises online as it is important for
students to work out solutions for themselves rather than copy them from the
internet"

Why exactly?

I've found that being able to walk through a solution can be quite
illuminating. There's also an exploration/efficiency tradeoff. At some point
you can't spend any more time thinking through a problem (because life) and
being able to work through a solution still brings many (if not maximal)
benefits.

~~~
Homunculiheaded
Robert Ash, imho one of the best writers for mathematical self-study, lists
"Include Solutions to Exercises" as #2 piece of advice in 'Remarks on
Expository Writing in Mathematics'[0] His quote is better than any summary I
could come up with:

"There is an enormous increase in content when solutions are included. I trust
my readers to decide which barriers they will attempt to leap over and which
obstacles they will walk around. This often invites the objection that I am
spoon-feeding my readers. My reply is that I would love to be spoon-fed class
field theory, if only it were possible. Abstract mathematics is difficult
enough without introducing gratuitous roadblocks."

[0][http://www.math.uiuc.edu/~r-ash/Remarks.pdf](http://www.math.uiuc.edu/~r-ash/Remarks.pdf)

------
mrcactu5
i know this book is not for everybody but I really like the math approach
here. software engineers don't like it when I say "data science has been
around 500 years since Kepler and Newton"

here they illustrate a nice connection between large deviations and convex
geometry - and have beautiful pictures.

so what are examples of high dimensional vector spaces? the set of all your
customers (hopefully in the thousands!!) and the vectors include all of their
transactions. there are many other examples

------
formulaT
Skimming the chapters, it seemed to be taking various fields of mathematics
that are used in data science, and presenting the foundations of those fields
as the foundations of data science.

I'm biased, but I think that data science is statistics, and therefore the
foundations of data science is statistics and probability theory. If you want
to understand data science at a fundamental level, I would suggest taking
courses in these areas.

~~~
cbgb
Just for the record, from the first page: "Background material needed for an
undergraduate course has been put in the appendix. For this reason, the
appendix has homework problems."

The appendix covers Probability and Linear Algebra.

~~~
formulaT
My point was that a book with the title "Foundations of Data Science" should
be mostly probability and statistics. Undergrad probability and linear algebra
is not a solid foundation in statistics.

Statistics is a powerful lens through which to view all data science. E.g.
supervised learning is building a model of the _conditional_ probability
P(y|x). Again, I am biased, but I think that methods that do not have some
statistical interpretation are unlikely to be useful. E.g. if we take the
graph of Facebook users and apply some matrix decomposition algorithm, who
cares? What can we do with this decomposition? What does it predict?

~~~
jules
While I would like to agree with this based on aesthetics, it didn't work that
way in practice. A lot of the most successful machine learning algorithms do
not have a statistical grounding or did not when they were invented. E.g.
neural networks, SVM, low rank matrix approximation, k-means, decision
trees/forests.

~~~
formulaT
Yes, the core algorithms do have this tendency (which is the exciting thing
about machine learning _per se_ ), but statistics provides the context to
understand them.

E.g. people used to say "neural networks are a simple, flexible functional
form for y = f(X,theta)". This turned out to be wrong: SGD training of neural
networks has more advantages than the flexibility of the functional form. But
it was a good hypothesis and starting point.

SVMs and decision trees have no statistical justification I know of. Low rank
matrix approximation and k-means are justified by latent variables and non-
parametric kernel methods respectively. I agree these justifications came
after the fact, but they _do_ give a way to understand how these models work.

Most importantly, all of the small tasks surrounding training a model are
purely statistical, e.g. cross validation, different measures of accuracy,
handling endogenous variables, etc.

------
tristanz
Can somebody explain to me the underlying theory of this type of book?

The only books that have ever felt coherent to me start with p(data, unknown)
as being an approximate model of some domain. Everything then follows smoothly
as inference, modeling, and computational methods or shortcuts.

~~~
jey
I agree, but it's also nice to have a toolbox of actually tractable
algorithms. Principled practical data science should approach the theoretical
ideal (i.e. p(unknown|theta)), but often has to use some approximations that
we actually know how to implement efficiently.

~~~
jey
Heh, I meant s/theta/data/.

------
hyperbovine
*And Ravi Kannan

------
SpaceManNabs
This seems like a nice book. It covers a lot of math that is bound to be
useful in some context in machine learning. I find that books like Machine
Learning: A Probabilistic Perspective by Murphy serve better for beginners
that have some math background as the beginning chapters introduce a lot of
the useful math. It is not as in depth as this book in the math details, but
these chapters show which mathematic tools are most used, and students can
always find literature for what they don't understand.

