
Foundations of Data Science [pdf] - necrodome
https://research.microsoft.com/en-US/people/kannan/book-no-solutions-aug-21-2014.pdf
======
snippyhollow
The second chapter "High-Dimensional Space" talks about the problem of spikey
spheres[0] (how most of the mass is near the surface), I made an ipython
notebook to illustrate it[1].

[0] [http://www.penzba.co.uk/cgi-
bin/PvsNP.py?SpikeySpheres](http://www.penzba.co.uk/cgi-
bin/PvsNP.py?SpikeySpheres)

[1]
[http://nbviewer.ipython.org/urls/gist.github.com/SnippyHollo...](http://nbviewer.ipython.org/urls/gist.github.com/SnippyHolloW/9025964/raw/b2d266e7e19d64e0343fd899dfbc3e8ddc889269/SpikeySpheres?create=1)

------
princehonest
"Please do not put solutions to exercises online as it is important for
students to work out solutions for themselves rather than copy them from the
internet."

I find crowdsourced solutions for honest autodidacts very valuable.

~~~
niels_olson
Thanks, so true. If you're not immersed, as in a traditional program, you're
stuck with catch-as-catch-can, which can be very inefficient. Things like
learn Python the hard way are wonderful.

------
thomaskcr
Anybody who doesn't read that first chapter to the end is going to be very
confused.

> To make it easier to read we use E^2(1-x) for (E(1-x))^2 and E(1-x)^2 for
> E((1-x)^2).

Why change that notation? That seems to purposefully be introducing confusion.

On page 14 they don't use that notation (om^2(x+y) = om^2(x) + om^2(y) --
according to their notation note that should really be om^2(x+y) = (om
(x+y))^2).

Not trying to knock what seems like a really neat introduction, I just don't
understand the need for defining ridiculously unconventional notation and then
not using it consistently introducing a lot of confusion.

~~~
PurplePanda
I've seen this notation quite commonly.

I haven't looked at the link but based on your quote your comment about page
14 doesn't look right. The different notation doesn't change the number of
times you need to write the operator.

Your new equation is just writing the same thing on each side, but using a
different notation. It's like a=a. Whereas their equation is apparently giving
an identity.

~~~
thomaskcr
I've never seen it but I am trying to think back to all of those times I
worked both directions on a proof and just shrugged in the middle =p.

Ah -- you're totally right on my last sentence. Thank you.

------
asquidy
This is cool, but how can you write a book about data science without
mentioning causal inference or experimental design? Most people that do data
science are not applying black box algorithms to clean data. They are actively
manipulating and shaping the data, coming up with theories, and testing those
theories. Inference is more important in theory and in practice for data
scientists than theoretical models of graph formation and some of the other
topics covered in this book.

------
kiyoto
I find the title to be linkbaity and misleading (quite disappointing for
decorated computer scientists like Hopcroft and Kannan).

Based on the table of contents, a more accurate title would be "Modern
Foundations of Theoretical Computer Science with an Eye Towards Machine
Learning", and even that is given a disproportionately large weight on machine
learning.

~~~
coherentpony
> I find the title to be linkbaity and misleading (quite disappointing for
> decorated computer scientists like Hopcroft and Kannan). > > Based on the
> table of contents, a more accurate title would be "Modern Foundations of
> Theoretical Computer Science with an Eye Towards Machine Learning", and even
> that is given a disproportionately large weight on machine learning.

What? That's the title of the book. And it's not linkbaity at all. Linkbaity
would be something like, "Two decorated computer scientists just wrote a book
about data science, and you'll never guess what's in it!". Or, "419 things you
didn't know about data science."

Changing the title of the post doesn't change the title of the book.

~~~
cgio
I believe that GP meant that the title of the book is linkbaity not that of
the post.

------
devilsdounut
Looks pretty academic. I see no mention of data cleaning or more practical
considerations in the table of contents.

~~~
ehurrell
"Foundations of" tends to be a common academic title start, so I would expect
an academic approach. To not talk of cleaning at all seems like a serious
error, from academic experience it does not take lone before your work
requires data not covered by pre-cleaned datasets e.g. MovieLens or TREC.

