Hacker News new | past | comments | ask | show | jobs | submit login
Foundations of Data Science [pdf] (cornell.edu)
165 points by Anon84 61 days ago | hide | past | web | favorite | 20 comments

Looks like a mess. Use this instead:


this is a weird collection of ML/computer vision/data processing topics

wavelets as it's own chapter, deep learning only has GANs as a single subsection, graphical models several chapters after the ML chapter...just weird arrangement choices

it's a theory heavy intro to a wide variety of analytical techniques. it's a common refrain that deep learning has very little in the way of theory. and in fact gans with their optimization scheme framed as a game that plays until equilibrium is one of the few pieces, that's also interesting.

If I read and master most of the stuff in this book, am I qualified to do data science?

I'd recommend Elements of Statistical Learning or ISLR instead, if you want to start with a theory-heavy introduction. Most of what you need for DS you'd I think better learn through projects or on-the-job.

Also, as others have mentioned, some of the most important skills for DS are data munging, data "presentation", and soft skills like managing expectations / relationships / etc.

I would not recommend this book if you want to get into DS with the idea that, "I'll read this and then I'll know everything I need to." It's too dense and academically-focused, and it would probably be discouraging if you try to read this all without getting your feet wet.

It depends what you mean by "do data science". If you've read and mastered what's in the book, you'll have a good chance at writing a nice PhD in machine learning. But there's a lot of academic fluff in there that isn't too useful outside of university.

Most data scientists are consumers of algorithms, not producers of algorithms. The rules are a bit different if you're at a bigco, but most data scientists don't do active research. It's nice to have a solid theoretical understanding machine learning, but most data data scientists' day-to-day consists of chaining together libraries and building nice dashboards.

>but most data data scientists' day-to-day consists of chaining together libraries and building nice dashboards.

This - though I'd add that data science is also about statistical intuition, knowing what questions to ask of the data and how to get sensible results. It's not essential to understand that PCA involves finding the eigenvectors of the covariance matrix, but it is essential to know when PCA would be useful, and common gotchas that might make it irrelevant.

By do data science, I mean the day to day stuff. So actually it's not that hard then if all they are doing is chaining together libraries.

Is there anything that this book would be missing the day to day stuff?

I think what's missing is how to assess whether what you've build actually does what it's meant to do. Too many textbooks (actually almost all of academia) boil evaluation down to the calculation of a handful of narrow metrics.

But really understanding the failure modes of what you've made is much harder than that. Unfortunately I don't have a magic bullet or even a reference for that side of things. It's just something I've learned on the job. To paraphrase the Anna Karenina principle [0]:

All good models are alike; each bad model is bad in its own way.

[0] https://en.wikipedia.org/wiki/Anna_Karenina_principle

> By do data science, I mean the day to day stuff. So actually it's not that hard then if all they are doing is chaining together libraries.

Anyone who believes this is dangerous and shouldn't be allowed anywhere near a data science project. The programming is easy, but working with data and making good inferences is very hard.

Sounds a bit of gate keeping to me. I mean it isn't magic, it's still 0s and 1s which yield results that are of some use to someone. Sure, it would be nice if everything were masters of DS but in reality, people make mistakes and sadly, or luckily depending how you look at it, you still can get useful results even when some of your assumptions about the data are wrong.

But I understand what you mean, and completely agree that poseur amateurs who treat DS as just easy "number fluff" and expect fancy ML frameworks to solve all their problems are cancerous and ruin the reputation of DS as a field.

Yet the more I have worked with regular people at work, the more I have moved from the camp of always doing the "right thing" to hoping that people would just "do something". I can't fix every problem so maybe I'll just deal with the reality as it is and try to make best out of it. I won't start quizzing people at work about DS know-how but maybe then silently guide them towards understanding what they are doing instead of driving them out of the room and keeping them located somewhere far away from the data science team.

Communication skills, business insights, politics ... those for starters ...

I mean on the technical side. The stuff you describe are required for any job.

The book looks fine, if a little random in its selection of topics, and it's probably as good a start as any, but is neither necessary nor sufficient. Much of data science in 'the field' is hunting down, gather and cleaning data. Knowing the right 'black boxes' to chain together in the right order to get the result you want, and knowing how to present the results in an easy to understand way so that the people that have to make the decisions can make the decisions they have to make.

Knowing the theory is important, but most of the time what you actually need know is to quickly knock together a script that pulls in, cleans up and merges some dirty data from 3 different sources, select the right out of the box algorithm for the situation and presents the results in a clear and pedagogical way.

Looks like you need a _lot_ of foundational stuff first. Page 17 has already jumped into multiple integrals, which at least when I was studying undergraduate calculus 30 years ago wasn’t covered until Calc IV. The appendix rushes through a lot of stuff you’re apparently expected to already have had some exposure to like Taylor series, probability density functions, eigenvalues and eigenvectors… it looks like you’d need at least an undergraduate (if not a graduate) level math degree before you could make much sense of the contents of this book.

Oddly enough that’s stuff that would have been optional in my CS program but was all mandatory for EE.

That sounds about right - EE is a lot if differential equations (or so I’m told), so you’d need a rock-solid calc framework for it. When I took CS in the early 90’s, calc I & II were required, but that was it.

I have taken a course based on this book (before it was retitled "foundations of data science"). It was more or less a math class where all homeworks are about proving stuff. You won't write a single line of code and usually don't even talk too deeply about the applications.

So no, you'd would be completely unqualified. You would however gain a deep understanding that might help you come up with novel new techniques for solving large scale problems.

I was in one of Hopcroft's courses (an author of this book) for a while and if this book is on brand with his style and fields of study, it's super theory heavy. You won't write a line of code; you would not be useful to a company hoping to apply data science in the world.

I was hoping there would be a section/mention on data privacy, but a CTRL+F revealed no mention of the word.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact