Hacker News new | past | comments | ask | show | jobs | submit login
Free Data Science Books (learndatasci.com)
331 points by LearnDataSci on Sept 9, 2015 | hide | past | favorite | 57 comments

Although, there is no denying that this is a valuable resource but I have started to get turned off by a list of n books to learn something - they can be valuable but it is undeniable that they can also be overwhelming and perplex someone about how to get started. I believe technical books should be used to complement your knowledge of the field not to get started in it. For eg, "Secrets of the JavaScript Ninja" will be very valuable because I already have experience in JS and it will help me understand some of the caveats that I might have overlooked. The best way has always been to get start implement something regarding the subject and try to dive into everything you uncover.

A blog post submitted here mentioned the same sentiment [1] -

> I can’t fully explain how immensely unmotivating it is to be given a huge list of resources without any context. It’s akin to a teacher handing you a stack of textbooks and saying “read all of these”. I struggled with this approach when I was in school. If I had started learning data science this way, I never would have kept going.

[1]: https://www.dataquest.io/blog/how-to-actually-learn-data-sci...

Second the dataquest post. Information without structure can be overwhelming, and its important to know what the optimal ways to learn something are. Arguably this is why formal schooling was created - to provide a framework for learning...

Thank you - this is a wonderful ressource that I had lost in my list of bookmarks about data science. That's another good example of information overload.

Sure a bunch of books is no use. But, for self learning there's nothing more systematic than following one or two well-written books through. Just trying to gain everything via "practical" knowledge without any systematic guidance is definitely dangerous.

At least "Python for Data Analysis" is a pirate copy. Wonder how many others are too. But as long as you make money from affiliate links you don't care, right?

Lists of "curated" free books/resources etc. are a very active spam format these days. It's a simple and effective way of publishing without having any original content of your own. People love clicking on these things because they love the idea of learning.

What makes it seem like Python for Data Analysis is a pirated copy? I figured since it was hosted from Canisius College it would be legally distributed.

I don't want to host pirated content, so if it is I will remove it.

The book is not listed at http://www.oreilly.com/openbook/

Also the PDF has a link to a notorious ebook pirate platform on every page. If you really believe content on college pages is legal, you must be very naive. I've never seen a naive webmaster that uses domain privacy though.

Personally I wasn't surprised to see (possibly) pirated content on an .edu site with a ~username URL, as the ~ suggested a student's page, where unauthorised content might pop up to share with classmates and stay up undetected by the college.

What surprised me is that the owner of the Canisius page appears to be teaching staff rather than a student. The other books hosted there seem to be legitimately freely available, however, so I'm guessing that was also a naive mistake.

Thanks for that link, I actually didn't know O'Reilly had such a page.

I'm not very familiar with ebook pirating platforms. So the link didn't seem suspicious to me.

Anyway, the book was removed. Thanks again for pointing it out.

If you're a beginner, you're probably going to be too overwhelmed by the options. I often find emailing/asking a few different professors/researchers/students in the field you want to learn for suggestions more productive.

That's not to say this isn't helpful. This is from my own personal experience.

Also get plugged into a local meetup/user group. They are popping up everywhere. Here are some examples of R user groups. http://blog.revolutionanalytics.com/local-r-groups.html

I would also add http://mmds.org/ in the list. Link to the book is 'http://infolab.stanford.edu/~ullman/mmds/book.pdf.

It's there. "Mining of Massive Datasets"

Is anybody aware of good books/resources on machine learning/data science in Matlab?

My SO has been trying to learn ML to further her work for a couple months now, and has had a hard time with it. She quite intelligent, but isn't a terribly experienced programmer (she's been writing Matlab for a couple years now, but mostly in a scientific setting)... Either way, I suspect part of the problem is that most of the explanations usually are in a language unfamiliar to her, and expect her to learn or translate it in addition to the concepts.

Andrew Ng, the man behind the excellent ML course on coursera, has an introduction to Deep Learning using Matlab.

[1]: Wiki with code, exercises and explanation

[2]: Video lecture one with a recap on back-propagation

[3]: Video lecture two on Sparse Auto Encoders

[4]: Handouts

In terms of books, Bayesian Reasoning and Machine Learning [5] is Matlab based. So is the Handbook of Monte Carlo methods [6].

[1]: http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial

[2]: http://www.stanford.edu/class/cs294a/video1.html

[3]: http://www.stanford.edu/class/cs294a/video2.html

[4]: http://www.stanford.edu/class/cs294a/handouts.html

[5]: http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=...

[6]: http://www.maths.uq.edu.au/~kroese/montecarlohandbook/

This course is great: https://www.coursera.org/learn/machine-learning It's all done in GNU Octave, which is mostly compatible with MATLAB.

I would recommend this fantastic in-depth intro to the principles and practice of Amazon Machine Learning:


(Hand-crafted by data and code guru James Counts)

I noticed something last night while watching the Djokovic US Open quarter-final. It featured an "IBM Insights" segment which claimed to have mined 8 years worth of Majors competitions to generate stats. And one interesting result it was able to produce went something like this: if Djokovic is able to return only 25% of his opponents serves, then in 85% of past matches it has resulted in victory for him. The implication being that such is the strength of his defensive game.

While this is no doubt really interesting, I find I am getting diminishing returns from outputting stats like this from big dumps of past historical data. What I would like to be able to show is a live heat graph style stats tracker, where each point in the match updates my belief net about who is winning, or playing better. Of course, the final outcome may be upended by some fluke occurrence such as a Hail Mary pass in the final seconds which is what makes sports interesting, but nonetheless I think a live tracker would say a lot more than the actual score of the match.

So, I am wondering if anyone has specific resources for real time online data mining? At web scale for high throughput data streams. And I agree with shubmajain above, libraries and repos are preferable to books and academic journals ;)

The insight was related to winning first-service points when returning serve. This tweet has a screenshot of the association: https://twitter.com/lapsu/status/620223838895407104

This isn't too far from the logic: "How can we win this game? Score more points than the other team". I suppose the more interesting thing would be to compare the same correlation across players.

I agree that the stats don't provide insight regarding game play and strategy. IBM has been providing the same weak stats for years now. I would like to see tennis incorporate the hawk-eye system tracking player movement and shot placement as well. Perhaps that could produce a heat map. On that note they can also eliminate the line judges while we're at it. The whole challenge system is idiotic. They have the tech, they should incorporate it throughout the sport.

I don't understand that IBM Insights note about Djokovic. Can you explain more?

Without doing the math - Djokovic is such a strong player that even if he's only returning a quarter of your serves, meaning you're 3/4's of the way to winning your set (I don't tennis, sorry if I'm getting the terms wrong), he's still probably going to beat you.

Well, that's a close explanation, except I think you're confusing set and match. For men's tennis, it takes 3 sets to win the match, with the potential of playing 5 sets.

I'm actually not sure that the math is true, though. (Or I really don't understand what the stat is saying.) Let's say that it actually is for every 4 serves, you win 3, Djokovic wins 1. That number gives you every game (winning the game game-point-15), to give you every set. I don't see how Djokovic ever wins a game, let alone the set or match.

It's hard to take any action based on that fact without further information. Even a gambler couldn't use that tidbit without conditioning on things like the current score. Or am I missing something?

Great resources.

I would add these great ebooks on Cloud Computing and AWS Certifications:

The Cloud Computing Job Market

With this eBook you will learn how Cloud Computing is changing the IT industry and creating a complete set of new roles for companies and businesses worldwide. Information and data to start your cloud computing career.

Link [0] https://cloudacademy.com/ebooks/cloud-computing-job-market-3...

A Guide to AWS Certification Exams

Introduction to the full range of Amazon Web Services certification exams: learn what, why, and how to pass just the right exam for you.

Link [1] https://cloudacademy.com/ebooks/guide-aws-certification-exam...

AWS Solutions Architect Certification

Study guide to Amazon Web Service's Solutions Architect certification exam: tips and suggestions on how, what, and where to learn.

Link [2] https://cloudacademy.com/ebooks/aws-solutions-architect-cert...

Honest question: is ML/DS something you can just pick up and be hired[0]? May be I'm ignorant, but I'd think employers would look for a degree in some related field to actually consider you for a position doing it.

[0] As in how you can pick up web hacking, do a few websites and create a reputation and get hired that way without a formal degree.

There was a thread on here a month or two ago about this. In general, it was noted that it's best (for both employment as well as just getting stuff done) to have a deep understanding of a particular area of ML rather than a general understanding of many areas. Usually those with a deep understanding have focused on it in school. But the latter group of generalists is a much larger group in the software industry, since most of us did not go to school for this specifically.

I went from being a US diplomat with no coding background to getting a job at edX as a machine learning engineer, so it's very possible. The keys are to find projects and build a portfolio so that you can prove your capabilities, and to start a blog/go to meetups so that you can build an audience and find opportunities.

Market seems to want a lot of them, different profiles and CVs for different domains and responsibilities: data wranglers, data analysts, statisticians, machine learning, business analysts, communicators, infrastructure operators, big data architects. The best shot is coupling your academic / self-matured strength with a domain you really like and start building your own portfolio from real-world case studies in the field you choose.

I think you kind of posed a question and a partial answer. If degree in related field (math, statistics) then yes you can pick these things up. If CS or no degree it will be much harder to pass resume filters.

Any specific recommendations from anyone?

The Elements of Statistical Learning together with the online course (http://www.r-bloggers.com/in-depth-introduction-to-machine-l...) makes for a great introduction.

EDIT: Oops I should have said "An Introduction to Statistical Learning with Applications in R" rather than The Elements of Statistical Learning. The Elements book goes into way too much depth to be a good introduction to the subject.

Similarly, An Introduction to Statistical Learning With Applications in R is like a practical version of (or companion to) Elements. I very much enjoyed it.

And the Stanford version of the same class linked above for ISLR is, in my opinion, better:


Yes good point, my bad - I meant to link to "An Introduction" rather than "Elements". Elements is not a good starting point - your head will explode.

Depends on what you want to learn.

"Mining of Massive Datasets" by Leskovec, Rajaraman and Ullman is very good.

Although the post gives a link to the Amazon page of the book, PDFs of the chapters are free to download at the official book web site[1].

[1] http://www.mmds.org/

I really like this kind of stuff.

It's my opinion that our educational process is a bit too heavy on algorithms and languages while being a bit too light on data structures.

I like to brush up on this subject matter from time to time just to keep myself sharp.

Anyone recommend any of the R books listed or know of any great R books for purchase?

My favorite intro to R book is The Art of R Programming by Norman Matloff http://www.amazon.com/The-Art-Programming-Statistical-Softwa...

This is a new book by Danny Kaplan that I was able to provide some feedback on prior to publishing:


I really enjoyed the book, it took a modern approach to R using many of the newer packages (dplyr for instance) and ggplot and combined them into a very nice introduction to R with labs, etc. Well worth checking out.

Discovering Statistics Using R by Andy Field. AINEC.

Nice books collection. Thanks :)


Why are you hijacking my scroll speed...

Your "smooth-scroll" library is completely breaking my touchpad scroll with an Acer c720 Chromebook. One slight movement (which should be a few pixels scroll) is moving me over half-way down the screen. Makes your site unusable with this touchpad as accidental scrolling sometimes happens and moves the screen a whole page away, especially when trying to right click open links because the gestures are similar.

Sorry to all affected by the smooth scroll. It's been removed.

Hmm. Interesting. I just implemented the smooth scroll yesterday so I will definitely check that out. Thanks for the input.

Smooth scrolling is already implemented correctly in the browser. Your implementation is just a hack that hijacks the normal behaviour a user is accustomed to and just gives back a version that just feels wrong to interact with, even without performance issues.

Do the entire internet a favour and un-implement it.

It isn't broken on touchpad for me.

That said, if I'm being honest, it's fairly unpleasant to use on a desktop with a mouse. It scrolls you to the top after it loads (which is after the rest of the page), and behaves differently than the computer normally does...

I would recommend doing away with it.

I made a change to the code, but since I don't have a touchpad, I won't be able to tell if it's fixed. Let me know what happens if you happen to go back to the page.

Please don't use this. There's nothing wrong with browsers and how they scroll. We all know how to use it and it works well everywhere.

You're loading more code just to mess with something that already works without any new benefit (and actually degrading the experience).

It's still not working well on my touchpad. It stutters badly. I honestly would recommend removing it. I checked it on my desktop. It works there, but the difference scroll speed is unhelpful and actually a little bothersome.

Just remove it entirely already, it's nearly unusable on my desktop now, and it was bad before.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact