Hacker News new | past | comments | ask | show | jobs | submit login
How to learn data science (dataquest.io)
248 points by spYa on July 16, 2015 | hide | past | web | favorite | 81 comments

The actual problem with learning "data science" is making inferences and conclusions which do not violate the laws of statistics.

I've seen many submissions to Hacker News and Reddit's /r/dataisbeautiful subreddit where the author goes "look, the analysis supports my conclusion and the R^2 is high, therefore this is a good analysis!" without addressing the assumptions required for those results.

Of course, not everyone has a strong statistical background. Except I've seen YC-funded big data startups and venture capitalists commit the same mistakes, who should really, really know better.

"Data science" is a buzzword that successful only due to obscurity and no one actually caring if the statistics are valid. That's why I've been attempting to open source all my statistical analyses/visualizations, with detailed steps on how to reproduce. (see my recent /r/dataisbeautiful submissions on reddit: https://www.reddit.com/user/minimaxir/submitted/ )

Roughly 80% of data scientists I know have PhD in something very math heavy. Rest have masters degrees. There are programmers who can assist them doing the grunt work but it's just basic programming to assist analysts to crunch data.

If you want to do data science for real:

1. Get Masters of PhD from statistics, computer science, economics, physics or some other heavy field and specialize data analysis in that field. You must learn lots of statistics when doing so.

2. Learn programming, statistical machine learning and tools of the trade.

Good data science is not based on collecting large amounts of data passively and then mining it mindlessly. You need to ask right questions and design data collection and modeling process based on those questions.

I've seen a huge range in the people calling themselves data scientists. Some have very analytically intensive academic degrees, others just finished a data science boot camp, and there are a lot of people that used to be called 'business analysts' who are basically doing the same job with a fancier title. In every group, I've had people tell me that what they're doing is really data science, because data science needs the (academic|integrated|business) perspective that they have, and what the other people are doing isn't really data science.

> Good data science is not based on collecting large amounts of data passively and then mining it mindlessly. You need to ask right questions and design data collection and modeling process based on those questions.

This resonates. That is, picking and designing features. Also understand dependent variables and knowing how to test for that, which is the biggest mistakes leading to flawed conclusions I see from the 'general public'.

What do you mean by testing for dependent variables?

Maybe something to do with instrumental variables? https://en.wikipedia.org/wiki/Instrumental_variable

Academic credentials aren't enough, good data-driven decisionmaking is as much an art as an academic discipline. A p-value of .01 is a Nobel Prize in medicine and unpublishable in physics -- domain knowledge is important to have a feel for the difference.

Assuming that smart autodidacts can't obtain sound statistics knowledge is selling many people short.

I think you are right in that it sells many people short, but then again having no good academic credentials is selling yourself short.

Data science is not like security. There it is more accepted that good engineers/researchers do not necessarily have the best accreditation. It seems that data science/engineering is turning around to this though.

It's not that autodidacts can not build bridges, it is that the people with the data and money do not want their bridges build by autodidacts.

Anyway... back to studying http://statweb.stanford.edu/~tibs/ElemStatLearn/ for me :).

No. A phd in statistics or economics means almost nothing at this point. Even if it did, truly, signal mastery of the content, which it doesn't anymore, it would signal to most people who do this kind of work that you're way overqualified while simultaneously being totally ignorant of the day-to-day work of actual data scientists.

If you want to be a useful data scientist, do a lot of work with data. If you have strong programming skills and are flexible and a quick learner then you will do well.

Spending the better part of your young adulthood getting a phd in statistics, unless you want to go into academia, just makes you look like a fool.

There absolutely are problems that require a more rigorous mathematical training than you get from undergraduate courses or day-to-day experience. Most data scientists and companies may not be tackling these problems, but they certainly exist.

Just having a PhD will open doors for you that would otherwise be shut. But before pursuing that degree, you should be confident that you enjoy working in the field and want to devote your career to it. Also, you have to be prepared to work hard, not just to get the degree, but then to land a job where you'll put that experience to use. Otherwise, you'll be sharing a cubicle with DataWorker and feeling like a fool.

That said, if you don't know whether you need a PhD, that means you probably don't know what kinds of problem you want to work on. And in that case, there's a good chance you'll end up working on a problem that only interests your advisor and nobody else (most PhD advisors have more students than they have good problems to work on). In that case, I wouldn't recommend it.

I've had complete opposite experience. Do the people who hire for this kind of work often bet on non-PhD candidates? Do they trust themselves to separate the wheat from the chaff?

Don't you want a colleague who is able to mention seminal papers for specific problems? Who is able to read and understand these papers and can distill useful features and optimizations from them?

People with PhD who go into business, usually end up in the better positions. They hire other PhD's for the good positions to keep the signal (mastery of the content) stronger.

As someone who did a lot of work with data I have little problem with my usefulness, but a lot of problems opening doors to the really interesting data companies (lacking a proper academic network). I wish I had gotten that PhD, because right now applying to Google, Microsoft, Facebook, Yahoo or eBay for data science positions makes me look like a fool.

I've met a lot of fools who've quoted all the right works, in both Computer Science and Data Science. Computer Science fools usually get fired. Data Science fools seem to get promoted to Yes-man status. It's a lot harder to lie about your code than it is with statistics; As the old adage goes, it right behind Lies and Damned Lies.

You've said "No", but you haven't countered the posters claims. Are in fact most data scientists PhD/Masters people? I hear the same information at a mid-sized tech company. I also hear similar things about Intel.

>Spending the better part of your young adulthood getting a phd in statistics, unless you want to go into academia, just makes you look like a fool.

This is why you are a DataWorker, and not a dataScientist.

Anyone can push bits around. It takes a trained mind to corral them using careful experimentation and observation.

It's dangerous to make big generalizations like "no one actually caring if the statistics are valid." This simply is not true. Sure, a lot of what you see on /r/dataisbeautiful is garbage, but that's because it's an open forum where anyone can show what they think they have found. Usually, whenever someone makes an egregious statistical error, they are called out for it. Of course, the same happens on larger scales and even in published research.

"Data science" at it's core is just statistical analysis, but it has been slowly morphing over the past few decades thanks to the budding field of machine learning and the commoditization of computing power. This has drastically changed the field of statistical research, and although the underlying math is the same, the tools and the amount of data are constantly in flux. Someone along the way must have felt that this evolution of statistical analysis needed a new name. In all honesty, it's just a name, and it doesn't matter. What matters is if you understand how to use it.

The classic venn diagram of data science is still helpful: http://drewconway.com/zia/2013/3/26/the-data-science-venn-di...

This article reads like a way to find yourself in the danger zone.

I've never seen this venn diagram before--thanks for bringing it up. I find that, as an academic (pursuing a Ph.D. in astrophysics) that plenty of traditional researchers are able to hack together code (many haven't ever taken a formal programming course; http://arxiv.org/abs/1507.03989) but many also misuse or can't interpret statistics (from personal experience). That puts us in the danger zone!

I think the key is to find the mathematics and statistics interesting because you want the [data] science to be meaningful. If that's a driving force, then you can learn math and statistics on your own (like the author did). Otherwise, yes--you will find yourself in the danger zone.

What if you just want to get paid well to play with interesting tools?

I assume data science is often used in place of astrology. People just want to have something to cling to, to get over their fears and insecurity. So if you can generate some reassuring graphs, who cares if they are based on solid statistics or not?

>People just want to have something to cling to, to get over their fears and insecurity.

The other side of this is that some businesses (especially SMBs) are so horrible at utilizing their data that very basic analyses can reap big gains (80/20 rule!). For the vast majority of businesses there is no need for elaborate models or machine learning techniques.

>see my recent /r/dataisbeautiful submissions on reddit: https://www.reddit.com/user/minimaxir/submitted/

They all seem very well-presented[0], but I can't help but ask - what do you do with this new information?

[0] https://i.imgur.com/PvWYB2n.png

I am planning a blog post on the relationship between YouTube video duration on other statistics. So I did a little exploratory analysis to validate the data.

As shown, the distribution of durations in Music videos is much, much different than all other categories. As a result, it skews nearly every other analysis and I may have to exclude videos from the Music category entirely.

what you described is prevalent in social sciences. Ton of biases and causation/correlation error and putting blind trust in some arcane statistical analysis without knowing what they really mean. Conclusion: statistically significant is the magic word peppered throughout academic literature.

Agreed. Fortunately, excellent social scientists really care about this -- see Andy Gelman's blog for many rants on this topic.

I'll echo for biology and sports sciences (think moneyball).



Good article for beginners. A couple thoughts, just to build on what the author said:

First off, data science == fancy name for data mining/analysis. Wanted to clear that up due to buzzwordy nature of "data science."

Learn SQL - this is the big one. You must be proficient with SQL to be effective at data science. Whether it's running on an RDBMS or translating to map/reduce (Hive) or DAG (Spark), SQL is invaluable. If you don't know what those acronyms mean yet, don't worry. Just learn SQL.

Learn to communicate insights - I would add here to try some UI techniques. Highcharts, d3.js, these are good libraries for telling your data story. You can also do a ton just with Excel and not need to write any code beyond what you wrote for the mining portion (usually SQL).

I would also go back to basics with regards to statistical techniques. Start with your simple Z Score, this is such an important tool in your data science toolbox. If you're just looking at raw numbers, try to Z-normalize the data and see what happens. You'd be surprised what you can achieve with a high school statistics textbook, Postgres/MySQL (or even Excel!), and a moderate-sized data set. These are powerful enough to answer the majority of your questions, and when they fail then move on to more sexy algorithms.

Edit: one more thing I forgot to mention. After SQL, learn Python. There are a ton of libraries in the python ecosystem that are perfect for data science (numpy, scipy, scikit-learn, etc). It's also one of the top languages used in academic settings. My preferred data science workspace involves Python, IPython Notebook, and Pandas (This book is quite good: http://www.amazon.com/Python-Data-Analysis-Wrangling-IPython...)

Great comment.

BTW, you can make interactive visualizations in pure python with bokeh: http://bokeh.pydata.org/en/latest/

Also with Blaze, you can use Pandas (or even Dplyr) syntax in python to query Hive, Spark and other large stores. http://blaze.pydata.org/en/latest/

This is the truth. People can't do simple statistics, even with advanced degrees. In many cases advanced degrees make things worse. The ability to reason about data and have strong fundamentals in math statistics is what's needed.

Someone else mentioned Gelman's blog. That's a great place to find evidence that phd's do not lead to an increased ability to ferret out "truth" or insight from data. In many cases they just hide the mistakes so that others without that background don't know they're being misled.

I would shift priorities to Python (from SQL).

Unless one has "data scientist" title so to make "database engineer" look more fancy, then data comes in various shapes and forms. And most questions cannot be answered with a simple aggregation.

For example, data I work on (I am a data scientist freelancer) is flat csv files, xls files, JSON files, some text files I need to parse, various SQL, MongoDB, things I am getting from various APIs, etc...

While understanding joins is crucial (and normal forms, etc), SQL itself does take negligible amount of my time (and effort).

I would disagree with that advice. If you work as a data scientist in a company, you will likely have the logs of something stored in an SQL table (be it pure SQL database or something like hadoop hive) and you will have to answer (and ask) to questions like: "Do people convert more when they come from X or Y?", so you will have to do a couple of queries to get the conversion rates from people coming from X and Y.

This is my experience when I worked as Data Scientist about a year ago. Now, YMMV, especially if you're a freelancer, I guess your clients are more comfortable with giving you raw dumps of data as files instead of giving you access to their database servers.

I work as a freelancer. And actually, I never ever processed logs.

Of course, sometimes I am given SQL access to server; but I never learnt SQL except for in action (i.e. things which I need right now).

And most of times I work with flat files. Even if they come from SQL they typically need a serious preprocessing before I can do a more adv analysis.

BTW: I have no problems with composing rather advanced queries. Just if SQL is a problem from someone (and, in case of doubt, it can't be Googled in no time) then I am curious how can get machine learning.

But how do you become a good data scientist, instead of a technical person, that knows how to apply an algorithm in Python/R.

What I am trying to ask is how do you become good at setting your start point(formulate your hypotheses), communicating your insights and selecting which tools apply where, because if your are good at coding and have experience in things related to computer science you have the abilities to handle a dataset(SQL Knowledge) and the data tools(Python, Pandas, etc), but that doesn't earn you the title of data scientist.

Practice. And education, but mostly practice. This is the kind of thing that is typically taught in formal educational settings (at least in engineering, which is my experience). As an example, I learned more about probability & statistics in 1) AP biology in high school, and 2) a "simulation systems" class in my industrial engineering master's curriculum. We spent much of the former class learning basic statistical analysis techniques (ANOVA, chi-square, etc) to apply to our lab data, and the latter class was all about statistical analysis of process flows (aimed at the real life problem of factory production planning & scheduling and manufacturing process optimization).

So, do I consider myself a data scientist? Absolutely not. But do I understand basic statistical concepts and know how to apply them to several categories of real life data analysis problems.

I'm a terrible coder, btw.

Would you recommend any approach or I should go undust my high school and college books in the search for study material. Or is this too basic material.

Regarding SQL, have you noticed any increase in the usage of "window functions" (how important do you find them for your work?)

You're basically describing stuff I was doing like 20 freaking years ago. Minus the Hive & Spark & Highcharts & d3.js - naturally.

But back then I couldn't get any of my managers to understand or appreciate what I was doing. Fickle finger of fate.

Hell even William Gosset was doing "data science" when he popularized the Student T distribution while working for the Guinness Brewery back in 1908.

I lived through Statistics, Business Analysis, Decision Analytics, Data Analytics, Data Mining now Data Science. Same thing renamed over and over again.

Regarding post above, it's right. Data scientist is someone better at statistics (classical stats, bayesian, machine learning) than computer scientist, and better at programming (SQL, R/Python for building models) than academic statistician. Plus a teaspoon of visualization (ggplot or d3).

AI has gone through the same sort of buzzword treadmill and even programming in general. Only after living through a few cycles does it really become obvious how cyclic these sorts of trends are.

I'm trying to work on being less jaded about it, and not letting my annoyance with the-new-trendy-thing-that-i-remember-doing-years-ago-under-a-different-name get in the way of learning new technology and new lessons.

But it's a struggle.

Definitely agree that those long lists that tell you to first become awesome at combinatorics, linear algebra, then learn all about statistical inference (that is, not actual statistical procedures but the mathematical underpinnings of statistics that would enable you to construct and evaluate methods you invent yourself), then move on to stochastic optimization... those are really more about machismo than about actually helping people to learn data science. Sure, linear algebra is helpful, but whether it's fundamental really depends on the kind of data science you're keen to do.

I also generally dislike /r/machinelearning and /r/statistics because they seem to have been taken over by people who will tell you to either get a PhD or get out. But, for me, just learning whatever I thought I needed to help me solve the problem at hand got me stuck really fast. There's so much statistics where you really just have to learn it first before you can start to see when and why you'd like to use it.

It never occurred to me to use hierarchical modeling and partial pooling for a certain set of problems until after I'd read Gelman & Hill. I never thought that inference on a changing process might require different techniques from the techniques for stationary processes until I had to study Hidden Markov Models for an exam. Heck, when I got started with data analysis I didn't even realize that the accuracy of most statistics improves proportional to sqrt(n) and so the next logical step in my mind was always "get more data!" instead of "learn more about statistics!" (If you look at the industry's obsession with unsampled data, data warehouses that store absolutely everything ever and map/reduce, my hunch is I'm not the only one who lacks or at some point lacked elementary statistical knowledge because it just never came up on their self-motivated, self-directed learning path.)

So I think the ideal learning path incorporates a bit of both: learn more about what excites you and about what's immediately useful right now, but also put aside some time to fill out gaps in your knowledge – even things that don't immediately look useful – and make some time for fundamental/theoretical study.

(x-posted from DataTau)

+1 for DataTau, I didn't know about that.


Check out this too: http://www.pyquantnews.com/

Data science is a stupid buzzword. The ideal candidate knows enough about IT to massage data, the more the knows about the domain to investigate the better, and for sure some statistics. Most of all always do sanity checks .. does it make sense? Can it be? Is the data correct?

It is an art. Like writing awesome code, etc. practice, practice, and working with experienced people is key.

At least it's not something engineer like every other job in the tech field.

"You need something that will motivate you to keep learning." This is so true and often forgotten. I am always learning new things, but the concepts that stick, beyond just the basics, are tied to specific projects or solutions to real problems. I'm typically ok with being a "jack-of-all-trades" for most technologies, just to stay aware of new things. However, when it comes to applying new concepts, skills, or tech to solve problems, a deeper understanding is required; usually obtained through motivation.

>"You need something that will motivate you to keep learning." This is so true and often forgotten.

I'm surprised that both you and OP seem to think this advice is rare, as I've personally seen it mentioned more than a few times. Professors always brainstorm ways to motivate students, employers always seek methods to motivate employees, etc. The answer always seems to be something in the lines of 'do what you love' which is so overused it loses its impact.

Anyway, as someone new to data science, I did not feel like I gained any new information after reading the article, and all the advice seems either intuitive or rehashed. Looking forward to read the HN discussion though.

> Professors always brainstorm ways to motivate students, employers always seek methods to motivate employees

Uhh.. You must have quite a fortunate streak of having great teachers and great employers. Speaking for myself, I had some great teachers, but finding a boss who seeks to motivate employees with the right challenges is rare. Bosses and companies generally assume your salary is the primary motivator.

If you want to learn about data science, read this book: http://www-bcf.usc.edu/~gareth/ISL/

I have been going through it and I cannot think of a better resource.

As someone who uses "Data Science" to teach "Computational Thinking", I think this blog post hits on a lot of really valuable pedagogoical notes. Getting motivated, learning things through doing, and having a strong context for your learning.

For those wondering why I put my buzzwords in quotes, it's because I don't want to sound like I'm a huge proponent of either of them. CT is the term I use to describe how I teach my students about abstractions, algorithms, and some programming. DS is the term I use to describe how students learn all of that in the context of working with data related to their own majors. I'm not trying to claim some crazy paradigm shift, just that it's a great way to convince students that CS is useful to them.

Anyone interested in data science should first study cognitive psychology. The CIA has a manual on the psychology of intelligence analysis that is a must read for anyone pursuing any analytical job.

If you dont understand how your mind sees, processes, retains and recalls data...how can you possibly analyze it accurately?

You have a link to where to obtain said manual?

Thank you.

These principles are useful when learning anything really: human language (immersion), programming (build something), sports (practice), etc.

That said, as someone who worked in software engineering for 5 years without a degree, and recently returned to school, I would say be careful not to discount studying theory at the same time you're practicing your craft. I really think a combined approach of structured university courses and MOOCs, including reading textbooks, along with applying the knowledge has been the best approach for me.

I was arrogant about "not needing" a degree for years, feeling justified by the fact that I was making very valuable contributions as an engineer, until I finally went back to school and realized how valuable theoretical knowledge can be.

I've been working as an analyst for 7 years, it's only last couple of years I've heard of statistical analysis referred to as data science.

Am I missing something or is it just a new word?

In many ways it's just a new word for the same thing, but there's a few key differences. The main difference between traditional statistics and data science is strength in programming. Data scientists are also expected to be more well versed in statistical modeling than your average programmer or data analyst.

With that said it's not really a new thing, people have been doing data science for decades. The demand for people who can program and also do more complex statistical modeling has skyrocketed so I think that's why there's a new name for it now.

Part of the problem is that even with this definition there's a wide range of abilities present in data scientists. A long time computer programmer who has dabbled in statistics and a long time statistician who has dabbled in computer programming would both be data scientists even though they bring very different strengths to the table.

No, no, they're totally disrupting the field of statistical analysis. That's why they need a new name.

I am doing this course and find it really good : https://www.edx.org/course/scalable-machine-learning-uc-berk...

It is about creating a linear and logistic regression + pca using spark (python api).

Here are some topics. Are they considered relevant to data science?

Matrix row rank and column rank are equal.

In matrix theory, the polar decomposition.

Each Hermitian matrix has an orthogonal basis of eigenvectors.

Weak law of large numbers.

Strong law of large numbers.

The Radon-Nikodym theorem and conditional expectation.

Sample mean and variance are sufficient statistics for independent, identically distributed samples from a univariate Gaussian distribution.

The Neyman-Pearson lemma.

The Cramer-Rao lower bound.

The margingale convergence theorem.

Convergence results of Markov chains.

Markov processes in continuous time.

The law of the iterated logarithm.

The Lindeberg-Feller version of the central limit theorem.

The normal equations of linear regression analysis.

Non-parametric statistical hypothesis tests.

Power spectral estimation of second order, stationary stochastic processes.

Resampling plans.

Unbiased estimation.

Minimum variance estimation.

Maximum likelihood estimation.

Uniform minimum variance unbiased estimation.

Wiener filtering.

Kalman filtering.

Autoregressive moving average (ARMA) processes.

Rank statistics are always sufficient.

Farkas lemma.

Minimum spanning trees on directed graphs.

The simplex algorithm of linear programming.

Column generation in linear programming (Gilmore-Gomory).

The simplex algorithm for min cost capacitated network flows.

conjugate gradients.

The Kuhn-Tucker conditions.

Constraint qualifications for the Kuhn-Tucker conditions.

Fourier series.

The Fourier transform.

Hilbert space.

Banach space.

Quasi-Newton iteration and updates, e.g., Broyden-Fletcher-Goldfarb-Shanno.

Orthogonal polynomials for numerically stable polynomial curve fitting.

Lagrange multipliers.

The Pontryagin maximum principle.

Quadratic programming.

Convex programming.

Multi-objective programming.

Integer linear programming.

Deterministic dynamic programming.

Stochastic dynamic programming.

The linear-quadratic-Gaussian case of dynamic programming.

I've been doing data science for a while now, and for me personally:

Not really. The SVD is much more important. No. Yes. Yes. No (R-N) yes (CE). Yes. Yes. Yes. Personally, no. Only in the usage of MCMC. Yes. Yes. No. Of course. All the time. Yes. Yes. The most I'll do is remember to use the sample standard deviation. No. Yes. No. Yes. Yes. Yes. No. No. Yes. I just use a solver. See above. See above. Of course. Yes. Yes. Not privileged w/r/t/ other bases. Of course. I've never needed it. Ditto. As another tool in the toolbox. They would not be my first or second choice. Yes. No. No. Yes. No. Yes. Yes. Yes. No.

The topics you mention are maths or applied maths topics. "Data Science" is a bubbly term that roughly means "take that big dump of data and give me some advice on how to make more money", so your list, very sadly, has little relevance with it.

Most of those topics I listed are supposed to be good at taking data and saying how to "make more money"!

I've seen in other threads you recommended Neveu's book to cover some probability theory topics. Care to explain whether Halmos & Rudin would be sufficient pre-requisites?

Halmos Measure Theory is a good prerequisite to Neveu. Rudin, Principles is a bit too little. Instead, the first half, the real half of Rudin's Real and Complex Analysis is a good prerequisite. So, is Royden's Real Analysis.

Neveu is elegant beyond belief, but Breiman, Probability, the SIAM book, available in paperback, is darned good, usually easier than Neveu, less elegant, closer to applications, and without some of the special Tulcea material in the back of Neveu. K. L. Chung also has a good, comparable book. Even if want Neveu to be your main probability book, which is fine, likely you should have alternative treatments.

Of course, there is Loeve, Probability -- written in English but somehow sounding like French. It has a lot, a little too much, but I liked the topics I studied in it. It turns out, Neveu and Breiman were both Loeve students.

Halmos, Measure Theory, is darned fun to read: It has the three series theorem and a famous exercise on regular conditional probabilities.

I learned the stuff from a course by A. Karr, a star student of E. Cinlar. Karr's course was the best course of any kind I ever took in school. Powerful material, beautifully presented, each day it was a shame to erase the board.

The exercises in Neveu are usually harder than the ones in Halmos, Breiman, and Chung.

Neveu makes probability a crown jewel of civilization.

The summer after Karr's course, I sat in the library for six weeks and walked out with a 50 page manuscript that was all the research and the first draft of my dissertation. Net, probability at the level of Neveu is darned powerful stuff, makes a lot in research, and research for applications, really easy -- that is, you really know just what the heck you are doing and can knock off new results having fun sitting in bed next to your wife while she watches TV (warning -- not gender neutral!).

What I've outlined is sometimes just called graduate probability. The biggest difference is that the whole subject makes daily use of measure theory.

I don't know how much you need in probability before starting on graduate probability. In my case, graduate probability was my first serious study of probability, and I never felt that I was not prepared.

But in my career I'd done a lot of practical work in both probability and statistics -- e.g., multivariate statistics, hypothesis testing, stochastic processes, digital filtering, the fast Fourier transform, beam forming (a case of antenna theory), power spectral estimation (US Navy sonar type stuff), how to get the central limit theorem out of digital filtering, and more, random number generation, etc. That work was plenty of intuitive background for graduate probability.

But in much of that work I was struggling due to what, really, at that level, is commonly weak basic knowledge of probability. So, after those struggles, seeing graduate probability be all clean and powerful was great.

I can't advise on just how much elementary probability you might need to have enough intuition to be comfortable with graduate probability. I will say, you do need both the intuitive experience and also the solid math.

I feel sorry for people who work in prob/stat without a background in grad prob: The elementary stuff is too often just confused from poor understanding from a poor background.

The sources I mentioned above were really the first sources from which I did any real study. Net, the elementary material of prob/stat is really too simple to be taken very seriously. So, for your first serious effort, just go for graduate probability from the sources above.

The Neveu, etc., material is much of the foundation for the secret sauce of my startup.

Thanks for the insights. Chung seems quite doable at my current level. I skimmed through it sometime ago. I borrowed a copy of Neveu and it seemed a bit harder.

Care to share other references you like. Real & complex analysis and algebra, in particular, are most welcome.

> Real & complex analysis and algebra,

I've mentioned books I've spent at least some significant time with.

There are lots more books on my shelves that look good, have good recommendations, etc. but I haven't paid much attention to.

My interest in algebra is a bit meager -- I'm not seriously interested in number theory, algebraic geometry, algebraic topology, etc.

For real analysis, the books I mentioned seem to me to provide really good sources. Of course there is much more to analysis, e.g., functional analysis. And there's a lot to stochastic processes. And much more to math.

If you want to dip your toes in algebraic geometry and functional analysis, you could do a lot worse than Lang's book on SL(2,R) for the former and Bollobas' for the latter.

cf. http://maths-magic.ac.uk/course.php?id=339


Also learning R Language and using RStudio is a great way to get into. RStudio has so many packages to help you do any data analysis. The learning curve is quite steep though.

Read this (free) book: http://mmds.org/

Moving data around is just grunt work.

Real science requires a creative and critical mind, which takes years to mold.

Sounds like you've also spent years molding professional disdain for everyone who's not a Real Scientist.

No I've just seen too many people spin their wheels on "analysis" that is not hypothesis driven.

You got to start with questions to get answers, and the hard part of science isn't crunching data, it is asking the right question!

And how does the Right Question appear if not through exploration and manipulation of the data?

Theory can obviously be very useful, but much of this stress on advanced statistics and phds is just a smokescreen for academics who suck at programming.

If you can't program and manipulate data, statistics won't save you because you won't have the ability to dig deep enough to find valuable insights. On the other side, if you know how to slice and dice data quickly and reliably, you can learn a huge amount by applying only the simplest statistical techniques. Generally the simple techniques are better anyway because they make mistakes less likely and your findings are easier to communicate.

>And how does the Right Question appear if not through exploration and manipulation of the data?

Questions don't magically come out of a data set. Doing so is called a fishing expedition and usually results in boring, descriptive results which have no impact.

To answer impactful questions, you must go into your data collection with the questions in mind. To understand what questions to ask, you need a trained, critical, and creative mind. That is something you don't get from pushing bits.

>If you can't program and manipulate data

Programming, and manipulating data is easy. Almost every new statistician these days can, and does do this routinely.

What's hard is the years of intuition about what is meaningful and what is noise.

I know. It's hard to hear, and career programmers most of all hate to hear it, but its the truth.

Anyone I've ever heard say "programming is easy" is without fail a terrible programmer.

I'm not really sure how to respond to the idea that exploring a dataset isn't a useful way to help develop questions about it. It's only a "fishing expedition" if you have no idea what you're doing.

>Anyone I've ever heard say "programming is easy" is without fail a terrible programmer.

Development of a worldclass application, is difficult because of the complexity built into a program of large scope.

Knowing enough programming to competently move a data set around, is easy. Hell you could do most of it with just bash.

>I'm not really sure how to respond to the idea that exploring a dataset isn't a useful way to help develop questions about it. It's only a "fishing expedition" if you have no idea what you're doing.

Well I've seen a lot of it, in both science and business. People who spend a lot of time and money to generate a large data set simply because they lack a question to ask. They expect meaningful answers to just tumble out of it like mana from heaven, and end up confused and dismayed when the answers aren't impactful.

Fishing expeditions are looked down upon because they can only describe the data you generated. That is minimally useful, and can be done without grabbing a huge sample.

Good science starts with a question, then puts data to work to create new insight by removing confounding factors through careful design.

Learn biology, chemistry and physics, question yourself often and use your instinct.

if I had some type of practical application that I knew could benefit from data science, like learning RoR to make a marketplace app for example, it would help a lot as I have a clear goal and route to achieve that. However, data science, machine learning, these are so broad, and seemingly complicated (my fear of complicated math formulas and statistics) and worse I don't know what I want to achieve out of it nor do I know what I want to make which really hinders the learning process for me. I need some incentive or reward at the end of the goal.

Yes, the step 1 that nobody seems to mention is that you need to have a question that you're curious about, which data analysis may be able to help you answer. The reward is having some answer to that question, with an argument for its validity. Instead of links to a bunch of datasets, I'd love to see a site that collects questions with the potential for data-driven answers. This perhaps exists somewhere.

Absolutely. I remember seeing an "Epic NHL goal celebration" post on here a little while ago. That was a fun read and seemed like a good project to get some exposure to ML.


Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact