Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What journals and blogs should I be reading to become a data scientist?
82 points by dewang on Jan 31, 2014 | hide | past | favorite | 44 comments

Seen on twitter today:


"A data scientist is a statistician who lives in San Francisco."

Loved this. Another variation that HN will fight to the death over: A growth hacker is a marketer who lives in San Francisco.

BigDataBorat's classic: "Data Science is statistics on a Mac."

I like that. Even worse would be "an actuary who lives in SF."

What would that be?

Am I being dense?

I'm guessing it's saying that the new 'data scientist' is more an actuary than a statistician. Just a bunch of predictive models trying to enhance a bottom line more than anything else.

This comment thread reminds me of the HN Parody http://bradconte.com/files/misc/HackerNewsParodyThread/ :)

a doctor is just a biologist who works in hospitals

You know, I absolutely see where the poster is coming from, and the suggestions look helpful so far, but the question might as well read: What journals and blogs should I be reading to become a Cardiothoracic Surgeon?

(though hopefully nobody bleeds out on a table when someone misconstrues statistical data)

We've lived through an amazing time where one could learn by doing, and talented people have been able to compete without the benefit of formal education (myself included), but in my opinion those days are numbered.

I've personally observed respected PhD statisticians stumble on the type of problems a data scientist is expected to address. The combination of complex software and often counterintuitive mathematics makes this an imposing field for all but perhaps the top one percent of practitioners. Most everybody else needs to really hit the books for a few years, in a formal setting.

With that pre-coffee rant out of the way, I'm looking forward to finding some new sources here myself. So, in that spirit, thanks for the question.

Just a bit of a counterpoint (taken from a comment on the Data Tau site):

"Data kiddies like me are coming. I just ran multiple passes of the Broyden–Fletcher–Goldfarb–Shanno algorithm with a 100-layer neural network on a tfidf-vectorized dataset. I have no clue what that all exactly means, all I know is that it took under an hour and it gives a higher (top 10%) AUC score. Kaggler amateurs are beating the academics by brute force or smarter use of the many tools that are currently freely available. Show a regular Python dev some examples and library docs and she can compete in ML competitions. I was getting good results with LibSVM before I even understood how SVM's work on the surface. Feed the correct input format and some parameters and you are good to go. Random Forests can be applied to nearly anything and get you 75%+ accuracy. Maybe I am just a engineer looking for pragmatic and practical use of techniques from ML and data science. Hard data scientists will be the statisticians, the algorithmic theory experts, the experimental physicists. It takes me 7 years to understand a complex mathematical paper. It takes me 7 minutes to train a model and predict a 1 million test set with Vowpal Wabbit."

The point is that a Data Scientist is really a person who is a blend of statistician and software engineer. Sure, there are brilliant people who will invent new ML algorithms, but you don't need to invent that stuff to be of tremendous value to a business who has data that they aren't currently getting much value out of. Just as a software engineer at a small business doesn't need to write a database, she just needs to be able to implement one somebody else wrote to add tremendous value.

Sure, most anyone can throw some data into an SVM and get a result out, maybe even a good one. The problem comes when someone like this has to answer questions beyond a simple 90% accuracy rate. What does the computed separation direction tell me? Could I improve accuracy by using some a priori information like how often one class occurs in relation to the other? What 10% of the population am I failing on? Is it an important part? Is there some easy way I could do better? Is my data so high dimensional that I'm getting some trivial separation and not anything driven by the data itself?

And what happens when this person gets a new data set and they are suddenly getting garbage out of some standard SVM? Is it just a matter of the data not being well-separated using a linear model but throwing some simple kernel at the SVM will do the trick?

Even something as simple as taking a mean can fall apart when you are dealing with data which doesn't live in a Euclidean space, let alone something like PCA or SVM which also make assumptions of linearity.

The point is, it isn't just about being able to invent new methods. Things like SVM make assumptions about your data and applying them in cases when these assumptions don't hold can give completely worthless information, even if it looks good on the surface. Using something you don't understand, even if it is at a (much) more basic level than someone with a PhD in statistics, is just asking for trouble.

I absolutely agree, and in that sense I'm sort of living the dream already. But even skilled people don't always know what they don't know, and that can show though somewhat more easily in this field.

Great list. I'd just like to give a shout out to Hilary Mason and her blog, as her very approachable style was what initially got me interested in this space.

Don't bother with journals - in pretty much any subject - unless you have a degree and/or you understand what to look for, or are directed to notable articles in bibliographies or by peers. There is a lot of crap in all journals, it's often needlessly technical for practical purposes or too bleeding edge to actually be useful yet.

I'm not trying to be snarky, but honestly unless you know what you're looking for it's a fool's game. Once you've got the feel for a subject, you tend to find several authors that crop up time and time again, or landmark papers that really shifted the field. But that takes a long time, it takes most PhD students a year to fully understand and simply collate the background of a topic they may think they know a lot about.

That and no one actually reads journals. You do a search on Web of Knowledge or ADS or arXiv or whatever your poison and you see what comes up. Point is, you need to know what you're looking for.

This is akin to saying that if you read Phys Rev enough, you'll become a physicist. Sure, sure, keep up with the trends, but big important results get press which is enough to rely on to start off with.

To become a data scientist? Read the recommended textbooks and take a proper degree in statistics, computer or data science. Look at the courses on EdX and Coursera for a starting point, they'll help you decide whether this is something you seriously want to pursue.

Even if this is just a hobby, e.g. you're a coder that wants to branch out, you should still take the time to invest in education properly. Data science, like statistics in general, is very easy to mess up. When people draw bad conclusions from data (and good data scientists can make up any conclusion from any data set), bad things inevitably happen. Entire threads of science have been destroyed because somewhere, someone messed up their stats and apparently important results are meaningless.

While a little off topic I think my typical research process goes like this:

- Hear or read about something that sounds neat

- See if there's a wikipedia article (I always cringe when I hear some colleagues of mine say never, ever use it)

- Get a high level understanding of the topic from the wikipedia article...that usually leads to some other wikipedia articles + plain old Google searches...just fishing for whatever comes up [I also search for TED talks, youtube videos and MOOCs related to the topic]

- Scribble stuff down on a piece of paper and structure it in a way that makes sense to me (sometimes it's just a list, sometimes a full blown mindmap) ...at this point I have a decent high level understanding...which basically means I could describe the topic to someone without stumbling (which I usually try at this point)

- From the high level understanding I usually also get: key terms for searches, intor level books/articles that are linked etc.

- At this point working at a university comes in handy because it lets me be behind the annoying paywalls at will...search Google Scholar or similar databases for the mined key words. Everything that looks remotely interesting...oh wait BEST TOOL EVER

- Zotero is sick good, comes as a FF plugin...great. If you search in scientific databases and the like a little icon pops up in the address bar of the browser indicating it identified the sources...click, mark everything -> it goes into your collection (with full text access) [I order it by topic so for AI I might have Expert Systems and Rule Based, Fuzzy etc.]

- So basically I just wade through the databases and get everything that sounds interesting from the title into Zotero. Alays a good idea to get some "history of XYZ" or "XCY since author Y" sources

- Once done I read the abstracts and the conclusions and put a rough note what the articels are about. I also scan the sources to grow my collection of relevant articles (I mark what I don't think is relevant or move it into a special subcollection)

- I usually try to establish a history of the field with the major stepping stones, this is usually easy (sometimes not, worth a paper to make it easier for future researchers :P)

- If it's related to programming in any way I also search google or github directly for anything related. Code is good :)

[often there are tomes that are the de facto standards in their fields that serve as a massive source collection as well. Perfect example would be AI - A Modern Approach]

Becoming a data scientist isn't a matter of reading journals and blogs. You can get a sense of the field and what is required by reading those sites but becoming a data scientist is years of hard work.

You need to develop serious skills in at least 4 of the following disciplines. Statistical analysis

RDMS query development

NoSQL databases

Machine learning

Natural Language Processing

Web crawling and data harvesting techniques

Programming to access data APIs

Web development

Data visualization

Systems in business that generate data including, CRM, ERP and more

Geospatial data systems

Each of these areas would have its own set of resources both formal and informal.

Well that’s just, like, your opinion, man.

I’m not a “data scientist” (or statistician, for that matter), but of the (excellent) data scientists that I know, the only specific skill they really have in common is statistical analysis. I’d say the truth is probably closer to “statistical analysis + ability to do independent research + computational chops using whatever their tools of choice may be"

As a data scientist, I have to agree with his opinion.

Usually you have a team where each person is "specialized" in a few of those categories.

You can call a data scientist a statistician, but I don't think you can necessarily call a statistician a data scientist.

The truth is, you need only a shallow understanding of machine learning and stats to be a data scientist. But you also need the know-how to collect data - this ends up being the much bigger issue to tackle in my environment. (For what it's worth, you need to have a strong understanding of how data points relate to one another, how accurate they might be, why they might not be accurate, and you also need to be constantly thinking about the long term vision for your data.)

Agreed it is just my opinion. And rarely, if ever, will you find all of these skills in the same person. More often it will be a small team of people each with 1 or 2 specialties plus some other areas they are reasonably competent at.

Most of what I have been reading on the topic seems to define data science as the intersection of the kinds of things I have listed. I guess my larger point was that each of these areas have their own learning curve and some like statistics or machine learning benefit from formal training. A person does not become a data scientist by reading blogs and journals.

A recently launched HN style community [0] is pretty good as well

[0] http://www.datatau.com/news

I'd second the okcupid blog. An overlooked part of the role of an analyst is being able to effectively tell a story with the data. Okcupid is particularly good at that aspect, so you can learn a lot by keeping up with them.

I wish okcupid's blog would be more up to date than 2011.

Unless you are part of a vanishingly small group of autodidacts who can train themselves up to graduate school levels of expertise in multiple overlapping subjects - statistics, computer science (might be able to get away with just being an ok programmer), and the interdisciplinary combination of those called "machine learning," you should disappear into a statistics degree program, and amend the traditional stats program deficiencies with the modern-day leavening agents that create "machine learning."

Downloading scikitlearn and R and such is not going to work. At that level you are only qualified to be bossed around by a real scientist or statistician. You are an "analyst".

Follow the link below, there is like 24hrs of lectures, including materials, code etc. These lectures cover reading data, saving data, cleaning & reshaping, visualization, stats, 8hrs machine learning in scikit learn, version control & unit testing, geospatial analyses. This is all in python using numpy,scipy,ipython,pandas and scikit learn as the base tools. You will love the ipython notebook! https://conference.scipy.org/scipy2013/tutorials_schedule.ph...

Try our upcoming book: "Practical Data Science with R" http://www.manning.com/zumel/

I don't have a PhD, and I'd love to be called a "scientist." But I think it's pretentious to use the label "data scientist" for anyone with solid stats experience and a gift for exploring data. To my mind, scientists have gone through formal training and earned a PhD, which, in a given context, may or may not be necessary for what these guys are doing.

Not a journal or blog, but you should start reading the application guidelines for your local university's math, econometrics or similar degrees.

Have you joined/visited http://datatau.com? Fun HN-style community site.

You can also take MOOC courses for example: https://www.coursera.org/specialization/jhudatascience/1?utm...

I'd recommend Hadley Wickam's papers: http://vita.had.co.nz/

He is the prolific author of many R packages, which are more like little languages than libraries. His papers are both philosophical and practical, and informed by writing a huge amount of code.

The first one on that page is really good, and along with another paper of his got me explicitly thinking of organize my data in R using the relational model (a thing people with computer science backgrounds will know well).

It made me realize that R is actually a better SQL. It's a language for tables, or an algebra of tables.

Grab this set: http://shop.oreilly.com/category/get/data-science-kit.do for Data Science, and maybe this set aswell: http://shop.oreilly.com/category/get/machine-learning-kit.do if you're into Machine Learning.

Both from O'Reilly (with some Packt mixed in). Excellent content.

This isn't a periodical (although you used to be able to view the top questions for the given week--if anyone knows how to get that out of StackExchange again, please let me know) but it is a good source of bite-sized info-trickle:


Udacity has a data science track of courses (https://www.udacity.com/courses#!/Data%20Science) and the blog has recently had data science related posts (http://blog.udacity.com/).

I use a twitter list to collect some cool data people, here are some https://twitter.com/lc0d3r/data-nerds

You can learn a lot about machine learning from this course https://www.coursera.org/course/ml

Not a journal or blog, but I highly recommend Andrew Ng's Machine Learning course on Coursera.

What journals and blogs should I be reading to become a data scientist?


and the HN for Data Sci - datatau.com

Dipshit Buzzwords Quarterly Data Mining, Machine Learning, Artificial Intelligence and other euphemisms for being pretentiously lazy Amazon Principal Engineer Tenets

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact