Ask HN: What journals and blogs should I be reading to become a data scientist? - dewang
======
davidw
Seen on twitter today:

[https://twitter.com/jeremyjarvis/status/428848527226437632](https://twitter.com/jeremyjarvis/status/428848527226437632)

"A data scientist is a statistician who lives in San Francisco."

~~~
jmount
I like that. Even worse would be "an actuary who lives in SF."

~~~
Bootvis
What would that be?

Am I being dense?

~~~
pessimizer
I'm guessing it's saying that the new 'data scientist' is more an actuary than
a statistician. Just a bunch of predictive models trying to enhance a bottom
line more than anything else.

------
rch
You know, I absolutely see where the poster is coming from, and the
suggestions look helpful so far, but the question might as well read: What
journals and blogs should I be reading to become a Cardiothoracic Surgeon?

(though hopefully nobody bleeds out on a table when someone misconstrues
statistical data)

We've lived through an amazing time where one could learn by doing, and
talented people have been able to compete without the benefit of formal
education (myself included), but in my opinion those days are numbered.

I've personally observed respected PhD statisticians stumble on the type of
problems a data scientist is expected to address. The combination of complex
software and often counterintuitive mathematics makes this an imposing field
for all but perhaps the top one percent of practitioners. Most everybody else
needs to really hit the books for a few years, in a formal setting.

With that pre-coffee rant out of the way, I'm looking forward to finding some
new sources here myself. So, in that spirit, thanks for the question.

~~~
JPKab
Just a bit of a counterpoint (taken from a comment on the Data Tau site):

"Data kiddies like me are coming. I just ran multiple passes of the
Broyden–Fletcher–Goldfarb–Shanno algorithm with a 100-layer neural network on
a tfidf-vectorized dataset. I have no clue what that all exactly means, all I
know is that it took under an hour and it gives a higher (top 10%) AUC score.
Kaggler amateurs are beating the academics by brute force or smarter use of
the many tools that are currently freely available. Show a regular Python dev
some examples and library docs and she can compete in ML competitions. I was
getting good results with LibSVM before I even understood how SVM's work on
the surface. Feed the correct input format and some parameters and you are
good to go. Random Forests can be applied to nearly anything and get you 75%+
accuracy. Maybe I am just a engineer looking for pragmatic and practical use
of techniques from ML and data science. Hard data scientists will be the
statisticians, the algorithmic theory experts, the experimental physicists. It
takes me 7 years to understand a complex mathematical paper. It takes me 7
minutes to train a model and predict a 1 million test set with Vowpal Wabbit."

The point is that a Data Scientist is really a person who is a blend of
statistician and software engineer. Sure, there are brilliant people who will
invent new ML algorithms, but you don't need to invent that stuff to be of
tremendous value to a business who has data that they aren't currently getting
much value out of. Just as a software engineer at a small business doesn't
need to write a database, she just needs to be able to implement one somebody
else wrote to add tremendous value.

~~~
jarvic
Sure, most anyone can throw some data into an SVM and get a result out, maybe
even a good one. The problem comes when someone like this has to answer
questions beyond a simple 90% accuracy rate. What does the computed separation
direction tell me? Could I improve accuracy by using some a priori information
like how often one class occurs in relation to the other? What 10% of the
population am I failing on? Is it an important part? Is there some easy way I
could do better? Is my data so high dimensional that I'm getting some trivial
separation and not anything driven by the data itself?

And what happens when this person gets a new data set and they are suddenly
getting garbage out of some standard SVM? Is it just a matter of the data not
being well-separated using a linear model but throwing some simple kernel at
the SVM will do the trick?

Even something as simple as taking a mean can fall apart when you are dealing
with data which doesn't live in a Euclidean space, let alone something like
PCA or SVM which also make assumptions of linearity.

The point is, it isn't just about being able to invent new methods. Things
like SVM make assumptions about your data and applying them in cases when
these assumptions don't hold can give completely worthless information, even
if it looks good on the surface. Using something you don't understand, even if
it is at a (much) more basic level than someone with a PhD in statistics, is
just asking for trouble.

------
daniyaln
Subscribe to Data Science Weekly:
[http://www.datascienceweekly.org/](http://www.datascienceweekly.org/)

[http://blog.zipfianacademy.com/](http://blog.zipfianacademy.com/)

[http://blog.cloudera.com/blog/category/data-
science/](http://blog.cloudera.com/blog/category/data-science/)

[http://www.hilarymason.com/](http://www.hilarymason.com/)

[http://mathbabe.org/](http://mathbabe.org/)

[http://fivethirtyeight.blogs.nytimes.com/](http://fivethirtyeight.blogs.nytimes.com/)

[http://blog.kaggle.com/](http://blog.kaggle.com/)

[http://grepalex.com/](http://grepalex.com/)

[http://pulseblog.emc.com/category/big-
data/](http://pulseblog.emc.com/category/big-data/)

[http://radar.oreilly.com/](http://radar.oreilly.com/)

[http://flowingdata.com/](http://flowingdata.com/)

[http://oreilly.com/data/newsletter.html](http://oreilly.com/data/newsletter.html)

[http://www.gapminder.org/](http://www.gapminder.org/)

[http://mlcomp.org/](http://mlcomp.org/)

~~~
sutterbomb
Great list. I'd just like to give a shout out to Hilary Mason and her blog, as
her very approachable style was what initially got me interested in this
space.

------
joshvm
Don't bother with journals - in pretty much any subject - unless you have a
degree and/or you understand what to look for, or are directed to notable
articles in bibliographies or by peers. There is a lot of crap in all
journals, it's often needlessly technical for practical purposes or too
bleeding edge to actually be useful yet.

I'm not trying to be snarky, but honestly unless you know what you're looking
for it's a fool's game. Once you've got the feel for a subject, you tend to
find several authors that crop up time and time again, or landmark papers that
really shifted the field. But that takes a long time, it takes most PhD
students a year to fully understand and simply collate the background of a
topic they may think they know a lot about.

That and no one _actually_ reads journals. You do a search on Web of Knowledge
or ADS or arXiv or whatever your poison and you see what comes up. Point is,
you need to know what you're looking for.

This is akin to saying that if you read Phys Rev enough, you'll become a
physicist. Sure, sure, keep up with the trends, but big important results get
press which is enough to rely on to start off with.

To become a data scientist? Read the recommended textbooks and take a proper
degree in statistics, computer or data science. Look at the courses on EdX and
Coursera for a starting point, they'll help you decide whether this is
something you seriously want to pursue.

Even if this is just a hobby, e.g. you're a coder that wants to branch out,
you should still take the time to invest in education properly. Data science,
like statistics in general, is very easy to mess up. When people draw bad
conclusions from data (and good data scientists can make up any conclusion
from any data set), bad things inevitably happen. Entire threads of science
have been destroyed because somewhere, someone messed up their stats and
apparently important results are meaningless.

~~~
kriro
While a little off topic I think my typical research process goes like this:

\- Hear or read about something that sounds neat

\- See if there's a wikipedia article (I always cringe when I hear some
colleagues of mine say never, ever use it)

\- Get a high level understanding of the topic from the wikipedia
article...that usually leads to some other wikipedia articles + plain old
Google searches...just fishing for whatever comes up [I also search for TED
talks, youtube videos and MOOCs related to the topic]

\- Scribble stuff down on a piece of paper and structure it in a way that
makes sense to me (sometimes it's just a list, sometimes a full blown mindmap)
...at this point I have a decent high level understanding...which basically
means I could describe the topic to someone without stumbling (which I usually
try at this point)

\- From the high level understanding I usually also get: key terms for
searches, intor level books/articles that are linked etc.

\- At this point working at a university comes in handy because it lets me be
behind the annoying paywalls at will...search Google Scholar or similar
databases for the mined key words. Everything that looks remotely
interesting...oh wait BEST TOOL EVER

\- Zotero is sick good, comes as a FF plugin...great. If you search in
scientific databases and the like a little icon pops up in the address bar of
the browser indicating it identified the sources...click, mark everything ->
it goes into your collection (with full text access) [I order it by topic so
for AI I might have Expert Systems and Rule Based, Fuzzy etc.]

\- So basically I just wade through the databases and get everything that
sounds interesting from the title into Zotero. Alays a good idea to get some
"history of XYZ" or "XCY since author Y" sources

\- Once done I read the abstracts and the conclusions and put a rough note
what the articels are about. I also scan the sources to grow my collection of
relevant articles (I mark what I don't think is relevant or move it into a
special subcollection)

\- I usually try to establish a history of the field with the major stepping
stones, this is usually easy (sometimes not, worth a paper to make it easier
for future researchers :P)

\- If it's related to programming in any way I also search google or github
directly for anything related. Code is good :)

[often there are tomes that are the de facto standards in their fields that
serve as a massive source collection as well. Perfect example would be AI - A
Modern Approach]

------
mswen
Becoming a data scientist isn't a matter of reading journals and blogs. You
can get a sense of the field and what is required by reading those sites but
becoming a data scientist is years of hard work.

You need to develop serious skills in at least 4 of the following disciplines.
Statistical analysis

RDMS query development

NoSQL databases

Machine learning

Natural Language Processing

Web crawling and data harvesting techniques

Programming to access data APIs

Web development

Data visualization

Systems in business that generate data including, CRM, ERP and more

Geospatial data systems

Each of these areas would have its own set of resources both formal and
informal.

~~~
stephencanon
Well that’s just, like, your opinion, man.

I’m not a “data scientist” (or statistician, for that matter), but of the
(excellent) data scientists that I know, the only specific skill they really
have in common is statistical analysis. I’d say the truth is probably closer
to “statistical analysis + ability to do independent research + computational
chops using whatever their tools of choice may be"

~~~
squigs25
As a data scientist, I have to agree with his opinion.

Usually you have a team where each person is "specialized" in a few of those
categories.

You can call a data scientist a statistician, but I don't think you can
necessarily call a statistician a data scientist.

The truth is, you need only a shallow understanding of machine learning and
stats to be a data scientist. But you also need the know-how to collect data -
this ends up being the much bigger issue to tackle in my environment. (For
what it's worth, you need to have a strong understanding of how data points
relate to one another, how accurate they might be, why they might not be
accurate, and you also need to be constantly thinking about the long term
vision for your data.)

------
visakanv
[http://www.informationisbeautiful.net/](http://www.informationisbeautiful.net/)

[http://blog.okcupid.com/](http://blog.okcupid.com/)

[https://www.facebook.com/data/](https://www.facebook.com/data/)

[http://www.pornhub.com/insights/](http://www.pornhub.com/insights/)

~~~
codegeek
A recently launched HN style community [0] is pretty good as well

[0] [http://www.datatau.com/news](http://www.datatau.com/news)

------
MrMan
Unless you are part of a vanishingly small group of autodidacts who can train
themselves up to graduate school levels of expertise in multiple overlapping
subjects - statistics, computer science (might be able to get away with just
being an ok programmer), and the interdisciplinary combination of those called
"machine learning," you should disappear into a statistics degree program, and
amend the traditional stats program deficiencies with the modern-day leavening
agents that create "machine learning."

Downloading scikitlearn and R and such is not going to work. At that level you
are only qualified to be bossed around by a real scientist or statistician.
You are an "analyst".

------
nashequilibrium
Follow the link below, there is like 24hrs of lectures, including materials,
code etc. These lectures cover reading data, saving data, cleaning &
reshaping, visualization, stats, 8hrs machine learning in scikit learn,
version control & unit testing, geospatial analyses. This is all in python
using numpy,scipy,ipython,pandas and scikit learn as the base tools. You will
love the ipython notebook!
[https://conference.scipy.org/scipy2013/tutorials_schedule.ph...](https://conference.scipy.org/scipy2013/tutorials_schedule.php)

------
jmount
Try our upcoming book: "Practical Data Science with R"
[http://www.manning.com/zumel/](http://www.manning.com/zumel/)

------
allochthon
I don't have a PhD, and I'd love to be called a "scientist." But I think it's
pretentious to use the label "data scientist" for anyone with solid stats
experience and a gift for exploring data. To my mind, scientists have gone
through formal training and earned a PhD, which, in a given context, may or
may not be necessary for what these guys are doing.

------
roel_v
Not a journal or blog, but you should start reading the application guidelines
for your local university's math, econometrics or similar degrees.

------
ScottWhigham
Have you joined/visited [http://datatau.com](http://datatau.com)? Fun HN-style
community site.

------
justinkestelyn
Good list of initial resources:

[http://www.cloudera.com/content/dev-
center/en/home/developer...](http://www.cloudera.com/content/dev-
center/en/home/developer-admin-resources/new-to-data-science.html)

------
amerkhalid
You can also take MOOC courses for example:
[https://www.coursera.org/specialization/jhudatascience/1?utm...](https://www.coursera.org/specialization/jhudatascience/1?utm_medium=listingPage)

------
chubot
I'd recommend Hadley Wickam's papers:
[http://vita.had.co.nz/](http://vita.had.co.nz/)

He is the prolific author of many R packages, which are more like little
languages than libraries. His papers are both philosophical and practical, and
informed by writing a huge amount of code.

The first one on that page is really good, and along with another paper of his
got me explicitly thinking of organize my data in R using the relational model
(a thing people with computer science backgrounds will know well).

It made me realize that R is actually a better SQL. It's a language for
tables, or an algebra of tables.

------
mindcrash
Grab this set: [http://shop.oreilly.com/category/get/data-science-
kit.do](http://shop.oreilly.com/category/get/data-science-kit.do) for Data
Science, and maybe this set aswell:
[http://shop.oreilly.com/category/get/machine-learning-
kit.do](http://shop.oreilly.com/category/get/machine-learning-kit.do) if
you're into Machine Learning.

Both from O'Reilly (with some Packt mixed in). Excellent content.

------
steamer25
This isn't a periodical (although you used to be able to view the top
questions for the given week--if anyone knows how to get that out of
StackExchange again, please let me know) but it is a good source of bite-sized
info-trickle:

[http://stats.stackexchange.com/questions?sort=votes](http://stats.stackexchange.com/questions?sort=votes)

~~~
steamer25
Aha! Run this with weeks == 2:
[http://data.stackexchange.com/stats/query/76694/top-
question...](http://data.stackexchange.com/stats/query/76694/top-questions-
asked-less-than-n-weeks-ago)

------
ih
Udacity has a data science track of courses
([https://www.udacity.com/courses#!/Data%20Science](https://www.udacity.com/courses#!/Data%20Science))
and the blog has recently had data science related posts
([http://blog.udacity.com/](http://blog.udacity.com/)).

------
ZygmuntZ
Try these:

[http://fastml.com/links/](http://fastml.com/links/)

------
x-sam
I use a twitter list to collect some cool data people, here are some
[https://twitter.com/lc0d3r/data-nerds](https://twitter.com/lc0d3r/data-nerds)

------
phatak-dev
You can learn a lot about machine learning from this course
[https://www.coursera.org/course/ml](https://www.coursera.org/course/ml)

------
dbecker
Not a journal or blog, but I highly recommend Andrew Ng's Machine Learning
course on Coursera.

------
0800899g
What journals and blogs should I be reading to become a data scientist?

------
skadamat
datasciencemasters.org

and the HN for Data Sci - datatau.com

------
slashdotaccount
Dipshit Buzzwords Quarterly Data Mining, Machine Learning, Artificial
Intelligence and other euphemisms for being pretentiously lazy Amazon
Principal Engineer Tenets

