
A Practical Intro to Data Science - swohns
http://blog.zipfianacademy.com/post/46864003608/a-practical-intro-to-data-science
======
rm999
>While R is the de facto standard for performing statistical analysis, it has
quite a high learning curve

What? R has a ridiculously low learning curve. I remember literally the first
time I used R I loaded up a dataset and had a histogram and qqplot within 5
minutes and 3 or 4 lines of code. Just figuring out what libraries I would
need to do that in python (and installing them) would probably take me at
least 30 minutes.

I think it's still highly debatable if Python is the way to go for general
data science, especially if you're spending a lot of time analyzing data. R is
more mature, but the tides are steadily moving in python's direction.

~~~
ahi
Some things, maybe most things that a beginner wants to do are easy, but
beyond the basics, R has a cliff for a learning curve. It's a complex language
with some questionable design choices.

~~~
rm999
I agree with you about questionable design choices, I really don't like
developing in R. But if you have some data and you want to know more about
that data (data analysis), R + a small handful of packages is the best free
environment that I know of. It's not just the basics, this can extend into
fairly complex operations on your data (not to mention producing very pretty
visualizations).

Data analysis is one of the most important steps in data science, so I think
it's worth keeping R around.

------
darksaints
Pretty decent list of links. However, I feel the importance of SQL has been
completely downplayed...there are more hadoop-oriented links than there are
SQL. Data retrieval and manipulation is where a data scientist will spend 95%
of her time, and SQL is still more ubiquitous by far.

~~~
danso
Dang, I wish I could find the link to this...an HP data scientist wrote a
short essay (something like "Intro to Data Science") and said that the proper
collection and cleaning of data is often seen as dirty grudge work that has to
be done (by someone else, hopefully) before the real groundbreaking work can
be done. However, the author said, this dirty grudge work _is_ the _real_
work.

When I think about, in my data programming related work, I'd say about 5% is
doing analysis or executing statistical routines. And 95% of my time is spent
on finding, cleaning, and properly normalizing data. This applies to whether
you're a solo researcher or Facebook...think about it: Facebook is a pretty
good website, but what it excels better at than just about anyone is being a
platform to _collect_ personal data in a way that...well, causes you to quite
willingly give it your personal data.

There was a presentation where Peter Norvig pointed out a data routine in
which someone had implemented with a naive Bayesian classifier with a comment
saying that they'd think of something better...and years later, no one
realized it was still a todo. Norvig said something like "You don't have to be
very smart when you have a lot of data"

~~~
SkyMarshal
_> Dang, I wish I could find the link to this...an HP data scientist wrote a
short essay (something like "Intro to Data Science") and said that the proper
collection and cleaning of data is often seen as dirty grudge work that has to
be done (by someone else, hopefully) before the real groundbreaking work can
be done. However, the author said, this dirty grudge work is the real work._

It's called data munging. Good short article on dataspora about it a while
back:

<http://www.dataspora.com/2009/05/sexy-data-geeks/>

------
mav3r1ck
I'm surprised no one commented on the cost: 14,400!

Assuming they don't do job placement (didn't see anything about that) then
this is a total rip off and is just some people trying to cash in on the data
science fad.

Besides, they just listed a bunch of free resources that invalidates the need
to go through them, so unless they offer job-placement in an actual data-
science-like position, why waste your money on this?

Besides, 12 weeks reminds me of Peter Norvig's "Teach Yourself programming in
21 days"

<http://norvig.com/21-days.html>

~~~
flanger
We apologize that it was not more clear on the site, but we do partner with
many companies throughout the program who give guest lectures and provide
perspectives from the industry. There is also a hiring day where we match
candidates with prospective employers to provide assistance with job placement
and prepare our students extensively for interviews.

While we have shown there are many online resources available for
understanding data science, we’ve found that structured, in-person programs
provide the best environment for a collaborative learning experience.

Scholarships are also offered for particularly promising applicants and for
students in need of financial aid. Please reach out to us directly if you
would like to know more about our financial assistance options:
hello@zipfianacademy.com

------
zissou
Happy to finally see an intro to data science article that puts statistics as
the #1 skill. While software engineering is also important, all the
engineering in the world won't help you extract any meaningful insight from
data without a solid foundation in probability and statistical theory.

------
touristtam
Too many link to click.

It is nice to present the wealth of resources that is available to anyone
looking to its skill set in this field.

However as far as course intro goes, it doesn't get me very exited, partly on
the way it is worded: Python as a language choice, using libraries that other
have build. Would it not be better to teach the basic from basic, and then
acknowledge there is a library that can handle that?

Another grip I would have here, as well, is the page formatting: the single
column layer you have on that blog does not fit the length of the text you
have. I am reading: <http://www.moserware.com/2010/03/computing-your-
skill.html> at the moment and apart from the nice picture to look at (who
doesn't like shiny), the layout is much cleaner and the information is more
readable.

Finally, and as it has been mentioned in other post, you ought to have a small
sum up of what 'data science' is (in relation of other used term for
describing statistical analysis of dataset) and where it is coming from.

------
lightblade
I like to know more about the mentioned scholarship. Will it able to cover the
whole tuition fee? And who's sponsoring the scholarship?

------
bonsai
It is interesting idea. But course is expensive.

I hope they will realease course in online form (similar like coursera) and
offer it for reasonable price (max 100usd per person).

There are plenty people outside of USA willing to take lessons on this kind of
a course.

------
maonuon
As a software engineer working in the biotech industry I find your posting
very interesting and insighful. I am currently pursuing an MS in
Bioinformatics at the same time is very interested in Big Data. I think the
$14,400 for a 12 weeks program is a great investment. However, the location is
not ideal. Do you plan to expand your bootcamp to other cities besides SF?

~~~
clearspandex
Thanks for the kind words! Right now we are focusing only on our SF class but
we may expand in the future. I would encourage you to signup for our email
list to stay up to date on any news about the program, and feel free to reach
out with any questions or concerns (jonathan@zipfianacademy.com). Best of luck
with your masters program and I hope you will keep in touch!

------
Irishsteve
Useful link... but I think intro is a little ambitious based on all the
content there.

~~~
recuter
Well we might as well split it into Big Data and "Little Data".

Little Data being:

Have a basic grasp of Python and Javascript/D3.js for the pretty
visualizations. That and basic statistics. The latter is probably the one
developers (at least here) would spend most of their effort on.

"Little Data" in itself can take you a long way.

~~~
tomrod
What is D3.js?

~~~
theaceae
A JavaScript data visualization library (d3js.org is linked to in the post)

A really friendly place to start understanding it is Scott Murray's tutorial:
<http://alignedleft.com/tutorials/d3/>

------
hyperbovine
It's time for this term to die.

If it doesn't involve data, it's not science.

~~~
yen223
Back in the good old days, Data Science used to be called Statistics.

~~~
stdbrouw
I used to feel the same way, but (1) calling it statistics downplays the
importance of data munging and grabbing/normalizing/managing data and (2) the
advent of computational methods and off-the-shelf machine learning and natural
language analysis has turned statistics into "one tool in our toolbox" rather
than The Tool.

And frankly, even if data science is just statistics rebranded, anything that
can get more people to take an interest in statistics is a good thing. If a
hype-fueled new sticker is what it takes, then why not.

