Hacker News new | past | comments | ask | show | jobs | submit login
A Practical Intro to Data Science (zipfianacademy.com)
153 points by swohns on April 7, 2013 | hide | past | favorite | 36 comments

>While R is the de facto standard for performing statistical analysis, it has quite a high learning curve

What? R has a ridiculously low learning curve. I remember literally the first time I used R I loaded up a dataset and had a histogram and qqplot within 5 minutes and 3 or 4 lines of code. Just figuring out what libraries I would need to do that in python (and installing them) would probably take me at least 30 minutes.

I think it's still highly debatable if Python is the way to go for general data science, especially if you're spending a lot of time analyzing data. R is more mature, but the tides are steadily moving in python's direction.

Jonathan here, Co-Founder of Zipfian Academy. While it is easy to get up and running quite easily in R for simple analysis, it is a complex language that takes years to master. I recommend learning Python for the aspiring scientist because of its breadth of applicability. While R is probably better for statistical analysis than Python (every language has its specific domain where it shines), across the entire domain of tasks a data scientist must perform, I feel that Python provides the best aggregate utility.

Also, as the comments below highlight, actual statistical analysis is but a small part of the data pipeline. Python has great facilities for interacting with data stores/sources in addition to being a powerful tool to clean and munge data.

>When I think about, in my data programming related work, I'd say about 5% is doing analysis or executing statistical routines. And 95% of my time is spent on finding, cleaning, and properly normalizing data.

I hope the post doesn't downplay the importance of R to statistical analysis, it is a mature language with a great community surrounding it. The toolset of a data scientist is probably one of the most heterogeneous out there and necessitates learning and using many different abstractions.

For such a new (and hard to define) subject, I think dialogue is crucial to constructively advance the field. I would love to hear suggestions from the HN community about how to train the next generation of data scientist, what aspiring data scientists want to learn (or find difficult to learn), and how we can build a great data community.

Some things, maybe most things that a beginner wants to do are easy, but beyond the basics, R has a cliff for a learning curve. It's a complex language with some questionable design choices.

I agree with you about questionable design choices, I really don't like developing in R. But if you have some data and you want to know more about that data (data analysis), R + a small handful of packages is the best free environment that I know of. It's not just the basics, this can extend into fairly complex operations on your data (not to mention producing very pretty visualizations).

Data analysis is one of the most important steps in data science, so I think it's worth keeping R around.

Depends on the analysis lifecycle I think - I hate using Matlab for writing any sort of large experimental framework in, especially if it has to interact with other systems, but for quick and dirty experiments, plotting, or parallelizing totally independent for-loops eg. for a grid-search over some model hyper-parameters, it's great (the latter is literally changing "for" to "parfor". After you fork out for the Parallel Computing Toolbox of course.) But then a lot of code in research is written in the style of use-code-for-a-week->get a graph->curve is higher than their curve?->publication.

When I need to write something more substantive and reusable though, Python forever.

Pretty decent list of links. However, I feel the importance of SQL has been completely downplayed...there are more hadoop-oriented links than there are SQL. Data retrieval and manipulation is where a data scientist will spend 95% of her time, and SQL is still more ubiquitous by far.

Dang, I wish I could find the link to this...an HP data scientist wrote a short essay (something like "Intro to Data Science") and said that the proper collection and cleaning of data is often seen as dirty grudge work that has to be done (by someone else, hopefully) before the real groundbreaking work can be done. However, the author said, this dirty grudge work is the real work.

When I think about, in my data programming related work, I'd say about 5% is doing analysis or executing statistical routines. And 95% of my time is spent on finding, cleaning, and properly normalizing data. This applies to whether you're a solo researcher or Facebook...think about it: Facebook is a pretty good website, but what it excels better at than just about anyone is being a platform to collect personal data in a way that...well, causes you to quite willingly give it your personal data.

There was a presentation where Peter Norvig pointed out a data routine in which someone had implemented with a naive Bayesian classifier with a comment saying that they'd think of something better...and years later, no one realized it was still a todo. Norvig said something like "You don't have to be very smart when you have a lot of data"

>Dang, I wish I could find the link to this...an HP data scientist wrote a short essay (something like "Intro to Data Science") and said that the proper collection and cleaning of data is often seen as dirty grudge work that has to be done (by someone else, hopefully) before the real groundbreaking work can be done. However, the author said, this dirty grudge work is the real work.

It's called data munging. Good short article on dataspora about it a while back:


I've recently seen this slidedeck [1] from guys on Twitter's data team where they say that most of the time the data mining process is basically:

1. Your boss says something vague

2. You think very hard on how to move the needle

3. Where’s the data?

4. What’s in this dataset?

5. What’s all the f#$#$ crap in the data?

6. Clean the data

7. Run some off-the-shelf data mining algorithm

8. ...

9. Productionize, act on the insight

10. Rinse, repeat

[1] http://www.slideshare.net/Hadoop_Summit/scaling-big-data-min...

Ryan, co-founder of Zipfian Academy here. Completely agree -- data scientists can spend up to 90% of their time cleaning and getting their data in the proper format for analysis. This, plus the emergence of Hive and Pig as dominant higher-level abstractions on top of Hadoop, have made robust SQL skills more important than ever. We have an upcoming blog post specially focusing on learning SQL and the differences between SQL/HiveQL.

I'm surprised no one commented on the cost: 14,400!

Assuming they don't do job placement (didn't see anything about that) then this is a total rip off and is just some people trying to cash in on the data science fad.

Besides, they just listed a bunch of free resources that invalidates the need to go through them, so unless they offer job-placement in an actual data-science-like position, why waste your money on this?

Besides, 12 weeks reminds me of Peter Norvig's "Teach Yourself programming in 21 days"


We apologize that it was not more clear on the site, but we do partner with many companies throughout the program who give guest lectures and provide perspectives from the industry. There is also a hiring day where we match candidates with prospective employers to provide assistance with job placement and prepare our students extensively for interviews.

While we have shown there are many online resources available for understanding data science, we’ve found that structured, in-person programs provide the best environment for a collaborative learning experience.

Scholarships are also offered for particularly promising applicants and for students in need of financial aid. Please reach out to us directly if you would like to know more about our financial assistance options: hello@zipfianacademy.com

Happy to finally see an intro to data science article that puts statistics as the #1 skill. While software engineering is also important, all the engineering in the world won't help you extract any meaningful insight from data without a solid foundation in probability and statistical theory.

Too many link to click.

It is nice to present the wealth of resources that is available to anyone looking to its skill set in this field.

However as far as course intro goes, it doesn't get me very exited, partly on the way it is worded: Python as a language choice, using libraries that other have build. Would it not be better to teach the basic from basic, and then acknowledge there is a library that can handle that?

Another grip I would have here, as well, is the page formatting: the single column layer you have on that blog does not fit the length of the text you have. I am reading: http://www.moserware.com/2010/03/computing-your-skill.html at the moment and apart from the nice picture to look at (who doesn't like shiny), the layout is much cleaner and the information is more readable.

Finally, and as it has been mentioned in other post, you ought to have a small sum up of what 'data science' is (in relation of other used term for describing statistical analysis of dataset) and where it is coming from.

I like to know more about the mentioned scholarship. Will it able to cover the whole tuition fee? And who's sponsoring the scholarship?

It is interesting idea. But course is expensive.

I hope they will realease course in online form (similar like coursera) and offer it for reasonable price (max 100usd per person).

There are plenty people outside of USA willing to take lessons on this kind of a course.

As a software engineer working in the biotech industry I find your posting very interesting and insighful. I am currently pursuing an MS in Bioinformatics at the same time is very interested in Big Data. I think the $14,400 for a 12 weeks program is a great investment. However, the location is not ideal. Do you plan to expand your bootcamp to other cities besides SF?

Thanks for the kind words! Right now we are focusing only on our SF class but we may expand in the future. I would encourage you to signup for our email list to stay up to date on any news about the program, and feel free to reach out with any questions or concerns (jonathan@zipfianacademy.com). Best of luck with your masters program and I hope you will keep in touch!

Useful link... but I think intro is a little ambitious based on all the content there.

Well we might as well split it into Big Data and "Little Data".

Little Data being:

Have a basic grasp of Python and Javascript/D3.js for the pretty visualizations. That and basic statistics. The latter is probably the one developers (at least here) would spend most of their effort on.

"Little Data" in itself can take you a long way.

What is D3.js?

A JavaScript data visualization library (d3js.org is linked to in the post)

A really friendly place to start understanding it is Scott Murray's tutorial: http://alignedleft.com/tutorials/d3/

Meant to get someone to see the value of their $15,000 6 week course. But this is a great collection. I had no idea the amount of courses out there.

It's time for this term to die.

If it doesn't involve data, it's not science.

Yes, and this is the Science of analyzing Data. The term isn't trying to say there is an alternative science that is "non-data", or that people in other kinds of science are not involved with data - it's just specifying what kind of thing is receiving the scientific method: the data itself.

I agree. I absolutely love almost everything on this list but I only despise the term "data science" slightly less than "big data". Sadly what is often called "data science" is the rather narrow analysis of log files generated by websites.

Back in the good old days, Data Science used to be called Statistics.

I used to feel the same way, but (1) calling it statistics downplays the importance of data munging and grabbing/normalizing/managing data and (2) the advent of computational methods and off-the-shelf machine learning and natural language analysis has turned statistics into "one tool in our toolbox" rather than The Tool.

And frankly, even if data science is just statistics rebranded, anything that can get more people to take an interest in statistics is a good thing. If a hype-fueled new sticker is what it takes, then why not.

Statistics (noun) - The practice or science of collecting and analyzing numerical data in large quantities.


And I would argue that qualification that the data be "numerical" in this definition is wrong. Plenty of statistical analysis invokes categorical text data (e.g. sex, race, occupation...).

But this way, my $employer's Business School can charge 19 thousand for an MSc in Data Science and Management!

Quantitative business analyst just doesn't sound as wallet-opening.

yea exactly - who does science without data. We've all been doing data science for years!

Yes and no - the component pieces of data science have been used for years, but as the OP describes, the field of data science assembles skills and tools that really haven't been universally understood and effectively applied to problem solving - it is a relatively fundamental shift to a new way of interacting with data.

Please. Statisticians have been doing this for over 300 years. Cooking up some infographic on your Hadoop cluster is not the paradigm shift you are making it out to be.

It sounds like we're not talking about the same thing.

No, we are. You are talking about gathering, modeling and summarizing data. To claim that those skills "really haven't been universally understood and effectively applied to problem solving" is ludicrous. What do you think scientists did prior to the advent of "data science"?

Every field of science is about gathering, modeling and summarizing data - that is not exclusive to statistics or any other field, which means we need to be more specific than that to determine whether Data Science is doing anything different than statistics.

Unfortunately data science is not well defined (in part because it is a young field), but it draws extensively from computer science and is targeted toward data sets that are different from those that have typically been used in 'traditional' statistics - huge, dirty data sets.

This debate has a similar flavor to Physicists saying that Chemistry is not a science because it's all just physics, and Chemists saying that Biology is not a science because it's all just chemistry. They are extensions, just like Data science, and pretending that an expert statistician will be and expert data scientist is just not tenable.

There is room for something more.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact