Hacker News new | past | comments | ask | show | jobs | submit login
Introduction to Data Analysis in R (briatte.org)
139 points by danso on Dec 12, 2014 | hide | past | web | favorite | 11 comments

As someone who's teaching data-analysis-oriented classes, I peruse a lot of open syllabi...and this was one of the most well-organized and easy to read that I've found. I'm strongly hesitant to teach R...preferring Python because it's closer to languages that I'm used to, and for its general-purpose utility...but reading through these lessons, it's hard not to be awed by what R can do, particularly in visualization.

This demonstration of using R to geocode (via the ggmap extension of ggplot2) was particularly cool (and also, as an example of the OP's organized notes, includes a copy of the data since the original link went dead): http://f.briatte.org/teaching/ida/101_geocoding.html

I would strongly recommend looking at R. I started using Python and Pandas and when I ran into issues with work requiring M$ Office documents R just amazes.

Also the amazing growth in R in just the last few years. http://www.tiobe.com/index.php/content/paperinfo/tpci/R.html (O know that ranking is not the greatest argument for a language BUT it does show (somewhat) its growth. Specifically the flexibility of R (12 ways to do one things) has allowed it to evolve quickly and the libraries are just amazing. RStudio has changed R with Hadley Wickham's ggplot2, dplyr, reshape2, tidyr and etc. It just makes the the language do so much and change so quickly.

I use to be in love with all things Python and now I still respect Python and Pandas but I kind of gone to more domain specific tools.

I also highly recommend R.

Dplyr and ggplot2 (noted by baldfat) are exceptional.

I recently wrote a tutorial on dplyr here: http://www.sharpsightlabs.com/dplyr-intro-data-manipulation-...

To put this simply, dplyr's syntax is set up to create streamlined workflows. All of the major data management tasks (sort, subset, group, summarize) are easy to do. And they can be "chained" together (much like using pipes in Unix).

Ggplot (another R package) is an amazing data visualization tool. The syntax has a deep underlying structure, based on the Grammar of Graphics theoretical framework. I won’t go into that too much, but suffice it to say, when you learn the ggplot2 syntax, you’re actually learning how to think about data visualization in a very deep way. You’ll eventually understand how to create complex visualizations without much effort.

GGplot and dplyr are the reason I settled on R (instead of Python). When you use them together (again, using "chaining") you can explore your data rapidly and also create really high quality analyses.

Do you have similar resources or references for Python that you could share, please?

I'm relatively new to Python so I don't have anything that I've closely studied...mostly things I've bookmarked that have been submitted to HN. The attractiveness of Python to me is the huge scientific programming community behind it, plus its human-friendly syntax...if R was more readable by me, I'd prefer to work with just one language.

So in terms of Python resources:

- The classic NLTK book: http://www.nltk.org/

- A Programmer's Guide to Data Mining: http://guidetodatamining.com/

- Hitchhiker's Guide to Python: http://docs.python-guide.org/en/latest/index.html

- Statistical Inference for Everyone: http://web.bryant.edu/~bblais/statistical-inference-for-ever...

- Frequentism and Bayesianism: A Python-driven Primer: http://arxiv.org/abs/1411.5018

- Probabilistic Programming and Bayesian Methods for Hackers: http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabil...

- Software Carpentry's primer: http://software-carpentry.org/v5/novice/python/index.html

- And of course, LPTHW: http://learnpythonthehardway.org/

I also came across Practical Python and OpenCV book and have found it really useful for implementing computer-vision exercises...it's not free, but the author has a blog where he regularly posts insightful examples: http://www.pyimagesearch.com/

(I haven't created any class-specific lessons but will definitely post them when I have them ready)

What do you think about Data Science course on Coursera? I don't know which one to pursue. They also teach the use of R langauge.

Honestly, half of that they teach is more like computer science and statistics instead of data analysis.

What I mean is that they have you spend weeks and months learning data types and 'for loops' when in reality, you don't need those to get insights from data. There are other toolsets (namely: dplyr and ggplot) that don't require you to know control flow, etc. If you're a developer, I'm sure that sounds strange, but believe me, you don't need software development style knowledge. What you need is to be able to do data manipulation and data exploration. You need to be able to turn data into insight.

Said differently, these courses are teaching statistics and CS in R. What they should be teaching is data manipulation and data visualization right from the start.

Thanks for the reply.

I'm not a developer, just a mechanical engineer and having extreme difficulty in landing a job. Do you believe if I do the course on the data analysis will allow me to get into the Financial industry? I really enjoy the concept of FOREX, stock trading and setting up algorithms for it like what 'investment companies' do or the 'hedge fund' guys.

Can't see the future, but probably yes. You probably want to take a look to the several financial statistics oriented packages available in R.

This environment is... how to describe it? "often wild horse, sometimes donkey"

Can be very (very!) defiant sometimes and have some well known issues with very big objects and small memory machines (is easily linkable with databases so this is a minor trouble)... but is also much rewarding.

I think that any hour that you spend on R will be worth it, just don't be a masochist, buy or borrow a couple of good introductory books, read some manuals and internet links like the former and, when you have some practice and wish to improve, one of the best ways is to remember the repositories, install some packages of your interest with

R> install.packages("packagename")

And take a look to as many R-code of those as you can.

Highly recommended. Try it.

This link can give you a better picture of what type of things people are doing with R

Using R to simulate the finances of public sector pension funds:


Curiously a job position was announced (four monts ago, currently closed) at the end of the note in the same link. They were looking for an R expert.

I'm a little confused about what exactly this is. Take the time series section, for example. The entire section seems to only have examples. The only reading listed is the "R Cookbook," and it's also mainly examples.

From the syllabus:

> The aim of the course is to show how to perform elementary data analysis in the social sciences.

I feel like the time series section doesn't teach basic time series analysis at all. For example, they show plots of the ACF and PACF without going into detail about what those are and how they're different. I don't think that's helpful.

This looks like a nice set of examples, but far from an actual course!

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact